It’s Store-and-Forward, Jim, but not as we know it.
March 10, 2011 3 Comments
You might have heard of v7 formal support for store-and-fwd and wondering what this post is about. Well, it’s not about v7’s support for it.
The product feature relates to its ability to recognise when a target system is unavailable, work upstream until it finds the first asynchronous interaction point, and start storing requests until human intervention via a dedicated Business Space widget to reopen the information flow.
The key points here are 1) only asynchronous interactions are supported 2) human intervention is necessary to restore the flow.
I’m working against a different scenario here. What I want to do is to automatically switch from synchronous to asynchronous processing of selected interactions when a target system is not available. I want synchronous consumers of my service to always get a synchronous response, either complete or partial, depending on whether my respective target systems are reachable or not.
I also want a retry mechanism that will continue accepting requests and process them all ‘offline’ once the failed external system is restored.
And I want the business process to pick up where it left off and carry on with its activity sequence.
So, imagine you apply for a credit card online, and there are 3 key steps to process your application. First we score your risk, then we validate your application and last we fulfill your order.
You can write a short running BPEL process to orchestrate those three services and give the web front end a synchronous response.
Now, suppose risk scoring is a third party service that’s notorious for being down for housekeeping a few hours every day.
Clearly we can’t fulfill your application without having scored your risk, but neither we want to just tell you to come back later, much later, and that we are sorry but you just wasted your time filling up a form.
What we want is to tell you that your application has been received, it is being processed, and you can look forward to your new credit card arriving in the post real soon (or if you haven’t qualified, a communication to that effect).
So, lets look at a simple prototype of the short running process, without any store and forward capabilities.
The external systems are implemented as mediation modules and stubbed with Java SCA components. I log the message and create a response from these components.
For the scoring service I did a bit more work. I configured a jndi string binding that I can manipulate through the admin console and depending on its value I throw a modeled fault. This is so I can emulate the system being unavailable.
I assume you can complete these tasks without assistance.
You can then run some basic tests and confirm that all your modules are hanging together and everything behaves as it should.
So now we can start thinking about how to approach the case when the scoring service is offline.
The first thing you’re going to need is a new module with a long running process implementing a new ScoringService interface with a one way operation taking the same input parameter type as the actual scoring service mediation.
You can think about this asynchronous LRP as a ‘wrapper’ to the synchronous scoring service.
So, this LRP is called asynchronously (there is no reply) and is instantiated once and only once. You will have to work on your correlation properties/set, so requests are routed to the running instance.
On initial request, an instance is created, the request is placed in a list and an attempt is made to call the scoring service. This call is likely to fail (we wouldn’t be here at all otherwise), so the fault handler executes, which puts the process in receive mode. Every additional request is appended to the list and every time we end up putting the process in receive mode again.
I we haven’t received a request for some time, we timeout the fault handler so we can probe the Scoring Service.
At some point the Scoring Service will be up and running again and for each pending request we will invoke it, get its response, remove the current pending request from the list and invoke the credit card application short running process, letting it know the scoring activity is now complete (we pass in the score result).
Note that ‘resuming’ the credit card application process does not technically resume anything. It simply creates a new instance but with the scoring data already present.
Next you have to modify the short running process so it can detect that the Scoring Service is down, call the async ‘wrapper’ and reply a partial response to the client.
When this short running process is called from the UI and the Scoring Service is up, it behaves exactly as before, and the UI receives a complete response.
When the Scoring Service is down, the fault handler runs, the long running process is called, and the UI receives a partial response.
When this short running process is called by the long running one, the scoring invoke is not attempted, the process proceeds with validating and fulfilling the credit card application, and the reply goes back to the long running process, which you can use for generating customer communications.
This approach keeps the business logic in a single place (the short running process), and effectively deals with offline treatment of requests when a given system is down.
It also addresses resource management, by creating a single long running process, rather than one for each pending request.
And because a long running process state is persisted, all those pending requests survive a server restart, so nothing is ever lost.
ttfn – gabz