Wednesday, March 11, 2009

LIVE

Ok, we have been LIVE since last Friday and I thought it was time for a short update.

Basically things have gone quite well, except for a lot of problems with AIF and message processing. We have identified some issues with customizations, without coming close to solving the biggeste issue: Batch execution of the AIF message processing services. The wierd thing is that they work for some time and then suddenly they stop working, and not leaving a trace of evidence of what went wrong (just sets the status to 'Error' without any error message in QueueManager or in the Exception log). We have done a lot of debugging but this is very time consuming and the underlying logic is hard to follow. We have set up the Batch Groups as per MS definition routing the batch execution to the correct AOS server (2 non clustered) and we have tried a lot of different setups for the Batch jobs with and without dependencies, without any success. The event logs are looking good and the overall utilization of the server resources, are under control.

Part of the complexity is tied to the impersonation logic (RunAs), but this seems to be under control after some issues the first days (permissions). We have ended up implementing a manual processing routine that bypasses the impersonation logic to allow debugging. The manual routine works well, but needs a dedicated resource to act as Queue Manager (not a good solution, but we are able to keep the queue in a controlled state).

I can't go into further details right now, but the new Batch Framework is causing us some pain and this is not what we where hoping for. When the batch jobs run successfully, the performance is acceptable and we are able to process a decent number of messages in any direction. To add some more mystic to the issue, we sometimes have the inbound processing running without problems, while we have problems with outbound processing. Suddenly this changes without any clue or trace of what happened. And we are monitoring the AIF lock table closely and we are also looking for database locks, without seeing any issues in these areas.

So the haunt continues and you can expect some more updates about this later when the issue hopefully is tracked down and solved.

If anyone have some more on this, you are welcome to leave a comment (I'm still optimistic).

So long

No comments: