Monday, March 30, 2009

A few observations 3 weeks after GO LIVE

Ok, now I've been part of a Go Live for AX 2009 Sp1 and I have gained 3 weeks of operational experience. It's to early to draw any definitive conclusions, but this is a short summary of my key observations in random order:

  • The decision to go for Windows Server 2008 x64 Standard was right
  • The decision to go for SQL Server 2005 Sp2 x64 Standard in a active-passive failover cluster was probably right, except for some breaks in the database communication (exact reason still unknown)
  • AIF performs better than expected even without utilizing paralell processing on several AOS instances (2 000 messages can easily be processed in a couple of minutes utilizing one uni directional channel)
  • The BizTalk Adapter haven't caused us any spesific issues so far
  • Pay extra attention when submitting batch jobs to the batch queue (or when constructing new batch jobs) and don't expect all logic to execute automatically under the new Batch Framework (look out for logic tied to the client tier even in the standard application and plan for adding some extra logic to keep the code compatible in both interactive and batch mode)
  • Look out for hot fixes from Microsoft (check Partnersource or Customersource freqently) and plan for some delays getting a response (new installations should evaluate the roll up package released late in February)

Bottom line; the Batch Framework is worth paying special attention to during analysis, design, build and testing.

Thursday, March 12, 2009

LIVE, update

Well we had a breakthrough this morning! At least the batch execution now seems to run as expected. We will conclude after running the batch through the night and tomorrow. Anyway all news are good news right now.

After investigating the AOS server used to run the AIF batch(es), we discovered some deviations in Performance Manager (high number of page faults and hence PF Delta). This led us to change the Batch Group and switch to another AOS server. After this we have been able to automatically process inbound and outbounde messages. If the batch survives the night and tomorrow, we will take actions on the faulting AOS by unistalling/cleaning up and installing a new AOS instance. The only load on the suspicious server right now, is generated by the BizTalk Adapter (.NET Business Connector).

After spending long days in the office since last Friday, I'm heading home more optimistic than ever.

Wednesday, March 11, 2009

LIVE

Ok, we have been LIVE since last Friday and I thought it was time for a short update.

Basically things have gone quite well, except for a lot of problems with AIF and message processing. We have identified some issues with customizations, without coming close to solving the biggeste issue: Batch execution of the AIF message processing services. The wierd thing is that they work for some time and then suddenly they stop working, and not leaving a trace of evidence of what went wrong (just sets the status to 'Error' without any error message in QueueManager or in the Exception log). We have done a lot of debugging but this is very time consuming and the underlying logic is hard to follow. We have set up the Batch Groups as per MS definition routing the batch execution to the correct AOS server (2 non clustered) and we have tried a lot of different setups for the Batch jobs with and without dependencies, without any success. The event logs are looking good and the overall utilization of the server resources, are under control.

Part of the complexity is tied to the impersonation logic (RunAs), but this seems to be under control after some issues the first days (permissions). We have ended up implementing a manual processing routine that bypasses the impersonation logic to allow debugging. The manual routine works well, but needs a dedicated resource to act as Queue Manager (not a good solution, but we are able to keep the queue in a controlled state).

I can't go into further details right now, but the new Batch Framework is causing us some pain and this is not what we where hoping for. When the batch jobs run successfully, the performance is acceptable and we are able to process a decent number of messages in any direction. To add some more mystic to the issue, we sometimes have the inbound processing running without problems, while we have problems with outbound processing. Suddenly this changes without any clue or trace of what happened. And we are monitoring the AIF lock table closely and we are also looking for database locks, without seeing any issues in these areas.

So the haunt continues and you can expect some more updates about this later when the issue hopefully is tracked down and solved.

If anyone have some more on this, you are welcome to leave a comment (I'm still optimistic).

So long