Sunday, January 13, 2013

4-Node RAC now officially supported; will the silver bullet deliver?

Oracle introduced Real Application Clusters (RAC) back in 2001 for Oracle 9i. Those days, Friends was running on TV, gas guzzling cars were running on Boston roads and Yantra 2x was running on retailers' back end systems with Oracle sans any RAC support. Fast forward to 2006 - Friends has ended its run and Sopranos was running on TV; gas guzzling sedans were being replaced by hybrid Priuses on Boston roads and large retailers were running Yantra 7x with 2x customers either wiped out in the dot-com bust or having upgraded to newer version. More importantly with 2 node RAC being supported by Sterling Distributed Order Management (Yantra was acquired by Sterling in 2005) early adopters like Staples, Gap and Best Buy had taken the 2-node RAC plunge and are asking for more nodes support to cater to their growing transaction volumes.  Further fast forward to 2012. Modern Family is now running on TV and with the announcement of 4 Node RAC support with IBM Sterling Selling and Fulfillment (SSFS) suite 9.1 (Sterling was acquired by IBM in 2010) the largest implementations of SSFS now have more choices in front of them on what to run on their Oracle database systems. To be precise what is officially supported per the documentation is only the 4-node RAC running on DELL PowerEdge M600 two quad core processors for upto 32 processors which means no official support for IBM Power series or HP Itanium or other processor based systems.

Is 4-node RAC the silver bullet that it is meant to be for your large volume cross-channel Sterling selling solution? In this article I shall try to dive into why it took this long for 4 node RAC support to materialize, what it means for your Sterling OMS implementation and what strategies you can use to make the most of this announcement.

First, the basics - RAC is a shared disk clustered databases:  every instance in the cluster has equal access to the database’s data on disk. Each instance has its own SGA and background processes and each
session that connects to the cluster database connects to a specific instance in the cluster.The main challenge in the shared disk architecture is to establish a global memory cache across all the instances in the cluster:  otherwise the clustered database becomes IO bound.  Oracle establishes this shared cache via a high-speed private network referred to as the cluster interconnect. Sterling OMS is primarily an OLTP application with key tables corresponding to the bulk of the transactions related to Order or Inventory processing being heavily used. Granted certain functionalities of the Sterling solution behaves or is rather used in a DSS like querying manner, specifically with the reporting solution aka Sterling BI, categorized as a DW solution,  but that is usually driven of a replicated database and is not the subject of the discussion here. Tables with the heaviest lock contention on the Production transaction schema in any implementation are typically YFS_INVENTORY_ITEM and YFS_ORDER_HEADER.

Competitive studies and independent experiments on RAC (source - various online) have both come to the conclusion that certain scenarios such as bulk load, long-running transactions, and handling high-frequency update applications as areas where RAC fails to scale out well. Overall performance suffers in these situations because Oracle RAC needs to transfer large amounts of buffer data among the nodes through the interconnect. Applications that make substantial use of serialization (such as Oracle’s sequence request and index update) also suffer because nodes must wait until operations complete on other nodes before they can continue, and such operations cannot therefore be truly scalable. Incidentally Oracle has reported excellent near linear scalability with RAC including the much touted TPC-C benchmark where they achieved 1.18 million tpm. Most real life customized implementation of Sterling OMS shows the former set of symptoms (long running transactions and high frequency transaction on a particular item or order) and even an out-of-box Sterling implementation uses both indexes and sequences heavily. Sequences in particular being used for all primary key generation on all schema tables. Index updates requiers index leaf blocks to be maintained and passed along multiple nodes

To optimize Sterling application behavior considering these known challenges with the Sterling OMS system and its database design the Performance Management Guide suggested that 2 Node RAC implementations use Jumbo Frames and 10G Ethernet protocols for optimal inter connect traffic times. However, even with those best practices in places at most implementations and in in-house testing what was discovered was that the application would not scale linearly beyond 2 nodes due to the high global cache related latency. Trials and load tests at some of the large customers led to the conclusion that bulk of the order and inventory update transactions are best handled on one instance. This runs counter to the general published recommendations around cluster load balancing to achieve higher scalability. Once customers and us Sterling Performance Consultants discovered the benefits of work load segregation this approach was further used to take advantage of not just 2 nodes but to further scale out to 3+ nodes although this was not supported with prior versions of Sterling OMS.

The disconnect between Oracle's own benchmarks and  real-world experiences of Sterling OMS in the field or internal benchmarks up until SSFS 9.1 can be explained by a combination of the following factors  :-
1. Benchmarks in spite of their best intentions are skewed favorably and not a realistic representation of the system in Production scenarios where products such as Sterling OMS are customized and integrated with other systems running in older legacy systems or a different data center.
2. New orders which form a majority of the workload during peak load cause a high number of inserts in key tables such as YFS_ORDER_HEADER and YFS_ORDER_RELEASE_STATUS. Since multiple transactions such as Create Order, Schedule Order, Release Order, Order Validations etc can run on multiple JVMs they all result in updates to the right most part of the index. High insertion rates are limited by the fact that index leaf blocks have to be released by one node before it can be acquired by the other.
3. Certain external systems interfaces or even highly custom Sterling transactions such as Schedule Order are   longer running than out of the box or benchmark like conditions. This increases lock holding times.
4. In spite of numerous tweaks and advances to the Hot Sku feature Inventory Item locking continues to be the Achilles heal for all high order volume implementations. The problem is magnified during events such as Black Friday when bulk of the orders are for a limited number of SKUs.
5. Index contentions are recommended to be addressed using hash partitions or reverse indexes. Neither of these are supported by Sterling due to the negative impact to performance (slow query response times) in other conditions.

The last point was addressed partially in 9.1 and in a more full-fledged manner in 9.2 with the introduction of the randomizing elements within the primary key. I have not had a chance to test the behavior in the field but results from internal tests have shown the results to be promising.

What does this mean for your implementation? Do you take the 4-node RAC plunge or wade cautiously sticking to a single or 2-node RAC?  Here is what I suggest -
a. Ascertain the size of your Oracle Database through a combination of hardware sizing and capacity planning exercises. Then determine if that need is best met by 2 node RAC or if you truly need more Oracle instances. Even if your load can be handled by a single instance a 2-node RAC may still help you accomplish a more highly available system especially when it comes to patching and database maintenance.
b. Unless you are planning to use the Dell PowerEdge Power M600 system for your database needs you have to run your own set of load and functional tests to ensure that the 3 or higher node RAC configuration meets your needs.
c. Even if you are running the supported Processor stack for 4-node RAC you may want to determine the best allocation of Sterling transactions to Oracle instances via the service configuration if your Sterling version is pre 9.2. Trying to use a single service spread across all instances does not perform best due to reasons mentioned above. Optimal RAC configuration is best determined by load testing under various work load segregation models. A starting point would be to keep the Order flow related transactions to one instance, inventory updates to another, purges to a third instance etc.
d. If you are running Sterling 9.2 ensure that the primary key randomizing feature introduced is working as expected on the 16 tables where it is enabled by default. Key among it are the tables  YFS_ORDER_RELEASE_STATUS, YFS_ORDER_LINE. Insert times irrespective of the RAC scenario should be under 10 ms  (preferably under 5 ms) and seek (read) times should not exceed 5 ms.

Your "Sterling" Performance Architect can help guide you through these choices. I would love to know what your experience with RAC has been for Sterling so do feel free to write in with your questions or comments.


Sunday, October 7, 2012

When seeing is not believing - Agent flow misconfiguration unraveled



The old saying goes Seeing is believing but the other day while examing an issue at a customer environment I saw something that made me do a double take. For I could not believe what I saw on the Sterling application configuration.  Thus, the title of the blog  (not to mention my weakness for catchy titles). Read on to find out how the issue was investigated and learn more about agent/flow configuration internals.

Like most issues it started out mundane - an Invalid Server error from one of the agent logs. The relevant lines from the logs of the agent server AsyncReqAgentServer are pasted below -


<Errors>
    <Error ErrorCode="YCP0223" ErrorDescription="Invalid Server." ErrorRelatedMoreInfo="No Services Configured for this Server: AsyncReqAgentServer">
        <Attribute Name="ErrorCode" Value="YCP0223"/>
        <Attribute Name="ErrorDescription" Value="Invalid Server."/>
        <Attribute Name="ErrorRelatedMoreInfo" Value="No Services Configured for this Server: AsyncReqAgentServer"/>
        <Stack>com.yantra.interop.services.InvalidConfigurationException


The AsyncReqAgent is typically used to run the ASYNC_REQ_PROCESSOR transaction. So, I did what most of us would do check out the configuration of the ASYNC_REQ_PROCESSOR transaction.  Here is what I saw - 


Now, you can see what stumped me. On one hand the Application configuration is showing one thing while the same application logs is vehemently indicating another. Putting on my PE hat I figured that there is more to it that meets the eye and decided to dig a little deeper. 

First, I checked if the transaction is indeed running. A quick grep of the agent logs showed that it was running as part of the DefaultAgentServer as it was the DefaultAgentServer logs that had the "Starting service..." message.
Then, I decided to check the other environments to see where it is supposed to be running or configured. In Production I learnt that it was running under the AysncReqAgentServer. In lower environments it was running in a mixed mode but with most of them it was running on DefaultAgentServer.
At this stage a combination of instinct and experience led me to venture a guess that it is probably right in Production and just messed up here and elsewhere and I just have to prove that. 
So I checked the server configuration instead of the transaction configuration. This is a neat little configuration screen that is not very well known mostly because it is seldom used.  Buried in the Platform Application view > System Administration grouping is the Configured Servers view. This can be used both to view all the servers defined but also the details of sub services or agent criteria configured for each of the servers. Here is a screenshot - 



The sub service list tab shown is accessed by doubleclicking and viewing the details of an individual server. Here is what it showed for the AsyncReqAgentServer -

and for the DefaultAgentServer - 

So now that it was clear the logs were correct (atleast in this scenario) with the ASYNC_REQ_PROCESSOR indeed running as part of the DefaultAgentServer and the AsyncReqAgentServer having no services configured. Thus it was the configuration that was out of whack between the Server and Transaction configuration. That mystery is unraveled further if one digs in how these views are dispalyed and how configuration data is propagated. 

Transaction configuration view is based on the YFS_FLOW and YFS_SUB_FLOW tables whereas the server configuration view and its associated sub-services are built on the YFS_SERVER and YFS_AGENT_CRITERIA table. Normally, these config tables are always in sync if the configuration changes are all driven by manual changes. However, in most implementations the Master Config environment is maintained as the source of config changes and CDT is used to promote configuration changes to various environments. A problem in the MC environment normally a crash or an incorrect data fix could result in a mis-configuration. This mis-configuration is then promoted to environments via CDT. Production was spared because it was running an older version of the release and config changes were yet to be promoted there. 

Here is a query that I could have used to confirm my observations   - 

select agent_criteria_id, transaction_key, flow_key, server_key from yfs_agent_criteria 
where server_key in (select server_key from yfs_server where server_name = 'AsyncReqAgentServer)

It can be adapted for your situation for e.g. to determine what all services are configured under a particular server. So when it comes to Sterling OMS (and perhaps most things in life) if you don't believe what you see  just look further. 



Sunday, September 9, 2012

Agent framework scalability and Tuning considerations for high volumes

The core of the Sterling solution for many implementations lies in the Sterling agent framework and APIs provided for monitoring and order fulfillment. OMS implementations typically use Schedule Order, Release Order, ConsolidateToShipment, Real Time Availability Monitor agents to name a few. Although every agent works differently and there is no run_faster parameter available to scale up Sterling agents there are few underlying elements that vastly control the extent of their scalability. Very little is documented in the public domain on how exactly the agent framework works and to what extent it can scale so here goes my attempt to demystify agent operations and scalability. This post assumes that you are familiar with the OMS nomenclature else you may want to read my earlier post first.

How a Generic Agent works -
A generic Sterling agent is a background batch processing job that does the following -
  1. Check if there are messages to process from the configured JMS Queue. If the queue is empty post a getJobs message and go to Step 2 else go to Step 4.  
  2. Read the getJobs message and gets the first set of jobs (first batch) from the database using a getJobs method up to the defined buffer size (Number of records to buffer configuration which defaults to 5000).
  3. Writes these records back into the configured JMS queue in the form of executeJobs messages as well as the next getJobs message containing the last fetched record key such as an TaskQKey
  4. Retrieves executeJobs messages from the queue and does the necessary processing using the executeJobs method
  5. After finishing first batch gets the next set of jobs (second batch) up to the buffer size using the last fetched record key in the getJobs message. 
  6. Works on second set of jobs.
  7. Continue the above process till all the present jobs are worked upon.
  8. After all the present jobs are worked upon then wait for signal  i.e. the agent trigger  to start working again.
  9. Upon getting the signal to start, agent will start working again i.e. follow Step 2 to Step 7 
More details on default agent behavior - 
Triggering an Agent is the act of posting a getJobs message to the JMS queue. Triggering may be manual or automatic i.e. self triggered. During an agent startup if there are no messages in the queue an agent automatically triggers itself.
Within getJobs method, agent tries to acquire lock on YFS_OBJECT_LOCK table for agent Criteria ID
If lock is not available then getJobs method exits and does nothing. This is used to ensure that duplicate sets of records are not retrieved for processing. 
If lock is available then getJobs method fetches records which needs to be processed. 
Above records are posted as execute message to JMS queue. For each message depending on the JMS session pooling setting a new MQ session is created or borrowed to post the message and then session is closed or returned to the pool. This default behavior could change in an upcoming version as a result of the testing we undertook for one of our customers.
After the execute messages, one getJobs message is also posted with last record key so as to facilitate retrieval of next batch of messages.
Each thread of the agent picks execute message one by one and processes them. Multiple threads of execute method can run concurrently. 
After all the execute messages are consumed then only getJobs message is left in queue then the same agent thread uses the getJobs() method to process the getJobs message and continue the processing cycle.

Scalability concerns and Scaling the Availability Monitor agent -
Are Sterling agents multi-threaded?
Not entirely. The getJobs component of the agent working is deliberately made single threaded via the database locking on YFS_OBJECT_LOCK to ensure same set of records are not processed and retrieved multiple times. However, the bulk of the workload is on the executeJobs component which is multi-threaded and can run in multiple JVMs.

Will my agent scale to meet the peak throughput?
Depends on your volumes. Scaling an agent involves tuning the getJobs and the executeJobs component.  The scaling and tuning of the executeJobs component is a different exercise which varies depending on the use case so it will not be covered in this post. At low to medium volumes under 100K/hr scalability issue are largely with the executeJobs component. For workloads under 100K jobs/hour the default settings that governs agent behavior should work well. If you are using the agent framework to process over 150K "jobs" per hour there may be challenges using the default implementation. I use the term jobs to denote the message entity for e.g. Jobs in the case of ScheduleOrder are distinct Orders and for Availability Monitor it is distinct Inventory Items.

What are the elements that affect scaling beyond 100K jobs/hour?
  1. Performance of the getJobs query - Slower the query more time is spent on retrieving messages
  2. Time taken to write all of the retrieved executeJobs messages to the queue - Default behavior for creating and closing MQ sessions to write individual messages meant that was a significant overhead. Using the product HF to enable bulk loading of messages significantly improves the write time per message. Other aspects such as Persistence setting used for the queue, network latency between the agent servers and MQ server can also affect message write times to the queue. 
  3. Buffer size of messages to get - Default of 5K may not suffice at very high loads as it would mean 40 or more execution of the getJobs component to achieve just 200K throughput. Since getJobs is single threaded there needs to be an optimal number of executions of it.

Scaling the Real Time Availability Monitor (RTAM) Agent - A case study
At a customer site one of the challenges was to scale the Real Time Availability monitor agent to do the Partial Sync of inventory at over 250 K records/hour. The customer was running Sterling 8.5 HF 25, WAS 7 and MQ 7. Following actions were taken to scale the agent from about 150K/hr to around 300K /hr -
1. Tuning the getJobs query - Front loading the YFS_INVENTORY_ACTIVITY table would heavily skew the test results due to the excessive time spent querying it as part of getJobs query.  Hence, trimming or keeping the Inventory Activity record table under check significantly alters the time take for getJobs and also more realistically represents production work load.  We also ensured usage of the correct index and updated statistics.
2. Setting the agent queue to non-persistent - We defined the internal JMS Queue as non-persistent on MQ. Then, setting the PER(QDEF) option in the scp file for this queue's entry while generating the bindings. Writing each message to the persistent queue takes between 11-20ms whereas on the non-persistent queue it is under 5 ms.
3. Enabled JMS session pooling for this agent via the following property in customer_overrides.properties -
yfs.yfs.jms.session.disable.pooling=N
This allows sessions to be borrowed and returned to the pool instead of new ones getting created and closed for each message.
4. Enabling the bulk loader property for the agent framework to avoid creating and closing sessions for each message being posted to the queue. We worked with IBM Sterling support to accomplish this via 8.5 HF48. The below 2 properties were set in the customer_overrides.properties file 
yfs.agent.bulk.sender.enabled=Y
yfs.agent.bulk.sender.batch.size=50000
The batch size setting of 50000 should be equal to or great than the maximum buffer size you plan to use across all agents.
5. Running the agents in the same data center as the MQ server - This helps further improve the latency between the two tiers and therefore the overall performance. This may not always be possible if you are running agents in multiple data centers. 
6. Increasing the buffer size of records retrieved to 10000 from the default of 5000 - We tested with various settings between 5K and 25K and found that the overall performance was best at 10000 for our setup. The optimal setting for the buffer size may vary on your environment and workload so run performance tests to determine what works best for your needs.

Now that you have a better understanding of how the Sterling agent works you should be in a better position to troubleshoot and scale the agents. Happy testing and tuning! 

Saturday, July 28, 2012

OMS Transaction Framework - Nomenclature and more


After a long break following my first post - longer thanks to the distractions of the Euro Cup and Wimbledon - I am back with a little tidbit on nomenclature related to the Sterling transaction framework. If you have been stumped by whether a Sterling process is an agent server or integration server then this info would help you make the right call. 

For many years as I have worked with various customers, colleagues and partners I would hear people using various Sterling terms - agent server, integration server, transaction interchangably. Although the Sterling OMS world is not what it was in 2000 and as the lines are getting blurred as traditional "agent" processes are being implemented as services I figured I should tackle this topic in my blog. Earlier this week when one of my colleagues mentioned that this was a topic he too had explained for the n-th time to a new customer and pinged me looking for such a write-up I figured that it was time to put pen to paper or rather finger to keyboard. (For illustrations do refer the Sterling product documentation guides - ftp://public.dhe.ibm.com/software/commerce/doc/ssfs/85/Application_Platform_Configuration_Guide.pdf)

Grab a cup of your favorite beverage as this post does get a little long..

Transactions - In software parlance, a transaction usually means a sequence of information exchange and related work (such as database updating) that is treated as a unit for the purposes of satisfying a request while ensuring data integrity. Transactions may be synchronous such as those running in the UI or Asynchronous such as the batch jobs.  In the Sterling world the product extensibility and flexibility make the boundaries of a seemingly similar transaction vary from implementation to implementation even if the project teams and customers may call it the same such as Create Order Transaction or even dropping the transaction and referring to it as Create Order. In Sterling these transactions are defined either as an agent criteria or a Service via the Application Configurator. These transactions are executed in background JVMs known as agent servers or integration servers (also referred to as batch jobs) or directly from the Sterling UI (traditional console or thick client) or a Webservice call from external systems on the Application server JVM. Transactions consist of the underlying API and its associated events, user exits and conditions. A successful transaction results in the changes being committed that usually involves a combination of Database updates and messages being written and read from a queue or file. Either the entire transaction is successful or an error is thrown which causes the entire transaction to be rolled back or an error to be raised for subsequent reprocessing.  

In Sterling MCF we can classify processes into the following types of transactions :-
1. Time-Triggered transactions or Agents- which are triggered on a scheduled basis to perform repetitive actions.  Actions typically include invoking APIs to perform database updates. for e.g. consolidation of orders to shipments that may need to happen around 30 minutes apart so the Consolidate To Shipment time triggered transaction can be configured to trigger every 30 minutes. Most of the time triggered transactions are driven by records in YFS_TASK_Q table or based on the pipeline. Time triggered transactions are defined by the Transaction Name and the Agent Criteria.  They can be run in single or multi thread mode and are also called agents and the servers in which they run being called agent server.  Three types of Time-triggered transactions are :-
i. Business Process transactions - Responsible for majority of processing entities such as orders (sales/purchase/transfer) and shipments. The entities in every implementation will require one or more business process transactions such as CONSOLIDATE_TO_SHIPMENT, CLOSE_ORDER to complete their lifecycle. Understanding limitations of the Sterling transaction framework and designing for your business needs can help you get the most out of the solution.
ii. Purge transactions - Archive data from live (transaction) tables to history tables or delete that data that does not require archiving. Helps to mitigate unrestricted growth of the OMS  transactional database. Frequently underestimated in value and in development+testing efforts and overlooked in most implementations leading to application performance issues.
iii. Task Q Syncher Time-Triggered Transactions – A relatively new addition to the fold and is used to update the task queue repository table with the latest list of open tasks to be performed  by the corresponding each transaction, based on the latest pipeline configuration. 4 of these transactions are available - Load Execution, Order Fulfillment, Order Delivery and Order Negotiation
iv. Monitors – These are circumstance driven transactions that watch for processes or circumstances that are out of bounds and then raise alerts. Common monitors are those for Order, Shipment, Inventory Availability and Exceptions. Monitoring jobs can be a huge system hog if the data is not being purged often and if excessive stale entities exist such as abandoned or erroneous orders.

2. Services or Flows – Transactions that are executed NOT in pre-defined times are called services or flows. In the Database and Configuration screen titles this name is also used for every transaction in the SDF. Services can be invoked via use of broadly available transports - Web service/SOAP, HTTP, JMS, MSMQ, DB, flat file etc. A service can invoke other services to make a longer chain of services.  A service could include invoking APIs (product or custom), evaluating conditions, making DB updates etc. The services are processed continuously subject to thread and resource availability and are not triggered at any particular time. They can be run in single or multi thread mode and the servers in which they are executed are called integration servers. The most common scenario is the use of services to read messages from an inbound queue to Sterling for example to Create orders flowing in from a web channel.

3. Externally-Triggered Transactions - An externally-triggered transaction is used to map a service invoked to a Sterling transaction and to leverage the transaction framework. Seldom used in the real world as implementations prefer to just use a service/flow minus the transaction instead.

4. User-Triggered Transactions - A user-triggered transaction is invoked manually through the Application Consoles, a configured alert queue, or an e-mail service.  Never seen it used in the field.  So if you are implementing this or the externally-triggered do let me know how it goes.

Composite services – A construct to enable invocation of multiple services in parallel. A very useful concept ever since its addition to the SDF but needs careful testing as implementations could run into issues stemming from funky exception handling or inadequate logging.

Agent Criteria – An element that describes attributes that are specific to a time-triggered transaction. These attributes include the selector criteria such as Organization code, Manual or Auto triggered, trigger interval and server name. A particular transaction may have one or more agent criteria for processing data for different organizations or other logical grouping. E.g. Schedule Order agent criteria could be used to run scheduling for different organizations at different intervals.

Agent Server - Server JVM on which one or more agent criteria (commonly referred to as agents) can run. Invokes the com.yantra.integration.adapter.IntegrationAdapter class and is started typically by a startIntegrationServer.sh script provided as part of the product installation.

Integration Server - Server JVM on which one or more integration services or flows (commonly referred to as services or mistakenly called agents) are run. Invokes the com.yantra.integration.adapter.IntegrationAdapter class and is started by a startIntegrationServer.sh. 
Yes, you read it right! Both agents and integration services are started by the same class and script but the server name, service name or agent criteria name and definition controls the behavior.

Trigger agent - This is the process that is typically invoked via Cron or Ctrl-M jobs to trigger a certain time triggered transaction at certain points in time using the triggeragent.sh or triggeragent.cmd script.  For e.g. to Create Waves at certain hours of the day in a WMS implementation the trigger agent job could be invoked to trigger Create Wave agent or to run a nightly purge of sales order we could trigger the Order Purge agent.

Events – Help accomplish certain specific actions executed upon a certain business event occurring. For e.g ON_SUCCESS of Create Shipment we could have an event to send an e-mail to the customer with the shipment details or ON_BACKORDER of Schedule Order could used to raise Alert to the Inventory Control Business team. Event Handlers are configured to associate the required Actions to a particular event. Conditions are often used to further customize the action taken. Event handlers can invoke any service to e-mail, or raise exception alert;  Publish XML to external queues/database or Invoke custom services.  Actions associated are triggered any time the transaction is raised and when applicable so use it with caution. Excessive number of and complicated actions can prolong a transaction so use them wisely and tune them well.

User Exits – These enable transactions to invoke custom logic to interact with external systems synchronously to complete processing. A classic example is in the Payment Agent for credit card authorization. Frequently a source of issues when not implemented well and only care while designing and testing can avoid myriad issues post production. 


Tuesday, April 17, 2012

Whose problem is it anyway?


Growing up in India cable TV did not make it to my home until my high school days. One of the early shows that caught my attention was the very funny syndicated improvisational game show – "Whose line is it anyway?"  In the eponymous round contestants on their turn use their creative instincts and quick-wittedness to “explain or demo” random and quirky looking props. The toughest OMS problems call for that same kind of creativity, (although the results or the scenario itself is far from being funny) and for someone to step up and make sense of the problem (random or otherwise) with a complex software solution that has a seemingly quirky side to it.

When a MQ Queue Full is not an MQ issue –
Here’s a typical problem encountered at a Sterling OMS implementation in the testing phase. A certain transaction say CREATE_ORDER fails with the following exception and stack trace –
com.yantra.interop.services.jms.JMSProducer$RetryException: com.ibm.msg.client.jms.DetailedInvalidDestinationException: JMSWMQ2007: Failed to send a message to destination 'CREATE_ORDER_QUEUE'. JMS attempted to perform an MQPUT or MQPUT1; however WebSphere MQ reported an error. Use the linked exception to determine the cause of this error.
        at com.yantra.interop.services.jms.JMSProducer.sendJMSMessage(JMSProducer.java:852)
       at com.yantra.interop.services.jms.JMSProducer.access$700(JMSProducer.java:63)
......
JMSCMQ0001: WebSphere MQ call failed with compcode '2' ('MQCC_FAILED') reason '2053' ('MQRC_Q_FULL'). [system]: JMSProducer
com.ibm.mq.MQException: JMSCMQ0001: WebSphere MQ call failed with compcode '2' ('MQCC_FAILED') reason '2053' ('MQRC_Q_FULL').

At first glance, this seems to be an MQ issue calling for the testing team to make a beeline to the WebSphere MQ administrator’s desk. However, a more thorough investigation calls for many additional checks to be done and questions to be answered before pinging the MQ Admin.
a.       Has the queue been sized appropriately for the environment?
b.      Are there processes – Sterling or otherwise - attached to and consuming messages from the queue?
c.       Are the messages from the queue being consumed at a much slower rate than incoming messages? 

Other Sterling OMS system and performance problems would entail weeding through many more questions such as
a.       Is it a browser issue?
b.      Is it a database tuning issue?
c.       Is it an Appserver configuration problem?
d.      Does the solution/product scale to meet our needs? 

Failure to consider all these questions to identify a root cause often leads to the conclusion that most Sterling OMS system problems are simply “a Sterling issue” (the industry is still to term this an IBM issue perhaps reserving that for their other woes on “traditional” products on the IBM tech stack). Whose problem is that anyway? Or to be more precise between an Implementation team – developers and testers, System admin team - DBAs, Appserver, JMS, AIX Admins and IBM Support who is going to own it and drive it to resolution? Thus, was born the role of a services focused Yantra/Sterling Performance Engineer in 2004 (Yantra as it was known up until 2005 the Sterling Commerce acquisition). The name Performance Engineer or PE has stuck although not all issues require performance tuning but because nothing else fitted either. 

How Performance Engineers are like Economists –
Steven Levitt in his best-seller SuperFreakonomics describes economists as being trained to be cold-blooded to calmly discuss trade-offs involved in a global catastrophe while the rest of us non-economists are a bit more excitable. A good Performance Engineer (Sterling or otherwise) is a lot like that economist and although he is not called on to explain implications of a global catastrophe like an earthquake or global-warming (a production outage being the biggest catastrophe that a PE is called on to solve) he needs to analyze issues calmly and keep emotions – blame, paralysis, confusion, panic, ego –  in check while collaborating with the various teams - business users, System Administrators, developers and Support to find a resolution. 

Had an economist been regarded as highly as a doctor or an engineer in the Indian middle class psyche perhaps I may have gone on to become one. Now 8 years since I first started as an in-house PE in the QA organization and 12 years since I started there as a Support Engineer I am still solving Sterling issues and still loving it.  This blog attempts to share what I have learnt over the years (and still learning) on implementing, fixing and tuning Sterling applications. Although it may be difficult to explore all Sterling issues in a simplistic Q & A format like that of asktom site hosted by the legendary Tom Kyte (the first “technology” guru I was and still am in awe of)  I shall experiment to see what best can be shared in this format. I am hoping that I can review your questions, try and answer some (or at least the most interesting and relevant ones) and other topics in these pages and most importantly nurture the inner "PE" in each of you.  

Do let me know your comments on this post & format and what Sterling topics you want to see covered (It will keep me from boring you with personal stories and not-particularly-useful insights).