Not all performance issues are found during formal performance tests. A significant number are found during performance test smoke testing or other debugging activities. The one described below was found during an exercise to see if Splunk would provide any useful information from the application logs being generated on a multi-tier application.
I was setting up a Splunk trial on a client site to see if Splunk would be useful to them and after setting up a Splunk instance and pointing it towards the various log locations, the first step was to work through all the reported errors. It turned out for the sample of logs I was provided, nothing too serious was reported; however while working through these I noticed a very interesting pattern with the logs from the middleware layer. The same request, replied to with the same response, often called many times within the same second. This confused me a bit, but figured this would be a good candidate to show the power of Splunk to the project team. Using the transaction1 command to aggregate all matching messages sent and received within a 1 second period, we end up with a frequency plot like below. Where the number of repeated messages in the transaction is on the y axis, and the number of different transactions is on the x axis. The data in this plot is from a 1 minute slice of the logs.
So in this minute we have 7 sets of messages, all being repeated 46 times? Sounds like something isn’t working as designed.
Digging through the data a bit more, I was able to narrow down which middleware calls were being made the most and handed this information over to the development team. Several weeks later I was informed that I had indeed found a bug.
It turns out that the code wasn’t saving all the data being returned, only a single field on each call. So when the application server wanted all the information about a user, each field required resulted in a round-trip to the middleware (and presumably all the way back to the database). Unfortunately my engagement at the client finished before this fix was pushed into production so I’m unable to report on any outcomes but presumably, faster response times, less logged data and happier application servers and middleware.
This is a good example showing the use of all the information available to improve code bases from a performance perspective. It also shows that you don’t need a big signed-off, covers-all-bases, final solution to enable this data to be useful. This bug was found with about 4 hours work, most of that was installing Splunk and teaching it the format of the logs it was consuming. It also suggests that a regular review of logged data could be quite beneficial in the absence of a full-blown log monitoring solution.
The transaction command is used to group events based on specified fields such as session IDs or IP addresses.