Friday, August 07, 2009

A Service Bottleneck


When I was looking at time series data from my previous post, I found an interesting issue. First, the above graph represents "connected time," i.e. time people are actually using the resource; else, the value at Zero would dwarf the entire graph. The Y-axis is the number of seconds served for the corresponding number of concurrent connections. This is actual data, and the describe is generic.

The coolest part is: IT'S A BELL CURVE. I've read about these in statistics, but this is the first time I've encountered one in the wild. I knew I had a problem when the bell curve did not flatten out on the right tail. The physical limit of the resource is 23 units. As concurrent connections approached 23 units, new connections were denied and subsequently retried. See the "hump" beginning at 18 units? That's the graphical representation of a bottleneck.

Now, how to fix this. . .it's not practical to fix this problem with a technological solution. 95% of connected time is served prior to the bottleneck at 18 units. Over 90% is served with the first 13 units, and the marginal return for each subsequent unit is less valuable.

The fix is with a business control. First we determined the build up occurred due to multiple long running connections, which did not act like the other connections. Many of these connections began at the same time and finished at the same time, and did not disconnect randomly. Therefore, we modified the behavior of the people using the long running connections.


Thursday, August 06, 2009

Working with Continuous Time Data

Recently, I was asked to study some time series data. Each record contained a date range, bracketed by a DATE_TIME and an END_TIME field. The objective is to find number of concurrencies for all records’ date ranges. Each second in each date range must be compared against other fields, and calculate a total for the date range, i.e. 3 concurrent events at 8:30:31 AM on July 31st.

First, I tried modeling this with Excel. With relatively few records, 600, this was extremely quick, immediate information. I listed the possible times in 1 minute increments. However, as the increment shortened, the possible times exponentially expanded. Excel lost steam as we approached 10000 times being checked.

Next, I tried a SQL query using an accompanying table with a datetime field. I used a scripting language to walk through each second of each record’s date range and create a record in the accompanying table. Then I could use the following SQL command to pull the data:

SELECT time, COUNT(time) FROM times GROUP BY time ORDER BY time;

This created two bottlenecks: 1) populating the accompanying table, and 2) retrieving the data. As the accompanying table expanded the GROUP BY statement became a burden.

Solution was to use a SQL aggregating table with a counter field. I ran a massive SQL creation of every possible time between the ranges we examined. Use a scripting language, I walked through every record, and ran the following SQL command:

UPDATE times SET counter = counter + 1 WHERE DATE_TIME >= record AND END_TIME <= record;

The final solution took a day to run for results from the previous month. It was faster than the two previous methods, and only created one bottleneck.