Recently, I was asked to study some time series data. Each record contained a date range, bracketed by a DATE_TIME and an END_TIME field. The objective is to find number of concurrencies for all records’ date ranges. Each second in each date range must be compared against other fields, and calculate a total for the date range, i.e. 3 concurrent events at 8:30:31 AM on July 31st.
First, I tried modeling this with Excel. With relatively few records, 600, this was extremely quick, immediate information. I listed the possible times in 1 minute increments. However, as the increment shortened, the possible times exponentially expanded. Excel lost steam as we approached 10000 times being checked.
Next, I tried a SQL query using an accompanying table with a datetime field. I used a scripting language to walk through each second of each record’s date range and create a record in the accompanying table. Then I could use the following SQL command to pull the data:
SELECT time, COUNT(time) FROM times GROUP BY time ORDER BY time;
This created two bottlenecks: 1) populating the accompanying table, and 2) retrieving the data. As the accompanying table expanded the GROUP BY statement became a burden.
Solution was to use a SQL aggregating table with a counter field. I ran a massive SQL creation of every possible time between the ranges we examined. Use a scripting language, I walked through every record, and ran the following SQL command:
UPDATE times SET counter = counter + 1 WHERE DATE_TIME >= record AND END_TIME <= record;
The final solution took a day to run for results from the previous month. It was faster than the two previous methods, and only created one bottleneck.
No comments:
Post a Comment