Let’s set the values for the 7 declared variables. The first 4 variables will need to be set manually by the individual running the query.
The first variable (_TIMEZONE) should represent your local TimeZone.
The second (_RANGE_START_TS_LOCAL) and third (_RANGE_END_TS_LOCAL) variables will represent the range of time using your local TimeZone which you want to analyze.
The fourth variable (_RANGE_INTERVAL_SECONDS) represents the size of the time intervals you want in which a single Point in Time (a second per Interval) will be collected.
Note – It is very important to limit the X-Axis (Time) data points to a reasonable number (something between 60 and 360); otherwise, you will have troubles with the size of the query result export and/or you will have issues with having too many X-Axis data points in the final graph.
For analyzing what happened for a 5 minute range of time, it would be appropriate to set the _RANGE_INTERVAL_SECONDS = ‘1’, as this would produce 300 data points on my X-Axis…one per second for every second in my defined 5 minute (300 second) range of time.
For analyzing what happened for a 1 hour range of time, it would be appropriate to set the _RANGE_INTERVAL_SECONDS = ‘10’, as this would produce 360 data points on my X-Axis…one per every 10 seconds in my defined 1 hr (3600 second) range of time.
On the same note, for analyzing what happened for a 24 hour range of time, it would be appropriate to set the _RANGE_INTERVAL_SECONDS = ‘300’, as this would produce 288 data points on my X-Axis…one per every 300 seconds in my defined 24 hr (86,400 second) range of time.
In summary, we are encouraging the user to sacrifice ‘accuracy’ for larger time ranges in order to produce a more ‘readable’ chart. While this chart is 100% accurate as of a second in time, if we only choose 1 second to visualize for every 300 seconds, then we are producing a chart that is a sample representation of actual slot consumption and concurrent query counts for the range of time being analyzed.
The fifth variable (UTC_OFFSET) represents the offset between your locally defined TimeZone and UTC. Let’s make it an expression as opposed to a manually defined literal value because of issues with Daylight Savings Time (DST); otherwise, the user would have to remember to change the literal offset throughout the year as DST changes.
The sixth (_RANGE_START_TS_UTC) and seventh (_RANGE_END_TS_UTC) variables represent the range of time you want to analyze converted into UTC time using the derived _UTC_OFFSET value.
You might be asking yourself, “Why spend so much time declaring and setting variables?” In short, this has been done for readability/supportability and to minimize the amount of manual changes needed every time you run this code for a new range of time.
Now that all of our variables have been declared and set, we can finally start to analyze the query. The query is built from two derived sets of data aliased as ‘key’ and ‘query_info’.
The ‘key’ derived table
The ‘key’ derived table is creating a one column result set with a single row for every interval (_RANGE_INTERVAL_SECONDS) that exists within the range of time you are wanting to analyze. We are able to do this with a couple of really neat array functions. First, we leverage the GENERATE_TIMESTAMP_ARRAY function which will produce an array (aliased as POINT_IN_TIME) of timestamps between the _RANGE_START_TS_UTC and _RANGE_END_TS_UTC variables for each interval of time defined in _RANGE_INTERVAL_SECONDS.
For example:
_RANGE_START_TS_UTC = ‘2021-08-05 14:00:00.00000’
_RANGE_END_TS_UTC = ‘2021-08-05 15:00:00.00000’
_RANGE_INTERVAL_SECONDS = 60
Using the above inputs, the GENERATE_TIMESTAMP_ARRAY will produce the following array with 61 elements:
[‘2021-08-05 14:00:00.00000’,’2021-08-05 14:01:00.00000’,’2021-08-05 14:02:00.00000,… ,’2021-08-05 15:00:00.00000’]
In order to convert this array of 61 elements into rows, we simply use the UNNEST Function.
Note: The ‘key’ derived table could be considered optional if you are 100% certain that queries were actively running every second of the time range being analyzed; however, if any point in time exists in which nothing was actively running, then your final chart wouldn’t have a datapoint on the X-Axis to represent that point(s) in time…which makes for a misleading chart. So, to be safe, it is strongly encouraged to use the ‘key’ derived table.
The query_info derived table
The ‘query_info’ derived table is relatively straightforward.
In our example, I want to pull Slot Seconds (period_slot_ms / 1000) and Query count information from the INFORAMTION_SCHEMA.JOBS_TIMELINE_BY_PROJECT object for every job for each second that matches the TimeStamps generated in the ‘key’ derived table.
In this particular query, the ‘GROUP BY’ statement isn’t needed…because every job should have a single row per second; therefore, nothing needs to be aggregated, and I simply could have hard-coded a ‘1’ for Query_Count. I left the ‘Group By’ in this example in case you aren’t interested in analysis at the Job_ID level. If you aren’t, you can simply comment out the ‘Job_ID’ field in ‘query_info’, tweak the ‘Group By’ statement accordingly, and comment out ‘Job_ID’ in the outermost query. In doing so, you would still be able to perform user_email level analysis with the final result set with accurate Slot Sec and Concurrency Query Count data.
Filters used in the query
We have six filters for this query.
First, in order to minimize the IO scanned to satisfy the query, we are filtering on ‘job_creation_time’ (the underlying value used to partition this data) where the min value is 6 hours earlier than the defined start time (to account for long running jobs) and the max ‘job_creation_time’ is less than the defined end time.
Second, we want to only look at rows with a ‘period_start’ timestamp within our defined range of time to be analyzed.
Third, we only want to look at job_type = ‘query’.
Fourth, in order to avoid double counting, we are excluding ‘scripts’ (as a script parent job_id contains summary information about its children jobs).
Fifth, and this is a personal preference, I don’t want to analyze any rows for a job if it isn’t actively using Slots for the respective Point in Time.
The sixth filter doesn’t actually change the # of rows returned by the final query; however, it provides an increasingly large performance improvement for queries as the value of _RANGE_INTERVAL_SECONDS grows. We first calculate the difference (in seconds) between the _RANGE_START_TS_UTC and the TimeLine object’s Period_Start timestamp. Next, we MOD that value by the _RANGE_INTERVAL_SECONDS value. If the result of the MOD operation does not equal 0, we discard the row, as we know that this respective Timestamp will not exist in the ‘key’ timeline built.
Note – Yes, these rows would have been discarded when we JOIN the ‘key’ and ‘query_info’ table; however, this requires shuffling a lot of potentially unnecessary rows. For instance, if the _RANGE_INTERVAL_SECONDS is set to 300 and a query ran for 300 seconds, then we’d be joining 300 rows of ‘query_info’ data for that job only to filter out 299 rows in the subsequent JOIN to the ‘key’ table. With this filter, we are pre-filtering the 299 unnecessary rows before joining to the ‘key’ table.
Outermost query
In the outermost query, we will LEFT OUTER JOIN the ‘key’ timeline table to our pre-filtered ‘query_info’ table based on the cleaned up TimeStamp values from each table. This needs to be a LEFT OUTER JOIN versus an INNER JOIN to ensure our timeline is continuous, even if we have no matching data in the ‘query_info’ table.
In terms of the select statement, we are using our previously defined _UTC_OFFSET value to convert the UTC Timestamps back to our defined TimeZone. We also select the job_id, user_email, proejct_ID, reservation_ID, Total_Slot_Second, and Query_Count from ‘query_info’. Note, for our two metric columns, we are filling in Null values with a 0 so our final graph doesn’t have null data points.
Plotting the chart
Now that we have a query result set, we need to copy & paste the data to Google Sheets or any equivalent Spreadsheet application. You could follow the below steps
Add a Pivot Table ( For Google Sheets on the menu bar, Data→ Pivot Table).
In the table editor – Period_Ts goes in as Rows, Total_Slot_Sec and Concuurent_Queries goes as Value.
Once the Pivot Table is created, it is time to add a chart/visual. ( For Google Sheets on the menu bar, Insert→ Chart)
Once the chart is inserted, you will see that the concurrent queries and Total_Slot_Sec are on the same axis. Let’s put them on a different axis i.e add another Y axis.
Double click on the chart and select customize. Click Series.
Select “Sum of Total_Slot_Sec” and Select Left Axis on the Axis selection.
Select “Sum of Concurrent_Queries” and Select Right Axis on the Axis selection.
Lastly, change the chart type to a line chart. That’s it, your chart is ready!