We often have customers running ad hoc queries on the databases that we host. Most of the time the queries run fine… sometimes they don’t.
The first step we took towards managing rogue queries was to deploy Resource Governor (only available in Enterprise or Developer editions) to limit their MAXDOP. In general, a rule of thumb is to limit ALL queries to the number of physical cores on a single socket. (The basis for this rule is related to NUMA, and SQL Server’s “NUMA awareness”, which I will not go into here. If you really want the authority on these matters check out Glenn Berry’s work.) By limiting the MAX DOP we can at least take comfort in knowing that all of the CPUs will not be consumed by a bad query.
The next step was to create an alert that fires when the user query (not system, an individual query) reaches a specified CPU threshold, in seconds.
It turns out there is a simple way to do this using the standard alert configurations in SSMS. Unfortunately, the counter behind the alert never changes from zero (at least not in SQL 2008R2SP2CU3). So I scrapped that approach.
The next approach involved SERVICE BROKER, EVENT NOTIFICATIONS and in this case, RESOURCE GOVERNOR. That sounds like a lot of moving parts to create an alert. Fortunately, I found an article by Aaron Bertrand, who leveraged work by Jonathan Kehayias, that provided me with a good foundation to develop my own, extensible alerting framework.
The following is an excerpt from an email that was sent to my internal team. I had to tweak it a little to make it more generic, but the concepts remain the same.
I’ve added the TSQL in the New Downloads section (to the right of this screen) called Event Notifications Framework. You can use that code to build your own set of trace event alerts. I only request that you not remove any of the credits, myself as well as the link to Aaron’s post.
OBJECTIVE: Create a database alert that will fire an email to support whenever an external user query exceeds a CPU threshold, such as 1 minute or 1 hour.
Q: I thought we already had a “longest running transaction” alert. What’s the difference?
A: We do. The LONGEST RUNNING TRANSACTION alert fires when a TRANSACTION (INSERT, UPDATE, DELETE, MERGE) runs for longer than a fixed threshold. For HFC that threshold is 8 hours. There are intrinsic weaknesses in this alert:
- It does NOT apply to SELECT queries, which are usually not part of a TRANSACTION.
- If multiple transactions are running above the 8 hour threshold, the alert will only fire for the first transaction that crosses the threshold. If the second transaction finishes before the first transaction, it will never be recognized.
- Simply put, this counter does nothing to warn us of runaway, ad hoc SELECT queries.
Q: Why didn’t we create a simple “long running SELECT query alert” in the first place?
A: It should be possible to create a high CPU alert using a standard SQL Server alert configuration. The configuration screen below “should” work… but it doesn’t. It turns out that the counter is always zero. It is a bug that was identified in the original release of SQL Server 2008 and has not been corrected in any of the subsequent service packs (SP) or cumulative updates (CU). Don’t worry… I tried it anyway.
Q: How does this “alternative” approach work?
A: First, a couple of SQL Server subsystems need to be configured.
- Deploy Resource Governor (RG). RG makes it real easy to:
- Categorize user logins into certain “buckets” using a “classifier function”. We then use these buckets to tell the SQL Engine how much resources the users in those groups can consume. (In most cases, only when there is contention on the system.)
- “Limit” any user’s query to a specified number of CPU seconds. We can do it when we configure RG by setting the REQUEST_MAX_CPU_TIME_SEC. (In SQL Server, a “request” is just a confusing term for a query of any sort.)
- Next we need to configure SERVICE BROKER (another subsystem of the SQL Server engine) to send out the alert. SERVICE BROKER is a fancy (misleading) name for an internal, queue-based, asynchronous messaging system. Specifically to this alert, when the CPU time for a user query goes above the 5 second threshold, an XML “message” is placed on a special SB queue. Once it is on the queue, it will stay there until some other process comes along and “consumes” it.
- EVENT NOTIFICATIONS, is what makes it possible for the trace event that was set by resource governor to be placed on the SB queue. There is nothing to configure… it just runs. Event notifications was introduced in SQL Server 2005 so it’s hardly new technology.
- A special stored procedure “service” was written that acts as the consumer of the XML events that are placed on the SB queue. Ultimately, it sends out the email alert with the details of the event.
Q: Why do we need SERVICE BROKER/EVENT NOTIFICATIONS to process the event? Shouldn’t Resource Governor have some way to flag an administrator if the CPU threshold is exceeded?
A: For starters, we really would not want RG to be automatically killing queries that exceed a certain CPU threshold. The ONLY response RG does when the CPU threshold is exceeded is raise a “trace event”. Once the trace event is raised, EVENT NOTIFICATIONS places the event on the SB queue. What we DO with that information is entirely up to US, the watchdogs of our systems. There are other “limits” that RG does place a hard stop on, such as the MAX DOP. But in the case of the CPU threshold, it doesn’t make sense. The response to the event simply needs to be customized.
Q: What will the email alert look like?
A: The email message will prominently include:
- The originating server
- The CPU time in milliseconds that the query has been running before the ALERT was processed from SB
- The SPID of the offending query session.
SERVICE BROKER and RESOURCE GOVERNOR are only available in the ENTERPRISE edition of SQL Server. They are also at the top of the list of subsystems that are too often not used (aka, least understood). But their potential value is immense.
Q: If the alert fires, how can we determine which user queries were running at the time of the event? We may not see the alert until several hours after the event occurs.
A: One of the “responses” that has been configured is to call sp_whoisactive and save the results into a table in DBA_BASELINES. This will capture all requests (active queries) that are running on the system. From there we should be able to narrow down the offending query/user very quickly. (Amplification: We have a SQL Agent job that calls Adam Machanic’s sp_whoisactive, and writes the result set into a table.)
Q: Can the CPU threshold fire even if the elapsed time (wall clock) of the query is less than the threshold?
A: Based on my knowledge of query analysis, YES. A CPU bound query, such as one that is dominated by a large SORT operation, that is running on multiple CPUs, could easily use up the 5 seconds of CPU in less than 5 seconds of ELAPSED time. In an extreme case, 5 seconds could be used by 4 CPUs running in parallel in 1.25 seconds.
Q: In the sample email message the reported CPU time is 8.376 seconds. Why didn’t the event fire on 5 seconds, which is the preset threshold value in RESOURCE GOVERNOR?
A: Service Broker is an ASYNCHRONOUS queue manager. This means that it will process the incoming event when it can; which may not be immediately. The delay could be UP TO 5 seconds after the trace event fires. In the real world we will be concerned with queries that are running for at least 15 minutes. Another 5 seconds (max) added to the runtime is insignificant.
Q: If the user queries runs for hours and hours through the night will my inbox be inundated with this alert?
A: No. This alert will fire ONCE and ONLY ONCE for each query that exceeds the CPU threshold.
Q: Will this approach create a high overhead on these busy database servers?
A: Two out of the three components of this system are always running anyways… SERVICE BROKER and EVENT NOTIFICATIONS.
SQL Server uses SB for its own management of processes. All instances of SQL Server have SB running in the background. It’s engineered into the SQL Engine. EVENT NOTIFICATIONS is very lightweight and consumes minimal system resources. It’s running all the time for SQL Trace, Profiler and the DEFAULT trace.
The only “optional” component of this solution, RESOURCE GOVERNOR, which has already been proven to be a valuable tool with managing system resources, is already in place. By all accounts, this approach to converting trace events into alerts has a very low overhead on the database engine. (Note: Resource Gov is NOT needed to process trace events, only the trace event that is used by RG to limit REQUEST_MAX_CPU_TIME_SEC.)
Q: If we now have a FRAMEWORK to process alerts, what other “events” can be added to our pool of interesting alerts?
A: Anything that can be monitored in SQL Profiler (or SQL TRACE) can be monitored and alerted on using EVENT NOTIFICATIONS. For the CPU Usage alert we only needed RESOURCE GOVERNOR to set the max threshold. Otherwise, it isn’t really needed for this framework to function. [I believe DDL triggers can also be processed with this framework but I have not tested any.] Adding any event that is available from sys.trace_events requires two simple steps:
1. Create a new notification. Such as…
2. Add a block of code to the Service Broker “Service” dbo.dba_processEventNotifications. This custom “service”:
- Pops the trace events off of the SB queue.
- Reads the XML out of the SB message.
- Formats the email SUBJECT and BODY content and sends out the DB Mail message that contains the alert information.
Basically, the following block of code would need to be added. The @message section would need to be tweaked to pull out the relevant attributes from the XML message. For example, CPU would not be relevant to an alert triggered by a database file autogrowth, but the database file name would be of interest.
Once again, the source code for this is available in the NEW DOWNLOADS section.
Stay curious, keep learning…