The single most important component of the SQL Server product is the query optimizer. The SQL Server storage engine utilizes a cost-based optimizer to determine optimal query plans. The job of the optimizer is to determine an optimal plan that will result in exceptional performance.
Note: An “optimal plan” is not necessarily the same as the lowest cost in terms of system resources. Because of the exponentially complex number of possible plan choices, the storage engine would have to determine every possible plan in order to determine which plan would run the fastest. The time to compute all of the possible plans would take more time than coming up with a pretty good plan and running it. In a sense, the query plans that SQL Server generates are actually heuristics; really good but not necessarily the absolute fastest plans.
The challenges of the optimizer are numerous. The number of possible combinations that the optimizer has to choose from is exponentially complex. As the number of tables in a single query grow, go grows the number of potential indexes, keys, constraints, etc.
For example, a simple query that has two tables linked together with a single INNER JOIN may present several options for the query optimizer, one of which is to choose the optimal PHYSICAL join operator:
The 3 possible physical join types will take into consideration the relative number of rows between the two tables, if there is an index on one or both of the tables that will allow it to avoid a SORT before joining the tables, which index(s) can be used to return all of the needed columns without doing a bookmark (key) lookup using the clustered index, are there any foreign keys that can be used to guarantee the existence of a value in a column.
Statistics Overview
Every time new rows are inserted into a table or are updated, the profile of the data in the table changes. SQL Server relies on the data profiles as yet another input parameter to determining an optimal query plan. For example, if we have a table with 100,000 rows and one of the columns is the customer BIRTH DATE. The statistics on that table would group the 100,000 birthdates into a histogram (bar chart) with ranges or “bands” of birth dates. These bands of data can be used to determine the selectivity of a query…
If the nature of the data is such that most customers are in the 18-25 age group, the bands of data for those years (such as 15-20 has 35,000 records and 20-25 has 40,000 records) than 75 percent of the customers are in that age range.
If a query restricts the data being selected to all 22 year olds, the SQL engine will know that there are many 22 year old in the data and will probably choose to do a clustered index scan (read each record) rather than try to use an index that has poor selectivity. If the query is looking for 35 year olds, now the selectivity of the data might be better suited to utilize an index to identify the appropriate records.
There are two types of statistics maintained in tables in SQL server:
1. Column statistics
2. Index statistics
Each index has an associated statistics histogram that attempts to profile the data. The greater the selectivity of the data (few records) for a given predicate (WHERE birthdate_date BETWEEN ‘1/1/1985’ and ‘1/31/21986’) the more likely that index will improve the performance of the query (index seek). If the index selectivity is too low (many records qualify), doing a straight index scan will work just as fast.
If a column that does not have an index on it is referenced as part of a predicate (WHERE clause) SQL Server will compute statistics on that column on the fly. The presumption is that predicate will be reused in the future, and subsequent queries will run faster. Column statistics can be easily identified in SQL Server Management Studio because their name starts with _WA.
By default, any INSERTED or UPDATED records into a table will result in the statistics on the column or index to be auto-updated. The auto update functionality can be disabled but it is widely agreed as best practice to leave auto update enabled, which is the default.
Depending on the size of the table, the statistics on each column or index may not be based on every record in that table. For small tables (those that occupy less than 8 pages), all rows in the table are used to determine the statistics (FULLSCAN or 100 PERCENT). As the table size grows (typically more rows but large rows in terms of data type size also impacts the number of rows per database page), and more pages are required to be read in order to read all of the rows, smaller sample sizes are utilized.
For large tables that have many insert and update operations, the statistics will become skewed to the point where they may actually start providing BAD data, which is worse than no data. Query plans that are built on outdated statistics may run so poorly that they never finish.
Note: Please reference the Microsoft white paper on statistics for a true description of stats: Statistics Used by the Query Optimizer in Microsoft SQL Server 2008
But between the time when the statistics are first initialized and the time the statistics are so far out of whack that the plan essentially fails, is a period where the performance of queries are increasingly becoming less efficient. These inefficiencies in query plan performance can result in progressively slower loads and reports.
It is for this reason that one of the universally agreed upon best practice is to update index and column statistics on a periodic basis to ensure accurate and efficient query plan generation. The better information we can give the optimizer, the better are our chances of having solid and consistent query.
Summary and related facts:
- By default, statistics are maintained and auto updated on all indexes and any columns that are used as predicates in a query.
- Due to time constraints, the larger the table, the smaller the sample set of the table is used to compute the statistics.
- Large tables with many updates and inserts will have statistics degrade if quality over time.
- In order to maintain good statistics SQL Server tables require regular maintenance to prevent the statistics from becoming skewed due to inadequate sample sizes.
- SQL Server does not provide any mechanism to selectively update individual statistics. Statistics are updated at the table level. So if a table has 10 columns/indexes with statistics on them, if 3 are out of date there is no way to update only those 3. All 10 of the statistics will need to be refreshed.
- For partitioned tables, there is no way in SQL Server to update the statistics at the partition level. Again, the statistics must be updated for the entire table. (This is widely regarded as a major weakness in Microsoft’s statistics solution. The irony is that partitioned tables tend to be the largest, and often times, only the most recent data changes. Once statistics are achieved on the oldest data, it usually does not change. Microsoft STILL has not announced if this issue is going to be resolved in Denali or not).
- Index maintenance should include index REBUILDS which addresses internal and external fragmentation of the index. Anytime an index is rebuilt (usually as a part of a maintenance plan) the statistics on the LEADING COLUMN of the index is computed with a 100 percent sample rate. In a sense, re-computing the statistics on an index after an index REBUILD is doing the same work twice. However, since we cannot update statistics at the column level, it may be necessary to re-compute the statistics on a table after index maintenance in order to update other columns that were not part of an index rebuild.
- The last date/time the statistic for each column/index is easily determined with the STATS_DATE T-SQL built-in function.
Statistics Updates in Maintenance Plans
Index and column statistics are normally updated as part of the recurring maintenance plan jobs. Tables that have a high percentage of their data changing (inserts and updates) will benefit the most from statistic updates.
Since statistics are updated at the table level, the time to compute the statistics are a function of the number of rows in the table AND the number of column/index statistics on the table. The largest tables, with the highest number of column statistics will take the longest amount of time in the ever decreasing maintenance windows to complete.
The other parameter that determines how fast the statistics can be computed on a table is the sample size used to compute the statistics. Smaller sample sizes will run faster, but higher sample sizes will arrive at statistically correct results.
For databases with large numbers of tables, indexes and column statistics, the time to re-calculate statistics can easily outstrip the amount of time of the maintenance window. In order to address this concern a statistics update procedure has been developed that will allow us to update column and index statistics within fairly narrow maintenance windows, while maintaining fairly high statistic sample rates.
As part of my maintenance plan I am running a TSQL script that will distribute the update of statistics over a multi-day period.
Here’s how it works…
- Instead of attempting to compute all of the statistics that need updating in a single session the column and index statistics will be spread out over several days. Initially, I have chosen four (4) days to distribute the statistic updates, but this could easily be setup as an input parameter.
- In order to evenly distribute the work load over all four days it may be tempting to use the number of rows in a table to estimate the relative length of time to compute the statistics for that table. However, a truer estimate of the time to compute the statistics is actually the number of rows * the number of column/index statistics. I called this factor the effective statistics count. For example, a table with 100000 rows and one index will have an effective stats count = 100000. A different table with 100000 rows but 75 columns with statistics will have a 7,500,000 effective count and will take 75 times longer to compute than the first table.
- In order to break the statistics computations over 4 days a modified NTILE algorithm was used. Basically, all tables that are candidates for statistic updates where ranked by effective statistics count ascending. Then each table was sequentially assigned an NTILE bucket (0-3). The net results are 4 buckets of tables that have a similar aggregated effective stats count.
- Tables are selected as candidates for statistic updates based on the following rules:
a. Any table that has one or more statistics older than 7 days will be updated
b. If the table has a disabled clustered index, the table is excluded from the list
c. Only user tables are included. Any system tables are excluded.
V. In order to balance compute time vs. quality of statistics, the following rules are in place to gradually reduce the sample rate for very large tables:
VI. The results of the statistic updates are logged in [master].dbo.dba_UpdateStatsLog.
By maintaining current and statistically accurate column and index statistics, we will have one less factor to eliminate when trouble shooting query performance issues.
Update Statistics
- USE master;
- GO
-
-
- If Not Exists(Select [object_id] From sys.tables Where name = N'dba_UpdateStatsLog')
- Begin
- – Drop Table dbo.dba_UpdateStatsLog
- Create Table dbo.dba_UpdateStatsLog ( UpdateStats_id int identity(1,1) NOT Null
- , databaseID int NOT Null
- , databaseName nvarchar(128) NOT Null
- , NTILE_id tinyint NOT NULL
- , TableName nvarchar(128) NOT Null
- , objectID int NOT Null
- , LastStatsUpdate smalldatetime NULL
- , NumStats smallint NOT NULL
- , [RowCount] int NOT NULL
- , EffectiveStatCount bigint NOT NULL
- , dateTimeStart smalldatetime NULL
- , durationSeconds int NULL
- , sampleRate tinyint NULL
- CONSTRAINT PK_UpdateStatsLog
- PRIMARY KEY CLUSTERED (UpdateStats_id) );
-
- Print 'dba_UpdateStatsLog Table Created';
- End
-
- IF OBJECTPROPERTY(OBJECT_ID('dbo.dba_UpdateStats'), N'IsProcedure') = 1
- Begin
- DROP PROCEDURE dbo.dba_UpdateStats;
- PRINT 'Procedure dba_UpdateStats dropped';
- End;
- GO
-
- –USE master;
- –GO
- –sp_msforeachdb 'EXEC dbo.dba_load_UpdateStatsLog
- – @database = ''?''
- – , @@num_days_old = 7;'
-
- –drop procedure dba_load_UpdateStatsLog
- CREATE PROCEDURE dbo.dba_load_UpdateStatsLog @databaseName nvarchar(128)
- , @num_days_old tinyint = 7
-
- AS
-
- – @databaseName – Name of DB to have statistics updated.
- – @num_days_old – Ignore any tables that had indexes rebuilt within the last
- – @num_days_old.
- –
- – USAGE: 1. Populate the dba_UpdateStats with tables that need to have status updated:
- – EXEC dbo.dba_load_UpdateStatsLog 'usta', 7
- – 2. Call procedure that updates the stats:
- – EXEC dbo.dba_updateStats
- –
- – SELECT * FROM dba_UpdateStatsLog where datetimestart is null ORDER BY ntile_id
- – select ntile_id, sum(effectivestatcount) FROM dba_UpdateStatsLog where datetimestart is null group by ntile_id
-
- – Credits and References:
- – Concepts such as the use of the log table in the master db and general architecture are
- – borrowed from Michelle Ullman's defrag procedure.
- – http://sqlserverpedia.com/wiki/Index_Maintenance
-
- – Reference Paul Randall's comments on the interaction of index rebuilds and statistic updates:
- – http://technet.microsoft.com/en-us/magazine/2008.08.database.aspx#id0230050
- –
-
- ————————————————————-
- ————————————————————-
- – Revision History:
- –
- – 20110224 D Turpin Original creation.
- – 20110225 D Turpin Limit to tables with DBO schema.
-
- DECLARE @sql_string nvarchar(max);
-
- – Check to see if there are still stats to be updated from the previous cycle:
- IF (SELECT count(*)
- FROM master.dbo.dba_UpdateStatsLog
- WHERE databaseName = @databaseName
- AND dateTimeStart IS NULL) > 0
-
- OR
-
- @databaseName IS NULL
- BEGIN
- PRINT 'Please update the exisiting stats in master.dbo.dba_UpdateStatsLog'
- RETURN
- END
-
- IF @databaseName IS NULL
-
- OR
-
- ( SELECT COUNT(*)
- FROM sys.databases
- WHERE name = ISNULL(@databaseName, '') ) = 0
- BEGIN
- PRINT 'Invalid database name submitted: ' + @databaseName
- RETURN
- END
-
- – Identify which statistics need to be updated:
- – Some ground rules: Only look at user tables
- – Get the row count based on the clustered index or heap
- – Ignore tables with disabled clustered indexes
- – Since stats updates do not work at the partition level (as of 2008R2)
- – sum the rows to get the row count. Otherwise our counts will be off
- – plus we will update the stats on the big table n-partitions times.
- SELECT @sql_string = 'USE ' + @databaseName + ';
- SELECT OBJECT_NAME(t.[object_id]) AS TableName
- , t.[object_id]
- , MIN(STATS_DATE(t.[object_id], s.stats_id)) AS LastStatsUpdate
- , COUNT(s.stats_id) AS [NumTableStats]
- , SUM(p.[rows]) AS [RowCount]
- , SUM(p.[rows]) * COUNT(s.stats_id) AS [EffectiveStatCount]
- INTO ##raw_stats
- FROM sys.stats AS s
- INNER JOIN sys.tables AS t
- ON t.[object_id] = s.[object_id]
- AND t.type = ''U''
- AND t.schema_id = 1
- INNER JOIN sys.indexes AS i
- ON i.[object_id] = s.[object_id]
- AND i.index_id <= 1
- AND i.is_disabled = 0
- INNER JOIN sys.partitions AS p
- ON p.[object_id] = i.[object_id]
- AND p.[index_id] = i.[index_id]
- AND p.[rows] > 0
- WHERE STATS_DATE(t.[object_id], s.stats_id) < DATEADD([day], -' +
- RTRIM(LTRIM(CONVERT(char, @num_days_old))) + ',
- CURRENT_TIMESTAMP)
- GROUP BY t.[object_id]
- , p.[rows];'
-
- EXEC SP_EXECUTESQL @sql_string;
-
- – Spread the workload over 4 days (Reference MOD operator):
- INSERT INTO [master].dbo.dba_UpdateStatsLog (
- databaseID
- , databaseName
- , NTILE_id
- , TableName
- , objectID
- , LastStatsUpdate
- , NumStats
- , [RowCount]
- , EffectiveStatCount )
- SELECT db_id(@databaseName)
- , @databaseName
- , (ROW_NUMBER() OVER(ORDER BY EffectiveStatCount)) % 4 AS NTILE_Modified
- , TableName
- , [object_id]
- , LastStatsUpdate
- , NumTableStats
- , [RowCount]
- , EffectiveStatCount
- FROM ##raw_stats
- ORDER BY 1 ASC;
-
- DROP TABLE ##raw_stats;
-
- RETURN;
-
- ——————————————
- ——————————————
- ——————————————
-
- IF ( SELECT count(*)
- FROM [master].dbo.dba_UpdateStatsLog
- WHERE dateTimeStart IS NULL ) > 0
- EXEC sp_msforeachdb 'EXEC dbo.dba_UpdateStats
- @database = ''?'';'
-
- ————————————————————-
- ————————————————————-
- ————————————————————-
- CREATE PROCEDURE dbo.dba_UpdateStats @databaseName nvarchar(128)
- –, @sample_rate tinyint = 100
-
- AS
-
- – @databaseName – Name of DB to have statistics updated.
- – @sample_rate – Percent of the table to check to compute stats.
- –
- – USAGE: 1. Call procedure that updates the stats:
- – EXEC dbo.dba_updateStats 'USTA' –, 5
- –
- ————————————————————-
- ————————————————————-
- – Revision History:
- –
- – 20110224 D Turpin Original creation.
- – 20110226 D Turpin Handle tables with spaces in the name.
- – Verify the table has not been dropped before updating the statistics.
- –
- –drop procedure dba_UpdateStats
- –
- – The EffectiveStatCount = #Rows in the Table * #Statistics on the table. The
- – concept is large tables statistic updates can explode if they have many columns
- – that have statistics on them.
- –
- – This routine uses a sliding scale to prevent huge tables from monopolizing system
- – resources.
- –
-
- DECLARE @ntile_id tinyint
- , @sql_string nvarchar(max)
- , @UpdateStats_id int
- , @ObjectID int
- , @TableName nvarchar(128)
- , @start_time datetime
- , @EffectiveStatCount bigint
- , @sample_rate tinyint;
-
- – Get the lowest tile for this db:
- SELECT @ntile_id = MIN(ntile_id)
- FROM [master].dbo.dba_UpdateStatsLog
- WHERE databasename = @databaseName
- AND datetimeStart IS NULL;
-
- IF @ntile_id IS NULL
- BEGIN
- PRINT 'No Statistics to be updated. Please run dbo.dba_load_UpdateStatsLog';
- RETURN;
- END
-
- DECLARE stats CURSOR FOR
- SELECT UpdateStats_id
- , ObjectID
- , TableName
- , EffectiveStatCount
- FROM [master].dbo.dba_UpdateStatsLog
- WHERE datetimeStart IS NULL
- AND NTILE_id = @ntile_id;
-
- OPEN stats;
- FETCH NEXT FROM stats INTO @UpdateStats_id, @ObjectID, @TableName, @EffectiveStatCount;
-
- WHILE @@fetch_status <> -1
- BEGIN
-
- SELECT @sample_rate =
- CASE WHEN @EffectiveStatCount > 10000000000 THEN 50 – In PERCENT
- WHEN @EffectiveStatCount BETWEEN 750000000 AND 10000000000 THEN 60
- WHEN @EffectiveStatCount BETWEEN 500000000 AND 7500000000 THEN 70
- WHEN @EffectiveStatCount BETWEEN 250000000 AND 5000000000 THEN 80
- WHEN @EffectiveStatCount BETWEEN 50000000 AND 2500000000 THEN 90
- ELSE 100
- END
-
- SELECT @sql_string = N'USE ' + @databaseName + ';
- IF EXISTS ( SELECT name
- FROM sys.tables
- WHERE [Object_ID] = ' + CONVERT(nchar, @ObjectID) + '
- AND schema_id = 1)
- UPDATE STATISTICS [' + @TableName + N'] WITH SAMPLE ' +
- LTRIM(RTRIM(CONVERT(nchar, @sample_rate))) + ' PERCENT';
-
- SELECT @start_time = CURRENT_TIMESTAMP;
-
- EXEC sp_executesql @sql_string;
-
- UPDATE [master].dbo.dba_UpdateStatsLog
- SET dateTimeStart = @start_time
- , durationSeconds = DATEDIFF(second, COALESCE(@start_time, current_timestamp), current_timestamp)
- , samplerate = @sample_rate
- WHERE UpdateStats_id = @UpdateStats_id;
-
- PRINT @sql_string;
-
- FETCH NEXT FROM stats INTO @UpdateStats_id, @ObjectID, @TableName, @EffectiveStatCount;
-
- END;
-
- CLOSE stats;
- DEALLOCATE stats;
-
- RETURN;