SQL Server Backup Compression Default

In SQL Server 2008 Microsoft added the long overdue backup compression feature.  You may have noticed in the backup dialog box the pull down menu to use the DEFAULT SERVER SETTING for compression.


To determine what the DEFAULT value actually is:


Out of the box, SQL Server 2008 backup compression is turned OFF.  To avoid the need to select COMPRESSION every time a backup is taken (assuming that is your intent), you can enable backup compression at the server level:

Enable Backup Compresion:
  1. sp_configure ‘backup compression default’, 1
  2. reconfigure
  3. go

This will simply allow you to not have to worry about explicitly selecting the backup COMPRESSION option in the backup dialog box.

If all of your servers are in SQL Server’s new CENTRAL MANAGEMENT SERVER, you can enable backup compression on all of the servers by executing the above command ONE time.

Stay curious… keep learning…


Monitor Transaction Log Rollback

I recently had a situation where an index was being updated but the process hung in the middle of the job.  Apparently this isn’t all that far fetched since it appears it has happened to other folks out there (the internet is the best… I rarely have a NEW problem)…

After stopping the query it went into a rollback.  A rollback that wouldn’t end.

How long should the query be rolling back?  How can I tell if the rollback is actually doing something, or is this thing never going to finish?

It turns out you can monitor the bytes reserved in the transaction log during a rollback.  As long as the reserved bytes are getting smaller your transaction rollback is working.

In my case, the rollback was NOT working and the reserved bytes remained flat.  The workaround, unfortunately, was to restart the SQL Server service.

Here’s the simple query to test the reserved bytes:

    select      current_timestamp as tod,
    from  sys.dm_tran_database_transactions
    where transaction_id > 1000
    and database_id = 7 — AdventureWorks
    — and transaction_id = whatever it is…

Just to prove the point, I’ll demonstrate how it works by running a transaction in AdventureWorks.

The first thing I’ll do is setup a query to collect the bytes reserved every 1 second.  The results will be saved to a temp table with the very unique name, #t.

Save Bytes Reservered
create table #t (    tod    datetime
                    , bytes_reserved    int
                    , bytes_used        int )
while 1=1
    insert into #t            
    select      current_timestamp as tod,
    from  sys.dm_tran_database_transactions
    where transaction_id > 1000
    and database_id = 7 — AdventureWorks
    — and transaction_id = whatever it is…
    waitfor delay '00:00:01'


Now I’ll do a little update in AdventureWorks, then rollback the transaction.

Kick off Xaction and Rollback
begin tran

    — 16s
    update Sales.SalesOrderDetail
    set ModifiedDate = CURRENT_TIMESTAMP
rollback    — 2 s


The update took about 16s to run; followed by a few second pause before I ran the rollback, which took 2 seconds.

Here’s what the result set looks like:



As the update transaction is running we see the transaction log bytes reserved increasing.  Then there is a delay before I did the rollback… Finally the last two entries shows the reserved bytes decreasing.

Although this tool may not tell you how LONG the ROLLBACK will take place, it will at least tell you that it is in fact ROLLING back and is not “hung”.

You can narrow down the query a bit by adding the transaction ID to ensure you are monitoring the transaction you need to look at.

Stay curious… Keep learning…

List of Databases

Back in April I posted a query that shows instance information.  This post is the next level down in the hierarchy that I collect on my database platforms each week…  The databases in an instance.

This query is certainly one of the more straight forward queries to capture database data.  It probably doesn’t really need to be executed / collected weekly, but I do it anyway to be consistent with my collection process.

If you plug this query into a SQL Server Central Management server you’ll have a snapshot of all your instances that are part of the CMS.

This query will return the database name, the overall size of the database including all data and log files, the create date, owner, compatibility level, online vs offline, and update stats info.

It’s interesting to find databases owned by staff that are no longer around, compatibility levels to old versions for no obvious reason, and create and update stats that are turned off.

Like they say on the History Channel’s Pawn Star show… you never know WHAT will be coming through that door.

So here’s the query:

SELECT    CONVERT(int, CONVERT(char, current_timestamp, 112)) AS CaptureDate
        , d.database_id
        , d.[name]
        , CAST(SUM(mf.size * 8096.0 / 1000000000) AS NUMERIC(18, 2)) AS [db_size (G)]
        , d.recovery_model_desc AS [Recovery Model]
        , d.create_date
        , suser_sname(d.owner_sid) AS [Owner]
        , d.[compatibility_level]
        , d.state_desc AS [State]
        , d.is_auto_create_stats_on
        , d.is_auto_update_stats_on
FROM        sys.databases d
        INNER JOIN sys.master_files mf
            ON    mf.database_id = d.database_id
WHERE    d.database_id > 4    — Exclude the system databases.
GROUP BY d.database_id
        , d.[name]
        , d.recovery_model_desc
        , d.create_date
        , d.owner_sid
        , d.[compatibility_level]
        , d.state_desc
        , d.is_auto_create_stats_on
        , d.is_auto_update_stats_on

My next post in this series will be on database backups, which is a much more interesting query AND really does need to be monitored weekly, minimum.

Stay curious… keep learning…


How to Drop a TempDB Database File

Now The SQL Server tempDB is arguably the most important database in any installed instance.  A poorly configured tempdb will trump all the best efforts for design of good OLTP or datawarehouse systems.

The standard recommendation is to have one tempdb database file for each CPU.  It’s also generally recognized that this rule of thumb us just that, a guess.  For busy OLTP databases one tempdb for each cpu may make sense.  But not always…

Last week I signed up to do a simple change to a production servers’s tempdb config.  Specifically, drop 4 data files.  I’ve recently added tempdb data files to existing databases… how hard can it be to drop them.

It turns out that I was right and not so right.  My approach to deleting the tempdb data files was spot on; I just didn’t work quite the way I anticipated.

I this post I will run through a few scenarios of adding a tempdb data file and then attempting to drop the file.

Just to set the record… the tempdb file I am creating is not a fixed size and does allow for autogrowth.  That is NOT the point of this exercise.  Please visit other sites for info on configuring your tempdb for your application.

Let’s first add a tempdb file to our instance:

Add tempDB file:
    ADD FILE (
        NAME = tempdb2
        , FILENAME = N‘D:\tempdb\tempdb2.ndf’
        , SIZE = 512MB
        , FILEGROWTH = 50MB )


Now let’s remove the file:

Drop the tempdb file:
— Try to delete the file just added

Since I’m running this test on a test server (translation, no one else is banging on it) the above code worked fine.  I got the following confirmation:

The file “D:\tempdb\tempdb2.ndf” has been modified in the system catalog. The new path will be used the next time the database is started.
The file ‘tempdb2’ has been removed.

The next scenario will add a new file to tempdb, write out temp table, then attempt to drop the file:

Drop file after temp table:
alter database tempdb
    add file (
        NAME = tempdb2
        , FILENAME = N‘D:\tempdb\tempdb2.ndf’
        , size = 512MB
        , filegrowth = 50MB )

— Use tempdb, before trying to remove the old table.
INTO    #t

— Try to delete the file just added

— It still worked!
The file “D:\tempdb\tempdb2.ndf” has been modified in the system catalog. The new path will be used the next time the database is started.
The file ‘tempdb2’ has been removed.

I anticipated the above test to generate an error message telling me that the tempdb file could not be deleted because it was in use.  After all, that’s what I got on the production server…

Maybe the small temp table wasn’t enough to use the new file.

Let’s create a larger temp table and see what happens:

BIG Tempdb Table:
alter database tempdb
    add file (
        NAME = tempdb2
        , FILENAME = N‘D:\tempdb\tempdb2.ndf’
        , size = 512MB
        , filegrowth = 50MB )

–drop table #test
— Let’s put significantly more data in tempdb:
CREATE TABLE #test (    id    int identity (1, 1),
                        data char(8000) )

declare @i    int

set @i = 1

while @i <= 50000
    insert into #test values (convert(char, @i))

    select @i = @i + 2

I have a rather lengthy query that I use to see how much space is in each database file.  I’ll post it here, but it really has more data/columns than is needed for this post:

Check TempDB Files
— Is there data in the second file of tempdb?
–drop table #sfs
–drop table #fixed_drives
–drop table #output_table
–drop table #databases
–drop table #dbf
–drop table #fg

— Save result set from showfilestats
CREATE TABLE #sfs (    fileid            tinyint
                    , filegroupid    tinyint
                    , totalextents    int
                    , usedextents    int
                    , dbfilename    sysname
                    , physfile        varchar(255) );
— Save result set from sys.database_files
CREATE TABLE #dbf (    [file_id]                int
                    , file_guid                uniqueidentifier
                    , [type]                tinyint
                    , type_desc                nvarchar(60)
                    , data_space_id            int
                    , [name]                sysname
                    , physical_name            nvarchar(260)
                    , [state]                tinyint
                    , state_desc            nvarchar(60)
                    , size                    int
                    , max_size                int
                    , growth                int
                    , is_media_ro            bit
                    , is_ro                    bit
                    , is_sparse                bit
                    , is_percent_growth        bit
                    , is_name_reserved        bit
                    , create_lsn            numeric(25, 0)
                    , drop_lsn                numeric(25, 0)
                    , read_only_lsn            numeric(25, 0)
                    , read_write_lsn        numeric(25, 0)
                    , diff_base_lsn            numeric(25, 0)
                    , diff_base_guid        uniqueidentifier
                    , diff_base_time        datetime
                    , redo_start_lsn        numeric(25, 0)
                    , redo_start_fork_guid    uniqueidentifier
                    , redo_target_lsn        numeric(25, 0)
                    , redo_target_fork_guid    uniqueidentifier
                    , back_lsn                numeric(25, 0) );
— Save result set from sys.filegroups select * from sys.filegroups
CREATE TABLE #fg (    [name]                sysname
                    , data_space_id        int
                    , [type]            char(2)
                    , type_desc            nvarchar(60)
                    , is_default        bit
                    , [filegroup_id]    uniqueidentifier
                    , log_filegroup_id    int
                    , is_read_only        bit );
— Populate #disk_free_space with data
CREATE TABLE #fixed_drives (DriveLetter    char(1) NOT NULL
                            , FreeMB    int NOT NULL );

INSERT INTO #fixed_drives
    EXEC master..xp_fixeddrives;
CREATE TABLE #output_table (DatabaseName    sysname
                            , FG_Name        sysname
                            , GB_Allocated    numeric(8, 2)
                            , GB_Used        numeric(8, 2)
                            , GB_Available    numeric(8, 2)
                            , DBFilename    sysname
                            , PhysicalFile    sysname
                            , Free_GB_on_Drive    numeric(8, 2) );
SELECT    name AS DBName
INTO    #databases
FROM    sys.databases
WHERE    database_id <= 4
AND        state_desc = ‘ONLINE’;

DECLARE    @dbname    sysname;

SELECT    @dbname = (    SELECT    TOP (1) DBName FROM    #databases);
DELETE FROM #databases WHERE DBName = @dbname;


    — Get the file group data
    INSERT INTO #sfs            
        EXEC(‘USE ‘+ @dbname + ‘; DBCC showfilestats;’);
    INSERT INTO #dbf
        EXEC(‘USE ‘+ @dbname + ‘; SELECT * FROM sys.database_files;’);
        EXEC(‘USE ‘+ @dbname + ‘; SELECT * FROM sys.filegroups;’);
    — Wrap it up!
    INSERT INTO #output_table    (DatabaseName
                                , FG_Name
                                , GB_Allocated
                                , GB_Used
                                , GB_Available
                                , DBFilename
                                , PhysicalFile
                                , Free_GB_on_Drive )
        SELECT    @dbname AS DATABASE_NAME
                , fg.name AS [File Group Name]
                , CAST(((sfs.totalextents * 64.0) / 1024000.0) AS numeric(8, 2)) AS GB_Allocated
                , CAST(((sfs.usedextents * 64.0) / 1024000.0) AS NUMERIC(8, 2)) AS GB_Used
                , CAST((((sfs.totalextents sfs.usedextents) * 64.0) / 1024000.0) AS NUMERIC(8, 2)) AS GB_Available
                , sfs.dbfilename
                , sfs.physfile
                , CAST((fd.FreeMB / 1000.0) AS NUMERIC(8, 2)) AS Free_GB_on_Drive
        FROM        #sfs sfs
                INNER JOIN #dbf dbf
                    ON    dbf.[file_id] = sfs.fileid
                INNER JOIN #fg fg
                    ON fg.data_space_id = sfs.filegroupid
                INNER JOIN #fixed_drives fd
                    ON    fd.DriveLetter = SUBSTRING(sfs.physfile, 1, 1);
    SELECT    @dbname = (    SELECT    TOP (1) DBName FROM    #databases);
    IF @dbname IS NOT NULL
        DELETE FROM #databases WHERE DBName = @dbname;

SELECT    CONVERT(int, CONVERT(char, current_timestamp, 112)) AS CaptureDate
        , DatabaseName
        , FG_Name
        , GB_Allocated
        , GB_Used
        , GB_Available
        , DBFilename
        , PhysicalFile
        , Free_GB_on_Drive
FROM    #output_table
ORDER BY DatabaseName
        , FG_Name


OK, this time we have confirmed that we have data in tempdb.  Let’s see if we can drop it:

Drop Non-empty Tempdb file:
— Now try to remove it!

— Cool, we have data in tempdb2!!!
The file “D:\tempdb\tempdb2.ndf” has been modified in the system catalog. The new path will be used the next time the database is started.
Msg 5042, Level 16, State 1, Line 1
The file ‘tempdb2’ cannot be removed because it is not empty.


This is the error I got with my production server.  It turns out, if you actually read the error message (which I can’t say I truly did) it does indicate that the system catalog has been updated BUT the change will not take affect until the service is restarted.

By simply restarting the SQL Server service the tempdb file(s) will be removed on startup.

I’ve seen some posts online that attempted to use DBCC to empty the tempdb data file.  For me it didn’t work.

Of course the above approach does require a restart of SQL Server.  Besides the obvious issue of the database going offline for 30 seconds or so (make sure no jobs are running) restarting the service will zero out all of the server counters.  This may be an issue if your using DMVs to monitor performance.

It’s been quite a while since I did a post… Hopefully this one is worthy of being remembered.

Stay curious… keep learning…


DBA Query: Instance Information

As part of my weekly routine I like to take a holistic view of the production servers I am responsible for.  After backups and maintenance plans, the next most important role of the DBA is to keep an eye on the overall health of the production databases.

I have several queries that I use to provide me with a summary view of the databases.

Today’s post will be a query that provides a nice view at the instance level.  Of course there are always more points to pull, but this is what I culled the query down to.  You may want to add more, or delete something.

Rather than running this query for every instance, all you need to do is set up your central management servers (providing you have at least one SQL2008 or later server).  This will allow you to run the query and get the result set for each server.  For more info on central management servers see BOL.

So here’s my query for instance reporting…  Enjoy.

Instance Properties
  1. DECLARE    @xp_msver table(    id                    int
  2.                             , name                sysname
  3.                             , internal_value    int
  4.                             , character_value    nvarchar(4000) );
  6. INSERT INTO @xp_msver                    
  7.     EXEC    master.dbo.xp_msver ProcessorCount;
  8. INSERT INTO @xp_msver                    
  9.     EXEC    master.dbo.xp_msver PhysicalMemory;
  11. SELECT    CONVERT(int, CONVERT(char, current_timestamp, 112)) AS CaptureDate
  12.         , SERVERPROPERTY (‘ServerName’) AS [ServerName]
  13.         , SERVERPROPERTY (‘ComputerNamePhysicalNetBIOS’) AS [ComputerNamePhysicalNetBIOS]
  14.         , SERVERPROPERTY (‘MachineName’) AS [MachineName]
  15.         , SERVERPROPERTY (‘InstanceName’) AS [InstanceName]
  16.         , SERVERPROPERTY (‘Edition’) AS [SQL Server Edition]
  17.         , SERVERPROPERTY (‘ProductVersion’) AS [ProductVersion]
  18.         , SERVERPROPERTY (‘ProductLevel’) AS [ProductLevel]
  19.         , SERVERPROPERTY (‘IsClustered’) AS [IsClustered]
  20.         , SERVERPROPERTY (‘IsIntegratedSecurityOnly’) AS [IsIntegratedSecurityOnly]
  21.         , SERVERPROPERTY (‘IsSingleUser’) AS [IsSingleUser]
  22.         , SERVERPROPERTY (‘LicenseType’) AS [LicenseType]
  23.         , SERVERPROPERTY (‘NumLicenses’) AS [NumLicenses]
  24.         , SERVERPROPERTY (‘ProcessID’) AS [ProcessID]
  25.         , CAST(xp1.ProcessorCount AS int) AS ProcessorCount
  26.         , CAST(xp2.PhysicalMemory AS int) AS PhysicalMemory
  27.         , dt.DefaultTraceEnabled
  28.         –, df.DefaultTracePath
  29.         , rde.RemoteDACEnabled
  30.         , (    SELECT    create_date
  31.             FROM    sys.databases
  32.             WHERE    database_id = 2) AS SQLServerStartDate
  33.         , (    SELECT    DATEDIFF(day, create_date, current_timestamp)
  34.             FROM    sys.databases
  35.             WHERE    database_id = 2) As DaysSinceSQLServerStart
  36. FROM    (SELECT    character_value AS ProcessorCount FROM @xp_msver WHERE name = ‘ProcessorCount’) xp1
  37.         CROSS JOIN
  38.          (SELECT    internal_value AS PhysicalMemory FROM @xp_msver WHERE name = ‘PhysicalMemory’) xp2
  39.         CROSS JOIN
  40.          (SELECT value_in_use AS DefaultTraceEnabled FROM sys.configurations WHERE name = ‘default trace enabled’) dt
  41.         CROSS JOIN
  42.         (SELECT value_in_use AS RemoteDACEnabled FROM sys.configurations WHERE name = ‘remote admin connections’) rde
  43.         –CROSS JOIN
  44.         –(SELECT NULL AS DefaultTracePath) df
  45.         –(SELECT VALUE AS DefaultTracePath FROM fn_trace_getinfo(0) WHERE property = 2) df

Stay curious… keep learning.


Monitor Transaction Isolation Levels

As part of a performance improvement effort we decided to make sure our enterprise reporting tools are operating under the READ UNCOMMITTED transaction isolation level (ie., allow dirty reads).  While our reporting application experts were certain the tools were at the desired isolation level we decided to monitor the actual transactions just to be sure.

Monitoring the isolation seemed like a pretty straight forward, common task.  While most things I don’t know how to do off the top of my head are readily available through my preferred search engine, I had a tougher time finding an a clear description of how to monitor the isolation level using T-SQL.

There were lots of posts about how to SET the isolation level, how to get the current isolation level for the current session, and even how to get the default isolation level at the instance level.

Without elaborating further, here is how I did it, so I don’t ever have to search for it again…

Monitor Trans Iso Levels
SELECT    original_login_name
        , nt_user_name
        , CASE transaction_isolation_level
            WHEN 0 THEN ‘Unspecified’
            WHEN 1 THEN ‘ReadUncommitted’
            WHEN 2 THEN ‘ReadCommitted’
            WHEN 3 THEN ‘Repeatable’
            WHEN 4 THEN ‘Serializable’
            WHEN 5 THEN ‘Snapshot’
            ELSE ‘???’
        END AS  [TransactionIsolationLevel]
        , status
        , last_request_start_time
        , *
FROM    sys.dm_exec_sessions
WHERE    session_id > 50
order by 3;

Like most tasks in life, it’s simple once you know how.

Stay curious; keep learning…

An unexpected honor…

The purpose of this blog, for the most part, is to provide me with a paper trail of technical SQL Server topics I feel are worth remembering.  Things that are easier for me to lookup later on and read my own explanation, if and when I need to re-learn the topic.  A secondary benefit is the learning process is enhanced simply by documenting my thoughts.

Today’s topic is not focused on SQL Server, but it is one that I would like to remember, for a long time…

I’ve been teaching a 6 part course on SQL Server T-SQL querying for about 2 years.  My audience are internal new hires, who often have very technical formal educations but not necessarily in programming or SQL at all.  The concept of the class was borne out of my own selfish desire to have my colleagues better equipped to solve resolve their own data quality issues with SQL.  The alternative was I would get many of their questions fired in my direction.

I’ve always enjoyed helping new hires…  Often times they are fresh out of college and are about as enthusiastic of an audience as you could ever ask for.  My secret intention is to inspire them to take their careers higher and faster than I did myself.  For sure, with their elite degrees they are certainly well prepared, so this is an easy task for me.

My recent batch of “students” was a very bright group of young ladies, four in number… none of them from the US of A.  One is from Romania, two from India, one from China.  The courage it takes for someone in their 20’s, maybe 30’s (one of those questions I don’t dare ask) to leave home and go to a foreign country simply baffles me.

At the end of each course a certificate of completion is presented to each individual who satisfied all of the course requirements.  The certificate is signed by the most senior executive manager in our group and is made part of the employees permanent training record.

The last class is a little more casual for the students, since it is a demo that takes a slighter deeper look at execution plans.  It’s the only class that does not have a “homework” so everyone is a little lighter in spirit.

In this particular case, however, I decided to defer the certificate presentation to the students…  A group staff meeting was scheduled in two days and I opted to have the certificates presented in front of their peers and managers.

Instead of me giving them their certificate… they gave ME mine…

The image below does not do justice to the actual document.  The art work was all hand done, was on a parchment style paper, rolled up like a scroll, tied with a red ribbon.  Of course when they handed it to me I had no idea what it was.

Suffice it to say I was deeply moved… the spirit of appreciation this group of young ladies expressed far exceeded my wildest imagination.  Just when you think no one is listening or that your contribution is rote, something like this happens to wake you up. 

To all of my own teachers, mentors and professors of the past, those who sometimes pushed me but often pulled and carried me, thank you.

To these young ladies of my present, thank you!

The following image is an excerpt from an email that consisted of a scanned image of the certificate and their personal comments.  It was sent to my management team up to 3 levels up.  This was one reward I never expected or even had a hint could happen.

View album

Stay curious, keep learning…



Maintaining Accurate SQL Server Statistics in Small Maintenance Windows

The single most important component of the SQL Server product is the query optimizer. The SQL Server storage engine utilizes a cost-based optimizer to determine optimal query plans. The job of the optimizer is to determine an optimal plan that will result in exceptional performance.

Note: An “optimal plan” is not necessarily the same as the lowest cost in terms of system resources. Because of the exponentially complex number of possible plan choices, the storage engine would have to determine every possible plan in order to determine which plan would run the fastest. The time to compute all of the possible plans would take more time than coming up with a pretty good plan and running it. In a sense, the query plans that SQL Server generates are actually heuristics; really good but not necessarily the absolute fastest plans.

The challenges of the optimizer are numerous. The number of possible combinations that the optimizer has to choose from is exponentially complex. As the number of tables in a single query grow, go grows the number of potential indexes, keys, constraints, etc.

For example, a simple query that has two tables linked together with a single INNER JOIN may present several options for the query optimizer, one of which is to choose the optimal PHYSICAL join operator:

  • HASH

The 3 possible physical join types will take into consideration the relative number of rows between the two tables, if there is an index on one or both of the tables that will allow it to avoid a SORT before joining the tables, which index(s) can be used to return all of the needed columns without doing a bookmark (key) lookup using the clustered index, are there any foreign keys that can be used to guarantee the existence of a value in a column.

Statistics Overview

Every time new rows are inserted into a table or are updated, the profile of the data in the table changes. SQL Server relies on the data profiles as yet another input parameter to determining an optimal query plan. For example, if we have a table with 100,000 rows and one of the columns is the customer BIRTH DATE. The statistics on that table would group the 100,000 birthdates into a histogram (bar chart) with ranges or “bands” of birth dates. These bands of data can be used to determine the selectivity of a query…

If the nature of the data is such that most customers are in the 18-25 age group, the bands of data for those years (such as 15-20 has 35,000 records and 20-25 has 40,000 records) than 75 percent of the customers are in that age range.

If a query restricts the data being selected to all 22 year olds, the SQL engine will know that there are many 22 year old in the data and will probably choose to do a clustered index scan (read each record) rather than try to use an index that has poor selectivity. If the query is looking for 35 year olds, now the selectivity of the data might be better suited to utilize an index to identify the appropriate records.

There are two types of statistics maintained in tables in SQL server:

1. Column statistics

2. Index statistics

Each index has an associated statistics histogram that attempts to profile the data. The greater the selectivity of the data (few records) for a given predicate (WHERE birthdate_date BETWEEN ‘1/1/1985’ and ‘1/31/21986’) the more likely that index will improve the performance of the query (index seek). If the index selectivity is too low (many records qualify), doing a straight index scan will work just as fast.

If a column that does not have an index on it is referenced as part of a predicate (WHERE clause) SQL Server will compute statistics on that column on the fly. The presumption is that predicate will be reused in the future, and subsequent queries will run faster. Column statistics can be easily identified in SQL Server Management Studio because their name starts with _WA.

By default, any INSERTED or UPDATED records into a table will result in the statistics on the column or index to be auto-updated. The auto update functionality can be disabled but it is widely agreed as best practice to leave auto update enabled, which is the default.

Depending on the size of the table, the statistics on each column or index may not be based on every record in that table. For small tables (those that occupy less than 8 pages), all rows in the table are used to determine the statistics (FULLSCAN or 100 PERCENT). As the table size grows (typically more rows but large rows in terms of data type size also impacts the number of rows per database page), and more pages are required to be read in order to read all of the rows, smaller sample sizes are utilized.

For large tables that have many insert and update operations, the statistics will become skewed to the point where they may actually start providing BAD data, which is worse than no data. Query plans that are built on outdated statistics may run so poorly that they never finish.

Note: Please reference the Microsoft white paper on statistics for a true description of stats: Statistics Used by the Query Optimizer in Microsoft SQL Server 2008

But between the time when the statistics are first initialized and the time the statistics are so far out of whack that the plan essentially fails, is a period where the performance of queries are increasingly becoming less efficient. These inefficiencies in query plan performance can result in progressively slower loads and reports.

It is for this reason that one of the universally agreed upon best practice is to update index and column statistics on a periodic basis to ensure accurate and efficient query plan generation. The better information we can give the optimizer, the better are our chances of having solid and consistent query.

Summary and related facts:

  • By default, statistics are maintained and auto updated on all indexes and any columns that are used as predicates in a query.
  • Due to time constraints, the larger the table, the smaller the sample set of the table is used to compute the statistics.
  • Large tables with many updates and inserts will have statistics degrade if quality over time.
  • In order to maintain good statistics SQL Server tables require regular maintenance to prevent the statistics from becoming skewed due to inadequate sample sizes.
  • SQL Server does not provide any mechanism to selectively update individual statistics. Statistics are updated at the table level. So if a table has 10 columns/indexes with statistics on them, if 3 are out of date there is no way to update only those 3. All 10 of the statistics will need to be refreshed.
  • For partitioned tables, there is no way in SQL Server to update the statistics at the partition level. Again, the statistics must be updated for the entire table. (This is widely regarded as a major weakness in Microsoft’s statistics solution. The irony is that partitioned tables tend to be the largest, and often times, only the most recent data changes. Once statistics are achieved on the oldest data, it usually does not change. Microsoft STILL has not announced if this issue is going to be resolved in Denali or not).
  • Index maintenance should include index REBUILDS which addresses internal and external fragmentation of the index. Anytime an index is rebuilt (usually as a part of a maintenance plan) the statistics on the LEADING COLUMN of the index is computed with a 100 percent sample rate. In a sense, re-computing the statistics on an index after an index REBUILD is doing the same work twice. However, since we cannot update statistics at the column level, it may be necessary to re-compute the statistics on a table after index maintenance in order to update other columns that were not part of an index rebuild.
  • The last date/time the statistic for each column/index is easily determined with the STATS_DATE T-SQL built-in function.

Statistics Updates in Maintenance Plans

Index and column statistics are normally updated as part of the recurring maintenance plan jobs. Tables that have a high percentage of their data changing (inserts and updates) will benefit the most from statistic updates.

Since statistics are updated at the table level, the time to compute the statistics are a function of the number of rows in the table AND the number of column/index statistics on the table. The largest tables, with the highest number of column statistics will take the longest amount of time in the ever decreasing maintenance windows to complete.

The other parameter that determines how fast the statistics can be computed on a table is the sample size used to compute the statistics. Smaller sample sizes will run faster, but higher sample sizes will arrive at statistically correct results.

For databases with large numbers of tables, indexes and column statistics, the time to re-calculate statistics can easily outstrip the amount of time of the maintenance window. In order to address this concern a statistics update procedure has been developed that will allow us to update column and index statistics within fairly narrow maintenance windows, while maintaining fairly high statistic sample rates.

As part of my maintenance plan I am running a TSQL script that will distribute the update of statistics over a multi-day period.

Here’s how it works…

  1. Instead of attempting to compute all of the statistics that need updating in a single session the column and index statistics will be spread out over several days. Initially, I have chosen four (4) days to distribute the statistic updates, but this could easily be setup as an input parameter.
  2. In order to evenly distribute the work load over all four days it may be tempting to use the number of rows in a table to estimate the relative length of time to compute the statistics for that table. However, a truer estimate of the time to compute the statistics is actually the number of rows * the number of column/index statistics. I called this factor the effective statistics count. For example, a table with 100000 rows and one index will have an effective stats count = 100000. A different table with 100000 rows but 75 columns with statistics will have a 7,500,000 effective count and will take 75 times longer to compute than the first table.
  3. In order to break the statistics computations over 4 days a modified NTILE algorithm was used. Basically, all tables that are candidates for statistic updates where ranked by effective statistics count ascending. Then each table was sequentially assigned an NTILE bucket (0-3). The net results are 4 buckets of tables that have a similar aggregated effective stats count.
  4. Tables are selected as candidates for statistic updates based on the following rules:

    a.  Any table that has one or more statistics older than 7 days will be updated

    b. If the table has a disabled clustered index, the table is excluded from the list

    c. Only user tables are included. Any system tables are excluded.

V.    In order to balance compute time vs. quality of statistics, the following rules are in place to gradually reduce the sample rate for very large tables:

VI.    The results of the statistic updates are logged in [master].dbo.dba_UpdateStatsLog.

By maintaining current and statistically accurate column and index statistics, we will have one less factor to eliminate when trouble shooting query performance issues.

Update Statistics
  1. USE master;
  2. GO
  5. If Not Exists(Select [object_id] From sys.tables Where name = N'dba_UpdateStatsLog')
  6. Begin
  7.     — Drop Table dbo.dba_UpdateStatsLog
  8.     Create Table dbo.dba_UpdateStatsLog ( UpdateStats_id        int identity(1,1)   NOT Null
  9.                                         , databaseID            int                 NOT Null
  10.                                         , databaseName            nvarchar(128)       NOT Null
  11.                                         , NTILE_id                tinyint                NOT NULL
  12.                                         , TableName                nvarchar(128)       NOT Null
  13.                                         , objectID                int                 NOT Null
  14.                                         , LastStatsUpdate        smalldatetime        NULL
  15.                                         , NumStats                smallint            NOT NULL
  16.                                         , [RowCount]            int                    NOT NULL
  17.                                         , EffectiveStatCount    bigint                NOT NULL
  18.                                         , dateTimeStart            smalldatetime       NULL
  19.                                         , durationSeconds        int                 NULL
  20.                                         , sampleRate            tinyint                NULL
  21.                                         CONSTRAINT PK_UpdateStatsLog
  22.                                             PRIMARY KEY CLUSTERED (UpdateStats_id)    );
  24.     Print 'dba_UpdateStatsLog Table Created';
  25. End
  27. IF OBJECTPROPERTY(OBJECT_ID('dbo.dba_UpdateStats'), N'IsProcedure') = 1
  28. Begin
  29.     DROP PROCEDURE dbo.dba_UpdateStats;
  30.     PRINT 'Procedure dba_UpdateStats dropped';
  31. End;
  32. GO
  34. –USE master;
  35. –GO
  36. –sp_msforeachdb 'EXEC dbo.dba_load_UpdateStatsLog
  37. —                    @database            = ''?''
  38. —                    , @@num_days_old    = 7;'
  40. –drop procedure dba_load_UpdateStatsLog
  41. CREATE PROCEDURE dbo.dba_load_UpdateStatsLog @databaseName    nvarchar(128)
  42.                                             , @num_days_old     tinyint = 7
  44. AS
  46. — @databaseName – Name of DB to have statistics updated.
  47. — @num_days_old – Ignore any tables that had indexes rebuilt within the last
  48. —                    @num_days_old.
  49. — USAGE: 1. Populate the dba_UpdateStats with tables that need to have status updated:
  50. —                EXEC dbo.dba_load_UpdateStatsLog 'usta', 7
  51. —          2. Call procedure that updates the stats:
  52. —                EXEC dbo.dba_updateStats
  53. — SELECT * FROM dba_UpdateStatsLog where datetimestart is null ORDER BY ntile_id
  54. — select ntile_id, sum(effectivestatcount) FROM dba_UpdateStatsLog where datetimestart is null group by ntile_id
  56. — Credits and References:
  57. — Concepts such as the use of the log table in the master db and general architecture are
  58. — borrowed from Michelle Ullman's defrag procedure.
  59. —        http://sqlserverpedia.com/wiki/Index_Maintenance
  61. — Reference Paul Randall's comments on the interaction of index rebuilds and statistic updates:
  62. —        http://technet.microsoft.com/en-us/magazine/2008.08.database.aspx#id0230050
  64. ————————————————————-
  65. ————————————————————-
  66. — Revision History:
  67. — 20110224    D Turpin        Original creation.
  68. — 20110225    D Turpin        Limit to tables with DBO schema.
  70. DECLARE @sql_string    nvarchar(max);
  72. — Check to see if there are still stats to be updated from the previous cycle:
  73. IF (SELECT    count(*)
  74.     FROM    master.dbo.dba_UpdateStatsLog
  75.     WHERE    databaseName = @databaseName
  76.     AND        dateTimeStart IS NULL) > 0
  78.     OR
  80.     @databaseName IS NULL
  81. BEGIN
  82.     PRINT 'Please update the exisiting stats in master.dbo.dba_UpdateStatsLog'
  83.     RETURN
  84. END
  86. IF @databaseName IS NULL
  88.     OR
  90.     (    SELECT    COUNT(*)
  91.         FROM    sys.databases
  92.         WHERE    name = ISNULL(@databaseName, '') ) = 0
  93. BEGIN
  94.     PRINT 'Invalid database name submitted: ' + @databaseName
  95.     RETURN
  96. END        
  98. — Identify which statistics need to be updated:
  99. — Some ground rules:    Only look at user tables
  100. —                        Get the row count based on the clustered index or heap
  101. —                        Ignore tables with disabled clustered indexes
  102. —                        Since stats updates do not work at the partition level (as of 2008R2)
  103. —                        sum the rows to get the row count.  Otherwise our counts will be off
  104. —                        plus we will update the stats on the big table n-partitions times.
  105. SELECT    @sql_string = 'USE ' + @databaseName + ';
  106.     SELECT    OBJECT_NAME(t.[object_id]) AS TableName
  107.             , t.[object_id]
  108.             , MIN(STATS_DATE(t.[object_id], s.stats_id)) AS LastStatsUpdate
  109.             , COUNT(s.stats_id) AS [NumTableStats]
  110.             , SUM(p.[rows]) AS [RowCount]
  111.             , SUM(p.[rows]) * COUNT(s.stats_id) AS [EffectiveStatCount]
  112.     INTO    ##raw_stats
  113.     FROM        sys.stats AS s
  114.             INNER JOIN sys.tables AS t
  115.                 ON    t.[object_id] = s.[object_id]
  116.                 AND    t.type = ''U''
  117.                 AND    t.schema_id = 1
  118.             INNER JOIN sys.indexes AS i
  119.                 ON    i.[object_id] = s.[object_id]
  120.                 AND    i.index_id <= 1
  121.                 AND    i.is_disabled = 0
  122.             INNER JOIN sys.partitions AS p
  123.                 ON    p.[object_id] = i.[object_id]
  124.                 AND    p.[index_id] = i.[index_id]
  125.                 AND    p.[rows] > 0
  126.     WHERE    STATS_DATE(t.[object_id], s.stats_id) < DATEADD([day], -' +
  127.                                                             RTRIM(LTRIM(CONVERT(char, @num_days_old))) + ',
  128.                                                             CURRENT_TIMESTAMP)
  129.     GROUP BY t.[object_id]
  130.             , p.[rows];'
  132. EXEC SP_EXECUTESQL @sql_string;
  134. — Spread the workload over 4 days (Reference MOD operator):
  135. INSERT INTO [master].dbo.dba_UpdateStatsLog (    
  136.                             databaseID
  137.                             , databaseName
  138.                             , NTILE_id
  139.                             , TableName
  140.                             , objectID
  141.                             , LastStatsUpdate
  142.                             , NumStats
  143.                             , [RowCount]
  144.                             , EffectiveStatCount )
  145.     SELECT    db_id(@databaseName)
  146.             , @databaseName
  147.             , (ROW_NUMBER() OVER(ORDER BY EffectiveStatCount)) % 4 AS NTILE_Modified
  148.             , TableName
  149.             , [object_id]
  150.             , LastStatsUpdate
  151.             , NumTableStats
  152.             , [RowCount]
  153.             , EffectiveStatCount
  154.     FROM    ##raw_stats
  155.     ORDER BY 1 ASC;
  157. DROP TABLE ##raw_stats;
  159. RETURN;
  161. ——————————————
  162. ——————————————
  163. ——————————————
  165. IF    (    SELECT    count(*)
  166.         FROM    [master].dbo.dba_UpdateStatsLog
  167.         WHERE    dateTimeStart IS NULL ) > 0
  168.     EXEC sp_msforeachdb 'EXEC dbo.dba_UpdateStats
  169.                             @database = ''?'';'
  171. ————————————————————-
  172. ————————————————————-
  173. ————————————————————-
  174. CREATE PROCEDURE dbo.dba_UpdateStats @databaseName    nvarchar(128)
  175.                                     –, @sample_rate    tinyint = 100
  177. AS
  179. — @databaseName – Name of DB to have statistics updated.
  180. — @sample_rate – Percent of the table to check to compute stats.
  181. — USAGE: 1. Call procedure that updates the stats:
  182. —                EXEC dbo.dba_updateStats 'USTA'    –, 5
  183. ————————————————————-
  184. ————————————————————-
  185. — Revision History:
  186. — 20110224    D Turpin        Original creation.
  187. — 20110226    D Turpin        Handle tables with spaces in the name.
  188. —                            Verify the table has not been dropped before updating the statistics.
  189. –drop procedure dba_UpdateStats
  190. — The EffectiveStatCount = #Rows in the Table * #Statistics on the table.  The
  191. — concept is large tables statistic updates can explode if they have many columns
  192. — that have statistics on them.
  193. — This routine uses a sliding scale to prevent huge tables from monopolizing system
  194. — resources.
  196. DECLARE    @ntile_id                tinyint
  197.         , @sql_string            nvarchar(max)
  198.         , @UpdateStats_id        int
  199.         , @ObjectID                int
  200.         , @TableName            nvarchar(128)
  201.         , @start_time            datetime
  202.         , @EffectiveStatCount    bigint
  203.         , @sample_rate            tinyint;
  205. — Get the lowest tile for this db:
  206. SELECT    @ntile_id = MIN(ntile_id)
  207. FROM        [master].dbo.dba_UpdateStatsLog
  208. WHERE    databasename = @databaseName
  209. AND        datetimeStart IS NULL;
  211. IF @ntile_id IS NULL
  212. BEGIN
  213.     PRINT 'No Statistics to be updated.  Please run dbo.dba_load_UpdateStatsLog';
  214.     RETURN;
  215. END
  218.     SELECT    UpdateStats_id
  219.             , ObjectID
  220.             , TableName
  221.             , EffectiveStatCount
  222.     FROM    [master].dbo.dba_UpdateStatsLog
  223.     WHERE    datetimeStart IS NULL
  224.     AND        NTILE_id = @ntile_id;
  226. OPEN stats;
  227. FETCH NEXT FROM stats INTO @UpdateStats_id, @ObjectID, @TableName, @EffectiveStatCount;
  229. WHILE @@fetch_status <> 1
  230. BEGIN
  232.     SELECT    @sample_rate =
  233.         CASE WHEN @EffectiveStatCount > 10000000000                        THEN 50    — In PERCENT
  234.              WHEN @EffectiveStatCount BETWEEN 750000000 AND 10000000000 THEN 60
  235.              WHEN @EffectiveStatCount BETWEEN 500000000 AND  7500000000 THEN 70
  236.              WHEN @EffectiveStatCount BETWEEN 250000000 AND  5000000000 THEN 80
  237.              WHEN @EffectiveStatCount BETWEEN  50000000 AND  2500000000 THEN 90
  238.              ELSE 100
  239.         END
  241.     SELECT @sql_string = N'USE ' + @databaseName + ';
  242.         IF EXISTS (    SELECT    name
  243.                     FROM    sys.tables
  244.                     WHERE    [Object_ID] = ' + CONVERT(nchar, @ObjectID) + '
  245.                     AND        schema_id = 1)
  246.             UPDATE STATISTICS [' + @TableName + N'] WITH SAMPLE ' +
  247.                 LTRIM(RTRIM(CONVERT(nchar, @sample_rate))) + ' PERCENT';
  249.     SELECT @start_time = CURRENT_TIMESTAMP;
  251.     EXEC sp_executesql @sql_string;
  253.     UPDATE    [master].dbo.dba_UpdateStatsLog
  254.     SET        dateTimeStart = @start_time
  255.             , durationSeconds = DATEDIFF(second, COALESCE(@start_time, current_timestamp), current_timestamp)
  256.             , samplerate = @sample_rate
  257.     WHERE    UpdateStats_id = @UpdateStats_id;
  259.     PRINT @sql_string;
  261.     FETCH NEXT FROM stats INTO @UpdateStats_id, @ObjectID, @TableName, @EffectiveStatCount;
  263. END;
  265. CLOSE stats;
  266. DEALLOCATE stats;
  268. RETURN;

Identify All Computed Columns

Use the following query to identify all computed columns in a database:


List all computed columns
  1. SELECT    object_name(object_id) AS [Table Name]
  2.         , [name] AS [Column Name]
  3. FROM        sys.columns
  4. WHERE    is_computed = 1;


Optimizing Clustered Indexes for Data Warehouse Staging Tables

Most of my database experience over the past 15 years has been in data warehousing.  During this time the confusion about what to pick for table primary keys and more specifically, the clustered index, has either been hotly contested or, more typically, flat out ignored.

By “ignored” I don’t mean the  primary key doesn’t exist.  It’s more that little if any thought goes into it.  It’s “how we do it”.  In most cases the primary key is simply a surrogate column, which is a fancy way of saying it is an IDENTITY column that has absolutely no bearing on the data.

One of class of tables in a data warehouse is the ubiquitous “staging” table.  These tables hold the data from the originating  data source before any transformations are performed.  The only additional data in the staging tables is the surrogate column (something_id) which is almost always the primary key, and a column to indicate the date the data was loaded (load_date).

Other than the primary key, staging tables do not generally have any other indexes on them.  Once the data is loaded to the target table(s) the data in these tables is simply there as a means to check the correctness of the target load.

If a data warehouse is loaded daily, for example, we will be loading NEW records with an INSERT/SELECT.  Existing records will need to be UPDATEd, based on whatever the unique combination of attributes apply.  Regardless,  INSERTs and UPDATEs will be driven by the most current data in the staging table, more  specifically, the rows with the most current load_date.

In this posting I will look at four (4) different clustered index scenarios for a staging table.  Each of the four scenarios will have the exact same data.  The four scenarios include:

Stage Table Name Primary Key Description
1. STG_Customer_IDENTITY Primary key, clustered, on the surrogate key (IDENTITY column)
2. STG_Customer_IDENTITY_Load_Date Primary key, clustered, on the primary key and the load_date.
3.  STG_Customer_Load_Date_IDENTITY Same as #II, except the load_date will be the first column in the index with the surrogate key second.
4.  STG_Customer_Load_Date A clustered index is on load_date but no primary key constraint exists.

Continue Reading »

Manage File Space using DBCC SHOWFILESTATS

I recently had to allocate some new database files for an existing production database.  It can be confusing between File Groups, Files and the space available in each  group / file.

The confusion is eliminated when running DBCC SHOWFILESTATS.  When concerned about available storage space left at the individual file level you’ll want to look at two columns:

  • Total Extents
  • Used Extents

Remember, an extent is 8, 8k (8096) pages, or 64k each. 

To determine how much space is left in a data file:

Free Space = (Total Extents – Used Extents ) * (8 * 8096).

Note: For large db files you may experience an arithmetic overflow.  To avoid this divide the Total and Used extents by a million to get the FREE SPACE in Gbytes.



Take some of the mystery out of your physical database layouts with this  simple to use and understand DBCC command.

Stay curious; keep learning…




Last month I posted on how the location of the predicate can change the optimizer’s resulting query plan.  The post was inspired by an article by Itzik Ben-gan in SQL Server Magazine.  Another key take away from the article highlights the optimizer’s response to having an INNER JOIN that follows and OUTER JOIN.

In this post I will show how the query plan is not the intended plan and recap why this is the  case.  I know I have personally seen this happen in production code many times over the years and have to admit I never took the time till now to really investigate how the SQL engine handles these queries.

Query 1: All INNER JOINS

I’ll start with the most straight forward and least confusing query that joins 3 tables, Contact, Order Header (master) and Order Detail.

  1. SELECT    TOP 50
  2.         c.FirstName
  3.         , c.LastName
  4.         , c.EmailAddress
  5.         , oh.SalesOrderID
  6.         , od.OrderQty
  7. FROM        Person.Contact c
  8.         INNER JOIN Sales.SalesOrderHeader oh
  9.             ON    oh.ContactID = c.ContactID
  10.         INNER JOIN Sales.SalesOrderDetail od
  11.             ON    od.SalesOrderID = oh.SalesOrderID;


This basic query will produce the following query plan.  The important part is to note the two Nested Loops Joins for the two join operators in the query.



Query 2 is a Bug: OUTER / INNER JOIN

Now let’s tweak our query a little bit.  In this case, we want to list contact information even if the contact never  placed an order.  Seems simple enough; we just change the join to the ORDER table to an OUTER JOIN.

List Contacts Even If No Order
  1. SELECT    TOP 50
  2.         c.FirstName
  3.         , c.LastName
  4.         , c.EmailAddress
  5.         , oh.SalesOrderID
  6.         , od.OrderQty
  7. FROM        Person.Contact c
  8.         LEFT OUTER JOIN Sales.SalesOrderHeader oh
  9.             ON    oh.ContactID = c.ContactID
  10.         INNER JOIN Sales.SalesOrderDetail od
  11.             ON    od.SalesOrderID = oh.SalesOrderID;


Here’s the query plan for this query:


But wait!  Something is wrong.  Our query has an OUTER JOIN in it but the query plan has two INNER JOINs.  In fact, this query plan is identical to the query that has only INNER JOINs.

For this query, the OUTER JOIN followed by the INNER JOIN are incompatible.  Basically the query optimizer has a couple of choices it can use to handle this query.  It can either enforce the OUTER JOIN and treat the INNER JOIN like an OUTER JOIN or it can treat the OUTER JOIN like an INNER JOIN.  We can see from the query plan above the optimizer converts the OUTER JOIN to an INNER JOIN.

Bug Override #1: Two OUTER JOINS

Fortunately there are at least a couple of ways to overcome this anomaly.   The first approach is to explicitly declare the INNER JOIN as an OUTER JOIN.

  1. SELECT    TOP 50
  2.         c.FirstName
  3.         , c.LastName
  4.         , c.EmailAddress
  5.         , oh.SalesOrderID
  6.         , od.OrderQty
  7. FROM        Person.Contact c
  8.         LEFT OUTER JOIN Sales.SalesOrderHeader oh
  9.             ON    oh.ContactID = c.ContactID
  10.         LEFT OUTER JOIN Sales.SalesOrderDetail od
  11.             ON    od.SalesOrderID = oh.SalesOrderID;

The following is the query plan produced by this query.


But there’s something still not quite right here.  The OUTER JOIN connecting the order  DETAIL record with the order HEADER isn’t exactly what we have in mind.  This could result in the contact information and the SalesOrderID being returned with a NULL OrderQty.  In the real world the MASTER and DETAIL records will always be connected, so adding this logic to satisfy the OUTER JOIN of the CONTACT with the ORDER records is a kluge.  If there are any orders that do not have DETAIL records they will be displayed by this query.

Bug Override #2: Separate Table Operators

In our final query we finally get what we are  really looking for.  A loosely coupled JOIN between CONTACTS and ORDERS but tightly coupled ORDER master and detail records.  We can accomplish this by making the JOIN between the order HEADER and DETAIL  tables treated separately relative to the JOIN with the CONTACT table.  This is the equivalent  to the difference between:

A = B * C + D and A = B * (C + D).

Here’s the query:

Rank JOIN Precedence
  1. SELECT    TOP 50
  2.         c.FirstName
  3.         , c.LastName
  4.         , c.EmailAddress
  5.         , oh.SalesOrderID
  6.         , od.OrderQty
  7. FROM        Person.Contact c
  8.         LEFT OUTER JOIN
  9.             (Sales.SalesOrderHeader oh
  10.                 INNER JOIN Sales.SalesOrderDetail od
  11.                     ON    od.SalesOrderID = oh.SalesOrderID)
  12.             ON    oh.ContactID = c.ContactID;


The query plan FINALLY  gives us what we are really looking for, an INNER JOIN and an OUTER JOIN. 




Mixing INNER and OUTER joins can result in some unintended and unexpected results.  Checking the query plans can reveal nuances of the query that are contrary to the intended result set.  Specifically, an INNER JOIN followed by an INNER JOIN effectively converts the OUTER JOIN into an INNER JOIN.  A closer solution where two OUTER JOINS are  used may work but it may also generate some unintended records if the data is not squeaky clean.

The  ultimate solution is to compartmentalize the ORDER header and detail JOINs by incorporating parentheses in the JOIN operators.  This attention to detail will produce the exact intended result sets every time.

Stay curious; keep learning…



Equality Filters: ON vs. WHERE

It’s been a while since I actually saw this topic “in the wild” but we all know we see it all of the time.  It’s been several months but there was a point where I actually identified this kind of bug.

In the September 2010 issue of SQL Server magazine Itzik Ben-Gan does a nice job (as always) of clarifying the use of predicate filters in the ON clause vs. the WHERE clause.

I teach a class to beginner SQL users so knowing the fine points of these type of topics helps answer the unexpected questions that students ask.  It’s also REALLY helpful to know these points when writing production code.  Being unaware can be dangerous (ignorance is not always acceptable).

So let’s pose the topic as a question: Should a predicate EQUALITY FILTER be in the ON clause or should it be in the WHERE clause?

Of course, as is usually the case with T-SQL, the answer is it depends…

Let’s start with a simple pair of queries that join two tables with an INNER JOIN.  The first query has the EQUALITY filter in the WHERE clause the second has the filter in the ON clause:

<span class="kwrd">USE</span> adventureworks;
<span class="kwrd">GO</span>

<span class="kwrd">SELECT</span>    c.customerid
        , oh.SalesOrderID
        , oh.OrderDate
<span class="kwrd">FROM</span>        Sales.Customer c
        <span class="kwrd">INNER</span> <span class="kwrd">JOIN</span> Sales.SalesOrderHeader oh
            <span class="kwrd">ON</span>    oh.customerid = c.customerid
<span style="color: #ff0000;"><strong><span class="kwrd">WHERE</span>    oh.OrderDate = <span class="str">'7/31/2004'</span></strong></span>;

<span class="kwrd">SELECT</span>    c.customerid
        , oh.SalesOrderID
        , oh.OrderDate
<span class="kwrd">FROM</span>        Sales.Customer c
        <span class="kwrd">INNER</span> <span class="kwrd">JOIN</span> Sales.SalesOrderHeader oh
            <span class="kwrd">ON</span>    oh.customerid = c.customerid
            <span style="color: #ff0000;"><strong><span class="kwrd">AND</span>    oh.OrderDate = <span class="str">'7/31/2004'</span></strong></span>;

My standard response to the question is “it’s a matter of style”.  Sometimes that advices is not true.

In this case it is true.  Both of these queries are equivalent.  We can validate that by looking at the execution plans.


Based on the execution plans, it turns out we were correct, placing the equality filter in the ON clause vs. the WHERE clause is a matter of style.  My personal style is to tend to place the filter in the WHERE clause (which is probably a throwback to my ANSI 89 coding days).  It turns out this is not the best choice in all cases.

If we modify our test queries to use an outer join we create a different scenario… the queries are no longer equivalent.

<span class="kwrd">SELECT</span>    c.customerid
        , oh.SalesOrderID
        , oh.OrderDate
<span class="kwrd">FROM</span>        Sales.Customer c
        <strong><span class="kwrd">LEFT</span> <span class="kwrd">OUTER</span> <span class="kwrd">JOIN</span></strong> Sales.SalesOrderHeader oh
            <span class="kwrd">ON</span>    oh.customerid = c.customerid
<strong><span class="kwrd">WHERE</span>    oh.OrderDate = <span class="str">'7/31/2004'</span></strong>;

<span class="kwrd">SELECT</span>    c.customerid
        , oh.SalesOrderID
        , oh.OrderDate
<span class="kwrd">FROM</span>        Sales.Customer c
        <span style="color: #ff0000;"><strong><span class="kwrd">LEFT</span> <span class="kwrd">OUTER</span> <span class="kwrd">JOIN</span></strong></span> Sales.SalesOrderHeader oh
            <span class="kwrd">ON</span>    oh.customerid = c.customerid
            <strong><span class="kwrd">AND</span>    oh.OrderDate = <span class="str">'7/31/2004'</span></strong>;

If we run the two queries we get 40 rows in the first query and 19845 rows in the second query.  Obviously these queries are not equivalent at all.

Taking a looking at the query plans…


From this we can see that not only are the plans strikingly different, but the first query is the same plan as the INNER JOINs above.  In essence, the first query gets converted into an INNER JOIN by the optimizer.

Why is this?  OUTER JOINS are characterized by one (LEFT and RIGHT JOINS) or both (FULL JOIN) tables being preserved.  Without the WHERE filter the LEFT side of the join is preserved; in this case the Sales.Customer.  The intended result set of this query SHOULD include ALL customers, regardless if they had an order on 7/31/2004 or not.  If they do NOT have an order than the order information would be NULL.  As written, the optimizer would create the result set with ALL customers and their order information.  The final step of the query execution will be to FILTER out all members of the result set where the ORDERDATE = ‘7/31/2004’.

The optimizer recognizes that the least expansive plan this scenario is with an INNER JOIN so that’s exactly how it puts together the plan.

The second OUTER JOIN query has the filter on the ORDER DATE on the RIGHT side of the LEFT JOIN.  By doing this the optimizer knows that all customers will be represented in the final result set but the ORDER information (RIGHT side of the JOIN) will only be populated if the EQUALITY filter is satisfied.


  • For INNER JOINS is doesn’t matter where the EQUALITY filter is located.  The query plans and result sets will be the same.
  • For OUTER JOINs, placing the equality filter in the WHERE clause causes the query plan to be converted an INNER JOIN.  Probably not the intended behavior…
  • For OUTER JOINS, placing the equality filter in the non-preserved side of the OUTER JOIN will give the intended results.

If you don’t want to remember how this works, as a matter of style, ALWAYS PLACE YOUR EQUALITY FILTERS IN THE ON CLAUSE. (Which means I need to adjust my default style!)

This post will help me personally remember this important but subtle point.  As long as we maintain the concept of the PRESERVED side of the OUTER JOIN we don’t have to memorize this; we can understand it.

Please reference Itzik’s article in SQL SERVER Magazine for the inspiration behind this post.

As always, Stay Curious… Keep Learning…


Move Tables to New Filegroup

About a year ago I showed how to move data in a table to a new partition by modifying the clustered index.  Since the clustered index contains the data for the table, moving the clustered index is equivalent to moving the table.

This post is a simpler example of using the clustered index to move the data.  We recently had a need at the office to add more disk space to an existing database.  Rather than increasing the size of the existing filegroups we created one or more new filegroups.

The following script outlines how to do it.

1. Create db on PRIMARY (default) filegroup with a separate log file.

2. Create a secondary filegroup (logical file for db).

3. Add a physical db file to the secondary filegroup

4. Create a table with clustered index constraint on the PRIMARY filegroup and put some data in it.

5. Show the table is in the PRIMARY filegroup.

6. Drop the clustered index constraint and move it to the SECONDARY filegroup.

7. Show the table is now in the secondary filegroup.

<span class="kwrd">CREATE</span> <span class="kwrd">DATABASE</span> test_clustered_index
<span class="kwrd">ON</span> ( name = test1_dat
    , filename = <span class="str">'C:\Documents and Settings\turpind\My Documents\sql_files\tci.mdf'</span>
    , <span class="kwrd">size</span> = 10
    , maxsize = 50
    , filegrowth = 15% )
LOG <span class="kwrd">ON</span> (name = test1_log
        , filename = <span class="str">'C:\Documents and Settings\turpind\My Documents\sql_files\tci.ldf'</span>
        , <span class="kwrd">size</span> = 5
        , maxsize = 50
        , filegrowth = 10% ) ;

<span class="rem">-- Add a second file group</span>
<span class="kwrd">ALTER</span> <span class="kwrd">DATABASE</span> test_clustered_index
    <span class="kwrd">ADD</span> FILEGROUP SecondFG;

<span class="rem">-- Add a new file</span>
<span class="kwrd">ALTER</span> <span class="kwrd">DATABASE</span> test_clustered_index
    <span class="kwrd">ADD</span> <span class="kwrd">FILE</span> ( name = test2_dat
    , filename = <span class="str">'C:\Documents and Settings\turpind\My Documents\sql_files\tci2.ndf'</span>
    , <span class="kwrd">size</span> = 10
    , maxsize = 50
    , filegrowth = 15% )
<span class="kwrd">TO</span> FILEGROUP SecondFG;

<span class="kwrd">use</span> test_clustered_index
<span class="kwrd">go</span>

<span class="kwrd">CREATE</span> <span class="kwrd">TABLE</span> table1 (    id    <span class="kwrd">int</span>    <span class="kwrd">identity</span> (1, 1) <span class="kwrd">not</span> <span class="kwrd">null</span>
                        , datavalue    <span class="kwrd">char</span>(2000)
                        <span class="kwrd">CONSTRAINT</span> [PK_table1] <span class="kwrd">PRIMARY</span> <span class="kwrd">KEY</span> <span class="kwrd">CLUSTERED</span> (ID)
                            <span class="kwrd">ON</span> [<span class="kwrd">PRIMARY</span>] );

INSERT <span class="kwrd">INTO</span> table1     <span class="kwrd">VALUES</span> (replicate(<span class="str">'a'</span>, 2000));
INSERT <span class="kwrd">INTO</span> table1     <span class="kwrd">VALUES</span> (replicate(<span class="str">'b'</span>, 2000));

<span class="rem">-- so what does our database look like</span>
sp_helpdb test_clustered_index

<span class="rem">-- So what is on the  file groups?</span>
<span class="kwrd">SELECT</span>    <span class="kwrd">DISTINCT</span>(object_name(id))
        , filegroup_name(groupid)
<span class="kwrd">FROM</span>    sysindexes
<span class="kwrd">WHERE</span>    object_name(id) = <span class="str">'table1'</span>;

<span class="rem">-- ok, now move the table to the secondary filegroup</span>
<span class="kwrd">ALTER</span> <span class="kwrd">TABLE</span> table1
    <span class="kwrd">DROP</span> <span class="kwrd">CONSTRAINT</span> [PK_table1]
        <span class="kwrd">WITH</span> (MOVE <span class="kwrd">TO</span> SecondFG);

<span class="rem">-- Now what filegroup is the data on?</span>
<span class="kwrd">SELECT</span>    <span class="kwrd">DISTINCT</span>(object_name(id))
        , filegroup_name(groupid)
<span class="kwrd">FROM</span>    sysindexes
<span class="kwrd">WHERE</span>    object_name(id) = <span class="str">'table1'</span>;

The secret to success of this script is the ALTER TABLE / DROP CONSTRAINT

Stay curious…  Keep learning…



SQL Server and Muscle Memory

If you learned to play golf by taking lessons the first time the golf pro sets you up to take your first swing it feels awkward.  After a while, if you listened to your instructor, your swing becomes more natural, you can stop thinking about every aspect of the swing, you can grip it and rip it, and the ball’s flight is usually on target.

If you learned to play golf by hacking at the driving range without any instruction you are enforcing the “bad practice makes a bad shot” rule.  After a while you figure out that you’re self study efforts aren’t paying off and so now you go to a golf instructor with all of your bad swing habits deeply engrained.

Maybe you don’t play golf so you can’t relate to a golf swing.

Fold your hands together as if you’re going to pray.  Don’t think about it, just do it.

Now take note… which thumb is on top of the other thumb, your RIGHT or your LEFT?  Whatever it is, change the way you interlock your fingers such that the OTHER thumb is now on top.

For starters you really have to think about it.  The ACT of folding your hands “the other way” takes a conscious effort.  How does it feel?  It’s awkward… it doesn’t feel right, and you instinctively want to fold your hands your “natural” way.

Both of these examples are examples of what kinesiologists call motor engrams.  Motor engrams are learned patterns of muscle movement.  Since the brain is responsible for all muscle movement, both conscious movement patterns such as picking up a pencil, and unconscious movement patterns such as the beating of the heart, motor engrams are ultimately all about training the brain to move the body.

If you do it often enough, you can learn new motor engrams.  In fact, researchers have determined that it takes 200-300 repetitions of a complex motor pattern, such as a golf swing, to make the movement pattern “effortless”.  Unfortunately for the self taught golfer, it takes 3000-5000 repetitions of a movement pattern to “unlearn” an existing motor engram.  That explains why it’s more difficult to develop a new golf swing.

So what does this have to do with SQL Server?

Every day I sit down to my computer and type in my password.  I’ll bet I type my password no less than 10 times a day.  Every time I change my password I have to go through that awkward period where I want to enter my last password and the new password doesn’t feel right yet.

How do you go about picking your passwords?  A combo of your dog’s and cat’s names?  A favorite movie or celebrity?  The day of the week?  Your birth date?

All those things you already know and that’s what makes them attractive for your password.

Why not take a different approach to password selection?  Why not pick your password as something you want to learn/remember?  Recently, I’ve been picking passwords that are based on DMV and DMFs.  There are a ton of them and I have problems learning them because they can tend to be rather long and frankly, not intuitive.

Does sys.dm_io_virtual_file_stats really flow off of your finger tips?  How about sys.dm_exec_sessions?  For me they do now.  Why?  There was a period of time when they were part of my passwords (you’ll need to throw some case, numbers and special characters in there to make them more hardened).

Every time I enter my password I am training my brain to move my fingers in a pattern that will make me a better SQL developer.  Imagine being a great developer without having to think about it?

Another side benefit…  Changing my password it’s no longer a drudgery.  I usually have a SQL keyword already lined up for me to learn.

Make your passwords a learning experience and develop the motor engrams to give you the SQL Server razor’s edge.

Stay curious… Keep learning…


Add Default Constraint to an Existing Column

This post is going to be short and sweet for two reasons:

1.  It is a short subject and

2.  I’m learning the Dvorak keyboard…

I recently needed to add a default constraint to an existing column.  It’s not something I have to do often so when I do do it I need to look it up.  The website Google sends me to has variations for several platforms but the Microsoft version isn’t working for me.

So here’s one I can reference any time I need to…

<span class="rem">-- SQL 2005+</span>
<span class="kwrd">ALTER</span> <span class="kwrd">TABLE</span> f_response
  <span class="kwrd">ADD</span> <span class="kwrd">CONSTRAINT</span> new_contact_flag_def
     <span class="kwrd">DEFAULT</span> 0
       <span class="kwrd">FOR</span> new_contact_flag;

This Dvorak is going to take some serious practice.

Stay curious… keep learning…



Location of SQL Server Default Trace

This is one of those posts for myself.  Sort of like a self-note.

When things go wrong, especially unexpectedly, sometimes the only clue is the SQL Server default trace.  Where is it?

<span class="kwrd">SELECT</span> * <span class="kwrd">FROM</span> ::fn_trace_getinfo(0)

Now you know…  But do you have permissions to get to the file?  That’s another issue.

For a more in-depth post on this topic see Adam Haines’ post.

Stay curious… Keep learning…



Parallel Query Plans

This post is actually less about parallel query plans and more about a recap of some lessons learned from a recent project.

I run an in-house T-SQL training class for new hires and other employees who either need to get up to speed with T-SQL or have a personal desire to learn.  In the very first class I show participants how to check the estimated plan cost for their queries.  The reason I do this, despite contrary opinions of other developers, is to possibly help them to prevent from running a query that brings the server to it’s knees.

I’ve written more than a few of these queries over the years and can confidently say that I usually check the plans of new queries.

I also indicate to participants that if a query has a high plan cost, it will run slow.  The corollary of that is not true; If a query has a low cost it may STILL run slow.

Why is that the case?  The reasons are too numerous to tell for sure… there may be foreign keys involved that are preventing a delete or update to complete as the keys are validated.  Statistics may be out of date.  Statistics samples may not represent the data.  Indexes can be fragmented.  Indexes can be missing. etc.

Recently I ran into a situation where a query with a fairly low cost, about 11, took forever to run.  It also consumed all resources making it challenging to get into the server even through the DAC.

Here’s my high level synopsis of what we did to correct the situation.

1.  Update the statistics on the tables involved in the query.  If possible, use the WITH FULLSCAN option to minimize the chance that the default sample size will not result in a misrepresented statistics.

2.  If running parallel query plans, it may end up taking more resources to combine the parallel threads then the benefit of running the threads in parallel in the first place.  If you suspect a problem with your parallel query set the MAXDOP = the number of physical processors.  Better yet, set the MAXDOP = 1 to see if the query will finish with a single thread.

3.  If #1 and #2 don’t solve your problem you may need to disable hyper-threading (HT) on your server.  There’s tons of expert info on HT, how it works, why it can impair performance, so just take some time and do your homework.  In order to disable HT you’ll need to shutdown the server and interrupt the boot up process to change the BIOS settings.

In my case it took a combination of all of the above events to get the query to run in a predictable, reasonable amount of time.  It still is not great but at least it is predictable.

The next step…  Capture the query / ETL trace data and run the trace results through DTA.

The servers I encountered my challenges with were circa 2006 machines running Windows 2003, SQL Server 2005 SP2.  One had 4 single core processors; the other had 2 single core processors.  Had we been running a new OS, applied SP3 to SQL Server or running SQL 2008, or had the latest generation processors installed, we might never have had any problems, so the life of this post may be short.

Regardless, the lessons learned were hard won and worthy of a 20 minute post… if for nothing else, to remind me that even the simplest queries may not be so simple after all.

Stay curious… Keep learning.



Set up Dedicated Administrator Connection (DAC) before you need it!

When I was in elementary school we often had planned “fire drills”.  We would be told of the impending alarms, the alarms would go off, and we’d all exit the school in an orderly fashion using the prescribed path and exits.

Fast forward a hundred years and I’ve discovered why fire drills are important…  So that when the real deal happens we know exactly what to do and where to go.

We recently had a database server become over run with one or more queries that sent the server into a tail spin.  The classical definition of a good operating system, graceful degradation under heavy load, wasn’t working.  The server was in such bad shape that our admins weren’t able to log into the machine to determine what was going on.

“Normal” connections didn’t even allow an sp_who2 to return it’s simple result set.

The time to conduct your own DAC fire drill is BEFORE you really need it.  Otherwise, it’s easy to forget that it even exists when the red phones are going off and panicked IMs and emails are coming in from all corners of the business.

For starters, let’s clarify one point:  The DAC is not a specialized account.  It is a connection.  So there is no special login account and / or password you need to remember.  Nor is it a secret back door into SQL Server.  The purpose of the connection is to increase the probability (not guarantee) that a user with ADMIN rights can connect to the server and start KILL-ing off offending sessions that are causing the server to spike.

It ain’t pretty… but can be pretty darn useful.

There are two ways to use the DAC.

1.  When connecting to the DATABASE ENGINE (not server), precede “ADMIN:” before the server name.  My personal laptop has an instance of SQL Server called DJF001.  Here’s how my connection to the DAC looks:


You’ll know you have a DAC connection by observing your connection description at the lower right of you command window:


2.  An alternative method that uses fewer system resources is through SQLCMD.   When connecting from SQLCMD just use the “-A” switch “SQLCMD -A”.  Note:  The Storage Engine book indicates to use “/A”… this probably works in 2005 but did not work for me in 2008.

The default setting only allows local DAC connections.  If you want to be able to use the DAC from a remote host just set the REMOTE ADMIN CONNECTIONS configuration option.

The DAC is not intended as a general purpose connection and all commands, such as backup and restore, will not work.  It’s purpose is to allow an administrator to connect to the machine when all other connection options fail.

Configuring Remote DAC: Unless you are running SQL Express edition, DAC is ready to roll by default.  However, if you want to use the DAC from a remote host you’ll need to configure it.  To configure remote DAC enter the following on the server you want to setup:

sp_configure <span class="str">'remote admin connections'</span>, 1

<span class="kwrd">GO</span>

<span class="kwrd">RECONFIGURE</span>

<span class="kwrd">GO</span>

Conclusion: The next time your server is running out of control think DAC and get connected.  In the meantime, run your own fire drill and make sure you know how to use it.


Inside Microsoft SQL Server 2005: The Storage Engine, Kalen Delaney.  Microsoft Press. 2007.

Microsoft SQL Server 2005: Implementation and Maintenance, Solid Quality Learning, Microsoft Press, 2006.

MSDN, Connecting to the DAC.

Stay curious… Keep Learning…