Side Effects / Benefits of Indexed Views

Indexed views, also known as materialized views, can offer significant performance benefits when properly implemented.  SQL Server BoL certainly highlights the benefits in great detail.

A co-worker, whom I have the highest regard for his technical / SQL abilities recently asked me to take a look at a query he was running that was behaving in a bizarre way.  Appealing to my ego, I eagerly spooled (so bad… sorry but couldn’t resist) my way to his desk to pitch in.

The query seemed simple enough…  A couple joins, and 2 columns in the result set.  The problem was that the WHERE clause had a WHERE column_value = 1 in it, but it wasn’t working.  To make it even more perplexing, the identical query was working on a copy of the database, on a different server.  The copy had an additional days worth of data in it; otherwise the two databases were identical.

We went through the typical checks… data types of the columns, indexes, consistency of the tables between the two databases.  We looked at the compatibility level of the databases.  We looked at the service pack levels of the two engines.  Everything matched up perfectly.

Finally, we looked at the actual query plan that was used.  Voila!  The database that was not returning the expected result set was using a view, an indexed view, to get the data; the other database that was working correctly was using the base table.

So here’s the lesson…  Indexed views may be substituted by the query optimizer if it is determined that the cost of the query will be less than using the base table.  (Note:  This only applies to Enterprise, Developer and Data Center editions of SQL Server).

In our particular case, the indexed view that was chosen by the optimizer filtered out the desired rows the query was intended to select.  The optimizer chose this view, probably because there were way fewer rows in the view than the base table.

Unfortunately for my associate, the database he was working on uses indexed views heavily.  I’ll even go as far as to say indiscriminately.  An additional challenge is that the data is accessed from the database with third party tools that may not have the ability to force the query plan to use the base table.  The good news is that the applications seldom access the data from the base tables directly, so the unintentional substitution of the incorrect index view for the base table may be the exception.

We identified 3 possible workarounds that will guarantee that the optimizer would not substitute the indexed view (would use the base table) even if the query plan cost is less:

1.  Use the table hint WITH (EXPAND).  Not a great option considering the third party reporting tools.  It would also need to be applied to every SELECT statement (and possible UPDATE and DELETE too!)

2.  Downgrade to SQL Server Standard edition.  This was not very popular…

3.  Set one of the SET OPTIONs that are necessary for the index view to be used by the optimizer, such as SET NUMERIC_ROUNDABOUT to ON, instead of OFF.  The other potential candidates are ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS, ARITHABORT, CONCAT_NULL_YIELDS_NULL and QUOTED_IDENTIFIER

We tested #1 and #3 and they both forced the query optimizer to utilize the base table instead of the indexed view.  Option #2 we’ll leave as an exercise for the reader.

I looked to see if there is an option to over-ride the auto substitution of the view for the base table when the view is created, ie., to emulate the view behavior as if the edition was STANDARD, but did not see anything.

SQL Server BoL describes the many benefits of using indexed views, including the benefit that the query optimizer will use an indexed view instead of the base table if the cost of the plan is the least.  Unfortunately, in our case, the view didn’t contain the data we needed.  I recall reading about horizontally partitioned indexed views, which at the time I interpreted as partitioned tables / views (partition function, partition scheme, etc), so I may need to go back and dig deeper in this area.  By creating a view that is filtering on a column’s value, we have essentially partitioned the base table.

Another cost of indexed views is that the data is actually stored.  By applying the clustered index we are creating a copy of the data.  So if you have several indexed views on a single table, you essentially have several copies of the data, along with the statistics of the data and the indexes and all of the overhead that goes with it.

It only seems reasonable that new records inserted into the base table will also result in inserted rows in the indexed view(s) that are SCHEMA bound to the table.  Likewise with DELETE and UPDATE operations… the cost of these operators will be multiplied by the number of indexed views.

And anyone who indiscriminately implements indexed views that call indexed views… layers of indexed views… should be publicly flogged.  That’s just my opinion and is unlikely to be deployed.

CONCLUSION:  The overuse of indexed views can be like winding a watch with a wrench.  It seems like it’ll work great but something will eventually break.  In a way it’s like the over use of triggers… the invisible stored procedures…  They have their place in production systems, but they should be implemented in moderation and with a full understanding of their positive and negative impacts.  Unintended side effects can have significant costs that totally negate the upside of their use in the first place.

Obviously, this is a big topic and this posting just raises some of the issues.  I’m certain others have hit this issue and have additional methods to avoid some of the ill-intended side effects of improper/over use of indexed views.

Charlotte SQL Server Users Group, Guest Speaker Kevin Kline

Just in time for the Holidays…  Kevin Kline has come to town.

Kevin’s topic at the Charlotte SQL Server Users Group weighed the pros and cons of SQL Server virtualization.  In this post I will note some of the many points that I learned.

Before I dive into the technology details I would like to add my personal comments. This was definitely one of the more enjoyable geek-meets I’ve been to in quite a while.  Our chapter president and by default program coordinator, Peter Shire, has done a superb job of bringing in some top notch SQL experts to our humble group but Kevin’s topics were in alignment with my own personal interests.  Not administration; not hard core programming.  It was more a topic that bridges admin and development.  Let me coin it DevelopMin.

Here are the high points of the meeting, from my perspective…

1.  If you have a bunch of SQL Servers lying around that are predominately low utilization, virtualization may make sense.  Another application for virtual servers would be a development shop that needs to certify their application against a variety of OSs, SQL server versions, service packs, etc.

2.  The overhead of virtualization has been improved since the initial releases.  Much of that improvement is in a reduced CPU utilization IFF you use current generation processors.  The new processors are VM-aware and will perform better in the virtual environment.

3.  If you’re configuring your virtual machines don’t let the software decide what resources to assign to each machine.

4.  Hyper-V was originally named Hyperv.

5.  A side discussion occurred when the topic of setting up the SAN on 64k blocks and how performance degradation can occur if disk alignment is out of sorts.  Apparently the newer controllers can be configured to buffer the data so in some cases, particularly with the newer controllers, this may not be an issue.

6.  In most cases, Microsoft will no longer make the customer reproduce a SQL Server bug in a non-virtualized machine before they will look at the bug.  We’ve come a long way baby.

7.  We got on the topic of how Microsoft clusters are good for availability but adds nothing along the lines of scalability.  I’m sorry, but I was spoiled in the mid-80s when I was working with VAX/VMS clusters (remember Digital Equipment, before they were bought by Compaq and HP?).  Back then, a node in a cluster shared the disk drives but added memory and CPU resources to the cluster.  Additional nodes in the cluster added processing power.  How does Microsoft get away with using the same terminology to mean something that is backward technology?  Why does the IT community put up with it?  Kevin mentioned several products that support the concept of scalable clusters… but I still wonder why it has to be provided by a third party at an additional cost.

8.  Several web sites were authentically referenced including Jimmy May at SQLCAT.com on disk alignment, Linchi Shea’s methodical analysis of disk arrays at sqlblog.com.  What’s really amazing is the depth of knowledge that there is out there.  As soon as you think you know something you find people like this that are in a completely different league.  It is humbling and inspiring, concurrently.

9.  Quest’s LiteSpeed database backup tool is used throughout Microsoft.  Lot’s of companies say their products are covertly used in the inner sanctums of Microsoft.  And Kevin said it on a Microsoft campus in front of Microsoft techs.  It must be true!  What I found intriguing about the product is that an errant implicit UPDATE or DELETE transaction can be rolled back from the transaction log.  Since we use LiteSpeed at my place of employment, and I have been known to skip the WHERE clause once or twice, I couldn’t wait to bounce this off my local admin.  Apparently this only works if the database is in FULL RECOVERY mode; our OLAP dbs are in SIMPLE, and should be.

9.  If you’re stuck in a rut, get out to a local user group meeting.  You’ll either be uplifted and inspired, or you can just go back to your rut.

I’m off to rating the presentation at speakerrate.com.

Happy SQL-ing!

Change column data type; change NULL constraint

Recently I’ve been working on performance issues relating to SELECT queries hitting some very large data warehouse stage tables.  In one instance I encountered a table that had over 300M records where the SELECT statement that reads from the STAGE table had an estimated execution cost of 3500.  I didn’t find this by accident… it was by far the longest part of the entire ETL process.

After doing some testing and validation in a development database I determined that I can get that query plan down do 3 just by re-indexing the table.  In this case the primary key constraint was on the surrogate key, which didn’t help at all when reading recently added data from the table.

Since it is still a work in progress I’ll save the details of the re-index process for a later post, when I have better empirical data.

I also noticed that the data type for the date column ([load_date]) that is being used by the SELECT statement to identify the most recently added records to the data store was a full blown 8-byte datetime.  The earliest date in the table only goes back to 2004 and the values stored in the table are just comprised of the month/day/year.  In other words, if the earliest date is not less than 1900 and we don’t need the 3ms time precision, why not go with a smalldatetime datatype?

At 300M records in the table, and a 4 byte per record savings, a simple conversion of the datetime to smalldatetime datatype will save 1,200M, or 1.2G  That’s just in the storage of this column.  That doesn’t factor in any savings in indexes, backups, tempdb and memory needed to process queries, etc.

Obviously, the database is not SQL Server 2008, otherwise we could have gone with the DATE datatype, which uses 3 bytes instead of the smalldatetime’s 4. 

The query to change the datatype of a column follows:

ALTER TABLE [dbo].[STG_my_table]
      ALTER COLUMN [load_date] smalldatetime NOT NULL;

This simple query accomplishes two objectives:

  1. Converts the   [load_date] column from it’s existing 8-byte datetime to 4-byte smalldatetime.
  2. Makes the column not nullable.

Please reference SQL Server Books Online for the details on which columns can be changed using this technique.  In general, if the data type converts with an implicit conversion it will work.

Also, be mindful that this query will require a large amount of transaction log space, it will take a long time for large tables (over 5 hours in the case of my 300M table) and will lock up the table making it unavailable to production users.

If you need to minimize the unavailability time of the table you’ll need to extract the data to a flat file, re-script the table, and reload the data from the flat file.  Not exactly a painless process.

Of course, it’s always possible to change the column type through SQL Server Management Studio, but if the task is going to be repeated for several tables it’s much easier to just T-SQL it out and go to bed.

Happy SQL-ing!

0 Comments

sp_start_job Runs Asynchronously

Fix a bug… discover another…  Hence the birth of this post.

I had a production SQL Agent job that called a sub-job using the sp_start_job system stored procedure (step 1).  The next step (step 2) involved a simple update of a configuration table date column.

My assumption was that when step #1 finished, step #2 would update the date and we’d be done.  The flaw in my thinking is that step #2, which happened to update dates that step #1 was using, would run AFTER step #1 finished.

Although not explicitly stated in BOL, the sp_start_job is asynchronous.  It has no concept of what the job that it is calling is doing.  Nor does it wait for the job to return a status.  It simply kicks off the job and that’s it.

The following will demonstrate the asynchronous nature of sp_start_job.

STEP 1: Create a job that can be called that will take a little time.  For this test case I’m just going to force the job to execute the WAITFOR to pause for a 1 minute delay.  The name of this job is Sub-Job.

 

image

 

STEP 2:  Create another job that will have two steps, Parent Job.  The first step will call the SUB-JOB job; the second step will do something.  We don’t care what the second step is.  Our goal is to see if the second step executes AFTER the first step completes.

 

image

 

Here’s the call to the sub-job in Step 1.

 

image 

 

The second step happens to be a call to EXEC sp_helpdb.  The only purpose of this call is to do something simple that completes in a short time period.

When we kick off the job it completes almost immediately.  Certainly less than the 1 minute wait that’s in step 1.

Let’s prove our concept by looking at the job history table in msdb:

SELECT  instance_id,
        step_id,
        step_name,
        message,
        run_date,
        run_time,
        run_duration
FROM    msdb.dbo.sysjobhistory
WHERE    instance_id >= 173
ORDER BY 1 ASC;

 

And here’s the results:

image

At instance 173 we see the call to the sub-job.  Note the run_duration = 0.  If the job was synchronous this step should have taken 60 seconds.  At line 174 we note the Step 2 kicking off.

Finally, on 176 we see the sub-job running and taking 1 minute (100).

Had the sp_start_job run synchronously line 173 would have had a duration of 1 minute instead of 0.

CONCLUSION:  The system stored procedure sp_start_job is asynchronous.  If you are using this procedure to kick off other jobs all subsequent steps in the calling job stream should be independent of the subordinate job.

This “feature” could be used to start up several jobs concurrently from a single scheduled agent job.  In most cases, however, the use of sp_start_job from SQL Agent should be used with extreme care.

Happy SQL-ing!

0 Comments

Letters and Money in ISNUMERIC may return 1

A colleague over the weekend sent a note out to the team requesting any kind of insight.  It turns out that he had a job fail because an instance of the T-SQL ISNUMERIC built-in function behaved in an unexpected way.

According to BoL: “ISNUMERIC returns 1 when the input expression evaluates to a valid numeric data type; otherwise it returns 0.”   Valid numeric data types include the four integer types (tiny, small, “regular” and big), decimal and numeric (which are the same), small money and money, and float and real (which are the same).

The money data types offer opportunities to have a whole host of money designations (such as the $, franc, yen, etc. ) to be embedded in the input string that may result in a false positive with ISNUMERIC.

In our particular example the input value was:

SELECT ISNUMERIC('107D084');

which returned 1.

It turns out that the expression ‘107D084’ is a valid scientific notation;  and is ‘107D-084’ too.

The D stands for DOUBLE FLOAT or the 64-bit version of E, which is SINGLE FLOAT.  The D and E can be either UPPER CASE or lower case.  Either will evaluate to a valid numeric expression.

What makes this even more interesting is that there is no DOUBLE FLOAT data type in SQL Server.  Regardless, the DOUBLE FLOAT is a valid numeric, even if we can not store it to that level of precision.

For the record, there are no other valid letters that can used to represent a scientific notion numeric. 

For a comprehensive dissertation on the isnumeric function check out John Magnabosco’s excellent blog contribution.

So what should we do when we get an unexpected scientific notation value in our data stream?  Chances are what you’re really trying to do is test for a valid integer value.  A quick Google search for ISINTEGER will return several peer reviewed user defined functions.

None of this is rocket science, but avoiding a process failure over the weekend is worth the extra effort to test for what you’re really looking for.

Happy SQL-ng!

0 Comments

Stuff You Have To Do vs. Stuff You Want To Do

I’m been bumping and grinding on a couple of data quality / cleansing jobs for so long I can barely remember when the last time was that I brought up an MSDN page to learn something new.  It’s just a matter of grinding out the various types of issues with the data.

All jobs have things that must be done.  After all, even SQL Server techies are really there for a business purpose, not some academic exercise to be on the prowl to learn new and cool things.  Let’s call the part of a job the must be done “stuff you have to do” or “have2do” for short.

If you are lucky, and like your job, your job will have things you want to do.  Lets call them “want2do” for short.  For me my want2do list is growing longer and longer.  I have big ideas to automate application dictionary files with Powershell; creating disk scrubbers with Powershell; vastly improving the performance of some huge tables via a sliding window partition scheme; converting a vast amount of DTS packages to SSIS; really tune my indexes; optimizing my databases by using minimal datatypes; using standardized user defined datatypes, etc.  The list never ends, and seldom ever shortens.  But it’s always characterized by things I have to learn; not the stuff I do 100 times a day.

I have learned that if you wait for the day to come where there are no have2dos before you start on your want2dos you’ll never get there.  Stuff ain’t gonna happen.

I’ve learned a strategy to at least get to my want2do list without neglecting my have2do list.  It’s called get over the guilt and steal some time.  Yes, it actually feels like I’m cheating someone when I divert off my “critical” tasks to tackle something where I might learn something new.  Over time, the feeling of “cheating” has subsided and I realize that I’m working in a more balanced mode.

Plus my manager loves it when I come up with something new.  Even if it’s not ready for prime (aka production) time.

If you’ve been at the same job for a long time, whatever that is to you, and you find yourself watching the clock and grinding out the day waiting to go home, it’s either time to move on to a new job where new challenges will be presented to you, or you need to find something new to challenge you.

My personal rule is: Once you get bored, you’re done.  It’s real tough to get yourself out of the rut.

If you are bored, it’s time to look inward and not outward.  I once heard a motivational speaker retort:  “Thump-ady-bump; thump-ady-bump.  It’s easy to blame the road.”  Blaming the job, the boss, your co-workers or your dog will not solve the root issue…  You’re bored and need a spark to fire your engines.

You might be overworked with have2dos; or your want2do list might be dead.  Either way it’s dooms day for your on the job satisfaction ratings.

I probably am operating on about a 80 percent have2do vs 20 percent want2do.  That seems to be a good balance of work and play, at least for now.

If you’re not learning… if you’re bored… stuck in a rut… what are YOU to do?

Happy SQL-ng!

0 Comments

INSERT to a table with an IDENTITY Column

I’ve been a SQL Practitioner since 1996 when 6.5 was the hot item.  Along the way I’ve learned a bunch of stuff about SQL Server.  Without a doubt, the more I learn, the more I know what I don’t know.  Being humbled on a regular basis is probably a good thing, but sometimes it’s just down right ridiculous what I don’t know.

Most of my database experience is in the data warehouse/mart space, where identity columns are used almost exclusively for the surrogate key.  There have been too many times to count over the years where I’ve had to load data into a table that had an identity column.

For the longest time I was manually typing all of the column names when loading data into a table.  Then I learned through Raman Renak’s article that you can capture the names for all the columns in a table through SSMS just by dragging it from the object explorer to the query window.  I thought that was a breakthrough!  All I had to do was not include the IDENTITY column from the dragged list of columns.

 

image

Now I’ve learned that when doing an INSERT to a table the IDENTITY column does not need to be explicitly listed as long as no other columns are specified in the optional column list.  This makes sense since you can’t specify the value for an IDENTITY column anyway.

Thanks to the folks at SQLServerCentral.com for this time saving lesson…  Rather than reproducing the lesson here just go the SQL Server Central for the original lesson plan!

http://www.sqlservercentral.com/Questions/Tags/T-SQL/T-SQL/

Happy SQL-ing!

0 Comments

Speed Up Startup of SQL Server Management Studio

Have you ever noticed how long it takes to start up SQL Server Management Studio (SSMS)?  I always assumed it was because it is the biggest and most powerful tool in my toy box and it deserves the load time.

You can SIGNIFICANTLY reduce the load time by avoiding Microsoft’s marketing message at startup, the splash screen.

image

To automatically skip over the splash screen follow these simple steps:

1. START ==> RIGHT click the SSMS icon ==> Properties

2. Click the Shortcut tab.

3. Add the option -nosplash at the end of the target string, outside of the quotes.  The final string should look something like:

“C:\Program Files\Microsoft SQL Server\90\Tools\Binn\VSShell\Common7\IDE\SqlWb.exe” -nosplash

4. Hit OK.

5. Start up SSMS and watch your wait time drop!

This also works with MS Visual Studio.

I got this tip from the August 2008 issue of SQL Server magazine, authored by Raman Renak.  The article is worth a review…

http://sqlmag.com/Article/ArticleID/98764/sql_server_98764.html

Happy SQL-ing!

Table Partitioning via the Clustered Index

Today’s post will demonstrate how to partition an existing table without disturbing the existing data in the table by taking advantage of it’s clustered index.

SQL Server Books Online (BOL) is laced with interesting tidbits of information but sometimes falls short of delivering on the specifics.  Let’s face it, there’s no way to demonstrate every combination of options and switches available.  With every new release of SQL Server the number of option combinations is growing exponentially.

Our post today is premised off the following statement in BOL for the ALTER TABLE command:

You can drop the clustered index and move the resulting table to another filegroup or partition scheme in a single transaction by specifying the MOVE TO option.

In this case, as far as I can tell, the details of HOW to move the table to another filegroup or partition scheme is left as an exercise for the reader.  Since I’m the reader and I like to exercise, here we go…

Before diving into the details, here’s the high level route we will take:

    1 Create a new database with a single file group.  This is straight out of the manual… RIGHT CLICK ==> CREATE DATABASE
       
    2 Create a table and populate it with data.  We’ll use a copy of the AdventureWorks.Person.Address table.  The key to this working is that the table has a clustered index on it.
       
    3 Setup the partitioning function and scheme.
       
    4 Use ALTER TABLE to move the clustered index to the new partition scheme.
       

For starters I created a test database called SwitchTest.  It is a stock, out of the box db right from SSMS.  The key point is that all data resides on the PRIMARY filegroup.

image

Now we’ll create a table that’s modeled after the AdventureWorks.Person.Address table and populate it with the data. 

USE SwitchTest;
GO

CREATE TABLE dbo.[2B_Partitioned] (
        [AddressID] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
        [AddressLine1] [nvarchar](60) COLLATE Latin1_General_CS_AS NOT NULL,
        [AddressLine2] [nvarchar](60) COLLATE Latin1_General_CS_AS NULL,
        [City] [nvarchar](30) COLLATE Latin1_General_CS_AS NOT NULL,
        [StateProvinceID] [int] NOT NULL,
        [PostalCode] [nvarchar](15) COLLATE Latin1_General_CS_AS NOT NULL,
        [rowguid] [uniqueidentifier] ROWGUIDCOL NOT NULL CONSTRAINT [DF_Address_rowguid] DEFAULT (newid()),
        [ModifiedDate] [datetime] NOT NULL CONSTRAINT [DF_Address_ModifiedDate] DEFAULT (getdate()),
        CONSTRAINT [PK_AddressID] PRIMARY KEY CLUSTERED (
            [AddressID] ASC )WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
    ) ON [PRIMARY];

-- Populate the Address table from Adventure Works...
INSERT INTO dbo.[2B_Partitioned] (    AddressLine1,
                                    AddressLine2,
                                    City,
                                    StateProvinceID,
                                    PostalCode )
    SELECT    AddressLine1,
            AddressLine2,
            City,
            StateProvinceID,
            PostalCode
    FROM AdventureWorks.Person.Address;

 

This resulted in 19, 614 rows being loaded to our 2B_Partitioned table.  Also note the clustered index on Address_ID!

Now let’s add a second file group, FG2,  to our database.  Since my tired laptop only has a single drive I’ll create it on the same drive as the primary file group.  If this were a production system we would prefer to add the second file group to a different physical drive.

USE [master]
GO
ALTER DATABASE [SwitchTest] ADD FILEGROUP [FG2];

ALTER DATABASE [SwitchTest]
    ADD FILE (    NAME = N'FG2',
                FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\DATA\FG2.ndf',
                SIZE = 2048KB ,
                FILEGROWTH = 1024KB ) TO FILEGROUP [FG2];
GO

 

Setting up the partitioning in SQL Server is a two step process.  Create the partition function; create the partition scheme.

USE SwitchTest;
GO

CREATE PARTITION FUNCTION Move2Part_part_funct (int) AS
    RANGE LEFT FOR VALUES (10000);
GO

CREATE PARTITION SCHEME Move2Part_part_scheme AS
    PARTITION Move2Part_part_funct TO ([Primary], [FG2]);

 

Since the 2B_Partitioned table has about 20,000 records, this partition function will split the data in half.  For more info on setting up partitions please refer to BOL

And now for the work horse of the task.  We’ll use the ALTER TABLE to drop the clustered index and in the same query, move the data to the partition scheme:

USE SwitchTest;
GO

SET STATISTICS TIME ON

ALTER TABLE dbo.[2B_Partitioned]
    DROP CONSTRAINT [PK_AddressID]
    WITH (MOVE TO Move2Part_part_scheme ([AddressID]));

 

How do we know we just partitioned the data?  Look at it!

SELECT    $partition.Move2Part_part_funct([AddressID]) AS [Partition Num],
        count(*) AS [Row Cnt]
FROM    dbo.[2B_Partitioned]
GROUP BY $partition.Move2Part_part_funct([AddressID])
ORDER BY $partition.Move2Part_part_funct([AddressID]);

Using the $partition command we can see that we have 10,000 rows in the first partition and the remaining 9,xxx in the second partition:

 

image

 

The last step towards making this table production ready is to re-create the primary key:

ALTER TABLE dbo.[2B_Partitioned]
    ADD CONSTRAINT [PK_AddressID] PRIMARY KEY ([AddressID]);

CONCLUSION:  SQL Server offers a myriad number of ways to get a job done.  Today we demonstrated how to use a table’s clustered index to convert a non-partitioned table to a partitioned table.  The secret of the task is the WITH (MOVE) clause of the ALTER TABLE command.

For a historical and detailed discussion on table and index partitioning please review Kimberly Tripp’s white paper.

 

Happy SQL-ing!

7-Zip Output Argument

This is a real quickie but it cost me dearly this weekend.  My task was to call 7-zip from within SSIS and extract the contents.

Here’s the basic example that’s all over the Internet:

7z e archive.zip -oC:\Doc

Where:

7z is the executable

e is the argument to perform an extract

archive.zip is the file to be opened and extracted

-o is to extract the output to and

C:\Doc is the extract directory

I’m not much of a DOS command line type but I do have a fair amount of UNIX, Linux, BSD experience.  I’ve used all the tools in the unix space including but not limited to PERL, grep, awk, sed, vi, …  It turns out this experience may have worked against me.

I have never seen a situation where if an argument has no white space between it and the argument value it breaks the command parser.

Bottom line, with 7-zip –o C:\Doc will generate a command line error while -oC:\Doc (no space between “o” ane “C”) works fine.

When it comes to T-SQL I tend to use lots of white space.  Those habits do not necessarily have a positive carryover to 7-Zip.

Happy SQL-ing!

0 Comments

Column DIFF between two tables

I recently had a situation where I needed to compare table columns between the production table and a test/development table.  Chances are if I needed to do it someone else will to.

I like using the EXCEPT SET OPERATOR.  Some people might prefer to use an OUTER JOIN to determine the column differences.  With the EXCEPT operator the sequence of the queries are important.  Think of it as “What’s in the first query that’s not in the second query?”.  Reversing the order of the queries will return a NULL result set.

For this posting I’ll use AdventureWorks.  Let’s first create a table that’s the same as the [HumanResources].[Department] table, except we’ll add the column CreatedDate.

CREATE TABLE [HumanResources].[Department_2](
    [DepartmentID] [smallint] IDENTITY(1,1) NOT NULL,
    [Name] [dbo].[Name] NOT NULL,
    [GroupName] [dbo].[Name] NOT NULL,
    [ModifiedDate] [datetime] NOT NULL,
    [CreatedDate] [datetime] NOT NULL,
 CONSTRAINT [PK_Department_DepartmentID_2] PRIMARY KEY CLUSTERED
(
    [DepartmentID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

 

Now we can check for the newly added column:

SELECT name FROM syscolumns WHERE OBJECT_NAME(id) = 'Department_2'
EXCEPT
SELECT name FROM syscolumns WHERE OBJECT_NAME(id) = 'Department'

 

Which produces our expected output:

image

Of course syscolumns also has the data type information in it so it’s not a huge leap to take this same concept to identify columns that do not have the same data type.

Happy SQL-ing!

Use a View to Control the Size of UPDATE Transactions

In several of my production databases I have to deal with tables that have hundreds of millions of records.  Querying these tables are challenging; doing updates to them can be nightmarish.

I recently had to add a column to a table that had 65 million records.  The table is an aggregate table so the new column needed to be loaded from a table that had 180 million records.

Doing an unmanaged UPDATE would only serve to fill the transaction log and waste a bunch of time and system resources.

After doing a little tinkering I was able to break the transactions into pieces by using a view.  First let’s layout the setup for this.

CREATE TABLE tmp_top_view ( id           int   identity (1, 1),
                            ContactID    int,
                            test_bit     bit,
                            loopcount    int );

INSERT INTO tmp_top_view ( ContactID )
    SELECT    ContactID
    FROM    Person.Contact;

CREATE VIEW v_tmp_top_view AS
    SELECT    TOP 500 id
    FROM    tmp_top_view  WITH (NOLOCK)
    WHERE    test_bit is null;

The tmp_top_view table would be the production table we want to update.  In a production situation it would have tens of millions of records.  Certainly more than you would want to have in a singe UPDATE transaction.

The test_bit  column is the target column we want to update from some other table, or just set to a value.

The loop_count column is just something we’ll use to show how many records get updated per transaction.

The INSERT statement is just populating the table with some dummy data from the AdventureWorks Person.Contact table.  In this case we are inserting 19,972 records into our table.

So far nothing interesting…

The CREATE VIEW is a key part of the solution.  You’ll notice the TOP 500 in the SELECT statement of the view.  This is literally setting the max size of the UPDATE transactions. 

The next key piece of the puzzle is the WHERE clause that filters on NULL test_bit.  This column will be updated in our query, below.  When the column becomes non-NULL it will drop out of the view.

Also, very important, notice the use of the NOLOCK hint.  Without this hint the batch will run much slower.

OK, we’re all setup.  Now let’s get some work done:

DECLARE @rowcount    int,
        @loopcount    int;

SELECT  @rowcount = 1,
        @loopcount = 1;

WHILE @rowcount > 0
BEGIN
    BEGIN TRAN

    UPDATE  t
    SET     test_bit = 1,
            loopcount = @loopcount
    FROM        tmp_top_view t
            INNER JOIN v_tmp_top_view v
                ON v.id = t.id;

    SELECT  @rowcount = @@rowcount,
            @loopcount += 1;

    COMMIT;
END

Observer the test_bit column getting set to one.  The value that we are setting it to does not matter.  What’s important is that it’s no longer NULL.  Once it stops being NULL, it stops being a member of the view.

Fortunately, in practice, the newly updated record doesn’t drop out of the view immediately.  If it did the view would keep adding records to itself and the entire 19,972 record dataset would be updated in one transaction, which is exactly what we are trying to avoid.

In this sample set the data gets updated with 40 transactions; the first 39 are 500 record transactions; the last transaction updates 472 records.

To see how the records were updated use the following query:

SELECT  loopcount,
        COUNT(*) AS [Count of loopcount]
FROM        tmp_top_view
GROUP BY loopcount
ORDER BY loopcount;

 

Finding ways to neatly break up large transactions is an art as well as a science.  What I didn’t discuss here is the 2 hours I spent trying to figure out the optimal transaction size to get all 64M records to update in the fastest time.

A key piece of the performance ingredient is to use the NOLOCK hint in the view definition.

I attempted to populate an indexed temp table to capture a subset of the main table (instead of the TOP 500 view) but found the time to write to the temp table and manage the index was too high.

I took a swag at using an indexed view but that doesn’t fly because of the use of the TOP in the view definition.

What solutions have you devised to break up large UPDATE transactions?  Have you found anything unique about DELETE vs UPDATE vs INSERT transactions?  Send me a pointer to your solutions!

Happy SQL-ing!

Download a file from the web using SSIS

There are some tasks, no matter how seemingly simple, I just would not want to have to do again.  Today’s post is about one of them…

At work we use a data cleansing tool that purchased from the USPS (Post Office) called NCOA (National Change of Address).  All of the big list brokers use the NCOA to update old addresses and keep their lists as current as possible.

One of the requirements for the NCOA is a daily delete file, that is used to make minor changes to the local NCOA database.  As a NCOA service bureau we are licensed and even required to capture the DAILYDEL.DAT file by downloading it from the USPS’s web site.

Since no one in their right mind is going to take on a task of downloading any file on a daily basis, its curious why the USPS doesn’t just issue a script to go get the file.  So I’ll share my code with anyone that needs it.

To be fair I figured this one out with much support of numerous posts from other people.  Since it’s been a while since I got this running I can not be specific as to who you are.

So here’s the objective of the task:   Go to the USPS web site and download a file every day.   The website URL and the file name do not change.  Pretty simple…

Unfortunately SSIS doesn’t have a native transform for HTTP.  As unbelievable as it is, the only way I could figure how to do it was with a EXECUTE SCRIPT task.

Here’s the code:

' Microsoft SQL Server Integration Services Script Task
' Write scripts using Microsoft Visual Basic
' The ScriptMain class is the entry point of the Script Task.
Imports System
Imports System.IO
Imports System.Text
'Imports System.Windows.Forms
Imports Microsoft.SqlServer.Dts.Runtime

Public Class ScriptMain
Public Sub Main()

        ' Get the unmanaged connection object.
        Dim nativeObject As Object = Dts.Connections("HTTP Connection to USPS").AcquireConnection(Nothing)

        ' Create a new HTTP client connection
        Dim connection As New HttpClientConnection(nativeObject)

        ' Save the file from the connection manager to SBDQS01
        Dim filename As String = "\\dqs01\NCOALink\DAILYDEL.DAT"
        connection.DownloadFile(filename, True)

        Dts.TaskResult = Dts.Results.Success
    End Sub
End Class

If you need to capture the NCOA DAILYDEL.DAT file just grab this code, drop it into an EXECUTE SCRIPT, change the path where you want to store it, and voila, you’re done.

I just scheduled a SQL Server Agent job to go get the file every night just before midnight (everything starts at midnight, right?).  The download finishes in about 5 seconds.  It takes me longer than that to wake up my mouse in the morning.

Happy SQL-ing!

0 Comments

Write Row Count to Flat File Using SSIS: Part III

This is the third and final part to this thread on how to capture the number of rows being processed in SSIS to a flat file.

In Part I we learned how not to do it and why it didn’t work.

In Part II we showed how to use the ROW COUNT transformation to populate an SSIS variable, use the DERIVED COLUMN transformation to convert the value of the variable to a column of data, and write the row count to a flat file.  We learned that we needed to use two data flows to do this.  The biggest benefit of this approach is that we can use the same technique of using the DERIVED COLUMN transform to write any SSIS variable out to a flat file, such as an audit file.

In this post we will show a simpler way to capture the row count of a transaction.  The downside is that it will not work for any variable in the package, as in the case of the DERIVED COLUMN.

So let’s get to work…

1.  We’ll only need a single data flow for this solution.  And here’s what it looks like:

image

What’s new here is the use of the AGGREGATE transformation circled in red.  Basically all this transform is doing is the classic T-SQL COUNT(*).

2.  To configure the AGGREGATE TRANSFORM just select the (*) input column.

image

3.  The last step is to map the output of the AGGREGATE to the input of theFLAT FILE DESTINATION transform, or any other target transform you need.

The following execution of the plan shows how the AGGREGATE transform takes the 940 rows and outputs the singe row count row that we were looking for.

image

CONCLUSION:  SSIS provides a powerful set of options when doing just about any task.  That certainly holds true even when doing something as simple as capturing the number of rows in a data flow.  To capture the row count all you really need to do is use the AGGREGATE transform.  If you’re more interested in being able to write any SSIS variable out than use the DERIVED COLUMN transform.

Write Row Count To Flat File Using SSIS: Part II

In yesterday’s post we pulled a Thomas Edison and showed one of 10,000 ways how not to write the number of records being processed using the ROW COUNT transformation.

We thought the process was straight forward, but it didn’t work.  Our approach was:

1.  Create a query that processed some records.

2.  Send the result set of the query to the ROW COUNT transform, to count the rows.

3.  Since the ROW COUNT transform populates a variable, use a DERIVED COLUMN transform to place the variable into a virtual table column.

4.  Write to a flat file destination the value of the ROW COUNT variable.

The result set was a zero for every row that the original query generated.  The reason for the zeros is because the ROW COUNT transform does not populate the SSIS variable with the row count until the DATA FLOW completes.

The second problem with the output was that we didn’t want 188 copies of the row count, we only wanted one.  That means we need to find a way to process 188 records but write out the row count only once.

Today’s post will build on what we learned yesterday to create a package that will actually work.  The trick of making the ROW COUNT transformation usable is to access the variable OUTSIDE of the DATA FLOW that populates it.

Here’s what the control flow will look like for our working package. 

image

The general principal behind this is the first data flow will use the ROW COUNT transformation to populate a RowCount variable.  The second data flow will use the local variable set by the ROW COUNT transform to write the value, one time, to a flat file.

In a production package the first data flow would probably write out the data of an extract file; the second data flow would be used to write out a control file, containing the number of rows expected in the data file and any other related information, such as the extraction date.

Remember, the ROW COUNT transform does not actually set the value of the variable until all the data has passed through.  Hence, the need for the second data flow.

Now let’s take a look under the hood of each of the data flows.

DATA FLOW 1:  Here’s what the first data flow looks like:

image

We re-used yesterday’s query, which is a query that selects about 2 percent of the rows from the AdventureWorks Sales.Customer table.  Then we pump the result set into a ROW COUNT transformation that sets the user variable RowCount variable.

The configuration details of these two transformations were posted in yesterday’s Part I post.

One additional lesson that one might get from this… The ROW COUNT transform can be used as a terminator for your data flow without actually doing anything.  It’s simple to setup and is an option when testing other data source functionality.

DATA FLOW 2:  The second data flow is only slightly more interesting so I’ll go into slightly more detail on it’s configuration.

image

The DERIVED COLUMN and FLAT FILE DESTINATION transforms are literally cut and paste from yesterday’s single data flow approach that didn’t work.  Nothing needed to be changed.

The concept behind this data flow is to map the RowCount user variable that was set AFTER the first data flow exited to a data column and write the value of that column to the output file.

At first I attempted to build this data flow without the OLE DB data source.  Although the package ran, no data was written to the output file.  The reason is because the DERIVED COLUMN transform needs some sort of data driver, even though we only want to utilize the RowCount variable.

In order to get the RowCount variable to be written out I prefixed the DERIVED COLUMN with an OLE DB transform.  Here’s what’s in the OLE DB transform:

image

It doesn’t get any simpler than this.  This simple query does two things:

1.  Generates an input data stream that will allow the DERIVED COLUMN to fire.

2.  Generates a single row so the ROW COUNT will only be written out once.

Just for completeness, here’s the configuration of the DERIVED COLUMN transform:

image

Again, it doesn’t get any simpler than that.

CONCLUSION  This post shows how to use the ROW COUNT transformation, utilizing 2 distinct data flows, to write the row count variable out to a flat file.  Although the focus has been on the RowCount variable, the use of the DERIVED COLUMN transform can be used to write ANY variable out to ANY destination.

In my next post I’ll show how to write the row count of a transform to a flat file without the use of the ROW COUNT transformation at all.

Happy SQL-ing!

0 Comments

Writing Row Count to Flat File in SSIS: Part I

Yesterday I had a simple enough task at hand.  An existing EXTRACT that loads a result set to a DATA file needed to have a corresponding COUNT flat file that contained the number of rows in the extract’s data file.  Not being the most savvy at SSIS I did what anyone else would do get to started: I Googled my request.

I found several examples of using a script to capture the row count and write it out to a file.  After a couple of hours of many false starts I consulted with my resident SSIS Expert and all around go to guy, Matt Austin, who gave me a couple of ideas to work with.

Today’s post will demonstrate how NOT to do it and will explain why this solution does not work.  I don’t know about you but I usually have 2 – 3 wrong ways before I get a good one.

Thankfully, this only applies to SSIS.  T-SQL is usually more or less successful on the 1st try, possibly with a slight modification from my first idea.

How Not To Do It #1: Use the SSIS ROW COUNT Transformation

SSIS has a built-in, ready to go, out of the box transformation called ROW COUNT.  Basically what it does is set the value of a variable to the number of records passing through it.  This was my first approach and it seemed like such a simple task to write the value out to a flat file, until I got to the point of writing the file.

So here’s how I did it:

1.  Create an OLE DB query to pull some records out of a database.  In this case I’ll use AdventureWorks.  I’ll grab about 2 percent of the 19,000+ records in the Sales.Customer table using TABLESAMPLE.

image 

2.  Create an INTEGER variable called RowCount.

3.  Add a ROWCOUNT transform after the OLE DB. Configure the ROWCOUNT transform to store the results to a variable.

image

3.  Add a DERIVED TABLE transformation and attach it after the ROWCOUNT transformation. 

4.  Configure the DERIVED TABLE transform to use the RowCount variable by dragging the USER:RowCount variable down to the “Derived Column Name”.

image

5.  Finally, add the flat file destination and configure the flat file destination to wherever you want the file to be created.  Configure the mappings to include only the RowCount; exclude the other columns:

image

6.  Run the package.  The final product will look like this:

image

7.  Check the output file.  Oh know!!!!  There are 188 zeros in it!

So what went wrong?

1.  We still have 188 rows.  To get this to display just one row with the value of the RowCount variable we could add the Row Sampling transform, which is equivalent to the T-SQL TOP command.  Set the row sample size to 1.

2.  But we still have a problem.  The row count is still zero!  The problem is that the variable that gets set by the ROWCOUNT transformation does not get set until the last row passes through the data flow.  In other words, the user variable is not usable until the data flow exits.  Please reference the MSDN page for the ROW COUNT transform for details.

One of those 2 issues alone might seem like a show stopper.  But not quite and the learning process is working in all its glory.

Tomorrow we will modify this package to give us the row count to the flat file that we are looking for.

Lastly, we will modify the package to output any SSIS variable to a flat file.

Happy SQL-ing!

0 Comments

My Five Reasons for Blogging

My wife of 24 years recently asked me “Why blog?  Are you looking for a new job?”

At first I didn’t have a good answer for her other than “If I were looking for a new job it would be too late for blogging”.  After all, nobody blogs their way into a job overnight.  If you told me I had to write 500 decent blog postings to get a new job I think I’d pass.  It’s a heck of a lot easier to stay right where I’m at.

After a few days of reflection on this innocent question here’s my answer, my final answer, for now, to myself…

For my own consumption.  Anyone who blogs because they think they’re doing something special, or because they know something 95 percent of the rest of the population in their field is in for a rude awakening.  Let’s face it.  It’s rare when a new idea is borne on the Internet.

Ask any fitness professional “What are the two most important ingredients for success?”  Number one is to have someone or something hold you accountable.  Accountability can come in the form of a trainer, coach, teammates, friend waiting for you at the local gym, or the dog waiting for his nightly walk when you come in tired from the day at the office.

The second most important ingredient for fitness success is to journal what you eat and what you do.  Without the journaling you will lose focus of your goals and quickly go back to your old habits.

Fortunately for database professionals we have it a little easier than the chronically overweight desk jockey.  Our blog, or journal, is often a chronology of things we did with a database.  Often times these are things that are worth remembering that we can go back to any time or any place.

Sure, there are probably at least a dozen places on the internet you can go to get the query to count rows in a table using sysindexes.  But when you read your own blog, in your own words, it’s more like watching the Wizard of Oz for the umpteenth time.  It’s more of a recollection process than a re-learning process.

Those who can do; those who can’t teachThe first time I heard that expression years ago my initial reaction was that it was funny and true.  After all, how hard can teaching be?  You already know the answers?

Therein lies the rub.  First of all teachers don’t always know the answers.  Even after years of teaching the same subject new waves of students will have new ways of challenging the established way of thinking.  In order to be a good teacher you not only need to be the subject matter expert, but you have to be able to break down complex topics into simple building blocks that can be easily digested by your student audience.  And when that approach doesn’t work, the good teacher must be prepared to instruct the lesson in a different manner.  When you are able to do that, you have mastered the subject.

Some of the best educators I had in college, and fortunately for me I’ve had many, are still revered in my mind.  These professionals made it possible for me to graduate, one project at a time, by making the learning process a series of incremental gains in knowledge and subject matter skill.  So here’s where I can name names: Professor Jay Gorham (Calculus, Hudson Valley Community College) who made the language of calculus palatable; Professor Jeff Schoonmaker, (Physics, Hudson Valley CC) who made physics cool;  Professors David Bozak and Doug Lea (computer science, SUNY Oswego) who made computer science fun and inspired us with their wit and ease;  Professor David Hemmingdinger (algorithm analysis, Union College) who just blew away his first class at Union with his amazing intelligence; Professor Ralph Jones, Boston College (switching theory) who showed me how hard simple math can really be.  The young professor at Union college that taught artificial intelligence… Sorry I can’t remember your name…  But on our first day of class he asked everyone to tell them their name and from that point forward he remembered everyone’s name without EVER referencing a piece of paper, even if we switched seats.

This is beginning to sound like the Emmys…  These are just the ones I can remember… there were more and I regret not being able to recognize them all here.

Back to topic… Blogging is like teaching a mini-lesson.  Like my teaching masters from my past, the preparation and organization of thoughts that go into a blog reinforces what I already know.  By the end of the blog I know the topic that much better because I’ve just articulated the nuances of the topic as accurately and succinctly as possible (that’s the goal anyway).

“Education is what remains after one has forgotten everything learned in school…”, Paul Chek, How to Eat Move and Be Healthy.  Without a doubt that is one of my all time favorite quotes.  Even though it comes from a health lifestyle and fitness book the meaning is profound and far reaching.  I would be hard pressed to be able to solve even the simplest calculus or physics problem today, but the growth from the process of learning remains in tact.

Blogging is an attempt to keep a little more of what was learned on the job.  In the end, the more that’s left over, the better the value we will all offer to our employers.

I wanna be like MikeWho doesn’t?  Who doesn’t want to be easily recognized as tops in their field?  Why bother being in it otherwise?  Unless you’re just coasting to retirement… 

Sure, if I had my choice I’d rather not go to work everyday.  Unfortunately in order to maintain the lifestyle that I’ve grown accustomed to (home with heat and running water, insurance of all ilk’s, two cars in the garage, food way beyond the basics, going out, etc.) I have to go to work to make $$$.

So if you’re going to go to work, why not be the best you can be?  Michael Jordan made a lot of money playing basketball.  And money is a good incentive.  But the real reward was the awareness that he was the best.  The best in the business and the best he was genetically capable of.

If you’re lucky, all you need to do is develop your skills to your genetic ability, and the best in the business will follow.  If not, at least you did your best.  And you’ll feel good for it.

If you’re a doorman for a luxury apartment house, be the best.

If you sell envelopes, be the best at (Harvey McKay did pretty good for himself)…

If you’re a day care teacher, be the best you can be…

I just happen to be a database weenie.  So I’m always trying to make myself a better database weenie.

To be the best, do what the best do.  In the SQL Server world the best of the best, at least we are so lead to believe, are the MVPs.  These guys and gals are always journaling their journey.  If I want that kind of job security, personal satisfaction and monetary rewards that ultimately follow, I must follow the examples and advice of the leaders.

Steve Jones, from SQL Server Central, advised with a great graphic at a Charlotte SQL Server User’s group to become the red umbrella in a sea of blue (it was a field of blue umbrellas with one red one).  That red umbrella belongs to the person who has made it to the top of their chosen field.  To be that red umbrella holder, you must write, write, write.

A great source of inspiration on how to be the best at what you do is Brad McGehee’s ebook: How to Become an Exceptional DBA.  As soon as you finish reading this post go download Brad’s book.  It’s free and it just may be the kick in the pants you need to really step it up.  Blogging is one of the easiest forms of writing and is a good starting place.  It’s not the end game; but it is participation.

A blog a day keeps the remissions awayI’m still working on this one… but by blogging regularly it will become easier.  Sometimes that means writing something short and sweet.  They all don’t have to be projects (this one was supposed to be a Sunday afternoon quickie).

Blogging is like exercise.  A 15 minute jog once a day is much better than a 2 hour run on Sunday.  They both need to be high enough on the priority list that they get done daily, or pretty close to that.  They need to be woven into the fabric of the day and not thought of as appendages to be done when everything else is finished.

OK, before this gets too carried away…  Let me wrap up by stating what blogging is not:

It’s not…  Letting the world know how smart I am.  I’m not a brain surgeon and I know that and don’t pretend otherwise.

It’s not…  A research paper every time.  Taking everything on as a big project spells doom over the long haul.  Don’t make it harder than need be.  Follow the KISS principal.

It’s not…  Journaling something that no one else in the world knows.  There’s nothing I and only I know that the rest of the caring world doesn’t know.  The internet makes information ubiquitous.  Don’t think you’ve got anything on the rest of the world.  Ain’t gonna happen.

It’s OK…  To be me and write what I think even if it doesn’t make the front page of the Wall Street Journal.  It is what it is….

Now for the $50,000 question…  If anyone stumbles on this…  Why do you blog?

Happy SQL-ing!

0 Comments

My Contribution to the TSQL Challenge #14

What does one do on a Friday night after going out with some friends and having a couple of mugs of suds?  How about doing a T-SQL puzzle?

Jacob Sebastian and a team of MVPs and high-powered database gurus are running a series of challenges.  It’s the closest thing to getting an assignment in college.  The challenges serve no purpose other than to hurt your head and make you think through a problem in a variety of ways.

The winners are chosen based on a variety of parameters such as #reads and writes, the execution plan, etc.  The prize is braggers rights… Sorry, no checks.

Anyway, I took a stab at #14, which is the first time I paid any attention to this.  I hope this isn’t addictive because it’s 3AM on a Saturday right now and I’ve got stuff to do tomorrow.

For a complete description of challenge #14 please go to http://beyondrelational.com/blogs/tc/archive/2009/09/28/tsql-challenge-14-identify-the-longest-sequence-of-characters-in-a-string.aspx.  Those guys did a much better job than me describing the problem.

In general, for a given series of input strings that are stored in a table variable, determine the starting position and length of all of the consecutive series of single characters, and print the results.

I’m posting the results to my blog knowing that someone just might be tempted to snatch my solution and improve it.  It a compact piece of code, using a recursive CTE.  I figure if someone can take my stuff and improve it, all the power to you.  Besides, lets face it, my blog isn’t even known out there.

Anyway, here’s the code:

DECLARE @t TABLE (Data VARCHAR(40) )

INSERT @t (Data) SELECT '9992EDC6-D117-4DEE-B410-4E5FAE46AE97';
INSERT @t (Data) SELECT '0BFC936B-BD9A-4C6A-AFB2-CF3F1752F8B1';
INSERT @t (Data) SELECT '4A73E7EB-7777-4A04-9258-F1E75097977C';
INSERT @t (Data) SELECT '5AAF477C-274D-400D-9067-035968F33B19';
INSERT @t (Data) SELECT '725DA718-30D0-44A9-B36A-89F27CDFEEDE';
INSERT @t (Data) SELECT '8083ED5A-D3B9-4694-BB04-F0B09C588888';

-- Algorithm:    Use a recursive CTE to pop off the last character of a string.
--                Along the way determine sequences of characters and their length.
--                Order the output by the strings with the highest sequences DESC.

WITH string_cte (    [Data_full],            -- Carry the original string 
                    [Data],                    -- This sting will be shortened at the end
                    [Char],                    -- Pull off the last character from the string
                    [Pos],                    -- Position of the character
                    [Length],                -- What is the length of consecutive characters?
                    [Cluster_ID] ) AS        -- Used to identify clusters of consecutive characters
(    SELECT    t.[Data],
            t.[Data],
            SUBSTRING(t.[Data], LEN(t.[DATA]), 1) AS [Char],
            LEN(t.[DATA]) AS [Pos],
            1 AS [Length],
            1 AS [Cluster_ID]
    FROM  @t t

    UNION ALL

    SELECT    cte.[Data_full],
            SUBSTRING(cte.[Data], 1, LEN(cte.[Data]) - 1),
            SUBSTRING(cte.[Data], LEN(cte.[Data]) - 1, 1) AS [Char],
            [Pos] - 1,
            CASE
                WHEN SUBSTRING(cte.[Data], LEN(cte.[Data]), 1) = SUBSTRING(cte.[Data], LEN(cte.[Data]) - 1, 1)
                    THEN [Length] + 1
                ELSE 1
            END AS [Length],
            CASE
                WHEN SUBSTRING(cte.[Data], LEN(cte.[Data]), 1) = SUBSTRING(cte.[Data], LEN(cte.[Data]) - 1, 1)
                    THEN [Cluster_ID]
                ELSE [Cluster_ID] + 1
            END AS [Cluster_ID]
    FROM        string_cte cte
    WHERE LEN(cte.[Data]) > 1 )

SELECT    cte.[Data_full],
        cte.[Char],
        MIN(cte.[Pos]) AS [Pos],
        MAX(cte.[Length]) AS [Len]
FROM        string_cte AS cte
        -- This derived table is used for one purpose: 
        -- To rank the full string by the max # of consecutive characters
        INNER JOIN (SELECT    [Data_full],
                            MAX([Length]) AS sort_order
                    FROM        string_cte
                    GROUP BY [Data_full] ) d
            ON    d.[Data_full] = cte.[Data_full]
WHERE    cte.[Length] > 1
GROUP BY cte.[Data_full],
        cte.[Char],
        cte.[Cluster_ID],
        d.sort_order
ORDER BY d.sort_order DESC,
    cte.data_full ASC,
    MIN(cte.[Pos]) ASC;

And here’s the output:

image

I’ll look forward to the winning solutions to see how they did it.  The only way to really learn from these challenges is to do or at least attempt to do the challenge.  Without doing the actual work it’s like looking at the solutions to the NY Times crossword puzzle…  It’s easy when you know the answer.

Thanks to the creative and energetic souls who are hosting this challenge.

Happy SQL-ing!

0 Comments

How Long Are Your Jobs Running?

Today’s post is going to be short and sweet.  About a year ago I put together a procedure that allows me to see the run times of all of the SQL Agent jobs on a database server.  Some might argue that I got a little carried away when I added the standard deviation and variance statistics, but I can counter that they do have meaning.

Knowing the amount of time a job takes is relevant.  But knowing how much deviation from the “norm” is also useful.  Even more useful, would be to plot the job run times over time.  In this context it might be interesting to spot jobs that are taking longer to run as the data sets grow.

Although I did not attempt to handle the time based plot, getting the Agent runtimes has value.  You could take it one step further and place the call to the stored procedure in an excel file and send it to your manager.  All she would have to do is hit Data==>Refresh.

Just a little background.  For starters, the SQLAgent information is stored in the msdb database.  We’ll capture the job name out of the sysjobs table and the job run times out of the sysjobshistory table.

Regardless, here’s what the code looks like:

USE [msdb]
GO

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[usp_jobs_summary_report]
AS

SELECT    j.[name],
        COUNT(jh.run_duration) AS 'Sample Size',
        CONVERT(DECIMAL(4, 2),
            MIN((    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 1, 2)) * 3600) +
                    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 3, 2)) * 60) +
                     CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 5, 2)))/3660.0))
                        AS 'MIN Runtime (hours)',
        CONVERT(DECIMAL(4, 2),
            MAX((    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 1, 2)) * 3600) +
                    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 3, 2)) * 60) +
                     CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 5, 2)))/3660.0))
                        AS 'MAX Runtime (hours)',
        CONVERT(DECIMAL(4, 2),
            AVG((    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 1, 2)) * 3600) +
                    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 3, 2)) * 60) +
                     CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 5, 2)))/3660.0))
                        AS 'AVG Runtime (hours)',
        CONVERT(DECIMAL(4, 2),
            STDEV((    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 1, 2)) * 3600) +
                    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 3, 2)) * 60) +
                     CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 5, 2)))/3660.0))
                        AS 'Std Dev.',
        CONVERT(DECIMAL(4, 2),
            VAR((    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 1, 2)) * 3600) +
                    (CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 3, 2)) * 60) +
                     CONVERT(int, SUBSTRING(STR(jh.run_duration, 6, 0), 5, 2)))/3660.0))
                        AS 'Variance'
FROM        sysjobs j
        INNER JOIN
            sysjobhistory jh
                ON    jh.job_id = j.job_id
WHERE    jh.run_status = 1    -- Succeeded
AND        jh.step_id = 0        -- Outcome
GROUP BY j.name
ORDER BY 5 DESC

 

No rocket science here.  Just plain old SQL.

And here’s the useless output from my notebook instance of SQL Server.  I’m happy to be getting the one row I got.  Try running this on a production server… then it’ll have real value.

image

Happy SQL-ing!

0 Comments

Table Row Counts

The easiest way to count the number of rows in a table, from the user perspective, is to use the simple COUNT(*) operator.  A more efficient method to count the number of records in a table is to use SQL Server’s system tables.

Let’s start out with the naive approach and see how it performs.

USE AdventureWorks;
GO

SET STATISTICS TIME ON

DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

SELECT    COUNT(*)
FROM    Sales.SalesOrderDetail;

 

The SET STATISTICS will provide us with some simple run time performance metrics.

The two DBCC commands ensure us that we are clearing the query plan and data from cache, so we know we’re getting worst case performance results each time, as if it is the first time the query is executed.

When we run this query we learn that there are 121,317 rows.  Here’s the runtime on my tired old laptop:

SQL Server Execution Times:

  CPU time = 40 ms,  elapsed time = 199 ms.

We also learn that the query has to do a non-clustered index scan on all 121k+ rows in the table.  The estimated total subtree cost of this query is 0.376 (not shown in the image, you would need to hover the sprite over the SELECT operator).

 

image

 

Now let’s look at the system table approach to counting the number of rows in a table:

DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

SELECT    OBJECT_NAME(id),
        rowcnt
FROM        sysindexes
WHERE    OBJECT_NAME(id) = 'SalesOrderDetail'
AND        indid = 1;

 

This query returns the same 121,317 row count but the runtime is less than 1ms:

SQL Server Execution Times:

   CPU time = 0 ms,  elapsed time = 0 ms.

Looking at the execution plan this query does a CLUSTERED index scan against the 125 row sysindex table:  The estimated total subtree cost for this comparable query is .0075 which is just 2 percent of the 0.376 required for the naive approach.

 

image

Now here’s where we really make this exciting (OK, a little overstated)…

When loading a data warehouse it’s often nice to be able to monitor the progress of a table load.  Unfortunately, while the table is being loaded the entire table is probably locked up.  So if you attempt to do a COUNT(*) on the table the query just sits there.

If you use the sysindexes approach you’re not even reading from the table that’s being loaded, and your row count is returned almost immediately.

Now let’s stretch out the benefit of this simple system table query one more time.  Let’s use it to return the row count of every table in the database:

SELECT    OBJECT_NAME(id),
        rowcnt
FROM        sysindexes
WHERE    indid = 1
ORDER BY 2 DESC;

This little query returns the row count for all 125 tables in less than 1 ms.  Try doing that with a singe query (ie., not looped or cursor driven… ugh!!!) with the COUNT operator!

The total subtree cost for this query for 125 tables is .02076; much less than the .376 to run the naive query for just 1 table.

I captured the pic of the total estimated subtree cost below:

image

 

I hope you are convinced, the system table approach requires a bit more typing but is MUCH more efficient to count rows in a table, particularly if the server is not idling away.

Happy SQL-ng!