Location of SQL Server Default Trace

This is one of those posts for myself.  Sort of like a self-note.

When things go wrong, especially unexpectedly, sometimes the only clue is the SQL Server default trace.  Where is it?

SELECT * FROM ::fn_trace_getinfo(0)

 

Now you know…  But do you have permissions to get to the file?  That’s another issue.

For a more in-depth post on this topic see Adam Haines’ post.

Stay curious… Keep learning…

Dave

0 Comments

Parallel Query Plans

This post is actually less about parallel query plans and more about a recap of some lessons learned from a recent project.

I run an in-house T-SQL training class for new hires and other employees who either need to get up to speed with T-SQL or have a personal desire to learn.  In the very first class I show participants how to check the estimated plan cost for their queries.  The reason I do this, despite contrary opinions of other developers, is to possibly help them to prevent from running a query that brings the server to it’s knees.

I’ve written more than a few of these queries over the years and can confidently say that I usually check the plans of new queries.

I also indicate to participants that if a query has a high plan cost, it will run slow.  The corollary of that is not true; If a query has a low cost it may STILL run slow.

Why is that the case?  The reasons are too numerous to tell for sure… there may be foreign keys involved that are preventing a delete or update to complete as the keys are validated.  Statistics may be out of date.  Statistics samples may not represent the data.  Indexes can be fragmented.  Indexes can be missing. etc.

Recently I ran into a situation where a query with a fairly low cost, about 11, took forever to run.  It also consumed all resources making it challenging to get into the server even through the DAC.

Here’s my high level synopsis of what we did to correct the situation.

1.  Update the statistics on the tables involved in the query.  If possible, use the WITH FULLSCAN option to minimize the chance that the default sample size will not result in a misrepresented statistics.

2.  If running parallel query plans, it may end up taking more resources to combine the parallel threads then the benefit of running the threads in parallel in the first place.  If you suspect a problem with your parallel query set the MAXDOP = the number of physical processors.  Better yet, set the MAXDOP = 1 to see if the query will finish with a single thread.

3.  If #1 and #2 don’t solve your problem you may need to disable hyper-threading (HT) on your server.  There’s tons of expert info on HT, how it works, why it can impair performance, so just take some time and do your homework.  In order to disable HT you’ll need to shutdown the server and interrupt the boot up process to change the BIOS settings.

In my case it took a combination of all of the above events to get the query to run in a predictable, reasonable amount of time.  It still is not great but at least it is predictable.

The next step…  Capture the query / ETL trace data and run the trace results through DTA.

The servers I encountered my challenges with were circa 2006 machines running Windows 2003, SQL Server 2005 SP2.  One had 4 single core processors; the other had 2 single core processors.  Had we been running a new OS, applied SP3 to SQL Server or running SQL 2008, or had the latest generation processors installed, we might never have had any problems, so the life of this post may be short.

Regardless, the lessons learned were hard won and worthy of a 20 minute post… if for nothing else, to remind me that even the simplest queries may not be so simple after all.

Stay curious… Keep learning.

Dave

0 Comments

Set up Dedicated Administrator Connection (DAC) before you need it!

When I was in elementary school we often had planned “fire drills”.  We would be told of the impending alarms, the alarms would go off, and we’d all exit the school in an orderly fashion using the prescribed path and exits.

Fast forward a hundred years and I’ve discovered why fire drills are important…  So that when the real deal happens we know exactly what to do and where to go.

We recently had a database server become over run with one or more queries that sent the server into a tail spin.  The classical definition of a good operating system, graceful degradation under heavy load, wasn’t working.  The server was in such bad shape that our admins weren’t able to log into the machine to determine what was going on.

“Normal” connections didn’t even allow an sp_who2 to return it’s simple result set.

The time to conduct your own DAC fire drill is BEFORE you really need it.  Otherwise, it’s easy to forget that it even exists when the red phones are going off and panicked IMs and emails are coming in from all corners of the business.

For starters, let’s clarify one point:  The DAC is not a specialized account.  It is a connection.  So there is no special login account and / or password you need to remember.  Nor is it a secret back door into SQL Server.  The purpose of the connection is to increase the probability (not guarantee) that a user with ADMIN rights can connect to the server and start KILL-ing off offending sessions that are causing the server to spike.

It ain’t pretty… but can be pretty darn useful.

There are two ways to use the DAC.

1.  When connecting to the DATABASE ENGINE (not server), precede “ADMIN:” before the server name.  My personal laptop has an instance of SQL Server called DJF001.  Here’s how my connection to the DAC looks:

image

You’ll know you have a DAC connection by observing your connection description at the lower right of you command window:

image

2.  An alternative method that uses fewer system resources is through SQLCMD.   When connecting from SQLCMD just use the “-A” switch “SQLCMD -A”.  Note:  The Storage Engine book indicates to use “/A”… this probably works in 2005 but did not work for me in 2008.

The default setting only allows local DAC connections.  If you want to be able to use the DAC from a remote host just set the REMOTE ADMIN CONNECTIONS configuration option.

The DAC is not intended as a general purpose connection and all commands, such as backup and restore, will not work.  It’s purpose is to allow an administrator to connect to the machine when all other connection options fail.

Configuring Remote DAC: Unless you are running SQL Express edition, DAC is ready to roll by default.  However, if you want to use the DAC from a remote host you’ll need to configure it.  To configure remote DAC enter the following on the server you want to setup:

sp_configure 'remote admin connections', 1

GO

RECONFIGURE

GO

Conclusion: The next time your server is running out of control think DAC and get connected.  In the meantime, run your own fire drill and make sure you know how to use it.

References:

Inside Microsoft SQL Server 2005: The Storage Engine, Kalen Delaney.  Microsoft Press. 2007.

Microsoft SQL Server 2005: Implementation and Maintenance, Solid Quality Learning, Microsoft Press, 2006.

MSDN, Connecting to the DAC.

Stay curious… Keep Learning…

Dave

Add DEFAULT Constraint to an Existing Column

It doesn’t happen very often but I recently needed to add a default value to an existing column.  Of course MSDN provided me everything I needed but figuring it out from the railroad diagrams took a few tries.

To add to the confusion, the command for SQL Server Compact Edition is different than it is for for Enterprise.  Of course I found Compact Edition code first but needed Enterprise, not knowing they were different.

The code for COMPACT edition follows…  It’s untested…

-- Compact edition.

ALTER TABLE f_response ALTER COLUMN new_contact_flag SET DEFAULT 0;

The code for Enterprise edition… tested…

-- SQL 2005 Enterprise
ALTER TABLE f_response ADD CONSTRAINT new_contact_flag_def
            DEFAULT 0 FOR new_contact_flag;

The next time I need to add a default constraint, here it is.

What discoveries have you made that are worth saving?

Stay curious… keep learning

Dave

0 Comments

Atlanta SQL Saturday #41 from the Rear View Mirror

Just got back from the Atlanta SQL Saturday.  In this case the title is a literal and figurative one, since I just spent 4.5 hours looking at ATL from my car mirrors as I made the commute back to Charlotte.

This was my second SQL Saturday event and once again I was not disappointed.  In fact, I’m kind of jacked…

There were many similarities to Charlotte’s SQL Saturday in April and there were some differences.  The one thing they had in common is they were both great events, from both a speaker and attendee perspective.

Speaker Dinner: Apparently it is tradition for the main sponsor to provide a “speakers dinner” on the Friday night before the main event.  The dinner was held at Bahama Breeze, which was a short and straight forward drive from the recommended hotel.  Making the dinner event easy to find from the hotel is a key point for the non-local speakers.  Not knowing your way around Atlanta, even the suburbs, can be real intimidating if you don’t have a GPS.  The traffic can be very heavy, the lights are long and if the daylight has expired, it can be tough to read the road signs.

I left Charlotte on Friday afternoon and arrived in the greater Atlanta area at 5; the peak Friday rush hour crush.

I’ve made the mistake before of arriving in Atlanta at 5PM on a Friday and paid the price of taking an hour to get through the last 10 miles on the outer beltway, I-285.  Fortunately I was able to never get on the beltway (the way Google Maps suggested) and cut over from exit 107 off I-85, cut through a suburb highway 120 to my hotel destination.  (When I went home I took the Google all-interstate approach and, even with no traffic issues, it seemed much longer).

When I arrived at the restaurant at 7PM the front parking lot was full so I swung around to the rear lot.  On my way to an empty slip I noticed a bunch of computer-nerd-looking folk on a umbrella festooned deck on the side of the building.  Sure enough, it was my crowd.  Never even went inside of the restaurant… there was a foot ramp that took me right up to the food, festivities, and for the most part, new colleagues.

There were a couple of familiar faces…  Kendall Van Dyke from Orlando and Chris Skorlinski from Charlotte.  The atmosphere of the event was very relaxed and the weather was perfect to be hanging out on an outside deck.  Had I not known any better I might have guessed that I was at an ocean side event.

I was promptly greeted by the main organizer, Stu Ainsworth, and the main sponsor reps, Nadine (sorry about the last name) and Janis Griffin from Confio.  I’ve been attending business networking events for years… but it can still be a bit uncomfortable walking into a room or patio full of “strangers”.  The greetings were very welcoming… thank you all.

Even though my intent is to meet everyone there, I basically failed to get past the first table, as usual, and quickly settled into a table with a couple of local gents…  Tejas Patel and Mark (again, sorry about the last name).  In doing so I got well acquainted with Tejas and Mark but missed the mark of the event… to meet all.

Unlike the Charlotte event where the speakers dinner was basically “order anything you want off the menu” of a high end restaurant, the Atlanta event was a buffet of hot, heavy finger food.  It was good stuff (not necessarily from a nutrition perspective, hahahah) and there was plenty, but I tend to not eat right in these networking scenarios.  That being said, I was actually quite hungry after the long drive and probably would have eaten more had I had a dinner instead of a buffet.  I made one trip to the buffet and spent too much time BS-ing with my new acquaintances, and eventually, with Chris from Microsoft.

After dinner and a couple of frosty beers, Stu passed out the speaker shirts, another apparent tradition with SQL Saturdays, and gave us a good rundown of the event.  Stu really did a great job of not only organizing the event, but making everything perfectly clear for us speakers, whom on event day, are supposed to know what’s going on.

The shirts were a very nice golf / polo shirt with the SQL Saturday logo emblazoned on the left breast.  It didn’t explicitly say “Speaker” which will make it a little appropriate to wear in the general public.

Facility: The event was at Microsoft’s sales office in Alpharetta, a not so sleepy bedroom community on the north side of metro Atlanta.  The drive from the hotel was less than 2 miles.  There were ample, professionally printed signs on the roads to guide the mis-guided visitor right into the parking garage.  Once in the parking garage it wasn’t obvious which floor to take the elevator… 1 or 4.  Fortunately I picked up and that was correct.

Going into the facility still was a little uncertain… the door was unlocked but there was no sign of life that this was the correct place.  Once in the building, however, the sounds of a bustling reception area guided me to the main reception.  The line was not too long and the volunteers were cheerfully greeting the guests, issues name badges, issuing some paperwork, and handing out nice screen printed SQL Saturday tee shirts.

The event was held on the 3rd floor, and the elevators where right there.  The only clue I had that the event was on another floor was I saw people taking them while standing in line.  I had to ask one of the volunteers at the desk which floor to take, since I didn’t see any signage indicating which floor to go to.  A minor oversight at worst…

Once on the event floor we were greeted by SQL Server MVP Geoff Hiten who showed up the “speakers room” and a room where the books that were to be used for raffles were.  I grabbed Itzik Ben-Gan and Lubor Kollar’s book T-SQL Querying for my session’s raffle.  I then went into the speakers room and dropped off my computer and raffle book.

The speakers room was a good concept and I noticed that many of the speakers were busily in there working away on, what I assume, their presentations.  They also ate their lunch in there.  I did neither.  For starters it seemed like a sort of “serious” atmosphere, and it also appeared somewhat disconnected from the rest of the conference / attendees.  Sort of like the press room at a golf tournament or perhaps the teachers lounge in a high school.  The idea of a place to drop off my computer and raffle book was a welcome idea that I appreciated.

The sponsor’s display area, as is probably the typical case, was in the main thoroughfare but seemed a bit crowded.  Some of the sponsors in the back may have had a challenging time getting people to visit their table.  Since I wasn’t a sponsor I’ll let them speak to this issue as they see fit.

The classrooms for the sessions consisted of three “large” rooms and 4 smaller rooms.  I’m guessing the large rooms had a capacity for about 100 people; the smaller rooms 40-50.  Overall the rooms seemed to be a nice size; rooms that are too big and predominately empty can be more challenging to create a buzz during a presentation than a packed, smaller room.

Unlike the Charlotte facility, where the rooms were spread out two buildings, the Atlanta rooms were all together.  Rest rooms and food and drink were close, sufficient and plentiful.

Included in the paper work available at the reception desks was a master schedule for the day.  They did a particularly nice job with this because they also included each speaker’s bio and their presentation(s) synopsis.  It also included a modest sized map of the classrooms which was helpful.

Related to the class rooms, they actually hand wrote the presentation titles and start times in the window adjacent to each door for each classroom.  As each presentation was completed someone took the time to erase the presentation so what remained at the top of the list was the next session and time.  Very nice.  Maybe even over the top!

Breakfast / Lunch: I ate breakfast at the hotel so I didn’t pay much attention to the breakfast food that was out.  I did grab a bottle of water and a apple-nut muffin which was quite good.  They had a variety of coffee options, that I think were outsourced by Starbucks.  It seemed like plenty.

By the time lunch rolled around at noon I was hungry again and paid more attention to the food choices.  There were lunch boxes by Jason’s deli which were excellent to the taste buds.  I had a roast beef sandwich that was piled high with meat and tomato slices on a whole grain bread.  No additional condiments were needed.  It camp with a 1/4 slice of dill pickle, a bag of chips and a large chocolate chip cookie.  It was large enough that I was not hungry at all for the rest of the day, though I did grab a snack of crackers-n-cheese around 2:30.

The lunch line was rather long, as would be expected, but moved quickly thanks to the several teenage volunteers that worked the counters.  They really did a great job of keeping everyone, best I can tell, happy and moving forward.  While not a vegetarian myself, I do know they had the forethought to order some vegetarian lunches before hand, which needed to be requested by the person wishing one.  There may have been some vegetarians that missed that point because they didn’t know otherwise.  On the other hand, had the vegetarian box lunches been put out with the roast beef, turkey and ham options, inevitably they would have been mistakenly by someone who wasn’t looking for a veggie option.  Perhaps a sign that indicated “vegetarian available on special request” could have helped.  It wasn’t an issue for me… and I think they did a great job of handling the lunch event quite well.

Later in the afternoon there were crackers, candy bars, water and soda a plenty… No one should have left the event hungry.

Presentations: Since my presentation was dead last, I had the opportunity to attend several presentations throughout the day.  Each classroom had a dedicated proctor who greeted everyone coming into the room and provided them with 2 pieces of paper: a session survey AND a raffle ticket for the book give away.  At the end of the session the proctors collected the surveys and the raffle was drawn.  No survey; no participation in the raffle.  And the best part…  The speakers got the survey results handed to them right there on the spot.  What a GREAT concept.  My survey results will be published in a separate blog post.

I went to Mike Femenella’s session on Intro to Performance Tuning for 3 reasons: I knew Mike was from Charlotte;  I didn’t know Mike;  and the topic was near and dear to my heart.  I got to his first session at least 20 minutes late.  When I got there he was doing some of the things I was doing in my Query Analysis session… working on query plans against the AdventureWorks person.contact table.  He did a good job of explaining what he was doing, then moved on to a discussion on temp table vs. table variables.  My take away was that Mike was a competent presenter and thoroughly knew his material.  Good job Mike!

The depth and breadth of the next presenter, Kevin Boles, was in a league by himself.  Kevin is a no nonsense, highly technical SQL Server engine type of guy.  He presents himself as an expert, and he is.  His “deep dive” session on memory was in fact quite deep.  He also had the best power point deck of the day (that I saw), but he admits and gives credit to a business partner.  (Sorry, didn’t catch the name).  Kevin has already submitted 5 sessions for the SQL Saturday event in Raleigh in September.  He will definitely be on my short list of must see speakers.

The next session on SQL Server R2 was by David Rodriquez, an unabashed salesman from Microsoft.  The world needs sales people, and if I wanted to make lots of money, and who doesn’t, I wish I were one.  David gave an energetic, high level presentation on the features and benefits of R2.  Some of my takeaways…  Master Data Management which may have application to address cleansing and constituent matching and Complex StreamInSight… which may have applications to control system, real time data processing and web log analysis…  And massively parallel data warehousing appliance…  Sort of like a Netezza or Teradata solution.  His session was high level which is exactly what I was looking for.  I just wanted to learn what all the fuss was about the product… and I got it… in an entertaining way.

The post lunch session was by Meredith Ryan-Smith on work / life balance.  It was supposed to be a panel discussion, of which I was one of the members of the panel, but the layout of the room lended itself more to a round table format, which worked out great.  I think everyone had an opportunity to share some aspect of their own personal struggles with juggling family and work demands.  It’s also good to see other people struggling with the same issues as you…

Back to the technical sessions…  The next session was by Louis Davidson.  I went to Louis’s session on database design patterns for really one reason…  I was curious about the speaker himself.   He was at the speaker’s dinner the night before but I never talked to him, regrettably.  He was also in Meredith’s session… In both environs he stood out, in a positive way.

It turns out Louis, in my opinion, is a very interesting chap.  He reminds me of another tech friend…  A very large man, gregarious, super smart, and very witty.  For sure, just like Kevin Boles’s session was the most technical, Louis’s session was the most entertaining.  I’ve found that even some of the best MVP, regularly featured speakers are not really candidates for the National Speaker’s Association.  They know their technical stuff, and some more than others can add pizzazz to the topics, but few really make their sessions fun, to the point of entertaining.  And that’s Louis’s greatest draw.  Don’t get me wrong, his knowledge is insightful and deep, but his energy and wit are contagious.  Louis, the next time we will meet!

T-SQL really is the basis for my original involvement with SQL Server, and it continues today.  That being said, Jennifer McCown’s session on T-SQL Code Sins was right up my alley of interests.  Jennifer has an outstanding, could-be-a-radio-or-TV-celebrity, type of voice.  Lots of energy, inflexion and “in command” presence.  I thoroughly enjoyed the topic and the session.  It was much more interactive then any of the other sessions, including my own, since the nature of the session and the speaker spurred attendees to share their own code horror stories.  It was a fun session, timed in the middle of the afternoon, when our circadian rhythm is otherwise telling us to take an afternoon siesta.

And then there was my session… last one of the day… 4:45.

My Presentation: I arrived at my session 10 minutes prior to start time.  Jen’s previous session actually ended a little early but I ended up in a conversation with a couple other gents in the back of the room continuing with our own renditions of bad code; Atlanta traffic; the US Interstate mess (build more roads, bring more congestion) and who knows what.  Always aware of the time, I figured I better get to my own session in the room next door.

When I walked into the room it was full, settled, and everyone seemed to be looking forward.  So much so that I thought the previous session was still in progress… but there was no one at the podium.  It turned out the were waiting for me; I had a full house…  No seats left, and the proctor was turning people away at the door.  I swear this is the truth… Until I arrived the attendees were concerned that I was pulling a no show.

Since the attendees were all in place, I was able to start my session right away, as soon as I was ready, about 5 minutes early.  After about 5 minutes into the presentation the proctor did let some people in.  It was a little surreal seeing the people in the hallway looking in through the glass adjacent to the doors.

My session in Charlotte’s SQL Saturday was also the last session of the day, and the attendance was full, not not beyond capacity.  Since I know they weren’t coming here to see me, it was confirmation that I really lucked out and picked a topic that appealed to a broad audience.

For sure this was the largest group of people I ever presented to, 10 – 12 chairs per row, at least 10 rows deep (after all, this was one of the “large” rooms) there had to be at least 100 people tightly packed in.

The audience was pleasant, asked questions, and challenged me…  the later not being too hard to do.  I made some mistakes, fumbled with one or two of the demos, but overall believe I got my message across: Execution Plans are a springboard to new avenues of learning.  Thankfully I had some experts in attendance who contributed to the discussion.  I wish I got all of your names… but Rob Volk was one of them.  This only reinforced my mantra: If you want to learn it, teach it.

At the end of the day while walking to the parking garage I met up with 2 people that indicated the session inspired (might be too superfluous of a word but the only one I can think of at the moment) to restart their previously failed attempt to read and understand Grant Fritchey’s book on Execution Plans.

Overall Experience: At this point in time I’m being advised by my wiser half to get outside and enjoy the day.  SQL Saturday ATL #41 was an absolute success for me;  personally, professionally, and I believe for all who attended.

Thanks to Stu, the sponsors, and the numerous captains and volunteers that it took to pull this event together.  I recommend y’all for The Apprentice.    I rekindled some previous relationships, initiated some new ones, and learned a bunch both about databases and, more importantly, people.  If there’s a SQL Saturday coming near you sign up, put it on your calendar, and don’t let anything stop you from attending.  To find out of there’s an event coming to you… go to www.sqlsaturday.com.

I regret not sticking around to help clean up at the end of the event but I was looking down the barrel of a 4 hour plus return trip to Charlotte, braving thunderstorms coming in from the west.

Keep learning… stay curious,

Dave

Composite Index vs. INCLUDE Covering Index

A simple question was posed today while giving a presentation on query plan analysis to colleagues at the office.

What’s the difference between a composite (multi column) index and an INCLUDE covering index?

At the time I was stumped.  Fortunately I don’t have any problem admitting I don’t know something… it happens all the time.

Ultimately, the answer to the question is at http://msdn.microsoft.com/en-us/library/ms190806.aspx

If you don’t feel like ferreting out the answer from  the post I’ll summarize:

1. The composite index (where all of the columns are “key” columns) will carry the data in all levels of the index; the INCLUDE index will only carry the “non key” columns in the leaf node.  Take away: The INCLUDE index will be smaller than the equivalent composite index.

2. The INCLUDE columns(non-key) will not count towards the max index size (900 bytes).  Weak argument since we really should not have indexes that big anyway…

3. Updates to non-key columns in INCLUDE indexes will not contribute to index fragmentation; updates to unused “key” columns in the composite index will result in fragmentation…  IMO, this is a biggie.

4. Assuming low index fragmentation, both composite and INCLUDE indexes will perform comparably.  All columns in the composite index need not be in the query predicate.

Thanks for the lesson… and upping my IC (intellectual curiosity).

Dave

DELETE using SYNONYMS

I had a situation happen at work today that I never would have guessed is possible.  But since it happened, it is possible.

It’s all about how I used a synonym for a table in a DELETE statement.

Since SYNONYMS are not commonly used, at least not in my circle of friends, I’ll take a moment to describe them.  Of course, for the best explanation see BoL.

Synonyms, besides being tough to say, are used in SQL Server to create an alternate name for a database object.  In my case, I like to use synonyms to replace long table names that I frequently use.  I don’t use synonyms in production code, but if I’m testing and doing corrective updates they are fair game.

I’m not going to use the actual tables to demonstrate this anomaly, lest the legal team pay me an unwanted visit.

In this block of code I’ll create a fictitious table, create a synonym for it (“lt” = LONG TABLE), load data into the table using the synonym, and select from the synonym to show the data actually loaded:

USE tempdb;
GO

CREATE TABLE dbo.this_is_one_long_table_name (    id    int,
                            whocaresaboutdata varchar(50));

CREATE SYNONYM lt FOR dbo.this_is_one_long_table_name;

INSERT INTO lt VALUES    ( 1, 'I do'),
                        ( 2, 'Do You?');

SELECT * FROM lt;

 

The result of that query is:

image

 

So far nothing too exciting.

Anyone who has ever had to do anything other than the most simple DELETE knows that DELETE can have two FROM clauses.  It’s awkward, but that’s how it is.

First let’s finish loading our little table…

INSERT INTO dbo.this_is_one_long_table_name (id)
    VALUES    ( 3 ), (4), (5), (6), (7), (8), (9), (0);

SELECT * FROM lt ORDER BY id;

OK, so now we have 0 – 9 loaded into our test table.  We don’t need much data to run this test.

Now let’s delete some data out of our table.  To do that we’ll create another table, delete_these_records, that will have the IDs that we want to delete:

 

CREATE TABLE dbo.delete_these_records ( id    int );
INSERT INTO dbo.delete_these_records VALUES (2), (4), (8);

-- DELETE the data using a JOIN and SYNONYM
BEGIN TRAN
DELETE FROM dbo.this_is_one_long_table_name
FROM    lt lt
        INNER JOIN dbo.delete_these_records dr
        ON dr.id = lt.id;
(10 row(s) affected)    

ROLLBACK;

 

If you’re following this, you’ll expect the above query to delete the rows where the ID IN (2, 4, 8).  Instead, it deleted all of the records in this_is_one_long_table_name.  Another happy time for ROLLBACK.

It turns out, after some experimentation and review of the execution plans, the issue is in the differences in the first and second FROM clauses.  If both clauses consistently use either the full table name in both places OR uses the synonym in both places, the query will work as expected.

For example, both of these queries will work as expected, ie., deletes IDs 2, 4 and 8.

BEGIN TRAN
DELETE FROM dbo.this_is_one_long_table_name
FROM    dbo.this_is_one_long_table_name lt
        INNER JOIN dbo.delete_these_records dr
        ON dr.id = lt.id;
ROLLBACK

BEGIN TRAN
DELETE FROM lt
FROM    lt lt
        INNER JOIN dbo.delete_these_records dr
        ON dr.id = lt.id;
ROLLBACK

 

At this point in time I surmised that if a table has a synonym associated with it, and a DELETE statement uses BOTH the actual table name and the synonym in the same query, the second FROM clause will basically get ignored.

It turns out, that if a careless SQL developer created the second FROM clause that did not link back to the table being deleted, ALL OF THE ROWS WILL BE DELETED.  The second FROM clause and any JOINs in it will be logically ignored.  No warnings.  No errors.  All data is gone!

The following example will delete ALL 10 records.  The second FROM clause does not include a reference to the table in the first FROM clause.  The query passes the parser, algebrizer and optimizer…  and proceeds to delete the entire table.

 

BEGIN TRAN
DELETE FROM dbo.this_is_one_long_table_name
FROM    dbo.delete_these_records d1
        INNER JOIN dbo.delete_these_records d2
        ON d1.id = d2.id;
ROLLBACK

 

I first observed this query behavior in SQL 2005 Enterprise Edition.  The query for this post is in SQL Server 2008 Developer Edition.  The version of SQL Server does not appear to be a factor.

CONCLUSION:  The second FROM clause in a DELETE statement must exactly reference the table name in the first FROM clause.  If it doesn’t, the DELETE will act as if the second FROM clause doesn’t even exist.  (Actually, that’s not quite true either, because a review of the query plan will show the filter taking place in the second FROM clause and then JOINs the entire dataset… resulting in all records being deleted.  So the query is actually doing more work than a straight forward DELETE.)

The use of SYNONYMS MUST BE CONSISTENTLY APPLIED to both FROM clauses.  If one clause uses a SYNONYM and the other the actual table name, the second FROM clause will be logically disconnected from the query and ALL RECORDS IN THE TABLE will be deleted.

Going one step further, we also recognized that if the second FROM clause does not include a table reference to the first FROM clause, that table that is actually being deleted, the DELETE will ignore the second FROM clause and proceed to delete all of the records in the target table.

I’ve looked in MSDN but have not found any indicators that this is a feature…  Have you ever run across this in your adventures?  Do you have a more in-depth explanation as to why the DELETE behaves this way?

I’d love to know more…

Always curious… but too often without answers…

Dave

1 Comment

Google Search Technique

Now that my blog is beginning to take shape I find myself going there to get the details on ideas that I’ve posted.  There’s no easier explanation for a topic than in my own words, even if they are not as perfectly articulated as other places.

Similarly, when working with T-SQL I often want to go to MSDN to get the official Books Online definition.

Google has a simple mechanism to limit the search to a single web site OR to all websites EXCEPT a single site.

For example, if I want the definition of of my search for the missing index DMF to the MSDN website I could enter the following:

Google Site Search

The secret weapon is the SITE qualifier.   Now Google will only look at the MSDN web site.

If I wanted to see what everyone else is saying I could EXCLUDE the MSDN web site by adding a minus before SITE:

Google NOT Site Search

I wish I could say I figured this out on my own or from the Google advanced search info… but I didn’t.  Again, I learned it from Ramesh Meyyappan at SQL Workshops.

Forever curious…

Dave

0 Comments

Use dm_io_virtual_file_stats to Monitor tempdb

I’ve been watching some Webcasts by Ramesh Meyyappan at SQL Workshops on query performance tuning.  This guy is a wizard… inspiring.

Many of his techniques, which he just does like second nature, may require someone to slow things down to really articulate what he does.  Sort of like watching an instant replay during a sports event.  How do those umpires get it so right (most of the time)?  There’s just a huge difference between real-time and the slow-mo, frame-by-frame review.

So here’s my instant replay of one of Ramesh’s techniques…  monitoring tempdb to determine if a query is spilling during a sort or JOIN operation.  In particular we will do so using one of the SQL Server Dynamic Management (DM) functions, sys.dm_io_virtual_file_stats.

If you prefer to use PROFILER to monitor tempdb for sort spills you can do so simply by turning on the  Errors and Warnings Events Selection subcategory.

Sort Warning to monitor tempdb spills

I also highlighted the Hash Warning event which identifies when SQL Server goes into a hash recursion (bad) or hash bailout (very, very bad) during a hash operations such as a hash join or aggregation.

OK, back to the topic…

According to BoL, our DM function accepts 2 input arguments, the ID of the database and the File ID of the file for the database that we are interested in.

If we’re not exactly sure how to determine those input arguments we can wing it by supplying NULL values for the input.

SELECT    *
FROM    sys.dm_io_virtual_file_stats (NULL, NULL);

The output of this query will give us the file stats for every database file on the SQL instance.  Even on my lowly laptop the output is rather daunting if you don’t have any clue about which database is being presented:

sys.dm_io_virtual_file_stats output

When working with DM functions and views there are a few Metadata Functions that will serve to make our life easier.  In our case db_name and file_name.

Let’s modify our query to help us figure out which database and file we need to look at:

SELECT  DB_NAME(dm.database_id) AS [DB_Name],
        FILE_NAME(dm.file_id) AS [File_Name],
        dm.*
FROM    sys.dm_io_virtual_file_stats (NULL, NULL) dm;

All we did here is add two new columns to figure out which database and file we are interested in.  Here’s a snippet of the output:

tempdb database and file IDs

OK, now we know tempdb is database #2.  We also know the data file is 1 (tempdev) and the tempdb log file is 2.  Hopefully that is clear from the screen capture above.

(Note: With SQL Server there are always multiple ways to do the same thing.  Another way we could have figured out the database and file IDs is with the system stored procedure sp_helpfile. We’ll leave that as an exercise for the reader or perhaps do a small posting on that one later on.)

OK, now we can limit the scope of our query to just return what we want to quantify how much tempdb is being used.

SELECT  DB_NAME(dm.database_id) AS [DB_Name],
        FILE_NAME(dm.file_id) AS [File_Name],
        num_of_reads,
        num_of_bytes_read,
        num_of_writes,
        num_of_bytes_written
FROM    sys.dm_io_virtual_file_stats (NULL, NULL) dm
WHERE    dm.database_id = 2
AND        dm.file_id = 1;
Note:  This query will work fine but a simpler version follows, that introduces the use of the DB_ID function for the first argument to our DM.

SELECT  DB_NAME(dm.database_id) AS [DB_Name],
        FILE_NAME(dm.file_id) AS [File_Name],
        num_of_reads,
        num_of_bytes_read,
        num_of_writes,
        num_of_bytes_written
FROM    sys.dm_io_virtual_file_stats (DB_ID('tempdb'), 1) dm;

So how do we actually use this to validate that tempdb is being used?  Simply run the query to capture the file stats data, run a query writes to tempdb, then re-run the DM query.

drop table #demoiostats
drop table #tempdcontact

DBCC dropcleanbuffers;
-- Query 1: Capture a snapshot of the file stats on tempdb
-- before any data is written to it.
SELECT    CURRENT_TIMESTAMP as capture_time,
        DB_NAME(dm.database_id) AS [DB_Name],
        FILE_NAME(dm.file_id) AS [File_Name],
        num_of_reads,
        num_of_bytes_read,
        num_of_writes,
        num_of_bytes_written
INTO    #demoiostats
FROM    sys.dm_io_virtual_file_stats (DB_ID('tempdb'), 1) dm;

-- Query 2: Force a write to tempdb.  In practice we would be more interested
-- in queries that are writing to tempdb as a result of a SORT
-- or HASH operation.
SELECT    FirstName,
        LastName,
        EmailAddress
into    #tempdcontact
FROM    AdventureWorks.Person.Contact;

SELECT    * from #tempdcontact;

-- Query 4: Capture the snapshot AFTER the write to tempdb.
INSERT INTO #demoiostats
    SELECT    CURRENT_TIMESTAMP as capture_time,
            DB_NAME(dm.database_id) AS [DB_Name],
            FILE_NAME(dm.file_id) AS [File_Name],
            num_of_reads,
            num_of_bytes_read,
            num_of_writes,
            num_of_bytes_written
    FROM    sys.dm_io_virtual_file_stats (DB_ID('tempdb'), 1) dm;

-- Query 5: Compute the total number of reads and writes.
SELECT    t2.num_of_reads - t1.num_of_reads AS [Total_Num_of_Reads],
        t2.num_of_bytes_read - t1.num_of_bytes_read AS [Total_Num_of_Bytes_Read],
        t2.num_of_writes - t1.num_of_writes AS [Total_Num_of_Writes],
        t2.num_of_bytes_written - t1.num_of_bytes_written AS [Total_Num_of_Bytes_Written]
FROM        #demoiostats t1
        CROSS JOIN #demoiostats t2
WHERE    t1.capture_time < t2.capture_time;

To recap the above code:

  1. Query 1 captures the file stats counters on tempdb to a temp file #demoiostats.
  2. Query 2 writes to a temp table #tempdcontact.  In practice we would be more interested in a spill to tempdb from either a hash or sort physical operator.
  3. Query 3 does a simple read from the temp table we just created.
  4. Query 4 captures the file stats AFTER the write and read to and from  tempdb.
  5. Query 5: Finally we take the differences between the two sets of counters we just captured.

Here’s the output of our little test:

image

What’s interesting is that we have 4 reads and and 171 writes.  You might be wondering “Reads and writes of what?”

I ran these tests several times and the results are consistent.  We’re doing about 64k per read and 12k per write.

The counts of reads and writes are PHYSICAL reads, or the number of trips to the physical disk.  The definition of LOGICAL reads, which is not what this routine returns, is the number of trips data is sent to/from the data cache.  This is why I added the DBCC dropcleanbuffers before running the routine.  Without clearing the data cache, no increase of read counters would occur because the data would conveniently be sitting in cache.

I found a nice blog post on physical and logical reads by Kasim Wirama.  Check it out for additional details.

Conclusion: The dm_io_virtual_file_stats DMV can be very useful to monitor PHYSICAL read and write operations to the tempdb (or any other db).  It’s also a useful tool to point out the difference between LOGICAL and PHYSICAL reads.  Clearly this tool can be easily used to build optimized queries to avoid spilling to tempdb.

Thanks to Ramesh for providing the lead example to get me thinking about this tool… and the many other posters out there who demonstrate how to really make this tool dance.

Anytime I attempt to post a blog I run the risk of learning something unexpected.  In this case it was the distinction between logical and physical reads.  We all know it’s there… but probably don’t think about it much.

Forever curious…

Dave

Upgrading SQL Server 2008 Trial Version to Developer Edition

After six months of using SQL Server 2008 Trial, the party is finally over; my free trial has expired.  Unlike most trial software applications Microsoft does not present the user with a dialog box indicating that there are only X number of days remaining with the trial version.  The SQL Server service will just shutdown and no longer be able to start up.

It’s not entirely obvious HOW to upgrade from the trial version.  Thankfully there are other places to find this info other than Microsoft.  In a nutshell, all you need to do is go to the MAINTENANCE window, not the INSTALLATION window.

I found the following post by Shahfaisal Muhammed particularly helpful:

http://shahfaisalmuhammed.wordpress.com/2010/02/22/sql-server-2008-evaluation-edition-upgrade/

If you haven’t already done so, you might want to avoid having to edit your registry by upgrading to SP1 before upgrading your trial version.

After the SP1 update and the upgrade I was presented with a nasty window indicating that the server was in script mode and could only be accessed by the admin:

MSSQL 2008 SP1 Install – Server is in script upgrade mode. Only administrator can connect at this time

I found this post by Steven White to be reassuring that the end of the world is not approaching:

http://sqlblogcasts.com/blogs/stevenwhite/archive/2009/06/25/mssql-2008-sp1-install-server-is-in-script-upgrade-mode-only-administrator-can-connect-at-this-time.aspx

Thanks to Shahfaisal and Steven for their postings.  Very helpful.

Stay curious,

Dave

Testing HASHING to Speed Up Queries on Strings Part II

The objective of my last post, Testing HASHING to Speed Up Queries on Strings Part I, was to demonstrate query lookup performance when querying hash columns instead of the original string value.  In my test I chose a small set of about 20,000 email addresses as the base test case.

To recap the last post:

  • A temp table was setup to contain an email address, first and last name data columns.  A CHECKSUM was applied to each email address to compute the email string’s “numerical equivalent”.
  • Using the NTILE ranking function I then divided the data up into 200 near-equally sized “buckets” based on the hash value of the email address.
  • The MIN and MAX hash value was then computed for each hash bucket.
  • Each hash bucket was then assigned a unique value from 1 to 200.

The actual testing of the HASHING involved three simple queries that attempted to select a single record from the table.

Query 1 looked up the email address by computing the CHECKSUM of the email address string, determining which hash bucket the email address was in by checking to the min and max hash values, and then finally looking up the email address in the bucket.

The concept behind this query was that once we knew which bucket the email address was in our search was narrowed down to the 200 records in the bucket.

In practice, what I had hoped would be the fastest query, turned out to be the slowest. The query never even looked at the hash bucket, instead it went right to the index on the actual email address and doing a RID lookup on the min and max hash values.  Down right ugly.

Query 2 simply used the email address’s hash value to lookup the value directly.  This query used an index on the hash column and did an index seek to locate the record.

Query 3 was even more straight forward.  I looked up the email address by the email address string.  The index on the email address was used via an index seek and the row was returned.

Although the query plans used different indexes, queries 2 and 3 had the exact actual SUBTREE cost.  Query 1, the query that had all of the hashing technology built into it, had twice the cost of queries 2 and 3.

————————————————————————————————–

The purpose of this second test was to see if the queries performed differently for a larger data set.  The original data set of 20,000 rows was too small for the query optimizer to justify using the hashing function.

In this second test I loaded the test table with 500,000 rows.  Unfortunately, the 3 query test results were identical.  The direct access methods (query 2 and 3) performed at half the cost of the hashed approach (query 1).

As is so often the case with undirected testing, where the results are anticipated by not known, this test has raised more questions than answers.  More than likely, I’ll be back to revisit this post as I learn how to apply hashing to data that will actually yield improved query performance.

On that note I will raise some questions:

Is it possible 500,000 rows is still to small to make the hashing come into play?

Perhaps the nature of the email strings is not a good example for hashing.  After all, email strings tend to be fairly short in size.  Perhaps hashing would be better suited for longer string values, such as a comments field?

Had the dataset been much larger, would this hashing approach lent itself to partitioning the data around the hash buckets?  Let’s say the data had 500M records (10 times larger) then the hash buckets would have been 20,000 records in size.  Would the storage engine reap the benefits if the optimizer could find the right hash partition, eliminating all other partitions, then find the targeted email address within the 20,000 record partition?  I think it might…

I can’t imagine anyone actually wanting to partition their data by email address but perhaps some other string data would be relevant.  Perhaps GUIDs?

—————————————————————————————————-

Have you had any success leveraging hashing to improve your query performance?  If you have I’d love to hear about it!

Still curious…

Dave

0 Comments

Testing HASHING to Speed Up Queries on Strings Part I

It’s common knowledge that relational database engines do well indexing numeric data, particularly integer data.  String data is a little more tricky and tends to take longer to search / index.

My original intent of this post was to simulate how the SQL Server storage engine might handle statistics on an index, in particular, a column with a character data type.  During the course of the testing I realized that my testing really didn’t align with how SQL Server statistics works, but nonetheless, the hashing component of the exercise was worth while pursuing.

Since I work with fairly large data warehouses I decided to test hashing performance on a column that contains email address information.

So here goes…  The first thing I’m going to do is create a table to store the data in:

use tempdb;
go

-- How Hashing Works
--drop table tmp_hash
CREATE TABLE tmp_hash ( --id            int IDENTITY NOT NULL,
                        hash_bucket    tinyint,
                        min_hash_in_bucket    int,
                        max_hash_in_bucket    int,
                        hash_value    int,
                        email_address    nvarchar(50),
                        first_name        nvarchar(50),
                        last_name        nvarchar(50));

Let’s talk about the structure of this table:

hash_bucket: We will divide the table up into 200 equally sized “buckets” in which a range of hash values will be stored.

min_hash_in_bucket and max_hash_in_bucket:  These columns will hold the smallest and largest hash value in the hash bucket.

hash_value: Contains the CHECKSUM(emailaddress).  The idea of this exercise is to show how SQL Server handles hash values better than string data.

email_address, first_name, last_name: String data for human consumption.

My original thinking was to put a clustered index on the hash_value and to guarantee uniqueness of the index, an identity column was included.  At the end of this exercise we’ll end up with 3 different queries that compare the use of the hash with the email address. In order to make all queries comparative I chose to make all indexes non-clustered.  Of course in the real world we would have a clustered index on the table.

-- Create non-clustered index so comparisons are equivalent
CREATE NONCLUSTERED INDEX idx_nc_tmp_hash_max_min
    ON [dbo].[tmp_hash] (min_hash_in_bucket, max_hash_in_bucket)
    INCLUDE (email_address, first_name, last_name);

--drop index tmp_hash.idx_nc_tmp_hash
-- Create non-clustered index so comparisons are equivalent
CREATE NONCLUSTERED INDEX idx_nc_tmp_hash
    ON [dbo].[tmp_hash] ([hash_value])
    INCLUDE (email_address, first_name, last_name);

-- drop index tmp_hash.idx_tmp_hash_email_address
CREATE NONCLUSTERED INDEX idx_tmp_hash_email_address
    ON [dbo].[tmp_hash] ([email_address])
    INCLUDE (first_name, last_name);

OK, so we added the non-clustered indexes.  Later on we’ll see how these indexes behave when querying either the hash values, the hash range and the email_address string data directly.

Let’s populate the table with data from the AdventureWorks db:

INSERT INTO tmp_hash (  hash_value,
                        email_address,
                        first_name,
                        last_name)
    SELECT  CHECKSUM(EmailAddress),
            EmailAddress,
            FirstName,
            LastName
    FROM    AdventureWorks.Person.Contact;
--(19972 row(s) affected)

 

The next step is where my thinking simulated but deviated from how SQL Server handles statistics.  SQL Server creates histograms of up to 200 columns with the data distribution in each of the columns.  So if the data tends to not distribute evenly across all 200 columns in the histogram the number of records in each of the buckets will vary.

In this test scenario I created 200 buckets for the data but each bucket has the same number of records in it (or nearly so since 19972 does not truly divide evenly by 200).

So here’s how I did it:

-- Now break up the hash_values into 200 equal sized chunks
WITH hash_cte (hash_value, hash_bucket) AS
    (    SELECT hash_value,
                NTILE(200) OVER(ORDER BY hash_value) AS hash_bucket
        FROM    tmp_hash )
UPDATE    t
SET       hash_bucket = d.hash_bucket
FROM        tmp_hash t
        INNER JOIN hash_cte d
            ON    d.hash_value = t.hash_value;        

-- Take a peek at the data so far.            
SELECT    hash_bucket,
        COUNT(*) as [Size of Hash Bucket]
FROM    tmp_hash
GROUP BY hash_bucket
ORDER BY hash_bucket;

 

We simply used the ranking function NTILE to divide up the data into nearly equal sized chunks.

The second query shows the distribution of the data as 100 records in the first 173 buckets and 99 records in buckets 174 – 200.

The next thing we did was updated the table to contain the MIN and MAX hash_values in each of the buckets.  The objective here is to provide the SQL Server storage engine a way to narrow down it’s search for an email address by providing a range of hash values.

-- Update the min hash value in each bucket
WITH min_hash_cte (min_hash_value, hash_bucket) AS
    (   SELECT    MIN(hash_value),
                hash_bucket
        FROM    tmp_hash
        GROUP BY hash_bucket ),
    max_hash_cte (max_hash_value, hash_bucket) AS
    (   SELECT    MAX(hash_value),
                hash_bucket
        FROM    tmp_hash
        GROUP BY hash_bucket )
UPDATE    t
SET        min_hash_in_bucket = [min].min_hash_value,
        max_hash_in_bucket = [max].max_hash_value
FROM        tmp_hash t
        INNER JOIN min_hash_cte [min]
            ON [min].hash_bucket = t.hash_bucket
        INNER JOIN max_hash_cte [max]
            ON [max].hash_bucket = t.hash_bucket;
            

 

Finally, I provide 3 distinct queries to query the table to retrieve not only the email address, but also the person’s name.

-- How do we actually use this to look up an email address?
SELECT    t.first_name,
        t.last_name,
        t.email_address
FROM    tmp_hash t
WHERE    CHECKSUM(N'eduardo22@adventure-works.com')
            BETWEEN t.min_hash_in_bucket AND t.max_hash_in_bucket
AND        t.email_address = N'eduardo22@adventure-works.com';

SELECT    t1.first_name,
        t1.last_name,
        t1.email_address
FROM    tmp_hash t1
WHERE    CHECKSUM(N'eduardo22@adventure-works.com') = t1.hash_value;

-- Is this better than the naive approach?
SELECT    t.first_name,
        t.last_name,
        t.email_address
FROM    tmp_hash t
WHERE    t.email_address = N'eduardo22@adventure-works.com';

 

[Side note: The CHECKSUM(N'eduardo22@adventure-works.com') <> CHECKSUM('eduardo22@adventure-works.com')]

The concept behind the first query is to figure out which bucket the target email address is in.  This should narrow down the range of records to look at to about 100 records, at which point it could choose to do a scan of the data.  This would be similar to a partitioned table implementation that facilitates partition elimination, where all but the required partition of data would be looked at.

The second predicate in the WHERE clause provides the actual record we are looking for.  The intent was to do the actual email lookup only AFTER the appropriate hash_bucket is identified.

What actually happened?  The query optimizer chose to ignore the use of the hash_bucket range and go right to the index on the emailaddress data.  Because we included a range to look at we ended up with the RID lookup.  Had we decided to include the identity column and place a clustered index on it the RID bookmark would have been replaced with a Key Lookup.  Either way, in this case the range of hash values is a waste of energy.

 

image

 

The intent of the second query was to go right to the hash value of the email address.  The result is a much cleaner index seek that has a cost half the size of the first query (0.003 vs 0.006).

 

image

 

Finally, the third query simply accesses the emailaddress column directly.  It’s cost is identical to the second query’s plan even though they are using different indexes.

 

image

 

Looking at the query plans of all 3 queries if run in a batch we confirm that the first query costs 2 times the cost of query #2 and query #3.

 

image

 

CONCLUSION:  Based on the results of this test, it is clear that the hashing of the email address did not result in any performance improvement over the simple, direct access of the email address string.  In fact, the overhead in terms of CPU, disk IO and storage of the hash values and ranges would suggest that the hashing is a total waste of time.  For this data set that is arguably true.

Unfettered, I believe the relatively small data set of 19,972 rows is not sufficient to warrant the SQL optimizer to attempt to take short cuts to get to the data.  My next post on this topic will use a much larger data set of about 500k rows, which is probably more real-world for most data warehouses.  

Always curious,

Dave

0 Comments

Practical Use of the OUTPUT Clause

SQL Server 2005 introduced many new switches and features in T-SQL.  Often times new features are met with “Why would I need to do that?”.  The reality is that the SQL Server development team are not adding features for nothing.

One of my favorites is the OUTPUT clause.  The OUTPUT clause can be used in conjunction with INSERT, UPDATE and DELETE statements.  This post will show how I used the OUTPUT clause in a real world scenario with UPDATE.

Situation:  While working on my data warehouse it was discovered that certain records were not being loaded into the warehouse because one of the lookup tables did not include one or more records that were required for an INNER JOIN.  Since this has been going on over a period of several years it was considered important to reload those records.

Fortunately the source for the incoming data has been stored in a “staging” table so we have all of the original incoming data.  One of the columns in the staging table is the load_date, which stores the date that the data was extracted from the source system.

Only a small number of records are affected each time the warehouse is loaded, but they are scattered over a period of several years of data.  The load routine utilizes the load_date column to identify the records that need to be inserted or updated on the warehouse.

In order to utilize the existing load routine, i.e.., not have to craft a custom loader to identify the missing records, we would like to set the load_date of all of the missing records to the current date and run the load routine for the existing date.

Using the OUTPUT clause we will be able to update the missing stage records with a common load_date value and at the same time save the original load dates for the records in a separate table.  We can then run the ETL to load the missing records.  After the load successfully completes, we will restore the stage table to the original load dates.

Implementation Details: The first step is to create an output table where we will store 3 things: the primary key of the table, the original value of the column that is going to be updated, and the newly updated value of the updated column (optional but can be handy).

We want to save the primary key so we can conveniently do the update later after we’ve verified the results of the ETL load.

CREATE TABLE dbo.tmp_stg_output (stg_pkey_id         int,
                                 load_date_before    datetime,
                                 load_date_after     datetime );

CREATE UNIQUE CLUSTERED INDEX idx_uc_output_clause
    ON   dbo.tmp_stg_output (stg_pkey_id);

 

Next we’ll update the stage table with our desired load_date.  We’ll also include the OUTPUT clause to save the data to our temp table.

 

UPDATE stg
SET    load_date = '03/12/2010'
     OUTPUT inserted.stg_pkey_id,
            deleted.load_date,
            inserted.load_date
     INTO tmp_stg_output
FROM  STG_table stg
WHERE stg.section_code = 2213
AND  stg.district_code = 197l;

 

This is just your run of the mill UPDATE statement except we have the OUTPUT clause that indicates the 3 columns to save to the tmp table.  If you’re not familiar with the special tables INSERTED and DELETED check out BOL, Using the inserted and deleted TablesMy first encounter with these special tables were when I was creating triggers. 

For the purpose of the use of the OUTPUT clause all we need to understand is that the INSERTED table contains the data that is changing.  So in this example, the INSERTED.load_date will save the new load_date value.  Also note, the INSERTED.stg_pkey_id could have been DELETED.stg_pkey_id.  The primary key of the table doesn’t change so the INSERTED and DELETED tables will contain the same value for this column.

The DELETED.load_date will contain what was originally in the table before the update.

At this point we can run the load ETL to load our missing records.  The details of the ETL are not relevant to this post…

Once we’ve validated we have a good load of our missing records we can restore our stage table to it’s original state by using our handy little temp table.

-- Run the load package using the load_date of 20100312

-- Verify the results are correct

-- Set the stg table back to it's original state
UPDATE stg
SET    load_date = t.load_date_before
FROM  dbo.STG_table stg
       INNER JOIN dbo.tmp_stg_output t
         ON   t.stg_pkey_id = stg.stg_pkey_id;

Conclusion:  Using the OUTPUT clause in conjunction with the UPDATE statement we were able to capture the state of our  staging table, update one of it’s columns to allow us to easily perform an ETL load.  Once the ETL finishes and is verified we restore the staging table to it’s original state using the temp table we created and populated with the OUTPUT clause.

Similar capabilities exist for INSERT and DELETE statements.

In our example we used an actual table that was created with the CREATE TABLE statement.  We also could have just as easily used a temp table (#tmp_stg_output, ##tmp_stg_output), or a table variable (@tmp_stg_output).  Obviously the table variable would not have given us the luxury of validating the data because the table is no longer available after the batch completes.

The risk of using #temp tables is that they are dropped when the session ends.  If your session disconnects as a result of a network glitch the data in your temp output table would be lost.

Stay curious – keep learning…

Dave

SQL Saturday #33, Charlotte, NC from the Rear View Mirror

I just wrapped up my first ever attendance at a SQL Saturday.  I was not disappointed.  But not only was I an attendee, but I was also a presenter (speaker) too.

The event was held at the Microsoft campus in Charlotte.  The facility is first class and the principal organizers, Peter Shire and the management and staff at SQL Sentry, could not have found a better venue.

There were 54 scheduled sessions presented across 9 tracks that were delivered by 25-30 speakers.  Most of the speakers were SQL Server MVPs that flew in from all over the country.  And then there was me…

When I signed up for the event I suggested 3 “proposed” sessions: Query Plan Analysis, Implement a Sliding Window Partition, and a session on Profiler / Trace.  I figured one might get selected, maybe.  Fortunately for me the query plan session was picked.

Also, at the time of my “proposed” sessions, there was very few speakers registered.  So I assumed that they were going to have a tough time getting people to run the sessions.  Fast forward 2 weeks later and the speaker roster looked like a Who’s Who of SQL Server MVPs, both local and out of state.  Out-of-towners included Andy Kelly, Andy Leonard, Steve Jones, Andy Warren, Kevin Kline, Jessica Moss and more.  Local MVP talent included John Welch, Alejandro Mesa, Rafael Salas and several top SQL talents from Microsoft itself.  Honestly, had I known the speaker list was going to be of that caliber I would not have had the audacity to inject my name on such a list.  But sometimes Providence has a way of getting us do go outside our comfort zone to take us to the next level.

There were several unexpected benefits of being a speaker at SQL Saturday…  First of all you get to wear a really cool button down shirt at the event.  It was navy blue with the SQL Saturday logo and event number embroidered above the pocket.  It will be a nice keepsake of the event. 

The night before the event there was a “speakers dinner”.  Apparently the speakers dinner is standard protocol…  the main sponsor of the events hosts a dinner for all the speakers at a nice restaurant.  In our case we all went to Red Rocks in Birkdale Village.  Anything and everything on the menu or the bar, was fair game.

The second biggest benefit of being a speaker was the opportunity to meet, in a social setting, the many MVPs of the group.  Like they say about the PGA golf pros… Those guys are good.

Finally, at the top of the list of speaking at a SQL Saturday, is the lessons learned by preparing and delivering a technical presentation to a technology savvy audience.  For me there were many…  One of my favorite sayings is “If you want to learn it, teach it”.  The prep work of organizing your thoughts for the presentation, thinking about how to simplify complex concepts, trying to anticipate questions that may be presented, and then figuring out the answers, all contribute.

As expected, the attendees of the experience did ask some insightful questions that challenged my own knowledge of the subject.  Fortunately I had some help from the attendees who helped explain the answer to the question… To them I will repeat myself…  Thank you.

At first I was a little disappointed that my presentation was in the last time slot of the day… 4-5.  I didn’t expect anyone to stick around to see it, after all, it was one of the first decent Saturdays of the year.  Was I ever wrong…  Although my session was in one of the smaller rooms, it was packed.  My guess is there was between 40 and 60 people in it.  All the seats where taken.  People were sitting on a table that flanked the length of one of the walls, the back wall was lined with folks standing, and a few people were straddled on the floor.

Obviously I had a popular topic… because there weren’t coming to see me, a rookie presenter.  After seeing several sessions though out the day that were closely related to my session, in particular the session by Andy Warren on column and index statistics and Kevin Kline’s session on internals (basically how queries are processed) I was concerned that my session would be too much of a repeat of their messages.

Instead, I do believe, partly because one of the attendees told my so after the event, that my hands on demo filled the gap between the theory and concepts presented by others and the real world.  For that comment I am grateful…

I now have more work to do to improve my own knowledge and the overall presentation.  That only confirms the Learn it – Teach it theory.  But overall the experience was a 10!

Since Saturday I have had several LinkedIn invites from attendees, asked to consider presenting at the SQL Server UG in Columbia, SC, asked to consider presenting a webcast for SQLLunch, and had a couple folks indicate that the presentation has inspired them to look deeper at their own code and at the general topic of query plan analysis.  I know I struggle regularly with trying to get my arms around this 800 pound gorilla we call SQL server.  If 1 person in that room was moved to improve I consider it a homerun.  2 or more… well, that’s a grand slam.

I can ask no more from this event.  The rewards I have personally and professionally received from it far exceeded my expectations.  To all of the participants that stayed late, to those that provided constructive and supportive feedback, and to those that didn’t walk out…   I Thank You!

Stay curious; Keep learning…

Dave

0 Comments

Database Properties

A couple of weeks ago a colleague asked me how to determine the recovery model of a database using T-SQL.  In other words, not using SSMS.  For the purpose of this post I’m going to show how to determine the recovery model for a database using SSMS, and then answer the presented question.

The following screen can be viewed in SSMS by RIGHT clicking on the database in Object Explorer ==> Properties ==> Options.

image

From this we can see that the AdventureWorks database is Simple (the other options are Full and Bulk Logged).

But back to the original question…  How can this be determined using T-SQL?  At the time we were at lunch and I suggested using sp_helpdb.  I wasn’t positive this property was included…  but it turned out to be a good guess:

 

image

If you prefer to identify only the recovery model using T-SQL, such as to be able to save the information in a variable of temp table you can use the following approach:

image

There is an older version of the DATABASEPROPERTYEX called DATABASEPROPERTY.  According to BOL, it is not recommend using DATABASEPROPERTY as it has been scheduled for deprecation.

We can validate that DATABASEPROPERTY is deprecated by tracing deprecation in Profiler:

image

After running  the same query using DATABASEPROPERTY instead of DATABASEPROPERTYEX we observe the following in Profiler:

image

This only confirms what BOL was telling us.

OK, last approach (for today, I’m sure there are more methods) to determine the recovery model for a database:

image

 

Lessons Learned: As is always the case with SQL Server there are always multiple ways to arrive at the same result.  In this case we learned that we can capture a database’s recovery model by using:

  • SSMS
  • sp_helpdb
  • DATABASEPROPERTYEX
  • sys.databases view
    We also learned and confirmed that DATABASEPROPERTY (no EX) will not be supported in versions beyond SQL Server 2008. 
    Finally, we showed how to verify a deprecated command using SQL Profiler.
    Although this post was focused on the recovery model of the database any of these methods will work with any other database properties.  For more information on the entire set of database properties see BOL for DATABASEPROPERTY .

    Stay curious; keep learning…

    Dave

    Finding Uncommitted Transactions

    One way to find out if a transaction is causing a block is with the sp_who2 system stored procedure.  The scope of this procedure is server-wide.  It identifies all connections to the database, the login name, the application that is running, and in the context of locks and blocks, any SPIDs that are causing a block.  The offending SPID will be in the BlkBy column.

    I often find myself using transactions when doing DELETE or UPDATE operations to allow myself the option of ROLLING back the transaction in case something goes awry.  As equally often, I forget to COMMIT the transaction and move on the the next order of business, resulting in a lock on the table and essentially blocking everyone out of it’s use (unless they use the with (nolock) hint in their query).

    If I ran the following query and forgot to commit the transaction what would happen?

    USE AdventureWorks;
    GO
    
    BEGIN TRAN
    INSERT INTO person.contact (FirstName, LastName, EmailAddress, PasswordHash, PasswordSalt)
    VALUES ('Dave', 'Turpin', 'dave.turpin@acm.org', 'goblygook', 'Salt');

     

    The table would lock up until either a commit or rollback happens.  But how would I know if there are any open transactions, either mine or somebody else?  sp_who2 would not indicate any blocks because there aren’t any.  There’s only an open transaction.  But as long as no other processes are tying to use that table there will not be any blocking/locking issues.

    DBCC OPENTRAN will list the earliest open transaction.  In our simple example, the DBCC will return:

    image

    In this case we can see that SPID 54 had a transaction open since 5:08AM.  Now we can take a look at SPID 54, see who it is, and initiate either a rollback or commit.  Chances are if an uncommitted transaction is hanging out there there’s either a major bug in someone’s code, we have a legitimate locking/blocking problem, or its an ad hoc batch.

    Since DBCC OPENTRAN only identifies the earliest open transaction, you may have to resolve someone else’s transaction before actually getting to yours.  Hopefully there will not be any open transactions for very long and the point is moot.  In practice, however, when we need to look for open transactions it’s because something else isn’t working or…  The red phones are lighting up with customer complaints.

    Stay curious; keep learning…

    Dave

    DISTINCT COUNT vs. COUNT DISTINCT

    Late last Friday afternoon I was working on a database quality issue that had been challenging me all day.  Essentially I was looking for any distraction that would give the left side of my brain a break.  Along comes two database colleagues to my rescue…

    These two gents came up to me with a somewhat coy look on their faces and after the usual courtesies along the lines of “Do you have a minute for a SQL question?” proposed the following question:  What’s the difference between “COUNT DISTINCT” and “DISTINCT COUNT”.

    Come on guys, I was hoping for a ping pong or foosball invite…  Not more taxing head ware…

    Regardless, I couldn’t help myself and proceeded to give it some thought.  Of course the easy way would have been to just plug the SQL into an example but I wanted to whiteboard it first before getting the “answer”.

    During the whiteboard sessions a couple lessons were already learned or re-learned:

    1.   The reason the question came up was one of the guys was using a database reporting tool and was not getting the expected results.
    2.   We all get in the habit of doing things the same way… so keep you mind open when reading other people’s code (I re-read Itzik Ben-Gans books repeatedly and I still find new jewels in there that are not even relevant to the lesson).  In my case I tend to use DISTINCT COUNT and not COUNT DISTINCT.

    Of course after the whiteboard session it was back to the desk to proof out our analysis…

    The first thing is to create some simple sample data:

    USE tempdb;
    GO
    
    CREATE TABLE #tab1 ( id tinyint);
    
    INSERT INTO #tab1 VALUES (1);
    INSERT INTO #tab1 VALUES (1);
    INSERT INTO #tab1 VALUES (2);
    INSERT INTO #tab1 VALUES (3);
    INSERT INTO #tab1 VALUES (3);
    INSERT INTO #tab1 VALUES (3);
    INSERT INTO #tab1 VALUES (3);
    INSERT INTO #tab1 VALUES (4);
    INSERT INTO #tab1 VALUES (4);
    INSERT INTO #tab1 VALUES (4);

     

    In this case all we did was insert 10 rows of data into a table; 4 distinct values; occurrences of those distinct values varies between 1 and 4.

    SELECT DISTINCT COUNT

    So let’s run a DISTINCT COUNT example:

    SELECT  DISTINCT COUNT(*)
    FROM    #tab1; 

     

    It doesn’t get much simpler than this.  But does it give us the expected results.  In the case of my report writing colleague: NO.  If you said 10 for the output you are correct.

    If you didn’t guess 10 don’t worry, it is tricky.  The best way to understand WHY the SQL engine returned 10 is to look at the query plan:

    image

    When reading graphical execution plans we always read from RIGHT to LEFT.  The first step the SQL engine does is a table scan on our temp table.  No surprises there… there’s only 10 rows in it, there are no indexes, and SQL probably wouldn’t use an index anyway since the data set is so small.  The trivial execution plan is the only choice.

    The above diagram highlights the pop up box when hovering the cursor over the line that connects the table scan with the stream aggregate (in our case, the only aggregate function is COUNT).  We can see that we have 10 rows going into the aggregate.

     

    image

    Between the stream aggregate and compute scalar we see we have only one row.  And the DISTINCT value of that one row is 10.  Hence SQL returns 10.

    In fact, it would have been just as easy and a whole lot less confusing to run the following query:

    SELECT  COUNT(*)
    FROM    #tab1; 

    In fact, SQL Server’s optimizer generates the identical plan:

    image

    Our lesson learned here:  SQL Server will often clean up our well-intentioned code for us, and we don’t even know it.  (Which is not to say SQL Server is smarter than we are, just real good at what it does).

    SELECT COUNT – DISTINCT

    Now let’s shift our focus to the other half of the question: COUNT-DISTINCT.

    SELECT  COUNT( DISTINCT ID )
    FROM    #tab1;

    Right away you may notice we did not do COUNT (DISTINCT *).  The wildcard doesn’t make any sense in this context.  If it did it would say “count all of the distinct values for all of the columns in the table”… which in our example is trivial, but still will not make it past the parser.

    Here’s the output:

    image

    This is probably more along the lines of what the reporting engineer was trying to do in the first place.  Here we see that SQL Server has returned a count of the number of distinct IDs in our simple little table.

    Here’s the query plan for this simple query:

    image

     

    Everything in this query plan looks identical to the query plan from the SELECT COUNT(*) except it introduced the SORT operator, second from the right.  We’ll also notice the amount of work SQL needs to do has shifted away from the table scan (was 100%, now 22%).  Now the most expensive part of the query is the sort, at 78% of the cost.

    Before the SQL Server engine can count the number of distinct IDs it must first sort them.  This is where the cost of the query is.  If this were a query that would be running in a transactional production system, instead of an ad hoc situation, we would want to do our best to ensure this query is not spilling to tempdb, as disk IO is always more expensive than memory operations.  We would do this by looking at the amount of RAM the query is being GRANTED and the expected row size to be sure enough RAM is being allocated for the query.  There are some tricks that can be used to ensure sufficient memory is allocated.

    The number of rows going into the SORT operator is 10.  The number of rows going into the stream aggregate is 4, which is also the value returned from SQL Server.

    In my opinion a more useful query would include not only the counts by ID but also the ID’s themselves.  We can still get the count of the records with the final COMPUTE clause.

    SELECT  id,
            COUNT(*)
    FROM #tab1
    GROUP BY id
    COMPUTE SUM(COUNT(*));

     

    The execution plan is identical to the previous plan (with only a slight variation in the total subtree cost, on the order of .001 more).

    Here’s the result:

    image

     

    Yet another variation of this already beaten to near oblivion query follows:

    SELECT  DISTINCT id,
            count_by_id
    FROM    (   SELECT    id,
                COUNT (*) OVER (PARTITION BY id) AS count_by_id
                FROM    #tab1) d;

     

    Interestingly, this query will give us the exact same result set as the previous query.  But the query plan is much more interesting, and at nearly twice the cost increase in overall cost (0.0263443 vs  0.0147141 for the previous plan, minus the COMPUTE).  A complete analysis of this plan will be left as an exercise for the reader, or if there is interest, in a future post.

     

    image

     

    So why bother with such a complex query?  The concept may prove to be beneficial with a different set of data.  The sample data is so trivial that the benefits of the COUNT/OVER/PARTITION may not have a chance to shine.

    Lessons Learned:   Although it’s easy to become complacent with doing things the same way all the time, it can be equally challenging and interesting when presented with even the most trivial scenarios.

    Like all things in SQL, there are many ways to arrive at the same result set.  Fortunately for us, the SQL engine’s optimizer is usually a very sophisticated partner and helps us arrive at the optimal solution with the least cost, even when we throw seemingly different queries at it.

    The flip side of that is that a seemingly innocuous change to a query can potentially change the runtime performance of it. 

    While best practices are always a good starting point when architecting a SQL solution, alternative paths should be tested to ensure optimal performance.

    Thanks to Sean and Paul for presenting their question, who got me thinking about this from different points of view.

    Be childlike; Stay Curious…

    Dave

    0 Comments

    Big SQL Fish Across the Big Pond

    If you’re a SQL developer or dba working for a small company it’s pretty easy to become the big fish in the small pond.  But the downside is everything you learn is gained through the school of touch lessons.  You probably don’t have regular exposure to high end SQL expertise, so your knowledge will only grow so much.  Even worse, you might start believing yourself how good you are, until you go out on a technical interview and find out what you don’t know.

    If you work for a large company you have a chance at finding a high end, technical mentor who’s not afraid of giving away all his knowledge (turf) and is truly secure in his / her hard earned skills.  You at least a decent shot at learning from established experts.

    Either way the internet makes it possible to have quality touch time with some of the best names in the business.  Most of that contact will be from their blogs, user group meetings and if you can afford it, conferences.

    I recently received a sponsored mailing from PASS about a web site called sqlworkshops.com where 2 free webcasts can be downloaded.  OK, another self-promoting, self-proclaimed “expert” setting the bait just before the switcher-oo.

    Not this time folks.  The web casts on this site are from a true expert on SQL performance and tuning.  I personally have found these videos as much inspiring as informative.

    So if you work for a small company and want to REALLY find out what you know, check out www.sqlworkshops.com and download the webcasts… then watch just one of them.  After you watch the first one, if you’re like me, you’ll watch it again and again, until you “get it”.  The other ones are still in the queue, waiting to be tapped.

    This guy’s approach is unique, expert and deep.  He calls it level 4 training; I call it awesome.  Whatever you call it will be up to you… just try it.

     

    Happy SQL-ing!

    Dave

    0 Comments

    Escaping a Quote

    I was on my way out the door when a respected colleague bounded over with a quick question.  He needed to query a column in a table that had a single quote embedded in the string.

    This is one of those little SQL tricks that can really slow you down.  I don’t know when or where I learned it but knew right away to escape the single quote with another single quote.

    USE master;
    GO
    
    SELECT 'D'Agostino';

    This doesn’t work because the parser sees the quoted ‘D’ but doesn’t know what to do with the trailing Agostino’ string.

     

    SELECT 'D''Agostino'

     

    This query works!  The first quote is the start of the string.  The second quote is the ESCAPE for the third quote.  Basically the ESCAPE makes the third quote a literal string.  Finally the fourth quote ends the string.  The parser knows what this string is and works just fine.

    SELECT 'D''Agostino'''

    In the above query the result set looks like: D’Agostino’.  At first glance it looks a little awkward but the 3rd from last quote is the ESCAPE for the second from last quote; the last quote simply terminates the string.

    Are there other ways to do the same thing?  In SQL Server there are ALWAYS other ways to get things done and this one is no exception.

    We could use the CHAR function to generate the ASCII character for the single quote.  In this case it will look like this:

    SELECT 'D' + CHAR(39) + 'Agostino'

     

    This is simple stuff, but only when you know it.  If you don’t know it you can spend way too much time figuring it out on your own.

    Happy SQL-ing!

    0 Comments