Late last Friday afternoon I was working on a database quality issue that had been challenging me all day. Essentially I was looking for any distraction that would give the left side of my brain a break. Along comes two database colleagues to my rescue…
These two gents came up to me with a somewhat coy look on their faces and after the usual courtesies along the lines of “Do you have a minute for a SQL question?” proposed the following question: What’s the difference between “COUNT DISTINCT” and “DISTINCT COUNT”.
Come on guys, I was hoping for a ping pong or foosball invite… Not more taxing head ware…
Regardless, I couldn’t help myself and proceeded to give it some thought. Of course the easy way would have been to just plug the SQL into an example but I wanted to whiteboard it first before getting the “answer”.
During the whiteboard sessions a couple lessons were already learned or re-learned:
- The reason the question came up was one of the guys was using a database reporting tool and was not getting the expected results.
- We all get in the habit of doing things the same way… so keep you mind open when reading other people’s code (I re-read Itzik Ben-Gans books repeatedly and I still find new jewels in there that are not even relevant to the lesson). In my case I tend to use DISTINCT COUNT and not COUNT DISTINCT.
Of course after the whiteboard session it was back to the desk to proof out our analysis…
The first thing is to create some simple sample data:
USE tempdb; GO CREATE TABLE #tab1 ( id tinyint); INSERT INTO #tab1 VALUES (1); INSERT INTO #tab1 VALUES (1); INSERT INTO #tab1 VALUES (2); INSERT INTO #tab1 VALUES (3); INSERT INTO #tab1 VALUES (3); INSERT INTO #tab1 VALUES (3); INSERT INTO #tab1 VALUES (3); INSERT INTO #tab1 VALUES (4); INSERT INTO #tab1 VALUES (4); INSERT INTO #tab1 VALUES (4);
In this case all we did was insert 10 rows of data into a table; 4 distinct values; occurrences of those distinct values varies between 1 and 4.
SELECT DISTINCT COUNT
So let’s run a DISTINCT COUNT example:
SELECT DISTINCT COUNT(*) FROM #tab1;
It doesn’t get much simpler than this. But does it give us the expected results. In the case of my report writing colleague: NO. If you said 10 for the output you are correct.
If you didn’t guess 10 don’t worry, it is tricky. The best way to understand WHY the SQL engine returned 10 is to look at the query plan:
When reading graphical execution plans we always read from RIGHT to LEFT. The first step the SQL engine does is a table scan on our temp table. No surprises there… there’s only 10 rows in it, there are no indexes, and SQL probably wouldn’t use an index anyway since the data set is so small. The trivial execution plan is the only choice.
The above diagram highlights the pop up box when hovering the cursor over the line that connects the table scan with the stream aggregate (in our case, the only aggregate function is COUNT). We can see that we have 10 rows going into the aggregate.
Between the stream aggregate and compute scalar we see we have only one row. And the DISTINCT value of that one row is 10. Hence SQL returns 10.
In fact, it would have been just as easy and a whole lot less confusing to run the following query:
SELECT COUNT(*) FROM #tab1;
In fact, SQL Server’s optimizer generates the identical plan:
Our lesson learned here: SQL Server will often clean up our well-intentioned code for us, and we don’t even know it. (Which is not to say SQL Server is smarter than we are, just real good at what it does).
SELECT COUNT – DISTINCT
Now let’s shift our focus to the other half of the question: COUNT-DISTINCT.
SELECT COUNT( DISTINCT ID ) FROM #tab1;
Right away you may notice we did not do COUNT (DISTINCT *). The wildcard doesn’t make any sense in this context. If it did it would say “count all of the distinct values for all of the columns in the table”… which in our example is trivial, but still will not make it past the parser.
Here’s the output:
This is probably more along the lines of what the reporting engineer was trying to do in the first place. Here we see that SQL Server has returned a count of the number of distinct IDs in our simple little table.
Here’s the query plan for this simple query:
Everything in this query plan looks identical to the query plan from the SELECT COUNT(*) except it introduced the SORT operator, second from the right. We’ll also notice the amount of work SQL needs to do has shifted away from the table scan (was 100%, now 22%). Now the most expensive part of the query is the sort, at 78% of the cost.
Before the SQL Server engine can count the number of distinct IDs it must first sort them. This is where the cost of the query is. If this were a query that would be running in a transactional production system, instead of an ad hoc situation, we would want to do our best to ensure this query is not spilling to tempdb, as disk IO is always more expensive than memory operations. We would do this by looking at the amount of RAM the query is being GRANTED and the expected row size to be sure enough RAM is being allocated for the query. There are some tricks that can be used to ensure sufficient memory is allocated.
The number of rows going into the SORT operator is 10. The number of rows going into the stream aggregate is 4, which is also the value returned from SQL Server.
In my opinion a more useful query would include not only the counts by ID but also the ID’s themselves. We can still get the count of the records with the final COMPUTE clause.
SELECT id, COUNT(*) FROM #tab1 GROUP BY id COMPUTE SUM(COUNT(*));
The execution plan is identical to the previous plan (with only a slight variation in the total subtree cost, on the order of .001 more).
Here’s the result:
Yet another variation of this already beaten to near oblivion query follows:
SELECT DISTINCT id, count_by_id FROM ( SELECT id, COUNT (*) OVER (PARTITION BY id) AS count_by_id FROM #tab1) d;
Interestingly, this query will give us the exact same result set as the previous query. But the query plan is much more interesting, and at nearly twice the cost increase in overall cost (0.0263443 vs 0.0147141 for the previous plan, minus the COMPUTE). A complete analysis of this plan will be left as an exercise for the reader, or if there is interest, in a future post.
So why bother with such a complex query? The concept may prove to be beneficial with a different set of data. The sample data is so trivial that the benefits of the COUNT/OVER/PARTITION may not have a chance to shine.
Lessons Learned: Although it’s easy to become complacent with doing things the same way all the time, it can be equally challenging and interesting when presented with even the most trivial scenarios.
Like all things in SQL, there are many ways to arrive at the same result set. Fortunately for us, the SQL engine’s optimizer is usually a very sophisticated partner and helps us arrive at the optimal solution with the least cost, even when we throw seemingly different queries at it.
The flip side of that is that a seemingly innocuous change to a query can potentially change the runtime performance of it.
While best practices are always a good starting point when architecting a SQL solution, alternative paths should be tested to ensure optimal performance.
Thanks to Sean and Paul for presenting their question, who got me thinking about this from different points of view.
Be childlike; Stay Curious…
Dave
Leave a Reply
Using Gravatars in the comments - get your own and be recognized!
XHTML: These are some of the tags you can use:
<a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>