It’s common knowledge that relational database engines do well indexing numeric data, particularly integer data. String data is a little more tricky and tends to take longer to search / index.
My original intent of this post was to simulate how the SQL Server storage engine might handle statistics on an index, in particular, a column with a character data type. During the course of the testing I realized that my testing really didn’t align with how SQL Server statistics works, but nonetheless, the hashing component of the exercise was worth while pursuing.
Since I work with fairly large data warehouses I decided to test hashing performance on a column that contains email address information.
So here goes… The first thing I’m going to do is create a table to store the data in:
use tempdb; go -- How Hashing Works --drop table tmp_hash CREATE TABLE tmp_hash ( --id int IDENTITY NOT NULL, hash_bucket tinyint, min_hash_in_bucket int, max_hash_in_bucket int, hash_value int, email_address nvarchar(50), first_name nvarchar(50), last_name nvarchar(50));
Let’s talk about the structure of this table:
hash_bucket: We will divide the table up into 200 equally sized “buckets” in which a range of hash values will be stored.
min_hash_in_bucket and max_hash_in_bucket: These columns will hold the smallest and largest hash value in the hash bucket.
hash_value: Contains the CHECKSUM(emailaddress). The idea of this exercise is to show how SQL Server handles hash values better than string data.
email_address, first_name, last_name: String data for human consumption.
My original thinking was to put a clustered index on the hash_value and to guarantee uniqueness of the index, an identity column was included. At the end of this exercise we’ll end up with 3 different queries that compare the use of the hash with the email address. In order to make all queries comparative I chose to make all indexes non-clustered. Of course in the real world we would have a clustered index on the table.
-- Create non-clustered index so comparisons are equivalent CREATE NONCLUSTERED INDEX idx_nc_tmp_hash_max_min ON [dbo].[tmp_hash] (min_hash_in_bucket, max_hash_in_bucket) INCLUDE (email_address, first_name, last_name); --drop index tmp_hash.idx_nc_tmp_hash -- Create non-clustered index so comparisons are equivalent CREATE NONCLUSTERED INDEX idx_nc_tmp_hash ON [dbo].[tmp_hash] ([hash_value]) INCLUDE (email_address, first_name, last_name); -- drop index tmp_hash.idx_tmp_hash_email_address CREATE NONCLUSTERED INDEX idx_tmp_hash_email_address ON [dbo].[tmp_hash] ([email_address]) INCLUDE (first_name, last_name);
OK, so we added the non-clustered indexes. Later on we’ll see how these indexes behave when querying either the hash values, the hash range and the email_address string data directly.
Let’s populate the table with data from the AdventureWorks db:
INSERT INTO tmp_hash ( hash_value, email_address, first_name, last_name) SELECT CHECKSUM(EmailAddress), EmailAddress, FirstName, LastName FROM AdventureWorks.Person.Contact; --(19972 row(s) affected)
The next step is where my thinking simulated but deviated from how SQL Server handles statistics. SQL Server creates histograms of up to 200 columns with the data distribution in each of the columns. So if the data tends to not distribute evenly across all 200 columns in the histogram the number of records in each of the buckets will vary.
In this test scenario I created 200 buckets for the data but each bucket has the same number of records in it (or nearly so since 19972 does not truly divide evenly by 200).
So here’s how I did it:
-- Now break up the hash_values into 200 equal sized chunks WITH hash_cte (hash_value, hash_bucket) AS ( SELECT hash_value, NTILE(200) OVER(ORDER BY hash_value) AS hash_bucket FROM tmp_hash ) UPDATE t SET hash_bucket = d.hash_bucket FROM tmp_hash t INNER JOIN hash_cte d ON d.hash_value = t.hash_value; -- Take a peek at the data so far. SELECT hash_bucket, COUNT(*) as [Size of Hash Bucket] FROM tmp_hash GROUP BY hash_bucket ORDER BY hash_bucket;
We simply used the ranking function NTILE to divide up the data into nearly equal sized chunks.
The second query shows the distribution of the data as 100 records in the first 173 buckets and 99 records in buckets 174 – 200.
The next thing we did was updated the table to contain the MIN and MAX hash_values in each of the buckets. The objective here is to provide the SQL Server storage engine a way to narrow down it’s search for an email address by providing a range of hash values.
-- Update the min hash value in each bucket WITH min_hash_cte (min_hash_value, hash_bucket) AS ( SELECT MIN(hash_value), hash_bucket FROM tmp_hash GROUP BY hash_bucket ), max_hash_cte (max_hash_value, hash_bucket) AS ( SELECT MAX(hash_value), hash_bucket FROM tmp_hash GROUP BY hash_bucket ) UPDATE t SET min_hash_in_bucket = [min].min_hash_value, max_hash_in_bucket = [max].max_hash_value FROM tmp_hash t INNER JOIN min_hash_cte [min] ON [min].hash_bucket = t.hash_bucket INNER JOIN max_hash_cte [max] ON [max].hash_bucket = t.hash_bucket;
Finally, I provide 3 distinct queries to query the table to retrieve not only the email address, but also the person’s name.
-- How do we actually use this to look up an email address? SELECT t.first_name, t.last_name, t.email_address FROM tmp_hash t WHERE CHECKSUM(N'eduardo22@adventure-works.com') BETWEEN t.min_hash_in_bucket AND t.max_hash_in_bucket AND t.email_address = N'eduardo22@adventure-works.com'; SELECT t1.first_name, t1.last_name, t1.email_address FROM tmp_hash t1 WHERE CHECKSUM(N'eduardo22@adventure-works.com') = t1.hash_value; -- Is this better than the naive approach? SELECT t.first_name, t.last_name, t.email_address FROM tmp_hash t WHERE t.email_address = N'eduardo22@adventure-works.com';
[Side note: The CHECKSUM(N'eduardo22@adventure-works.com') <> CHECKSUM('eduardo22@adventure-works.com')]
The concept behind the first query is to figure out which bucket the target email address is in. This should narrow down the range of records to look at to about 100 records, at which point it could choose to do a scan of the data. This would be similar to a partitioned table implementation that facilitates partition elimination, where all but the required partition of data would be looked at.
The second predicate in the WHERE clause provides the actual record we are looking for. The intent was to do the actual email lookup only AFTER the appropriate hash_bucket is identified.
What actually happened? The query optimizer chose to ignore the use of the hash_bucket range and go right to the index on the emailaddress data. Because we included a range to look at we ended up with the RID lookup. Had we decided to include the identity column and place a clustered index on it the RID bookmark would have been replaced with a Key Lookup. Either way, in this case the range of hash values is a waste of energy.
The intent of the second query was to go right to the hash value of the email address. The result is a much cleaner index seek that has a cost half the size of the first query (0.003 vs 0.006).
Finally, the third query simply accesses the emailaddress column directly. It’s cost is identical to the second query’s plan even though they are using different indexes.
Looking at the query plans of all 3 queries if run in a batch we confirm that the first query costs 2 times the cost of query #2 and query #3.
CONCLUSION: Based on the results of this test, it is clear that the hashing of the email address did not result in any performance improvement over the simple, direct access of the email address string. In fact, the overhead in terms of CPU, disk IO and storage of the hash values and ranges would suggest that the hashing is a total waste of time. For this data set that is arguably true.
Unfettered, I believe the relatively small data set of 19,972 rows is not sufficient to warrant the SQL optimizer to attempt to take short cuts to get to the data. My next post on this topic will use a much larger data set of about 500k rows, which is probably more real-world for most data warehouses.
Always curious,
Dave
Leave a Reply
Using Gravatars in the comments - get your own and be recognized!
XHTML: These are some of the tags you can use:
<a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>