Testing HASHING to Speed Up Queries on Strings Part II

The objective of my last post, Testing HASHING to Speed Up Queries on Strings Part I, was to demonstrate query lookup performance when querying hash columns instead of the original string value.  In my test I chose a small set of about 20,000 email addresses as the base test case.

To recap the last post:

  • A temp table was setup to contain an email address, first and last name data columns.  A CHECKSUM was applied to each email address to compute the email string’s “numerical equivalent”.
  • Using the NTILE ranking function I then divided the data up into 200 near-equally sized “buckets” based on the hash value of the email address.
  • The MIN and MAX hash value was then computed for each hash bucket.
  • Each hash bucket was then assigned a unique value from 1 to 200.

The actual testing of the HASHING involved three simple queries that attempted to select a single record from the table.

Query 1 looked up the email address by computing the CHECKSUM of the email address string, determining which hash bucket the email address was in by checking to the min and max hash values, and then finally looking up the email address in the bucket.

The concept behind this query was that once we knew which bucket the email address was in our search was narrowed down to the 200 records in the bucket.

In practice, what I had hoped would be the fastest query, turned out to be the slowest. The query never even looked at the hash bucket, instead it went right to the index on the actual email address and doing a RID lookup on the min and max hash values.  Down right ugly.

Query 2 simply used the email address’s hash value to lookup the value directly.  This query used an index on the hash column and did an index seek to locate the record.

Query 3 was even more straight forward.  I looked up the email address by the email address string.  The index on the email address was used via an index seek and the row was returned.

Although the query plans used different indexes, queries 2 and 3 had the exact actual SUBTREE cost.  Query 1, the query that had all of the hashing technology built into it, had twice the cost of queries 2 and 3.

————————————————————————————————–

The purpose of this second test was to see if the queries performed differently for a larger data set.  The original data set of 20,000 rows was too small for the query optimizer to justify using the hashing function.

In this second test I loaded the test table with 500,000 rows.  Unfortunately, the 3 query test results were identical.  The direct access methods (query 2 and 3) performed at half the cost of the hashed approach (query 1).

As is so often the case with undirected testing, where the results are anticipated by not known, this test has raised more questions than answers.  More than likely, I’ll be back to revisit this post as I learn how to apply hashing to data that will actually yield improved query performance.

On that note I will raise some questions:

Is it possible 500,000 rows is still to small to make the hashing come into play?

Perhaps the nature of the email strings is not a good example for hashing.  After all, email strings tend to be fairly short in size.  Perhaps hashing would be better suited for longer string values, such as a comments field?

Had the dataset been much larger, would this hashing approach lent itself to partitioning the data around the hash buckets?  Let’s say the data had 500M records (10 times larger) then the hash buckets would have been 20,000 records in size.  Would the storage engine reap the benefits if the optimizer could find the right hash partition, eliminating all other partitions, then find the targeted email address within the 20,000 record partition?  I think it might…

I can’t imagine anyone actually wanting to partition their data by email address but perhaps some other string data would be relevant.  Perhaps GUIDs?

—————————————————————————————————-

Have you had any success leveraging hashing to improve your query performance?  If you have I’d love to hear about it!

Still curious…

Dave

0 Comments

Leave a Reply

You must be logged in to post a comment.

Using Gravatars in the comments - get your own and be recognized!

XHTML: These are some of the tags you can use: <a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>