Unless your marketing data warehouse is stored in a massively parallel appliance such as Netezza, your data most likely resides on a Symmetric Multi-Processing (SMP) computer in a relational database engine, such as Microsoft SQL Server.
SQL Server is a highly scalable, versatile database product that can serve the needs of a wide range of applications, from the tiniest 100MB database all the way up to multi-terabyte data warehouses. Along with its versatility comes the responsibility of the data architects to optimize the configuration of the database for the specific application. The configuration of a transactional database, such as an order processing system, will look much different than a heavy hitting operational data warehouse.
For large-scale data warehouses, such as marketing automation databases, optimization often means providing the end users with the fastest reporting times possible. Slow response times will quickly frustrate them, and inevitably, result in your phone ringing off the desk.
If you are running a large data warehouse, and don’t like having the red phone light up all day long, there are a few simple rules that must be followed in order to keep your customers happy:
- Spread your data across as many physical disk drives as possible and as evenly as possible. If you are building a large data warehouse, don’t put all of your data in a single file group that contains a single file. In order to do this you will need to:
- Create multiple file groups. File GROUPS are logical database containers that are used to hold database files.
- Create multiple database files for each of the file GROUPS. These are the physical database files. Ideally they will be spread across multiple physical drives.
- Rule #1: A database RECORD is always contained in a single database file. It is NEVER spread across multiple database files, even if those database files are in the same file GROUP.
- Rule #2: A database TABLE is always contained in a single database file GROUP.
- Rule #3: SQL Server uses a proportional fill algorithm to spread writes across the multiple database files (within a file group). If all database files in the FILE GROUP are the same size the same number of records will be written to each individual file (desired affect!!!). Hence, always create database files within the same file group to have the same size.
- Initialize your database files to the max size that will be needed to hold your data. This is way easier said than done. It’s very easy to grossly under estimate the amount of data that will be flowing into your data warehouse. When this happens the data files will auto grow, which is a SLOW process. Even more important in the context of reporting, the new disk space added to the file group will probably not be contiguous with the original file, which will lead to slower reporting capabilities.
- Use the fewest number of threads possible when loading data. If you set the MAXDOP (max degree of parallelism) to 1 the incoming data will be written to contiguous chunks of disk space. Data warehouse reporting is typically grabbing lots of related records, usually over a time range. This results in pulling records that are all loaded at the same time. This is known as a RANGE SCAN. Range scans work best when the data is stored on contiguous blocks of disk. It’s really that simple… By using a single thread to lay the data down on the disk, we are sacrificing load performance for improved reporting performance. That’s OK, you only load the data once and hopefully no people are involved in the load process. The data will be read many times over by people waiting for reports to come out of the system. A slower load time is a fair trade for a faster user experience any day of the week.
- Maintain the highest quality statistics on your data as possible. If your tables tend to grow incrementally over time, the auto statistics feature of SQL Server will work fine, for a while. At some point the low sample rate of the auto stats will result in inaccurate statistics. For that reason it is absolutely imperative that a rock solid statistics maintenance plan be put in place. And don’t be fooled into thinking that the stock “maintenance plan” wizard in SQL Server will suffice. What happens if a statistic becomes severely out of date? The query optimizer will generate a bad plan for the SQL query and the query will appear “hung”. The consumer of the data will think the server is busy or that you have a bad design, when in fact you just have outdated statistics.
- Give SQL Server as much memory as you can afford. Whatever you give it, it will use. If your server is a dedicated database server, turn on Lock pages in memory. This will prevent SQL Server from ever giving up memory after it has allocated it. If your server is used for other applications however, locking memory pages may end up starving those applications of memory, which may end up making the database look slow.
- On a related note, separate services when ever possible. SQL Server Integration Services (SSIS), like the SQL Server engine, will grab as much memory as is available. If you just gave all of the system memory to SQL Server (minus a few Gig so you don’t starve the OS) what will be left for your ETL/loads? Put SSIS on a separate server.
- If you have tables that are loaded once a month and then used for reporting purposes the rest of the month, consider placing these tables in READ ONLY file groups. The SQL engine will reward you with a drastic reduction in locks locks on your data, and your data consumers will think you are a genius.
- For very large tables, such as more than 100M records, consider a horizontal partitioning strategy. SQL Server does not have a nice wizard to setup partitioning. It is very technical to implement and deploy. But if you have a table with 200M records, but usually don’t care about any of the data except for the most recent month, horizontal partitioning will allow correctly designed queries to ignore all but that most recent month’s data. This is called partition elimination, and can/will make a huge difference in query performance.
- Do your housecleaning… Get rid of indexes that are no longer needed. Add new indexes sparingly and with great caution. But do so when it makes sense. I’ve never seen a perfectly indexed database in the initial design. Drop “temp”/”test” tables. Archive old data. Better yet, deploy an automated archiving process. There’s no point in keeping 10 years’ worth of data when your end users only need the current year plus last year.
- And last but not least, my personal favorite, NEVER SHRINK a database or database file. If you do, you just fragmented your data and indexes, probably FOREVER. The only way to get rid of that fragmentation will be to move the data to a new file group that isn’t fragmented. The catch 22… if you just did a shrink on your database it’s probably because you were running low on disk space. If you don’t have disk space to auto-grow your existing data files, you probably don’t have the disk space to move the data to a new file group. The data will remain fragmented, FOREVER.
Your reporting customers want their data as simply and as quickly as possible. Making that happen only requires following some basic design and maintenance strategies.
What strategies have you employed to avoid a data scare-house?