This editorial was originally published on Mar 5, 2008. It is being re-run as Steve is on vacation.
The Identity property might be one of the most controversial constructs in SQL Server, or at least right up there with the NULL/No NULLs debate. Someone sent me a note awhile back about this debate and I thought it might be worth bringing forward in an editorial.
When I was getting started in SQL Server in the early 90s, I learned about primary keys, natural keys, and the ways you go about modeling a system. It all made good sense, especially at the time when I was working with systems that took years to build and were based on very well understood processes with tight controls on the data involved. So I learned to look for natural keys.
A few years later I moved to a smaller company, considerably more agile, that was trying out new way of doing business. In the process of trying to revamp their systems, I found that things I was told were “unique” or “set in stone”, weren’t. Not even close, more like set in Play-Doh. And even then in a very humid environment with no danger of things drying out.
I ran into identity field at some point and found that they were lightweight, solved a lot of problems, didn’t interfere with natural keys, and above all, were easy for a DBA to remember and work with when building applications or troubleshooting issues.
However there is often a huge debate out there on what the value of identity colums is and how should they be used. Some people, like Joe Celko, rail against their use in a relational system. Others use them everywhere, making them clustered primary keys for every table. Many people are somewhere in between, using them where it fits, and sticking with natural keys if they are identifiable. Some people even use identities as their primary key, relegating natural keys as alternates with a separate unique index.
I don’t think we put out enough information here, or on any other database site, that would help people better model their data and build relational systems. It’s a hard topic to write about and it’s hard to do it well without making everyone’s eyes glaze over.
But I wanted to put this out there and see what you think about identity columns and their use (or mis-use) in today’s databases. They are definitely a tool that’s available to the SQL Server professional and one that I bet everyone has an opinion on. We surely value those opinions and as with many of these debates, I’m looking forward to learning a thing or two.
The Voice of the DBA Podcasts
The podcast feeds are now available at sqlservercentral.podshow.com to get better bandwidth and maybe a little more exposure :). Comments are definitely appreciated and wanted, and you can get feeds from there.
Overall RSS Feed: or now on iTunes!
- Windows Media Podcast – 27.8MB WMV
- iPod Video Podcast – 22.2MB MP4
- MP3 Audio Podcast – 4.6MB
Today’s podcast features music by Everyday Jones. No relation, but I stumbled on to them and really like the music. Support this great duo at www.everydayjones.com.
I really appreciate and value feedback on the podcasts. Let us know what you like, don’t like, or even send in ideas for the show. If you’d like to comment, post something here. The boss will be sure to read it.
There was a time when natural keys were ideal for SQL Server. The key tuning phrase back then was “avoid hot spots”. Natural keys helped you avoid hot spots in your index or table by spreading out activity so that no one area was getting hit disproportionately more than others.
SQL Server has changed. Today, hot spots are good. SQL has evolved to favor hot spots It’s caching mechanism favors high activity to the same spot in an index. Sequential, ever-increasing values are very good at creating a hot spot at the end of the index. SQL Server has evolved to favor identity values as primary keys.
Natural keys can definitely be good candidates for promoting this behavior as well. Sometimes not.
Single-key, small-value primary keys are also generally ideal for join performance and for index size. Natural keys sometimes fit the bill for this, sometimes not. And please don’t argue that disk space is cheap therefore I shouldn’t care how big my indexes are. Disk space is not the reason I want my indexes small. It’s about I/O that SQL has to perform and about the amount of memory it uses in the buffer pool. I/O is not cheap when it comes to performance.
I believe in using what will work the best. Natural keys if they meet the criteria for performance. Alternate keys, like an identity, if not. Committing one’s self to either one ahead of time is just the wrong answer, in my opinion and experience.
Sometimes natural keys can get out of control too. Yesterday I was working on a table that was actually a built out relation of some other tables but since some of the keys from the original tables were so wide I had a 18 field key on my new table that I couldn’t index.
Here is my big question, especially for you Robert. What about using a non-clustered primary key on an Identity column, while clustering on the natural key?
I’m doing this in some tables and I’m not 100% sure if its the right way to go or not. I’ve run a variety of tests, but the results are not definitive. One thing I’ve noticed is that in many cases this results in key lookups when querying the data, for example when joining on the primary key I believe you get a key lookup to the clustered index in order to complete the join.
An example of where i’m using this is a claim transaction table. There is a transaction id which is an identity column that gives each transaction a unique id, but I’m clustering the transactions on customer id, claim id, transaction date, because that is how the transactions are always aggregated.
My other thought was to use a simple clustered primary key on transaction id, and then create a non-clustered covering index on customer id, claim id, transaction date, including most of the other columns in the table. This would basically be creating two copies of the table ordered in different ways. Not really sure that’s a good idea either though…
Hi R. G. As always, there is no 1 answer that is right. If you are getting key lookups for your query, then the query requires more columns than just the primary key value. It will never perform key lookups unless there are columns it can’t find in the smaller index.
One way to avoid key lookups, like you said, is a covering index. But I wouldn’t automatically add a covering index just because key lookups are occurring. Indexes have a maintenance overhead, and I would want to evaluate if the index is worth the cost of maintaining it.
For example, if key lookups rarely occur and when they do, the result set is very small, then the index may not be warranted. On the other hand, if the key lookups are happening frequently and it is causing the queries to run a long time, then it sounds like it would be warranted.
Key lookups should only be occurring if the result set is small (generally less than 3% of the table), but on a very large table, that can still be a lot of rows.
My advice would be to run the database engine tuning adviser (DTA) on a non-production system with a production workload and evaluate the indexes. Or simply try out your proposed changes on a non-production system with a production workload and evaluate how the different strategies affect the workload.
Thanks Robert. What about the general practice of using a non-clustered primary key and clustering on a different set of columns?
It’s a very popular tactic with people who have a hard time changing their way of thinking from always using natural keys to using alternate keys. I think there is definitely a case where using it is appropriate, but it is probably used way more frequently than it should be.
It all depends on your workload and how it accesses the data.