Surrogate Keys or Natural Keys?
by Craig S. Mullins
If you’ve worked with relational database systems for any length of time you’ve probably participated in a discussion (argument?) about the topic of this month’s column, surrogate keys. A great debate rages within the realm of database developers about the use of ‘synthetic’ keys. And if you've ever Googled the term "surrogate key" you know the hornet's nest of opinions that swirls around "out there" about the topic. For those who haven't heard the term, here is my attempt at a quick summary: a surrogate key is a generated unique value that is used as the primary key of a database table; database designers tend to consider surrogate keys when the natural key consists of many columns, is very long, or may need to change.
Proponents of the surrogate key argue that a concise, manageable key is easier to deal with than an unwieldy multi-
Those arguing against surrogate keys outline their disadvantages. Perhaps the strongest claim is the disassociation of the generated surrogate key from the real-
I doubt that there is any “final word” on this topic. Indeed, the argument has been raging on for years and years now. However, when I get to the performance portion of the argument I am somewhat skeptical. Does fewer columns in a join really translate to better performance? I can see how it would be easier to code for a programmer, but not necessarily for the DBMS to optimize. If you’ve created an index on those multiple columns how much “worse” will the performance be, really? Sure, the SQL is more difficult to write, but will a join over 4 or 5 indexed columns perform that much worse than a join on one indexed column? I suppose if the number of columns required for the natural key is absurdly high (e.g. 10 columns???) the impact could be greater. And perhaps if you are swapping a variable length key with a surrogate having a fixed length key performance can be impacted.
At any rate, consider this: for the parent table at least, the natural key columns are still going to be there -
And what about the impact of adding data? If there are a significant number of new rows being added at the same time by different processes there will be locking issues as they all try to put the new data on the same page, unless, of course, your surrogate key is not a sequential number and is, instead, something like the microseconds portion of the current timestamp. But then that must be tested to avoid duplicates, further degrading performance.
The one thing that usually causes me to tend to favor natural keys is just that – they are natural. If the data is naturally occurring it becomes easier for end users to remember it and use it. If it is a randomly generated surrogate nobody will actually know the data. Yes, this can be masked to a great deal based on the manner in which you build your applications to access the data, but ad hoc access becomes quite difficult.
I guess the bottom line on surrogate keys versus natural keys is that “it depends” on a lot of different considerations… and that should really be no surprise to DBAs at all!
From Database Trends and Applications, September 2012.
© 2012 Craig S. Mullins,