Natural vrs surrogate primary key

Tom Kyte: “If you have a natural key, by all means – use it. The natural key should be immutable and sensible – they are sometimes “rare”, but if I had a document table, document version table – I would definitely use the document_id (might be a surrogate) as the primary key of one and (document_id, version# ) as the primary key of the other (and document version might have an association to authors so it’s primary key is a foreign key elsewhere in the system too) ” "Natural Key as Primary Key Vs Surrogate Key"



Just reviewed your other surrogate/natural topics on AskTom. Looks like we are in agreement: would be nice to mostly use surrogate keys, but sometimes it depends on the nature of the data, the business, the anticipated use, etc.

In the ideal world found only in my head, we would have surrogate keys on every entity. Makes for easier join query writing, automation and a few other niceties.

But reality frequently gets in the way.

In my first job as an overcharged but underpaid Andersen Consulting grunt, the tables we used for a large telco had no artificial keys. So every table had a 3 to 8 column natural key. Other than that, the model was fairly well normalized. So in our code, where we needed to join 5 to 8 tables together, we frequently ended up with 30 to 40 lines just dedicated to the predicate to get them hooked up. Since Andersen’s unit testing standard demanded a separate test condition for every permutation of a where clause, including NULL and mix/max values for each, I quickly became an undying fan of single column surrogate keys.

But I’ve seen it taken too far as well, creating surrogate keys for everything, including associative/cross-ref entities — where they really didn’t help anything, were never used and got in the way.

Couple more stories…

My last company’s previous modeler insisted on using string-based short codes as the PK for most reference/lookup tables, mainly, they told me, to make querying easier for developers. They kept patting me on the head and telling me not to worry about meaningless PKs, that the codes would never change. 3 years later they wanted to change one of the codes, and found the update to the child tables would take 6 to 20 hours at every customer’s site. Surrogate keys finally started to make sense to my technical architect who couldn’t be convinced until that day came.

They sold a software system that was, for the most part, similar for all the participants in the energy industry. But there were enough variations that we had to maintain separate copies of the UI and data schema for the nine major markets. Underneath that, we had a common data model for the things that the nine markets shared. Some of these entities were partitioned and handled huge data volume.

Unfortunately the modeler that preceded me decided to use a multi-column, local natural key for these tables. Every once in a while, we would sell the product to a new customer that had new uniqueness requirements that didn’t match the common model. So we would have to write a script that would re-create the enormous UK and roll it out to ALL the customers in ALL the markets to accommodate them. It meant hours of downtime for customers that shouldn’t have been affected by a new customer across the country. I was able to convince them to at least start using market-specific UKs on the tables shared in common. Now I hear they may be factoring out the natural key into a master table, giving it a surrogate, and then using the surrogate as the PK for the partitioned beast. Should ease their woes.

So natural/business keys that change, can kill your business as well.

In the end, the judgment call for surrogate vs. natural depends and the circumstances. And that’s where an experienced data architect/modeler is invaluable.