A Poor Data Model

Recently there was some online complaints about social security numbers (SSNs) in the US being duplicated and re-used by individuals. This is really political gamesmanship, so ignore the political part. Just know that social security numbers appear to be one of the contenders used in many data models.

I found a good piece about how SSNs aren’t unique, and have a mess of problems. Despite this, many people seem to want to use SSNs as a primary or alternate key in their database systems. They also aren’t well secured in many systems, even though we should consider this sensitive PII data.

As we have moved many analog systems to digital ones, we often find that our assumptions about the rules governing data aren’t that well defined. I’ve worked in many systems where data elements were assumed to be unique in some way, but they actually weren’t. I’ve seen invoices, POs, and other “unique numbers” actually duplicated because of simple mistakes by humans. When we try to enforce uniqueness in a database, we end up with problems. Often we actually need to drop keys and make exceptions because the data must be entered.

This has led many people to not create unique constraints or even foreign keys in their systems, and I understand why they don’t. There are real problems when we assume the real world has the same strict structures we implement in code. I’ve seen systems go online and then FKs removed because of poor data quality. It might be a mistake, but it’s also a reality when we find there is existing child data without a parent. We might create a pseudo-parent at times, but we might also decide not to do so if that creates other problems.

I suspect over time the real world will migrate some of their problematic keys to something more robust. However, some that are used widely in older systems, like SSNs, are unlikely to change in my lifetime. There are far too many places where this is in use and I am not sure that there is any consensus to undertake the massive amount of work to implement something else. I suspect even trying to add digits to the value is a task we’ll put off indefinitely.

Be careful of using natural keys in your data models unless you are sure they are really a natural key. To me, I’m better off with some surrogate key in the event that my “natural key” turns out to not have the uniqueness I expected.

Steve Jones

Listen to the podcast at Libsyn, Spotify, or iTunes.

Note, podcasts are only available for a limited time online.

Unknown's avatar

About way0utwest

Editor, SQLServerCentral
This entry was posted in Editorial and tagged . Bookmark the permalink.

2 Responses to A Poor Data Model

  1. Maybe I’m tool old school but I grew up on the idea that your primary Key was always something entirely disconnected from eth data set like a self incrementing integer value. Sure you’d have indexes on critical info but the only to way to ensure uniqueness is if the key is not related to the data.

    I’ve bumped heads with a former employer over the lack of security with regards to customers critical data used in identify theft like Birthdate + Name + SSN. I believe in small, very limited government but private sector abuses like this make it a necessity. Unless required to and or unless they fear the results of not doing so most private sector businesses are not going to make any efforts on their own motivation to do the right thing with regards to customer data. Some will but too many will not without being forced to.

    Like

    • way0utwest's avatar way0utwest says:

      I think old school was that you always have a natural key somewhere, like an SSN. Lots of people I learned from always wanted a natural key, not a surrogate key.

      I realized quickly that was fraught, when we had to add something like birthday to an SSN. That seems silly. Use a surrogate key and then add constraints if you are trying to (for now) ensure uniqueness in your supposed natural key.

      Like

Comments are closed.