On the hashing of keys

In Anchor we follow the established paradigm that an instance in the domain we are modeling should only be represented once in the database. For this reason, the surrogate keys we use as identities of such instances need to be dumb in the sense that they neither by themselves can convey any meaning nor be an encoding of anything that has meaning. We prefer sequences, as they are small with respect to size, cheap with respect to the generation of identities, monotonically increasing, and reasonably hard to confuse with something that carries meaning.

As a discouraging example, let’s assume we would like to hash the natural key of a citizen of Sweden using MD5 and use it as identities in our database. First, every citizen is identified by a personal number, on the form:

YYMMDD±NNNC

The birth of date is followed by a delimiter and a four digits, where the first three constitute a three digit serial number, followed by a final check digit. The serial number is even for men and odd for women. The delimiter is a minus sign if you are younger than 100 years old and a plus sign once you get older than that. In other words, not an entirely stable key over time. To complicate things even further, foreigners visiting the country may be given a coordination number which looks exactly like a personal number, except with DD+60. In any situation in which you need to provide your personal number but cannot do so, you may also be given a reserve number. The way to create a reserve number is that it should retain a correct birth date but contain at least one letter instead of a digit in the NNNC part.

As many drawbacks as this system may have, it is a fact of life that every data warehouse in Sweden has to cope with. For example, the MD5 of someone staying in the country for a longer period of time with coordination number 890162-3286 is ee0425783590ecb46030e24d806a5cd6. This can be stored as 128 bits, whereas an integer sequence using 32 bits will suffice for the population of Sweden with a healthy margin. Let’s also assume that we have a tie representing the relationship between married people. As there is a risk of divorce, such a tie must be knotted and historized. If we are purists, the key in the knot should also be a hash, even if it could be represented using a single bit with 1 for married and 0 for divorced. The keys in the hashed tie will consume 128 + 128 + 128 = 384 bits, whereas the sequenced tie will consume 32 + 32 + 1 = 65 bits. Caeteris paribus, the hashed tie is almost six times larger. The issue is further accentuated if you look at a plain text representation of the data. A sequence from 1 to 10 million will use less than 70 million characters in plain text, whereas the 10 million hashes will use no less than 1.2 billion characters. In a situation where the license cost is based on a textual representation of raw data, the hashes would be almost 20 times as expensive.

To continue the example, after some period of time, 890162-3286 gets a citizenship and becomes 890102-3286. The MD5 changes as well and is now 049fda914afa455fae66115acd78b616, completely different than before. Preceding the citizenship this person got married, so there are now rows in the tie with the wrong hash. In the sequenced tie we would expect to find the personal number as a historized attribute of the adjoined anchor. No problem here, two personal numbers now refer to the same sequence number, but at different time periods. The resolution with the hashed keys would be to introduce another instance in the anchor with the new hash together with an additional tie indicating that these two instances actually refer to the same thing in real life. Any query must now take this into account, slowing all queries down, and data must be duplicated, resulting in increased maintenance costs and degraded performance. Of course, if you are not interested in keeping a history of changes you can just update the existing rows, but this is a costly operation and may be quite troublesome if foreign keys are declared. We also presumed that such a history is a ‘must have’ requirement, as in any proper data warehouse.

The conclusion is that hashing is a poor choice when keys may change over time, due to the fact that the hash, while looking like a meaningless string, actually still carry the remnants of the meaning of the input. This is sufficient for causing trouble! We also saw that the size of “safe” hashes are significantly larger than integer sequences. By “safe” we assume that the hash is complex enough for clashes to be extremely rare. However miniscule the risk is of a collision may be it could still be a completely unacceptable event, should it occur. See Black Swan Theory for more information. The “safeness” of the hash is also proportional to the cost of generating it, so the safer you want to be, the more CPU cycles are used in order to try to reassure you.

Published by

Lars Rönnbäck

Co-developer of the Anchor Modeling technique. Programmer of the online modeling tool. Site maintainer. Presenter and trainer.

4 thoughts on “On the hashing of keys”

  1. Hi Lars,

    thx for writing this down. It is a clear example where Anchor and Focal Model have a nice way of handling the issue of the changing identifier. And of course why hashing is almost never the best solution.

Leave a Reply

Your email address will not be published. Required fields are marked *