On the hashing of keys

In Anchor we follow the established paradigm that an instance in the domain we are modeling should only be represented once in the database. For this reason, the surrogate keys we use as identities of such instances need to be dumb in the sense that they neither by themselves can convey any meaning nor be an encoding of anything that has meaning. We prefer sequences, as they are small with respect to size, cheap with respect to the generation of identities, monotonically increasing, and reasonably hard to confuse with something that carries meaning.

As a discouraging example, let’s assume we would like to hash the natural key of a citizen of Sweden using MD5 and use it as identities in our database. First, every citizen is identified by a personal number, on the form:

YYMMDD±NNNC

The birth of date is followed by a delimiter and a four digits, where the first three constitute a three digit serial number, followed by a final check digit. The serial number is even for men and odd for women. The delimiter is a minus sign if you are younger than 100 years old and a plus sign once you get older than that. In other words, not an entirely stable key over time. To complicate things even further, foreigners visiting the country may be given a coordination number which looks exactly like a personal number, except with DD+60. In any situation in which you need to provide your personal number but cannot do so, you may also be given a reserve number. The way to create a reserve number is that it should retain a correct birth date but contain at least one letter instead of a digit in the NNNC part.

As many drawbacks as this system may have, it is a fact of life that every data warehouse in Sweden has to cope with. For example, the MD5 of someone staying in the country for a longer period of time with coordination number 890162-3286 is ee0425783590ecb46030e24d806a5cd6. This can be stored as 128 bits, whereas an integer sequence using 32 bits will suffice for the population of Sweden with a healthy margin. Let’s also assume that we have a tie representing the relationship between married people. As there is a risk of divorce, such a tie must be knotted and historized. If we are purists, the key in the knot should also be a hash, even if it could be represented using a single bit with 1 for married and 0 for divorced. The keys in the hashed tie will consume 128 + 128 + 128 = 384 bits, whereas the sequenced tie will consume 32 + 32 + 1 = 65 bits. Caeteris paribus, the hashed tie is almost six times larger. The issue is further accentuated if you look at a plain text representation of the data. A sequence from 1 to 10 million will use less than 70 million characters in plain text, whereas the 10 million hashes will use no less than 1.2 billion characters. In a situation where the license cost is based on a textual representation of raw data, the hashes would be almost 20 times as expensive.

To continue the example, after some period of time, 890162-3286 gets a citizenship and becomes 890102-3286. The MD5 changes as well and is now 049fda914afa455fae66115acd78b616, completely different than before. Preceding the citizenship this person got married, so there are now rows in the tie with the wrong hash. In the sequenced tie we would expect to find the personal number as a historized attribute of the adjoined anchor. No problem here, two personal numbers now refer to the same sequence number, but at different time periods. The resolution with the hashed keys would be to introduce another instance in the anchor with the new hash together with an additional tie indicating that these two instances actually refer to the same thing in real life. Any query must now take this into account, slowing all queries down, and data must be duplicated, resulting in increased maintenance costs and degraded performance. Of course, if you are not interested in keeping a history of changes you can just update the existing rows, but this is a costly operation and may be quite troublesome if foreign keys are declared. We also presumed that such a history is a ‘must have’ requirement, as in any proper data warehouse.

The conclusion is that hashing is a poor choice when keys may change over time, due to the fact that the hash, while looking like a meaningless string, actually still carry the remnants of the meaning of the input. This is sufficient for causing trouble! We also saw that the size of “safe” hashes are significantly larger than integer sequences. By “safe” we assume that the hash is complex enough for clashes to be extremely rare. However miniscule the risk is of a collision may be it could still be a completely unacceptable event, should it occur. See Black Swan Theory for more information. The “safeness” of the hash is also proportional to the cost of generating it, so the safer you want to be, the more CPU cycles are used in order to try to reassure you.

Polymorphic Graph Queries in SQL Server

Sometimes I get the question of when Anchor Modeling is not suitable. “Actually, most of the times it is”, is my common answer. However, there are times when the requirements are such that you need a bit of trickery on top of a model. One such case recently emerged at a client. The dilemma was how to ask polymorphic graph queries in SQL Server, when you have a network represented as a parent-child-relationship in your Anchor model? First, a polymorphic graph query is one in which you want to find nodes of certain properties connected through any number of edges in your network. For example, “find all computers that at any point have a wireless connection between them”. You may think that the new graph table types in SQL Server 2017 would solve this, but alas, they do not support these types of queries (yet).

Fortunately, in and since SQL Server 2008, an often overlooked new data type was introduced: HIERARCHYID. At first glance it looks disappointing, but it turns out that by using string searches and manipulation, polymorphic queries can be asked. Below is an example that shows how this is done, and should of course be applicable for any type of network, and not just the ones containing computers, switches, routers, cables and wireless connections. As a small bonus, a hint is also given of how to solve the traveling salesman problem.

Poor Man’s Auditability

One common question in Anchor Modeling is what to do with values that suddenly disappear. Our answer has been that values do not disappear, but may become unknown, and if such occurances are of interest then this fact should be modeled explicitly. The exception is when corrections are made, and if you need to keep track of them the solution is to use concurrent-reliance-temporal modeling. However, there may be scenarios in which you only want to keep some simple auditability of corrections (or some bastardization of disappearing values) while still remaining in unitemporal modeling. This is now possible in the test version of the online Anchor Modeler, thanks to the addition of deletability.

In order not to break compatibility with existing code, new columns have been added in the latest views for deletable attributes. It works as follows with the update trigger on the latest view for an attribute marked as deletable:

update lST_Stage
set
   ST_NAM_Stage_Name = null,
   ST_NAM_ChangedAt = '2018-05-04',
   Deletable_ST_NAM = 1
where
   ST_ID = 42;

Values can be deleted only when Deletable_ST_NAM is set to 1 in an update. Without that row, the update works exactly as before, which is to ignore anything being set to null. The result of the code above is that a new table is created, named ST_NAM_Stage_Name_Deleted, to which all matching rows are moved. Note that this clears out all history. The new table will have one additional column containing the date of the deletion, which is the current date if the attribute is static or no ST_NAM_ChangedAt is present in the update or whatever ST_NAM_ChangedAt is set to if it is present. This should give you a “poor man’s auditability” for disappearing values in unitemporal modeling.

Optimizing BETWEEN joins

Have you ever found yourself needing to do a join on a BETWEEN condition? If so, you have probably also been struggling with performance. The optimizer in SQL Server just cannot seem to produce a good execution plan for BETWEEN. Thanks to a lot of trial and error, we were however able to find a workaround!

This workaround, using a CROSS APPLY together with a SELECT TOP 1 … ORDER BY, should prove useful outside of the Anchor Modeling community as well. The example is a draw consisting of a bunch of floating point numbers, similar to the Fisher’s noncentral hypergeometric distribution, but this works similarly for date ranges.

Performance improvements

The code generated by the test version of the online modeler has had some modifications made in order to achieve better performing insert triggers and perspectives. Currently, these modifications have only been made for the unitemporal models and only affect the historized parts of the model. Thanks to the “Live Query Statistics” in SQL Server, a bottleneck with the scalar-valued functions finding previous and following values in the insert trigger was identified. This code has instead been inlined. When writing new BIML generation for the sisula ETL framework, an apparently faster way to find the latest version was found, using ROW_NUMBER and a “superfluous” GROUP BY to retain table elimination. This has only been tested on SQL Server 2016 so far, but should work since version 2008 and onwards. As a side note, the SSIS packages generated from the BIML are magnitudes faster than using the insert triggers, so we recommend using these for large volume data warehouses or when priming with historical data. Please test and report back, and if no issues are found this will go into the production version as a new release.

Co-locating in an MPP-RDBMS

The idea behind a Massively Parallel Processing Relational Database Management System is that each node that participates in the cluster should be as autonomous as possible. We have earlier shown that using the available distribution techniques in for example HP Vertica, it is possible to co-locate an instance of an anchor and all of its attributes along with any knots they relate to and any ties the instance participates in. Recently, we were asked how to additionally co-locate closely related instances.

In a very simple thought experiment, this is actually not so difficult to do. Imagine a model of restaurants and their menus. It is reasonable that in many cases an instance of a resturant and its related menu-instances would be part of the result set. If we can co-locate these instances on the same node, the node will become even more autonomous. However, using a simple modular hash on the identities of restaurants and menus will distribute them evenly across the nodes, but for the most part not keep related instances together.

Let us look at a possible solution using some pseudo code.

Let 3 be the number of nodes in the cluster.
Let s_resturant be a sequence starting with 1 incremented by 1.
Let s_menu0     be a sequence starting with 3 incremented by 3.
Let s_menu1     be a sequence starting with 1 incremented by 3.
Let s_menu2     be a sequence starting with 2 incremented by 3.
Let {1, 2, 3, 4, 5, 6} be ids of six restaurants.
Let id modulo 3 determine the node on which to locate an instance.
Then restaurants {3, 6} will be co-located on node 0.
Then restaurants {1, 4} will be co-located on node 1.
Then restaurants {2, 5} will be co-located on node 2.
Let s_menu0 generate ids for menus of restaurants on node 0.
Let s_menu1 generate ids for menus of restaurants on node 1.
Let s_menu2 generate ids for menus of restaurants on node 2.
Find the node of restaurant 4 through 4 modulo 3 = 1.
Use s_menu1 to create menus {1, 4, 7, 10} at restaurant 4.
Then menus {1, 4, 7, 10} will be co-located on node 1.

The drawback is that you will need as many sequences as you have nodes and the fact that the number of nodes may not stay the same over time. However, using similar logic it may be able to apply similar thinking to cover the redistribution of instances in the case of an added node in the cluster.

Anchor Modeling Academy

We will bringing some of our courses online, starting with an Introduction to Anchor Modeling. It is a four hour video course for $299 that will give you a general introduction to Anchor, serving as a good start to build from if you are interested in the technique. A link to the course can be found in the sidebar to the left, or you can click here: anchor.teachable.com. This will hopefully be an opportunity for those of you who have shown interest in our courses, but have been unable to travel to Sweden in order to take them. We’re open to suggestions on what you would like to see as a second course that builds upon this one, so let us know!

 

Anchor Modeler 0.99 Released

We are proud to announce the release of version 0.99 of the modeling tool. This version has been in the making for over one year, and code generated from it is, as usual, already in production for a number of data warehouses and system databases. Expect a more user friendly interface, that again works in Chrome, after they unexpectedly and deliberately broke SVG support. The generated code has, among other things, improved trigger logic. For example, if you try to update a static attribute, you will now get a warning, rather than a failure. This should make life easier, particularly for those using our sisula ETL Framework for metadata driven Data Warehouse automation. Using this framework with the latest modeling tool we have built and put a DW in production in record time. It took less than a week to model, populate three years of history, and start a daily incremental load of a Data Warehouse used in a high security environment.

Anchor Model Generator

A while back Juan-José van der Linden created a script that would reverse-engineer a database into an Anchor model. He was kind enough to donate that script to the community, and it is available in our forum. Now there’s also a second effort which is available in the form of the script below. This is a work in progress and it will be updated with more features in the future. Perhaps we can merge the best features from JJ’s script into this one or the other way around.

Please note that the script will use column statistics in order to determine if knots should be created, so it may take a long time to run when no statistics are available. It will reuse existing statistics, so a second run of the script is much faster. It tries to determine ties based on primary keys and matching column names.

The following script can be used to generate knot loading code, based on the data stored in the descriptions of the knots in the model after running the script above.

The following script can be used to generate source to target mappings for use with the sisula ETL framework, based on the data stored in the descriptions of the attributes in the model after running the script above.

24 New Certified Anchor Modelers

Last week we had the pleasure of certifying 24 new Anchor Modelers in the Netherlands. In other words, if you want to find expertise on Anchor modeling, you should head over to our Directory and contact them! We also held a public session with an introduction to Anchor modeling, showing them how it is possible to capture facts like “someone being somewhat sure about something, while someone else is completely sure about its opposite”. We were happy to have participants from Nike, the Dutch Police, Rabobank, Essent, ChainPoint, consultant companies like Free Frogs, Ordina, Cap Gemini, Kadenza, and a number of freelancers.

IMG_1031 IMG_1032

Thanks everyone for participating at the courses and presentations, and the great discussions during the breaks!