Schemafull Databases

Over the last decade schemaless databases have become a thing. The argument being that too much work is required at write time conforming information to the schema. Guess what, you only moved that work higher up the information refinery. In order to digest the information, it will at some point have to conform to some schema anyway, but now you have to do the work at read time instead. Would you rather spend additional time once, when writing, or spend additional time every time you read? You may also have heard me say that ‘unstructured data is just data waiting to be structured’.

That being said, there is a huge problem with schemas, but the problem is that information needs to be stored according to one rigid schema, rather than alongside many flexible schemas. The rigidity of current schema based databases often leave us no option but to peel and mold information before it fits. Obviously, discarding and deforming information is bad, so I do understand the appeal of schemaless databases, even if they fail to address the underlying issue.

The only way to store the bananas above in my stomach are to peel them and mold them through my mouth, acting much like a relational database with a particular schema. The relational database forces me to do a few good things though, things not always done in schemaless databases. Identification has to be performed, where bananas are given identities, so that we know if this is a new banana or one that is already in the stomach. Thanks to things having identities, we can also relate them to each other, such that all these bananas at one point came from the same bunch above.

Another advantage of identities, along with the ability to relate these to the identities of other things and their properties, is that we can start to talk about the things themselves. Through such metaspeak it is possible to say that a thing is of the Banana type. With this realization, it is hard to understand why a schema should be so rigidly enforced, unless metaspeak is assumed to be autocratically written in stone, while at the same time being univocal, all-encompassing, and future proof. How true does that not ring to someone living in the real world?

No, let there be diversity. Let us take the good things from relational databases, such as identification and the possibility to express relationships as well as properties, but let us not enforce a schema, taking the best part of the schemaless databases as well. Instead, let there be metaspeak expressed using the same constructs as ordinary speak, and let this metaspeak be spoken in as many ways as there are opinions. Let us talk about #schemafull databases instead!

A #schemafull database is one in which information is stored as pieces of information, posits, each tied to a an identified thing, and opinions about these pieces, assertions, made by some identified thing. What class of things a particular thing belongs to is just another piece of information, about which you may be certain to a degree, hold a different opinion than someone else, or later change your mind about. These pluralistic “schemas” live and indefinitely evolve alongside the information. There is neither no limit to how much additional description these classes and schemas can be given. They could, for example, be parts of different (virtual) layers, such as logical or conceptual, or they could be related hierarchically, such that a Banana is a Fruit.

Surely though, if we are to talk about a things and have a fruitful conversation, must we not agree upon something first? We do, of course, and what we believe to be the least to be agreed upon are necessarily identities and roles, and presumptively values. As an example, looking at the posit “This banana has the color yellow”, everyone who asserts this must agree that ‘this banana’ refers to the same thing, or in other words uniquely identifies a particular thing. The role of ‘that having a color’ must have the same meaning for everyone, in how it applies and how it is measured. Finally, ‘yellow’ should be as equally understood as possible.

The reason the value part is less constrained is simply because not all values have a rigorous definition. The color yellow is a good example. I cannot be sure that your yellow is my yellow, unless we define yellow using a range of wavelengths. However, almost none of us run around with spectrometers in our pockets. The de facto usage of yellow is different, even if we can produce and measure it scientifically. We will therefore presume that for two different assertions of the same posit “This banana has the color yellow”, both making those assertions share an understanding of ‘yellow’. There is also the possibility of representing imprecision using fuzzy values, such as ‘yellowish’, where ‘brownish’ may to some extent overlap it.

It is even possible to define an imprecise Fruitish class, which in the case of the Banana may be a good thing, since bananas botanically are berries. It’s also important to notice the difference between imprecision and uncertainty. Imprecision deals with fuzzy posits, whereas uncertainty deals with fuzzy assertions. It is possible to state that “I am certain that Bananas belong to the Fruitish class”, complete certainty about an imprecise value. Other examples are “I am not so sure that Bananas belong to the Fruit class”, uncertainty about a precise value, and “I am certain that Bananas do not belong to the Fruit class”, complete certainty about the negation of precise value.

A database needs to be able to manage all of this, and we are on our way to building one in which all this will be possible, but we are not there yet. The theory is in place, but the coding has just started. If you know or can learn to program in Rust and want to help out, contact me. You can also read more about #transitional modeling in our scientific paper, and in the articles “Schema by Design“, “The Illusion of a Fact“, “Modeling Consensus and Disagreement“, and “The Slayers of Layers“. Don’t miss Christian Kaul’s “Modeling the Transitional Data Warehouse” either, in which conflicting schemas are exemplified.

There is also an implementation of #transitional modeling in a relational database, of course only intended for educational purposes. The rigid schema of which can be seen below…

Stop peeling and molding information and start recording opinions of it instead. Embrace the schemas. Embrace #transitional modeling. Have a banana!

♫ Let’s Twine Again ♫

In our paper Temporal Dimension Modeling we introduced the concept of a twine. A twine is an efficient set based algorithm that can be applied when you have a table in which you have recorded a history of changes and some other table with related points in time, for which you want to know which historical rows were in effect at those different time points.

Let us look at an example. In one table we have stored the weekly billboard rankings of songs, and in another we have information about the artists performing the songs, and a third table with song information. The billboard table references both together with a ranking and a date indicating since when that ranking came into effect.

The ranking is historized weekly, so that if the ranking has changed since last week, a new row with the new rank is added with a later ValidSince date. The table below lists rows for the song “Let’s Twist Again” by Chubby Checker.

Given that we know the song was released on the 19th of June 1961, we see that it went straight into the 25th spot. It reached the 8th spot as it highest rating, and the latest information we have is that it is currently outside of the top 100. Note that for some weeks there are no rows since the ranking remained the same. This is a typical example of managing change without the introduction of data duplication. Thanks to the fact that Ranking is exhaustive, such that that it can always express a valid value, there is no need for a ValidTo column (also known as end-dating). The benefit of avoiding ValidTo is that historization can be done using a insert-only database, avoiding costly update operations.

Now, let’s assume that these tables have lots and lots of data, and we would like to know what ranking each song had one month after its release. This forces us to do a temporally dependent join, or in other words a join in which the join condition involves temporal logic. Picking up the song “Let’s Twist Again” from the Song table gives us the 1961-06-19 ReleaseDate and the SongID 42. Looking at the table above we need to find which ranking was in effect for SongID 42 on 1969-07-19 (one month later). Visually, it is easy to deduce that it was ranked 15th, since that ranking is valid since 1961-07-13 until replaced by the best ranking on 1961-08-20, and 1969-07-19 falls in between those dates.

Unfortunately relational databases are inherently bad at optimizing in between conditions. The best trick before the twine was to use a CROSS APPLY together with a TOP 1, which would give the query optimizer the additional information that only a single version can be in effect at a given point in time. Using this approach still had us, after four hours of execution time, cancelling a similar query we ran for the tests in our paper. Rewriting the query in numerous different ways using known tricks with subselects, row numbering, last value functions, and the likes, produced the same results. It was back to the drawing board.

After a while we had a sketch similar to the one above on a whiteboard. I recalled having written about a very similar situation, where I called for parallel projections in databases. So, what if we could project one timeline onto another and just keep the segment where they match? We started by forming a conjoined timeline, where time points were labelled so we could determine from which timeline they had originated.

Now we just needed to traverse the conjoined timeline and at every B-point pick up the latest value of the A-point. As it turns out, conjoining can be done using a simple union operation, and finding the latest A-point value can be done using a cumulative conditional max operation. Both performance efficient operations that, at least theoretically, only requires one pass over the underlying tables, avoiding Row-By-Agonizing-Row issues.

The resulting code can be seen to above. The union will combine the timelines, on which a conditional max operation is performed with a window. The condition limits the values to only A-points and the window makes sure it is picking up the the largest one for each song. Now this also gives us rows for the A (billboard history) timeline in which the Timepoint and ValidSince always will be the same. An outer condition discards these by limiting the result to those relevant for the B (songs) timeline. When looking at the actual execution plan of the query, we can see that the tables have indeed only been scanned once.

Twining is very performance efficient! To drive the point home, the mentioned query that was cancelled after four hours only took seconds to run when it was rewritten as a twine. If you are worried about the sort in the execution plan, adding some indexes (that basically precalculate the sorts) changes it into a merge join. The tradeoff is that sorting is done at write time or rather than at read time, so it will depend on your particular requirements what you choose.

Finally, as an exercise, try to produce the twine for the question: “How many artists had a top 10 hit on their 40th birthday?” Happy Twining!

The Slayers of Layers

Frankeinstein montage, using images in the public domain.

Once upon a time, tech had no layers. Tech was obscure, unmaintainable and inaccessible, reserved to the few. Few implementors, few maintainers, and few users. Tech, ingenious as it is, circumvented its issues by introducing layers, and now tech is for the masses. Layers conceal the ugliness underneath, each layer bringing with it understandability, maintainability, and accessibility.

We have the layers to thank for the incredible development that put us in the Information Age. Doubtlessly, the layers have brought with them a lot of good things, but we should not forget that they exist because of a bad thing. Because tech is immature, poorly designed, insufficiently researched, or plainly opportunistic, layers are slapped on like makeup until it is beautiful enough to be sold to the masses. This has led to the rapid development of ugly tech and a tremendous waste of resources.

Unfortunately, efforts in the line of layerless architectures are often met with resistance. This is understandable. Due to the large sunk costs and workforce occupied with layers, the protectionism make them next to impenetrable. How much money is your organisation spending on layers? How much money are you making from layers? Have you ever stopped to question the existence of layers?

It is easy to see the incentives to keep the layers around, but they will have to go. Charles Darwin will make sure they do. The “survival of the fittest” will, over the long run, favour organisations that adopt well designed tech on which they spend less resources than their competitors. The only caveat is that evolution requires time, and right now precious time is spent researching layered architectures, rather than layerless ones. In its eagerness to satisfy the industry, too much effort is steered towards applied science and too little towards pure science.

We need to go back to the drawing board and question the inability of every tech that needs layers. Some may even have layers only for the sake of having layers, some cannot be saved, but some are ugly ducklings on their way to become beautiful swans. We will have layers for a long time still and they do serve a purpose while the underlying tech is immature, but we cannot let the layers prevent the tech itself from reaching maturity.

In every discipline there will be one tech to rule them all, and trust me, that tech will be layerless. Which side are you on? Are you going to be slayers of layers?

It is time to rethink the layer. When you see a layer, act with suspicion and question its existence. If nothing else, think about the costs involved having to maintain every additional layer. Above all, we need to go back to the root of the problem and create tech that needs few or no layers to be acceptable.

Temporal Dimensional Modeling

Back in 2012 we introduced a way to “fix” the issues with Slowly Changing Dimensions in Dimensional Modeling. That script was actually sent to Ralph Kimball, but seeing as not much has happened in the following seven years, we decided to become a bit more formal and write a paper about it. It is entitled “Temporal Dimensional Modeling” and can be read on ResearchGate. Even if you are not interested in Dimensional Modeling (disclaimer: we are not either), you can learn a lot about temporality and improper ways to manage it, through the existing SCD types. Of particular general interest is the twine, a clever way to find historically correct relationships, given information that is stored differently.

Here is the abstract from the paper:

One of the prevalent techniques for modeling data warehouses is and has for the last decade been dimensional modeling. As initially defined, it had no constructs for keeping a record of changes and only provided the as-is latest view of available information. Since its introduction and from increasing requirements to record changes, different approaches have been suggested to manage change, mainly in the form of slowly changing dimensions of various types. This paper will show that every existing type of slowly changing dimension may lead to undesired anomalies, either at read or at write time, making them unsuitable for application in performance critical or near real-time data warehouses. Instead, based on current research in temporal database modeling, we introduce temporal dimensions that make facts and dimensions temporally independent, and therefore suffer from none of said anomalies. In our research, we also discovered the twine, a new concept that may significantly improve performance when loading dimensions. Code samples, along with query results showing the positive impact of implementing temporal dimensions compared to slowly changing dimensions are also presented.

The experiments in which performance was measured was done using Transact-SQL in Microsoft SQL Server. The code for reproducing the tests is available on GitHub.

Modeling Consensus and Disagreement

If you didn’t know it before, let me tell you that consensus is a big thing in Sweden. Looking in a dictionary, consensus is defined as “agreement among all the people involved” and it is rare for Swedes to leave a meeting room before it has been reached. Honestly, it’s to the point where meetings can become very tedious, but perhaps the inefficiency of the meeting itself is weighed up by the benefits of having consensus when leaving the room. I think the jury is still out on that one though…

When it comes to databases, there is an unspoken understanding that there is consensus among those who want to retrieve information from it. There is consensus on what the information represents, what it means, and which constraints are imposed upon it. But, can we be sure that is the case? Wouldn’t it be nice if we could write down a timeline where we could prove there are intervals of consensus interspersed by intervals of disagreement? How great would it not be if this was possible to deduce from the information itself?

This is where transitional modeling comes to the rescue. Let’s dig deeper into its two constructs, the posit and the assertion, which enable the modeling of consensus and disagreement. First, this is the structure of a posit:

[{(id1, role1), … ,(idN, roleN)}, value, time]

Every idi is a unique identifier, meaning that it uniquely represents a thing in whatever it is we are modeling. Once a thing has been assigned an id, it belongs to that thing alone for all eternity. In a posit, it is possible that idi = id but the roles must be unique. The value is either a primitive value or an instance of some complex data type. The time is either a primitive or a fuzzy value. Let’s clarify this by looking at some examples:

[{(Arthur, beard color)}, red, 1972-1974]
[{(Arthur, address)}, {street: AStreet, zip code: 1A2B3C, …}, 1972]
[{(Arthur, husband), (Bella, wife)}, <married>, 1999-09-21]

Posits do not differentiate between properties and relationships. They both share the same structure, but properties are easy to recognise since they only have a single role and id. The interval 1972–1974 in the second posit means that the information is imprecise, and expresses that since sometime within that interval Arthur grew a red beard, not that his beard was red between those years. If the color of the beard changes, a different posit would express this, along with the time the change occured. As can be seen, the address is a complex data type in the form of a structure. In the marriage relationship the angled brackets around the value <married> indicates that it is a complex data type, which in this example is a value from an enumeration.

The data types were picked specifically so that some parallels to traditional database modeling techniques can be drawn. That beard color is the name of an attribute in Anchor modeling, possibly on a Person anchor, of which Arthur is an instance having a surrogate key as the unique identifier. That address is a satellite in Data Vault, since the value structure is basically a bunch of attributes, possibly on a Person hub, where Arthur may have some hash or concatenation as its unique identifier. The marriage relationship is a knotted tie in Anchor modeling, where the knot constrains the possible values to those found in the enumeration, or a link in Data Vault, connecting Persons to each other.

Posits are neither true nor false. They just are. This is where the magic happens though. First, in order to be able to reference the posits, let’s give them some names. In practice these could be their memory addresses in an in-memory structure or some identity column in a database table.

p1 = [{(Arthur, beard color)}, red, 1972-1974]
p2 = [{(Arthur, address)}, {street: AStreet, zip code: 1A2B3C, …}, 1972]
p3 = [{(Arthur, husband), (Bella, wife)}, <married>, 1999-09-21]

Now we introduce the positor as someone or something that can have opinions about posits. It could be a human, it could be a machine, or it could be something different, as long as it can produce assertions. An assertion has the following structure:

!(id, posit, reliability, time)

The id is a unique identifier for the positor who has an opinion about a posit. Positors are part of the same body of information and may be described by posits themselves. The reliability expresses with what certainty the positor believes the posit to be true and is a real value between -1 and 1. Finally, the time is when the positor is making the assertion. As it turns out, reliabilities have some interesting properties, such as it being symmetric and can be used to express opposite beliefs. Somewhat sloppily expressed, negative reliabilities correspond to putting a not in front of the value. For example:

!(Me, “beard is red”, -1, Now) = !(Me, “beard is not red”, 1, Now)

This comes in handy, since storing the complement of a value is often unsupported in information storages, such as databases. If another assertion is added for the same posit but by a different positor, it can express a consensus or a disagreement.

!(Me, p2, 1, Now)
!(You, p2, 1, Now)
!(Me, p3, 0.75, Now)
!(You, p3, 0.5, Now)

So it seems both Me and You are in complete agreement on the address of Arthur and we can declare consensus. However, Me thinks that there is a 75% chance that the marriage posit is true, while You only believe that the chance is 50%. Here some additional guidelines are needed in order to determine if this means consensus or not. At least it can be determined that we are not in complete agreement any longer.

You are, due to the symmetrical nature of reliabilities, stating that there is a 50% chance that Archie and Bella are not married. It is easy to be mislead into believing that this means it could be any value whatsoever, but that is not the case. You are stating that it is either “married” or “not married”, with equal probabilties, but the enumeration could contain an arbitrary number of values, each of which would be a valid case for “not married”, making “married” more likely than the rest. This is not the same as the following assertion:

!(Else, p3, 0, Now)

Here Else is stating that it has no clue whatsoever what the value may be. This is useful, as it makes it possible to retract statements. Let’s say that Else also asserted that:

!(Else, p3, 1, Before)

It seems there was a time before, when Else was certain that the marriage took place as posited. Now Else has changed its mind though. This means that Now there is no consensus, if we look at Me, You, and Else, but Before there was. The full treatment of posits and assertions is available in our latest paper, entitled “Modeling Conflicting, Uncertain, and Varying Information”. It can be read and downloaded from ResearchGate or from the Anchor Modeling homepage. There you can find how You can refrain from being contradictory, for example.

 

The Illusion of a Fact

It is funny how limitations, when they have been around for a while, can be turned into beliefs that there is only one way of thinking. The right way. This is the case of databases and the information they store. We have been staring at the limitations of what databases can store for so long that we have started to think that the world works in the same way. Today I want to put such a misconception to rest. Databases store facts, so naturally we look for facts everywhere, but the truth is, in the real world there are very few facts.

The definition of a fact is “a piece of true information” and “things that are true or that really happened, rather than things that are imaginary or not true” according to the MacMillian dictionary. Let me then ask you, what do you know to be true? It is a fact that “the area of a square with the side x is x squared”, however, limited to squares on a euclidean plane. Mathematics, as it turns out, is one of the few disciplines in which we actually can talk about truth. This is not the case for science in general though.

Is the statement “There are no aliens on the dark side of the moon” a fact? You would, if asked if there are aliens on the dark side of the moon, probably answer that there are no aliens there. However, if you were pushed to prove it, you may think otherwise. The reasoning would be that there is this extremely miniscule chance there could be something alien there. We could leave it at that and disqualify the statement as a fact, but let’s not just yet. What is more interesting is why you are almost sure it is a fact.

Back in days of the ancient greeks, Aristarchus of Samos suggested that the Earth revolves around the Sun. Heliocentrism was then forgotten for a while, but brought back by the brave Galileo Galilei almost 2000 years later. You have to rely on these guys being right to begin with, and that the moon is not painted on the sky or made of cheese. Then, you have to rely on the Apollo 8 mission, in which astronauts actually observed the dark side. The photographs that were taken further imply that you rely on the science behind imagery and that any images have not been tampered with. You need to rely on that aliens do not have cloaking devices, or that aliens in generals seem unlikely, and that any claimed observations are not made by credible sources.

You can build a tree view of all the things you rely on in order to feel assured that there are no aliens on the dark side of the moon. I just need to put one of them in doubt for the fact to become a non-fact. This illustrates how fragile facts are, and that they therefore constitute a small small small minority of the information we manage on a daily basis. Yet, for the most part we continue to treat all of it as facts.

For this reason, in transitional modeling, the concept of a fact is omitted, and replaced by a posit. A posit has no truth value at all and is merely a syntactical construct. Assuming “There are no aliens on the dark side of the moon” is a posit just means that it is a statement that fits a certain syntax. In order for such a statement to gain meaning and some kind of truth value, someone called a positor must have an opinion about it. A second construct, the assertion, semantically binds a posit to a positor, and expresses the degree of certainty with which the positor believes the statement to be true or not true. Together they express things like ‘Peter the positor is almost completely sure that “There are no aliens on the dark side of the moon”‘. Concurrently it may also be the case that ‘Paulina the other positor thinks there is a slight chance that “There actually are aliens on the dark side of the moon”.

Information, is in this view factless and instead has two parts, the pieces of information (posits) and the opinions about the pieces (assertions), better representing its true nature. That two such simple constructs can lead to a rich theory, from which other modeling techniques can be derived as special cases, such as Anchor modeling, Data Vault, and the third normal form, may be a bit surprising. Read more about it in our latest scientific paper, entitled “Modeling Conflicting, Uncertain, and Varying Information”. It can be read and downloaded from ResearchGate or from the Anchor Modeling homepage.

Schema by Design

Lately, there’s been a lot of talk about when a schema should be applied to your data. This has led to a division of databases into two camps, those that do schema on write and those that do schema on read. The former is the more traditional, with relational databases as the main proponent, in which data has to be integrated into a determined schema before it can be written. The latter is the new challenger, driven by NoSQL solutions, in which data is stored more or less exactly as it arrives. In all honesty, both are pretty poor choices.

Schema on write imposes too much structure too early, which results in information loss during the process of molding it into a shape that fits the model. Schema on read, on the other hand, is so relaxed in letting the inquiring part make sense of the information that understandability is lost. Wouldn’t it be great if there was a way to keep all information and at the same time impose a schema that makes it understandable? In fact, now there is way, thanks to the latest research in information modeling and the transitional modeling technique.

Transitional modeling takes a middle road between schema on read and schema on write that I would like to call schema by design. It imposes the theoretical minimum of structure at write time, from which large parts of a schema can be derived. It is then up to modelers, which may even disagree on classifications, to provide enough auxilliary information that it can be understood what the model represents. This “metainformation” becomes a part of the same body of information it describes, and abides by the same rules with the same minimum of structure.

But why stop there? As it turns out, types and identifiers can be described in the same way. They may be disagreed upon, be uncertain, or vary over time, just like information in general, so of course all that can be recorded. In transitional modeling you can go back to any point in time and answer an inquiry as it would have been answered then and from the point of view of anyone who had an opinion at the time. Actually, it does not even stop there, since constraints over the information, like cardinalities, also are respresented in the same way. It all follows the same minimum of structure.

What then is this miraculous structure? Well, it relies on two constructs only, called posits and assertions, both which are given proper treatment in our latest scientific paper, entitled “Modeling Conflicting, Unreliable, and Varying Information”. It can be read and downloaded from ResearchGate or from the Anchor Modeling homepage. If you have an interest in information modeling, and what the future holds, give it an hour. Trust me, it will be well spent…

Transitional Modeling

Our latest paper is now available, entitled “Modeling Conflicting, Unreliable, and Varying Information”, in which Transitional Modeling is formalized. It can either be viewed and referenced on ResearchGate or downloaded directly from here. Much of what is found in the paper has been part of our courses since we began certifications, and is also available in the online course, but new research from the last couple of years has also been added.

ABSTRACT
Most persistent memories in which bodies of information are stored can only provide a view of that information as it currently is, from a single point of view, and with no respect to its reliability. This is a poor reflection of reality, because information changes over time, may have many and possibly disagreeing origins, and is far from often certain. Hereat, this paper introduces a modeling technique that manages conflicting, unreliable, and varying information. In order to do so, the concept of a “single version of the truth” must be abandoned and replaced by an equivocal theory that respects the genuine nature of information. Through such, information can be seen from different and concurrent perspectives, where each statement has been given a reliability ranging from being certain of its truth to being certain of its opposite, and when that reliability or the information itself varies over time, changes are managed non-destructively, making it possible to retrieve everything as it was at any given point in time. As a result, other techniques are, among them third normal form, anchor modeling, and data vault, contained as special cases of the henceforth entitled transitional modeling.

We hope you all will have fun with transitional modeling, as our research continues, particularly with respect to how it should fit into a database, relational or not.

On the hashing of keys

In Anchor we follow the established paradigm that an instance in the domain we are modeling should only be represented once in the database. For this reason, the surrogate keys we use as identities of such instances need to be dumb in the sense that they neither by themselves can convey any meaning nor be an encoding of anything that has meaning. We prefer sequences, as they are small with respect to size, cheap with respect to the generation of identities, monotonically increasing, and reasonably hard to confuse with something that carries meaning.

As a discouraging example, let’s assume we would like to hash the natural key of a citizen of Sweden using MD5 and use it as identities in our database. First, every citizen is identified by a personal number, on the form:

YYMMDD±NNNC

The birth of date is followed by a delimiter and a four digits, where the first three constitute a three digit serial number, followed by a final check digit. The serial number is even for men and odd for women. The delimiter is a minus sign if you are younger than 100 years old and a plus sign once you get older than that. In other words, not an entirely stable key over time. To complicate things even further, foreigners visiting the country may be given a coordination number which looks exactly like a personal number, except with DD+60. In any situation in which you need to provide your personal number but cannot do so, you may also be given a reserve number. The way to create a reserve number is that it should retain a correct birth date but contain at least one letter instead of a digit in the NNNC part.

As many drawbacks as this system may have, it is a fact of life that every data warehouse in Sweden has to cope with. For example, the MD5 of someone staying in the country for a longer period of time with coordination number 890162-3286 is ee0425783590ecb46030e24d806a5cd6. This can be stored as 128 bits, whereas an integer sequence using 32 bits will suffice for the population of Sweden with a healthy margin. Let’s also assume that we have a tie representing the relationship between married people. As there is a risk of divorce, such a tie must be knotted and historized. If we are purists, the key in the knot should also be a hash, even if it could be represented using a single bit with 1 for married and 0 for divorced. The keys in the hashed tie will consume 128 + 128 + 128 = 384 bits, whereas the sequenced tie will consume 32 + 32 + 1 = 65 bits. Caeteris paribus, the hashed tie is almost six times larger. The issue is further accentuated if you look at a plain text representation of the data. A sequence from 1 to 10 million will use less than 70 million characters in plain text, whereas the 10 million hashes will use no less than 1.2 billion characters. In a situation where the license cost is based on a textual representation of raw data, the hashes would be almost 20 times as expensive.

To continue the example, after some period of time, 890162-3286 gets a citizenship and becomes 890102-3286. The MD5 changes as well and is now 049fda914afa455fae66115acd78b616, completely different than before. Preceding the citizenship this person got married, so there are now rows in the tie with the wrong hash. In the sequenced tie we would expect to find the personal number as a historized attribute of the adjoined anchor. No problem here, two personal numbers now refer to the same sequence number, but at different time periods. The resolution with the hashed keys would be to introduce another instance in the anchor with the new hash together with an additional tie indicating that these two instances actually refer to the same thing in real life. Any query must now take this into account, slowing all queries down, and data must be duplicated, resulting in increased maintenance costs and degraded performance. Of course, if you are not interested in keeping a history of changes you can just update the existing rows, but this is a costly operation and may be quite troublesome if foreign keys are declared. We also presumed that such a history is a ‘must have’ requirement, as in any proper data warehouse.

The conclusion is that hashing is a poor choice when keys may change over time, due to the fact that the hash, while looking like a meaningless string, actually still carry the remnants of the meaning of the input. This is sufficient for causing trouble! We also saw that the size of “safe” hashes are significantly larger than integer sequences. By “safe” we assume that the hash is complex enough for clashes to be extremely rare. However miniscule the risk is of a collision may be it could still be a completely unacceptable event, should it occur. See Black Swan Theory for more information. The “safeness” of the hash is also proportional to the cost of generating it, so the safer you want to be, the more CPU cycles are used in order to try to reassure you.

Polymorphic Graph Queries in SQL Server

Sometimes I get the question of when Anchor Modeling is not suitable. “Actually, most of the times it is”, is my common answer. However, there are times when the requirements are such that you need a bit of trickery on top of a model. One such case recently emerged at a client. The dilemma was how to ask polymorphic graph queries in SQL Server, when you have a network represented as a parent-child-relationship in your Anchor model? First, a polymorphic graph query is one in which you want to find nodes of certain properties connected through any number of edges in your network. For example, “find all computers that at any point have a wireless connection between them”. You may think that the new graph table types in SQL Server 2017 would solve this, but alas, they do not support these types of queries (yet).

Fortunately, in and since SQL Server 2008, an often overlooked new data type was introduced: HIERARCHYID. At first glance it looks disappointing, but it turns out that by using string searches and manipulation, polymorphic queries can be asked. Below is an example that shows how this is done, and should of course be applicable for any type of network, and not just the ones containing computers, switches, routers, cables and wireless connections. As a small bonus, a hint is also given of how to solve the traveling salesman problem.