Rethinking the Database

This is the final article in the series “What needs to be agreed upon”“What can be disagreed upon”“What will change and what will remain”, and “What we are”. The series has established the fundamental concepts in #transitional modeling, a theoretical framework for representing the subjectivity, uncertainty, and temporality of information. This is analog to the previously published paper “Modeling Conflicting, Unreliable, and Varying Information”, but here with the assertion converted to a meta-posit. I will now be so bold as to state that all information is subjective, uncertain and temporal in nature.

Having worked with Anchor modeling for 15 years, it had evolved to the point where the old formalization from the paper “Anchor modeling — Agile information modeling in evolving data environments” was no longer valid. I had also come to the point where I started to doubt the relational model as the best way to represent Anchor. It felt as I was working against relational rather than with it as more features were added. A working theory of the beautiful constructs posits and assertions had already been formulated, albeit under other names (attributes and timeline annexes) back in 2012, “Anchor Modeling with Bitemporal Data”. Thanks to these, I had started to think about what a database engine built around those concepts could do.

During the same period, NoSQL has seen its rise and fall, but it wouldn’t have rose at all if there wasn’t some circumstances in which SQL databases did not suffice. I believe it had to do with conformance. In order to get data into an SQL database it has to conform to a table, conform to a candidate key, conform to data types, conform to constraints, conform to rules of integration, conform to being truthful, conform to be free of errors, and conform to last. With this in place, data falls into three categories; non-conforming data that cannot be made to conform, non-conforming data that can be made to conform, and conformingdata. From my own experience, almost all data I was working with fell into the first two categories. If it cannot conform, simply discard, BLOB, or in rare cases, find a fitting data type, such as JSON or XML. If it can be made to conform, write complex logic that molds the data until it fits. If it directly conforms, do a reality check or accept that you have a JBOT-style database.

Here, NoSQL flourished in comparison, with practically zero conformance demands. Just dump whatever into the database. For someone who is spending most of their time writing complex logic that molds the data until it fits, this sounds extraordinarily attractive. The issue here, as it turned out, is that what is no longer your problem suddenly became someone else’s problem. The funny thing is, that someone else didn’t even have a job description at the time, which is why it has taken far too long to realize that “inconsistent conformance on every read” is not such a nifty paradigm. However, we also want to leave the “perfectly consistent conformance on a single write” paradigm behind us.

We are currently at a point where we’ve visited two extremes of a scale on how to conform information in order to store it; totally and not at all. With that in mind, it’s not that difficult to figure out a possible way forward. It has to be somewhere in between the two. I am not the only one who have thought of this. There is currently a plethora of database technologies out there, positioning themselves on this scale. To name a few, there are graph databases, triple stores, semantic fabrics, and the likes. In my opinion, all of these still impose too much conformance in order to store information. This is where I see a place for a transitional database, aiming to minimize conformance requirements, but still provide the mechanics for schemas, constraints, and classifications on write. Different from the others, these are subjective, evolving and possibly late-arriving schemas, constraints and classifications. Similar to “eventual consistency” in a blockchain, a transitional database has “eventual conformance”.

Let’s assume that we have access to a transitional database, built upon posits at its core. What type of queries could we expect to run?

  • Search anywhere for the unique identifier 42, NVP-like search.
  • Search for everything that has the girlfriend role, Graph-like search. 
  • Search for every time 42 was a girlfriend, Graph-like search. 
  • Search for everything nicknamed ‘Jen’, Relational-like search. 
  • Search for all Persons, Relational-like search.
  • Search for all subclasses of Person, Hierarchical-like search.
  • Search as it was on a given date, Temporal-like search. 
  • Search given what we knew on a given date, Bi-Temporal-like search. 
  • Search for disagreements between 42 and 43, Multi-tenant-like search. 
  • Search that which is at least 75% certain, Probabalistic-like search. 
  • Search for corrections made between two dates, Audit-like search. 
  • Search for all model changes made by Jen, Log-like search.
  • Search for how many times consensus has been reached, new feature. 
  • Search for how many times opposite opinions have been expressed, new feature. 
  • Search for individuals that have contradicted themselves, new feature.
  • Search for when a constraint was in place, new feature.

That sure seems like a handy database, given the things it can answer. It’s a shame that it does not yet exist. Or does it? As it happens I am working on precisely such a database, written in the Rust programming language. My goal is to release a working prototype as Open Source by the end of the summer. After that I will need help, so start polishing your Rust now!

What we are

This is a continuation of the articles “What needs to be agreed upon”“What can be disagreed upon”, and “What will change and what will remain”. So far we have learnt to build posits and assert these, in order to create an exhaustive transcript of a discussion. Furthermore, the transcript came alive as a stone tablet, onto which we continuously and non-destructively can capture changes of values or opinions, following the three individuals involved. We also started to glimpse the power of #transitional modeling as a formal framework.

The last article ended with the posit P4 as ({(S44, nickname)}, Jen, 1988). We had already seen a similar posit P2 as ({(J42, nickname)}, Jen, 1988). The way the story had been told, we have presumed that J42 is a female human being. Presumptions only lead to headaches. If not now, then somewhere down the road. The transcript is clearly not exhaustive enough, so we need to rectify this, and at the same time solve the mystery of identity S44.

As it turns out, an utterance about an unrelated topic was made in the discussion. Someone said “Haha, but I wonder what Jen (J42) feels about being hit by Jen (S44)? That storm is about to hit shores tomorrow.” Aha, there’s the person J42 and the storm S44, both nicknamed Jen. In order to tell what things are, we again need to reserve some roles. Let’s reserve the strings ‘thing’ and ‘class’. We can now create two new posits ({(J42, thing), (C1, class)}, active, 1980-02-13) and ({(S44, thing), (C2, class)}, active, 2019-08-10). These connect things to classes, but the classes themselves also need to be explained.

The classes C1 and C2 can be dressed up with a lof of their own information, but let us stay with the basics and only introduce two more posits ({(C1, named)}, Female Child, 2019-08-20) and ({(C2, named)}, Storm, 2019-08-20). But wait, Jen is not a child any longer. Let’s also add ({(C3, named)}, Female Adult, 2019-08-20). If we assume that you become an adult at the age of 18, then ({(J42, thing), (C3, class)}, active, 1998-02-13). The same dereferencing set, but with a different value and a later timepoint, in other words a change. Things may change class as time goes by.

The third party reading the transcript is not much for specifics. Generics are much better. Let’s help her out and add ({(C4, named)}, Person, 2019-08-20) along with ({(J42, thing), (C4, class)}, active, 1980-02-13). The third party can assert these, simultaneously as Jennifer herself asserts the other. There is a difference of opinion, leading to concurrent models, both equally valid. Thinking about it some more, it turns out that these particular classes can actually be related ({(C1, subclass), (C4, superclass)}, active, 2019-08-20) and ({(C3, subclass), (C4, superclass)}, active, 2019-08-20). Both female children and female adults are persons.

Now that we’ve seen some of what #transitional modeling can do, it is still only a theoretical framework. What if there was a database built using posits at its core? This is the topic of the next article, entitled “Rethinking the database”.

What will change and what will remain

This is a continuation of the two articles “What needs to be agreed upon” and “What can be disagreed upon”, in which two people have a discussion, after which a third party is invited to interpret a transcription of it. We have concluded that if the meaning of a posit is universally agreed upon, anyone is nevertheless at liberty to disagree or express doubt towards what it is saying. Opinions about posits are recorded in the transcript itself using assertions, a kind of meta-posit that assigns someone’s confidence level with respect to a posit.

Change is everywhere, and the last article concluded that both the circumstances that posits and assertions describe may change over time. Values change and opinions change. Let us make the transcript a living document, required to capture such changes. Additionally, the transcript much be historically complete, capturing the changes in a non-destructive manner. Anything that goes into the transcript is written in stone.

Grab your chisel, because Jennifer broke up with her boyfriend. Recall that the posit P1 is ({(J42, girlfriend), (B43, boyfriend)}, official, 2019), where J42 is Jennifer and B43 her boyfriend. The posit P3 tells us what happened in 2020 and it looks like this ({(J42, girlfriend), (B43, boyfriend)}, broken up, 2020). Remember the posit P2? It is ({(J42, nickname)}, Jen, 1988). Clearly, they are all different posits, P1 ≠ P2 ≠ P3, but P1 and P3 must share something in order for them to be describing a change that they do not share with P2.

It is actually possible to precisely define change. When two posits share the same set in their first position, but have different values and one time point follows the other, they describe a change. With that in place, P3 is obviously a change from P1. Since the set in P2 differs from that in P1 and P2, it is not a change of either P1 or P3. In #transitional modeling, the set is called a dereferencing set. They remain, indefinitely, while their surroundings may change entirely. Even after J42 is gone, the dereferencing sets in which that identity is found will remain, because we can, of course, have a recollection of things that are no more.

Then how does change affect assertions? Since assertions are posits themselves it works in exactly the same way. Jennifer made the following assertion on the 5th of April in 2019 ({(P1, posit), (J42, determines confidence)}, 0.8, 2019-04-05), stating that it is very likely that she and her boyfriend officially has been an item since 2019. After learning that her “boyfriend” thought otherwise, she changed her mind. Let’s say that they had a serious talk about this on the 21st of September 2019. Jennifer’s revised confidence in P1 can then be expressed through another assertion ({(P1, posit), (J42, determines confidence)}, 0, 2019-09-21). This assertion changes her previous confidence level from 0.8 to 0, so after the 21st she has no clue if they actually were an item or not.

Incidentally, this is how a ‘logical delete’ works in a bitemporal database. Albeit, in those databases there are only two confidence values, 1 (recorded) and 0 (deleted). The database itself is also the only one allowed to have an opinion. In other words, the functionality of a bitemporal database can be described as a small subset of #transitional modeling. When confidences are extended to the continuous interval [0, 1], the functionality approaches that of probabalistic databases. If further extended to include negative confidence, [-1, 1], we approach uncertainty theory, yet to be explored in databases. The fact that anyone may have an opinion is similar to multi-tenant databases. Transitional modeling as a formal base is very powerful, despite it’s simple construction.

Back in the shoes of the third party reading the transcript, we find the posit P4 as ({(S44, nickname)}, Jen, 1988). So, wait, what? There were in fact two different Jens after all, J42 and S44? What exactly is the thing with identity S44? This will be the topic of the next article, entitled “What we are”.

What can be disagreed upon

This is a continuation of the article entitled “What must be agreed upon”, in which two individuals have a discussion, whereafter a transcript appears that a third party is invited to interpret. The transcript consists of a number of posits, for example ({(J42, girlfriend), (B43, boyfriend)}, official, 2019) and ({(J42, nickname)}, Jen, 1988). The syntax of a posit is described as a triple, where the first position is occupied by a set of ordered pairs, each pair being an identity and a role. The set is followed by a value and time point, which both may be imprecise in nature. What must be agreed upon is the syntax of the posit and the semantics of what it expresses. That sums up the conclusions of the earlier article.

Given the title of this article, let us follow up by investigating disagreements, and here comes an important distinction. Even if you understand what a posit is saying, it doesn’t imply that you believe in what the posit is saying. Many different opinions are certainly held towards a statement such as “We are alone in the universe”. So, if we want to talk about posits in the language of #transitional modeling, the posit itself must be given an identity. To talk about ({(J42, girlfriend), (B43, boyfriend)}, official, 2019) we give it the identity P1 and ({(J42, nickname)}, Jen, 1988) will be P2. The identities make it possible to create new posits that talk about other posits; meta-posits if you like.

If posits that talk about other posits live alongside posits that talk about other things, we cannot allow for any confusion with respect to the roles. We will therefore reserve roles for our purposes, say the strings ‘posit’ and ‘determines confidence’. An assertion is a posit exemplified by ({(P1, posit), (J42, determines confidence)}, 0.8, 2019-04-05). The interpretation is that Jennifer (J42), since 2019-04-05, expresses a confidence of 0.8 with respect to if her girlfriend/boyfriend status with B43 was official since 2019 (P1). Similarly ({(P1, posit), (B43, determines confidence)}, -1, 2019-04-05). We will see that those confidence numbers reveal a big conflict!

Confidences, at lease those found in our assertions, fall within the [-1, 1] interval. The mapping between how something is expressed in natural language and the numerical confidence is fuzzy. But, let us assume that 0.8 corresponds to “very likely”. Then Jennifer is saying that it is very likely that B43 officially became her boyfriend in 2019. The twist in the plot is that the boyfriend is of a very different opinion. On the negative scale of confidences, certainty towards the opposite is expressed, and -1 is “completely certain of the opposite”. More precisely, this is equivalent to being completely certain of the complement of a value. In other words, the boyfriend is completely certain, with confidence value 1, of the posit ({(J42, girlfriend), (B43, boyfriend)}, anything but official, 2019).

Tucked in between is a confidence of 0, which we call “complete uncertainty”. Let us assume that the boyfriend is clueless, and instead asserted ({(P1, posit), (B43, determines confidence)}, 0, 2019-04-05). This is interpreted as the boyfriend having no clue whatsoever if P1 is a truthful posit or not. Perhaps memory is failing or the boyfriend chose to forget. Assertions with confidence give us a powerful machinery to express differences of opinon. To recap, confidence 1 means certain of one particular value, -1 certain it’s a value different from one particular value, and 0 that it could be any value whatsoever. Values in between express confidence to a given degree.

The first article mentioned the unlikeliness of Jennifer eternally being in a good mood. There will come a time when her mood is different. Likely when she finds out what her boyfriend is asserting. At that point, maybe the recollection of her boyfriend is better, and he changes his mind. Circumstances definitely change over time, but we haven’t yet seen change in action. This will be the topic of the next article, entitled “What will change and what will remain”.

The observant reader will notice that assertions here are slightly different from in the paper “Modeling conflicting, uncertain, and varying information”. There the assertion is an actual construct, a predicate, different from the posit. Here the assertion is a meta-posit built around a semantical reservation. The reasoning behind the different approach is that if assertions are expressed as posits, they can be talked about as well using another layer of posits; a kind of über-meta-posits.

What needs to be agreed upon

If we are to have a discussion about something there are a few things we first need to agree upon in order to communicate efficiently and intelligibly. A good way to test for what we need is to transcribe a discussion between two individuals and afterwards hand this over to a third non-participating individual, who should be able to unambiguously interpret the transcription.

Let’s say one of the individuals in the discussion frequently talks about ‘Jennifer’, while the other uses ‘Jen’. In order to avoid possible confusion, it should be agreed upon that ‘Jen’ and ‘Jennifer’ is the same person. This can be done by the introduction of a unique identifier, say ‘J42’, such that the transcript may read “I talked to Jen (J42) today.” and “Oh, how is Jennifer (J42) doing?”. We need to agree upon the identities of the things we talk about.

Assuming the respose is “Jen (J42) is doing good.” followed by “I figured as much, new boyfriend and all, so Jennifer (J42) must be good.”. In this case ‘good’ is a value that should mean the same thing to both individuals. However, ‘good’ is not scientifically precise, but there is a consensus and familiarity with the value that let’s us assume that both parties have sufficiently equal definitions to understand each other. Imprecisions that would lead to confusion could of course be sorted out in the discussion itself. If “Jen (J42) is doing so so.”, then “What do you mean ‘so so’, is something the matter?” is a natural response. We need to agree upon values, and that any imprecisions lie within the margin of error with respect to the mutual understanding of their definitions.

Now, if the transcription needs to be done efficiently, leaving out the nuances of natural language, that challenge can be used to test whether or not the two constructs above can capture the essence of any discussion. Using identities and values we can construct pairs, like (J42, Jen), (J42, Jennifer), (J42, good). Forget that you’ve heard anything and put yourself in the shoes of the third party. Apparently, something is missing. There is no way to tell what ‘Jen’, ‘Jennifer’, and ‘good’ are with respect to the thing having the identitiy J42. This can be sorted out by adding the notion of a role that an identity takes on with respect to a value. Using roles, the pairs become triples (J42, nickname, Jen), (J42, first name, Jennifer), and (J42, mood, good). We need to agree upon the roles that identities take on with respect to values.

Surely this is enough? Well, not quite, unless Jen is permanently in a good mood. The triple (J42, mood, good) is not temporally determined, so how could we tell when it applies? The third party may be reading the transcript years later, or Jen may even have broken up with her boyfriend before the discussion took place, unbeknownst to the two individuals talking about her. Using a temporal determinator, interpretable as ‘since when’, the triples can be extended to quadruples (J42, nickname, Jen, 1988), (J42, first name, Jennifer, 1980-02-13), and (J42, mood, good, 9.45 on Thursday morning the 20th of August 2019). The way the point in time is expressed seems to differ in precision in these quadruples. Actually, there is no way to be perfectly precise when it comes to expressing time, due to the nature of time and our way to measure it. It’s not known exactly when Jennifer got her nickname Jen, but it was sometime in 1988. We need to agree upon the points in time when identities took or will take on certain roles with respect to certain values, to some degree of precision.

This is almost all we need, except that the quadruple can only express properties of an individual. What about relationships? Let B43 be the identitiy of Jen’s boyfriend. We may be led to use a quadruple (J42, boyfriend, B43, 2019), but this has some issues. First, B43 is in the third position, where a value is supposed to be, not an identity. If we can overlook this, the second issue is more severe. Can we really tell if B43 is the boyfriend of J42 or if J42 is the boyfriend of B43? The way we introduced the role, as something an identity takes on with respect to a value, the latter alternative is the natural interpretation. Finally, the nail in the coffin, relationsships may involve more than two identities. Where in the quadruple would you put the third party?

The solution is to return to a triple, but where the first position contains a set of ordered pairs ({(J42, girlfriend), (B43, boyfriend)}, official, 2019). This resolves the issue of who is the boyfriend and who is the girlfriend. The second position is again a value and the third is the temporal determinator. Looking back, we can actually consolidate the quadruple used to express properties to a triple as well ({(J42, nickname)}, Jen, 1988). The only difference being that the cardinality of the set is one for properties and more than one for relationships. Triples like these are called posits in #transitional modeling.

We could leave it here and be able to represent a whole lot of information this way. But, let us return to the three individuals we have been talking about. Now that the transcript is in place, consisting of a number of posits, what if they cannot agree upon it being the single version of the truth? What if some of what was written down is not 100% certain? What if someone is of an opposite opinion? This sounds like important information, and it can actually easily be transcribed as well, but it requires another construct, the assertion. This will be the topic of the next article; “What can be disagreed upon”.

Screw the Hammer

About 500BC the greek philosopher Heraclitus said “Panta Rhei”, which translates to “everything flows”. The essence of his philosophy is that if you take a thing and examine it at two different points in time, you can always find something that changed in between the two. This thinking is at the heart of transitional modeling, in that not only does it capture the changes a thing has undergone, but also changes to the interpretation of the examination (making it bitemporal). Furthermore, while a thing is an absolute, an examination is relative to the one performing it. In transitional modeling concurrent and possibly conflicting opinions are allowed. Finally, examinations may lead to imprecise results or there may be uncertainty about the examination process. Both imprecision and uncertainty are well defined and representable concepts in transitional modeling. 

I believe that even if you are not facing requirements that fits the above, yet, you will benefit from understanding the concepts of transitional modeling. Given some screws and a hammer it is way too easy to let the tool decide the solution, which is precisely what has been going on with database modeling.

Many features of transitional modeling can be found in our concurrent-reliance-temporal Anchor modeling. Why not take it for a spin in the online modeling tool?

http://anchormodeling.com/modeler/latest

Schemafull Databases

Over the last decade schemaless databases have become a thing. The argument being that too much work is required at write time conforming information to the schema. Guess what, you only moved that work higher up the information refinery. In order to digest the information, it will at some point have to conform to some schema anyway, but now you have to do the work at read time instead. Would you rather spend additional time once, when writing, or spend additional time every time you read? You may also have heard me say that ‘unstructured data is just data waiting to be structured’.

That being said, there is a huge problem with schemas, but the problem is that information needs to be stored according to one rigid schema, rather than alongside many flexible schemas. The rigidity of current schema based databases often leave us no option but to peel and mold information before it fits. Obviously, discarding and deforming information is bad, so I do understand the appeal of schemaless databases, even if they fail to address the underlying issue.

The only way to store the bananas above in my stomach are to peel them and mold them through my mouth, acting much like a relational database with a particular schema. The relational database forces me to do a few good things though, things not always done in schemaless databases. Identification has to be performed, where bananas are given identities, so that we know if this is a new banana or one that is already in the stomach. Thanks to things having identities, we can also relate them to each other, such that all these bananas at one point came from the same bunch above.

Another advantage of identities, along with the ability to relate these to the identities of other things and their properties, is that we can start to talk about the things themselves. Through such metaspeak it is possible to say that a thing is of the Banana type. With this realization, it is hard to understand why a schema should be so rigidly enforced, unless metaspeak is assumed to be autocratically written in stone, while at the same time being univocal, all-encompassing, and future proof. How true does that not ring to someone living in the real world?

No, let there be diversity. Let us take the good things from relational databases, such as identification and the possibility to express relationships as well as properties, but let us not enforce a schema, taking the best part of the schemaless databases as well. Instead, let there be metaspeak expressed using the same constructs as ordinary speak, and let this metaspeak be spoken in as many ways as there are opinions. Let us talk about #schemafull databases instead!

A #schemafull database is one in which information is stored as pieces of information, posits, each tied to a an identified thing, and opinions about these pieces, assertions, made by some identified thing. What class of things a particular thing belongs to is just another piece of information, about which you may be certain to a degree, hold a different opinion than someone else, or later change your mind about. These pluralistic “schemas” live and indefinitely evolve alongside the information. There is neither no limit to how much additional description these classes and schemas can be given. They could, for example, be parts of different (virtual) layers, such as logical or conceptual, or they could be related hierarchically, such that a Banana is a Fruit.

Surely though, if we are to talk about a things and have a fruitful conversation, must we not agree upon something first? We do, of course, and what we believe to be the least to be agreed upon are necessarily identities and roles, and presumptively values. As an example, looking at the posit “This banana has the color yellow”, everyone who asserts this must agree that ‘this banana’ refers to the same thing, or in other words uniquely identifies a particular thing. The role of ‘that having a color’ must have the same meaning for everyone, in how it applies and how it is measured. Finally, ‘yellow’ should be as equally understood as possible.

The reason the value part is less constrained is simply because not all values have a rigorous definition. The color yellow is a good example. I cannot be sure that your yellow is my yellow, unless we define yellow using a range of wavelengths. However, almost none of us run around with spectrometers in our pockets. The de facto usage of yellow is different, even if we can produce and measure it scientifically. We will therefore presume that for two different assertions of the same posit “This banana has the color yellow”, both making those assertions share an understanding of ‘yellow’. There is also the possibility of representing imprecision using fuzzy values, such as ‘yellowish’, where ‘brownish’ may to some extent overlap it.

It is even possible to define an imprecise Fruitish class, which in the case of the Banana may be a good thing, since bananas botanically are berries. It’s also important to notice the difference between imprecision and uncertainty. Imprecision deals with fuzzy posits, whereas uncertainty deals with fuzzy assertions. It is possible to state that “I am certain that Bananas belong to the Fruitish class”, complete certainty about an imprecise value. Other examples are “I am not so sure that Bananas belong to the Fruit class”, uncertainty about a precise value, and “I am certain that Bananas do not belong to the Fruit class”, complete certainty about the negation of precise value.

A database needs to be able to manage all of this, and we are on our way to building one in which all this will be possible, but we are not there yet. The theory is in place, but the coding has just started. If you know or can learn to program in Rust and want to help out, contact me. You can also read more about #transitional modeling in our scientific paper, and in the articles “Schema by Design“, “The Illusion of a Fact“, “Modeling Consensus and Disagreement“, and “The Slayers of Layers“. Don’t miss Christian Kaul’s “Modeling the Transitional Data Warehouse” either, in which conflicting schemas are exemplified.

There is also an implementation of #transitional modeling in a relational database, of course only intended for educational purposes. The rigid schema of which can be seen below…

Stop peeling and molding information and start recording opinions of it instead. Embrace the schemas. Embrace #transitional modeling. Have a banana!

♫ Let’s Twine Again ♫

In our paper Temporal Dimension Modeling we introduced the concept of a twine. A twine is an efficient set based algorithm that can be applied when you have a table in which you have recorded a history of changes and some other table with related points in time, for which you want to know which historical rows were in effect at those different time points.

Let us look at an example. In one table we have stored the weekly billboard rankings of songs, and in another we have information about the artists performing the songs, and a third table with song information. The billboard table references both together with a ranking and a date indicating since when that ranking came into effect.

The ranking is historized weekly, so that if the ranking has changed since last week, a new row with the new rank is added with a later ValidSince date. The table below lists rows for the song “Let’s Twist Again” by Chubby Checker.

Given that we know the song was released on the 19th of June 1961, we see that it went straight into the 25th spot. It reached the 8th spot as it highest rating, and the latest information we have is that it is currently outside of the top 100. Note that for some weeks there are no rows since the ranking remained the same. This is a typical example of managing change without the introduction of data duplication. Thanks to the fact that Ranking is exhaustive, such that that it can always express a valid value, there is no need for a ValidTo column (also known as end-dating). The benefit of avoiding ValidTo is that historization can be done using a insert-only database, avoiding costly update operations.

Now, let’s assume that these tables have lots and lots of data, and we would like to know what ranking each song had one month after its release. This forces us to do a temporally dependent join, or in other words a join in which the join condition involves temporal logic. Picking up the song “Let’s Twist Again” from the Song table gives us the 1961-06-19 ReleaseDate and the SongID 42. Looking at the table above we need to find which ranking was in effect for SongID 42 on 1969-07-19 (one month later). Visually, it is easy to deduce that it was ranked 15th, since that ranking is valid since 1961-07-13 until replaced by the best ranking on 1961-08-20, and 1969-07-19 falls in between those dates.

Unfortunately relational databases are inherently bad at optimizing in between conditions. The best trick before the twine was to use a CROSS APPLY together with a TOP 1, which would give the query optimizer the additional information that only a single version can be in effect at a given point in time. Using this approach still had us, after four hours of execution time, cancelling a similar query we ran for the tests in our paper. Rewriting the query in numerous different ways using known tricks with subselects, row numbering, last value functions, and the likes, produced the same results. It was back to the drawing board.

After a while we had a sketch similar to the one above on a whiteboard. I recalled having written about a very similar situation, where I called for parallel projections in databases. So, what if we could project one timeline onto another and just keep the segment where they match? We started by forming a conjoined timeline, where time points were labelled so we could determine from which timeline they had originated.

Now we just needed to traverse the conjoined timeline and at every B-point pick up the latest value of the A-point. As it turns out, conjoining can be done using a simple union operation, and finding the latest A-point value can be done using a cumulative conditional max operation. Both performance efficient operations that, at least theoretically, only requires one pass over the underlying tables, avoiding Row-By-Agonizing-Row issues.

The resulting code can be seen to above. The union will combine the timelines, on which a conditional max operation is performed with a window. The condition limits the values to only A-points and the window makes sure it is picking up the the largest one for each song. Now this also gives us rows for the A (billboard history) timeline in which the Timepoint and ValidSince always will be the same. An outer condition discards these by limiting the result to those relevant for the B (songs) timeline. When looking at the actual execution plan of the query, we can see that the tables have indeed only been scanned once.

Twining is very performance efficient! To drive the point home, the mentioned query that was cancelled after four hours only took seconds to run when it was rewritten as a twine. If you are worried about the sort in the execution plan, adding some indexes (that basically precalculate the sorts) changes it into a merge join. The tradeoff is that sorting is done at write time or rather than at read time, so it will depend on your particular requirements what you choose.

Finally, as an exercise, try to produce the twine for the question: “How many artists had a top 10 hit on their 40th birthday?” Happy Twining!

The Slayers of Layers

Frankeinstein montage, using images in the public domain.

Once upon a time, tech had no layers. Tech was obscure, unmaintainable and inaccessible, reserved to the few. Few implementors, few maintainers, and few users. Tech, ingenious as it is, circumvented its issues by introducing layers, and now tech is for the masses. Layers conceal the ugliness underneath, each layer bringing with it understandability, maintainability, and accessibility.

We have the layers to thank for the incredible development that put us in the Information Age. Doubtlessly, the layers have brought with them a lot of good things, but we should not forget that they exist because of a bad thing. Because tech is immature, poorly designed, insufficiently researched, or plainly opportunistic, layers are slapped on like makeup until it is beautiful enough to be sold to the masses. This has led to the rapid development of ugly tech and a tremendous waste of resources.

Unfortunately, efforts in the line of layerless architectures are often met with resistance. This is understandable. Due to the large sunk costs and workforce occupied with layers, the protectionism make them next to impenetrable. How much money is your organisation spending on layers? How much money are you making from layers? Have you ever stopped to question the existence of layers?

It is easy to see the incentives to keep the layers around, but they will have to go. Charles Darwin will make sure they do. The “survival of the fittest” will, over the long run, favour organisations that adopt well designed tech on which they spend less resources than their competitors. The only caveat is that evolution requires time, and right now precious time is spent researching layered architectures, rather than layerless ones. In its eagerness to satisfy the industry, too much effort is steered towards applied science and too little towards pure science.

We need to go back to the drawing board and question the inability of every tech that needs layers. Some may even have layers only for the sake of having layers, some cannot be saved, but some are ugly ducklings on their way to become beautiful swans. We will have layers for a long time still and they do serve a purpose while the underlying tech is immature, but we cannot let the layers prevent the tech itself from reaching maturity.

In every discipline there will be one tech to rule them all, and trust me, that tech will be layerless. Which side are you on? Are you going to be slayers of layers?

It is time to rethink the layer. When you see a layer, act with suspicion and question its existence. If nothing else, think about the costs involved having to maintain every additional layer. Above all, we need to go back to the root of the problem and create tech that needs few or no layers to be acceptable.

Temporal Dimensional Modeling

Back in 2012 we introduced a way to “fix” the issues with Slowly Changing Dimensions in Dimensional Modeling. That script was actually sent to Ralph Kimball, but seeing as not much has happened in the following seven years, we decided to become a bit more formal and write a paper about it. It is entitled “Temporal Dimensional Modeling” and can be read on ResearchGate. Even if you are not interested in Dimensional Modeling (disclaimer: we are not either), you can learn a lot about temporality and improper ways to manage it, through the existing SCD types. Of particular general interest is the twine, a clever way to find historically correct relationships, given information that is stored differently.

Here is the abstract from the paper:

One of the prevalent techniques for modeling data warehouses is and has for the last decade been dimensional modeling. As initially defined, it had no constructs for keeping a record of changes and only provided the as-is latest view of available information. Since its introduction and from increasing requirements to record changes, different approaches have been suggested to manage change, mainly in the form of slowly changing dimensions of various types. This paper will show that every existing type of slowly changing dimension may lead to undesired anomalies, either at read or at write time, making them unsuitable for application in performance critical or near real-time data warehouses. Instead, based on current research in temporal database modeling, we introduce temporal dimensions that make facts and dimensions temporally independent, and therefore suffer from none of said anomalies. In our research, we also discovered the twine, a new concept that may significantly improve performance when loading dimensions. Code samples, along with query results showing the positive impact of implementing temporal dimensions compared to slowly changing dimensions are also presented.

The experiments in which performance was measured was done using Transact-SQL in Microsoft SQL Server. The code for reproducing the tests is available on GitHub.