Towards a Model-Driven Organization (Part 1)

Christian Kaul and Lars Rönnbäck

It’s incredible how many years I wasted associating complexity and ambiguity with intelligence. Turns out the right answer is usually pretty simple, and complexity and ambiguity are how terrible people live with themselves.

David Klion (2018)

Many organizations today struggle with a strong disconnect between their understanding of the work they are doing and the way their IT systems are set up.

Data is distributed over a large number of nonintegrated IT systems and manual interfaces (sometimes called “human middleware”) exist between incompatible applications. Within these applications, data may also be subject to regulations, and compliance is difficult to achieve. We can trace most, if not all, of these issues back to an abundance of unspecific, inflexible, and non-aligned data models underlying the applications these organizations use to conduct their business.

In this article, the first in a series, we briefly describe the issues resulting from this disconnect and their origins within a traditional organization. We then suggest a radical shift, to a model-driven organization, where all applications work towards a single data platform with a unified model. Instead of creating models that mirror the existing organization and its dysfunctions, we suggest first creating a unified model based on the goals of the organization, and thereafter derive the organizational structure and the necessary applications from it.

Technologically, databases are now appearing in the market that can manage OLTP (operational) and OLAP (analytical) loads simultaneously, with associated app stores and application development frameworks, thereby enabling organizations to become model-driven.

Motivation

All organizations create data. When you’re using computers (and who isn’t these days), everything you do produces data. Therefore it’s not surprising that data is becoming an ever more important asset to manage.

Pretty much all organizations therefore store data in various shapes and forms. Putting this data to good use is the natural next step on the agenda and organizations that are successful in that respect claim to be data-driven.

The transition from giving little attention to data to becoming data-driven has been gradual, and many businesses have yet to organize themselves around data. Rather, data is predominantly organized around business processes. Unification of the available data is done far downstream, after it has passed through the organizational structure, the people in the different departments, the applications they use, and the databases in which they have stored it.

These databases also have their own application-specific models, creating a disparate data structure landscape that is hard to navigate, and unification of these is usually a resource-intensive and ongoing task in an organization. This leads to confusion, frustration and an often abysmal return on investment for data initiatives.

Processes

Ultimately, organizations have some set of goals they wish to fulfill. These can be goals for the organization itself (profit, market share, etc), but also goals related to their customers (satisfaction, loyalty, etc), their employees (health, efficiency, etc), applicable regulations (GDPR, SOX, etc.), or society as a whole (sustainability, equality, etc.).

The organization then structures itself in some way based on a perception on how to best work on reaching these goals. This perception is often influenced by current management trends, with flavors like functional, matrix, project, composite, and team-based organizational structures. There are also various frameworks associated with these describing ways of working within the organization, such as ITIL, SAFe, Lean, DevOps, and Six Sigma.

The sheer number of flavors and frameworks gaining and falling in popularity should be a warning sign that something is amiss. We believe that all of these treat the symptoms, but neither get at the root cause of the problem.

Technology

A heterogenous application and data store landscape within an organization is a strong detractor from achieving a unified view of the data they contain.

There are a plethora of job titles related to dealing with this heterogeneity: enterprise architect, integration architect, data warehouse architect, and the like. There are also different more or less systematic approaches, such as enterprise messaging systems, microservices, master data management, modern data stack, data mesh, data fabric, data lake, data warehouse, data lakehouse, and so on.

Again, the sheer number of titles and approaches and their gaining and falling in popularity should be a warning sign that something is amiss. We believe that also all of these treat the symptoms, but neither get at the root cause of the problem.

Problem Statement

The problem is that the way an organization is intended to work is usually misaligned with how it actually works, due to a number of factors distancing the ideal way of working from the de-facto way of working.

Some of these factors causing misalignment are:

  • The goals of the organization are vague and fuzzy and localized to some select individuals.
  • The de-facto way of working is a heritage from a different time.
  • The de-facto way of working strays from the ideal because of management fads.
  • The de-facto way of working is externally incentivized by vendors who benefit from it.
  • The de-facto way of working is a compromise due to technological limitations.
  • The de-facto way of working is sufficient to be profitable.

In future articles, we will show how these misalignment factors can be addressed in a model-driven organization, bringing its way of working much closer to the ideal.

We also believe that the significant divide between created data and actionable data found in most organizations is debilitating, since actionable data is what in the end creates value for the organization.

Data and Organizations

While products or services tend to leave the organization, data usually does not. It is the remainder of the daily operations, the breadcrumbs of human activity inside the organization, and as such the source from which an organization may learn, adapt and evolve.

If the collective knowledge of an organization only resides in the memories of its employees, it will never be utilized to its full potential. Even worse, given high record-high turnover (what some call “the great resignation”), this knowledge is leaving the organization at a dangerously high rate. This is especially harmful because it’s usually not the least competent, least experienced people leaving, quite the opposite.

Harnessing the full potential of the knowledge hidden in its data is therefore a necessity in the “survival of the fittest”-style environment most organizations face today. The survival of the organization depends on it, not figuratively but literally. Therefore, the data an organization creates must be stored, and stored in a way that makes it readily actionable.

The Traditional Approach

Looking at the architecture of a traditional organization (Figure 1), the organizational structure is formed to satisfy its goals.

The people working within this organizational structure buy and sometimes build applications that simplify their daily operations or solve specific problems. These applications create data, often stored in some database local to each application.

Data is then integrated from the disparate models found in the many application databases into a single database with a unified model. Analytics based on this unifiedly modeled data help people understand the ongoings in the organization and indicators show whether or not it is on the right track to achieving its goals.

Figure 1: Data within a traditional organization.

In this architecture, there is a divide between created data and actionable data. This divide also reduces the capacity with which the organization can assess its progress towards its goals.

Trying to Make Sense of Your Data

Data is created far from where it is analyzed, and data creation is often governed by third-party applications made for organizations in general, not custom-made for a specific organization.

The models those applications have chosen for the data they create rarely align perfectly with the model of a particular business. In order to align data created by different applications into a unified model of the organization, data must be interpreted, transported, and integrated (the dreaded ELT processes of extracting, loading and transforming data).

Application developers usually face fewer requirements than those a unified model should serve. As an example, there is often little to no support for retaining a history of changes in the application and they show only the current state things are in. Any natural progression or corrections that may have happened just overwrite the existing data. Living up to regulations in which both of these types of changes must be kept historically can significantly raise the complexity of the architecture needed to interpret, transport, and integrate data.

Another aspect that is complicating the architecture is the need for doing near real-time analytics. Interpreting, transporting, and integrating data are time-consuming operations, so achieving zero latency is impossible, not even with a massive increase in the ETL process execution frequency.

Data in the unified model is therefore never immediately actionable. Reducing this lag puts a strain on both the applications and the database serving the unified model, introduces additional challenges when it comes to surveillance and maintenance, and potentially significant cloud compute costs.

Trying to Make Sense of Someone Else’s Model

Applications that are not built in-house are normally built in a way that they are suitable for a large number of organizations. Their database models may therefore be quite extensive, in order to be able to serve many different use cases. These models also evolve with new versions of the applications.

Because of this, it is unusual that all possible data is interpreted, transported, and integrated into the unified model. Instead some subset is selected. Because of new requirements or applications evolving, this subset often has to be revised. Adapting to such changes can consume significant portions of the available time for maintenance of the unified model.

Maintaining a separate database with a unified model also comes with a monetary cost. Staff is needed with specialist skills in building unified models and logic for interpreting, transporting, and integrating data, while also maintaining these over time. On top of that is the cost of keeping a separate database to hold the unified model. Depending on whether this is in the cloud or on premise, there may be different costs associated with licensing, storage, compute, and backups.

Fragmentation

In larger or more complex organizations, the specialists can rarely comprehend and be responsible for all sources, given the number of applications used.

This results in hyper-specialization on some specific sources and tasks, which impairs their ability to understand and deliver on requirements that encompass areas outside of their expertise. Hyper-specialization also increases the risks of having single points of failure within the organization.

Making data actionable in the heterogenous application landscape resulting from the traditional approach outlined above requires a lot of work and carries a significant cost for the organization. There should be a better way and we’re convinced there is one. We’ll go into more detail in the next article in this series.

Large Scale Anchor Modeling

Quoting the video description:

The Data Vault approach gives the data modelers a lot of options to choose from: how many satellites to create, how to connect hubs with links, what historicity to use, which field to use as a business key. Such flexibilites leaves a lot of options for inoptimal modeling decisions.

I want to illustrate some choices (I call them issues) with risks and possible solutions from other modeling techniques, like Anchor Modeling. All issues are based on the years of evolving the Data Vault and Anchor Modeling data warehouses of 100+ TB in such databases as Vertica and Snowflake.

Speaker: Nikolai Golov is Head of Data Engineering of ManyChat (SaaS startup with offices in San Francisco and Yerevan), and a lecturer at Harbour Space University in Barcelona (data storage course). He studies modern data modeling techniques, like Data Vault and Anchor Modeling, and their applicability to big data volumes (tens and hundreds of TB). He also, as a consultant, helps companies to launch their own analytical/data platform.

Recorded at the Data Modeling Meetup Munich (DM3), 2022-07-18 https://www.meetup.com/Data-Modeling-DM3

Also recommended are the additional Medium articles by Anton Poliakov: https://medium.com/@yaschiknamail

Atomic Data

We failed. I recently attended the Knowledge Gap conference, where we had several discussions related to data modeling. We all agreed that we are in a distressful situation both concerning the art as a whole but also its place in modern architectures, at least when it comes to integrated data models. As an art, we are seeing a decline both in interest and schooling, with practitioners shying away from its complexity and the topic disappearing from curriculums. Modern data stacks are primarily based on shuffling data and new architectures, like the data mesh, propose a decentralized organization around data, making integration an even harder task.

When I say we failed, it is because data modeling in its current form will not take off. Sure we have successful implementations and modelers with both expertise and experience in Ensemble Modeling techniques, like Anchor modeling, Data Vault and Focal. There is, however, not enough of them and as long as we are not the buzz, opportunities to actually prove that this works and works well will wane. We tried, but we’re being pushed out. We can push back, and push back harder, but I doubt we can topple the buzzwall. I won’t stop pushing, but maybe it’s also time to peek at the other side of the wall.

If we begin to accept that there will only be a select few who can build and maintain models, but many more who will push data through the stack or provide data products, is there anything we can do to embrace such a scenario?

Data Whisperers

Having given this some thought I believe I have found one big issue, preventing us from managing data in motion as well as we should. Every time we move data around we also make alterations to its representation. It’s like Chinese Whispers (aka the telephone game) in which we are lucky to retain the original message when it reaches the last recipient, given that the message is whispered from each participant to the next. A piece of information is, loosely speaking, some bundle of stuff with a possible semantic interpretation. What we are doing in almost all solutions today is to, best we can, pass on and preserve the semantic interpretation, but care less about the bundle it came in. We are all data whisperers, but in this case that’s a bad thing.

Let’s turn this around. What if we could somehow pass a bundle around without having to understand the possible semantic interpretation? In order to do that, the bundle would have to have some form that would ensure it remained unaltered by the transfer, and that defers the semantic interpretation. Furthermore, whatever is moving such bundles around cannot be surprised by their form (read like throwing an exception), so this calls for a standard. A standard we do not have. There is no widely adopted standard for messaging pieces of information, and herein lies much of the problem.

The Atoms of Data

Imagine it was possible to create atoms of data. Stable, indivisible pieces of information that can remain unchanged through transfer and duplication, and that can be put into a grander context later. The very same piece could live in a source system, or in a data product layer, or in a data pipeline, or in a data warehouse, or all of the above, looking exactly the same everywhere. Imagine there was both a storage medium and a communication protocol for such pieces. Now, let me explain how this solves many of the issues we are facing.

Let’s say you are only interested in shuffling pieces around. With atomic data pieces you are safe from mangling the message on the way. Regardless of how many times you have moved a piece around, it will have retained its original form. What could have happened in your pipelines though, is that you have dressed up your pieces with additional pieces. Adding context on the way.

Let’s say your are building an integrated enterprise-wide model. Now you are taking lots of pieces and want to understand how these fit into an integrated data model. But, the model itself is also information, so it should be able to be described using some atoms of its own. The model becomes a part of your sea of atoms, floating alongside the pieces it describes. It is no longer a piece of paper printed from some particular modeling tool. It lives and evolves along with the rest of your data.

Let’s say you are building a data product in a data mesh. Your product will shuffle pieces to data consumers, or readers may be a better word, since pieces need not be destroyed at the receiving side. Some of them may be “bare” pieces, that have not yet been dressed up with a model, some may be dressed up with a product-local model and some may have inherited their model from an enterprise-wide model. Regardless of which, if two pieces from different products are identical, they represent the same piece of information, modeled or not.

Model More Later

Now, I have not been entirely truthful in my description of the data atoms. Passing messages around in a standardized way needs some sort of structure, and whatever that structure consists of must be agreed upon. The more universal such an agreement is, the better the interoperability and the smaller the risk of misinterpreting the message. What exactly this is, the things you have to agree upon, is also a model of sorts. In other words, no messaging without at least some kind of model.

We like to model. Perhaps we even like to model a little bit too much. Let us try to forget about what we know about modeling for a little while, and instead try to find the smallest number of things we have to agree upon in order to pass a message. What, similar to a regular atom, are the elementary particles that data atoms consist of? If we can find this set of requirements and it proves to be smaller than what we usually think of when it comes to modeling, then perhaps we can model a little first and model more later.

Model Little First

As it happens, minimal modeling has been my primary interest and topic of research for the last few years. Those interested in a deeper dive can read up on transitional modeling, in which atomic data pieces are explored in detail. In essence, the whole theory rests upon a single structure; the posit.

posit_thing [{(X_thing, role_1), ..., (Y_thing, role_n)}, value, time]

The posit acts as an atomic piece of data, so we will use it to illustrate the concept. It consists of some elements put together, for which it is desired to have a universal agreement, at least within the scope in which your data will be used.

  • There is one or more things, like X_thing and Y_thing, and the posit itself is a thing.
  • Each thing takes on a role, like role_1 to role_n, indicating how these things appear.
  • There is a value, which is what appears for the things taking on these roles.
  • There is a time, which is when this value is appearing.

Things, roles, values, and times are the elements of a posit, like elementary particles build up an atom. Of these, roles need modeling and less commonly, if values or times can be of complex types, they may also need modeling. If we focus on the roles, they will provide a vocabulary, and it is through these posits later gain interpretability and relatability to real events.

p01 [{(Archie, beard color)}, "red", '2001-01-01']
p02 [{(Archie, husband), (Bella, wife)}, "married", '2004-06-19']

The two posits above could be interpreted as:

  • When Archie is seen through the beard color role, the value “red” appears since ‘2001-01-01’.
  • When Archie is seen through the husband role and Bella through the wife role, the value “married” appears since ‘2004-06-19’.

Noteworthy here is that both what we traditionally separate into properties and relationships is managed by the same structure. Relationships in transitional modeling are also properties, but that take several things in order to appear.

Now, the little modeling that has to be done, agreeing upon which roles to use is surely not an insurmountable task. A vocabulary of roles is also easy to document, communicate and adhere to. Then, with the little modeling out of the way, we’re on to the grander things again.

Decoupling Classification

Most modeling techniques, at least current ones, begin with entities. Having figured out the entities, a model describing them and their connections is made, and only after this model is rigidly put into database, things are added. This is where atomic data turns things upside down. With atomic data, lots of things can be added to a database first, then at some later point in time, these can be dressed up with more context, like an entity-model. The dressing up can also be left to a much smaller number of people if desired (like integration modeling experts).

p03 [{(Archie, thing), (Person, class)}, "classified", '1989-08-20']

After a while I realize that I have a lot of things in the database that may have a beard color and get married, so I decide to classify these as Persons. Sometime later I also need to keep track of Golf Players.

p04 [{Archie, thing), (Golf Player, class)}, "classified", '2010-07-01']

No problem here. Multiple classifications can co-exist. Maybe Archie at some point also stops playing golf.

p05 [{(Archie, thing), (Golf Player, class)}, "declassified", '2022-06-08']

Again, not a problem. Classification does not have to be static. While a single long-lasting classification is desirable, I believe we have put too much emphasis on static entity-models. Loosening up classification, so that a thing can actually be seen as more than one type of entity and that classifications can expire over time will allow for models being very specific, yield much more flexibility and extend the longevity of kept data far beyond what we have seen so far. Remember that our atomic pieces are unchanged and remain, regardless of what we do with their classifications.

Multitenancy

Two departments in your organization are developing their own data products. Let us also assume that in this example it makes sense for one department to view Archie as a Person and for the other to view Archie as a Golf Player. We will call the Person department “financial” and it additionally needs to keep track of Archie’s account number. We will call the Golf Player department “member” and it additionally needs to keep track of Archie’s golf handicap. First, the posits for the account number and golf handicap are:

p06 [{(Archie, account number)}, 555-12345-42, '2018-01-01']
p07 [{(Archie, golf handicap)}, 36, '2022-05-18']

These posits may live their entire lives in the different data products and never reside together, or they could be copied to temporarily live together for a particular analysis, or they could permanently be stored right next to each other in an integrated database. It does not matter. The original and any copies will remain identical. With those in place, it’s time to add information about the way each department view these.

p08 [{(p03, posit), (Financial Dept, ascertains)}, 100%, '2019-12-31']
p09 [{(p04, posit), (Member Dept, ascertains)}, 100%, '2020-01-01']
p10 [{(p06, posit), (Financial Dept, ascertains)}, 100%, '2019-12-31']
p11 [{(p07, posit), (Member Dept, ascertains)}, 75%, '2020-01-01']

The posits above are called assertions, and they are metadata, since they talk about other posits. Information about information. An assertion records someone’s opinion of a posit and the value that appears is the certainty of that opinion. In the case of 100%, this corresponds to absolute certainty that whatever the posit is stating is true. The Member Department is less certain about the handicap, perhaps because the source of the data is less reliable.

Using assertions, it is possible to keep track of who thinks what in the organization. It also makes it possible to have different models for different parts of the organization. In an enterprise wide integrated model, perhaps both classifications are asserted by the Enterprise Dept, or some completely different classification is used. You have the freedom to do whatever you want.

Immutability

Atomic data only works well if the data atoms remain unchanged. You would not want to end up in a situation where a copy of a posit stored elsewhere than the original all of a sudden looks different from it. Data atoms, the posits, need to be immutable. But, we live in a world where everything is changing, all the time, and we are not infallible, so mistakes can be made.

While managing change and immutability may sound like incompatible requirements, it is possible to have both, thanks to the time in the posit and through assertions. Depending on if what you are facing is a new version or a correction it is handled differently. If the beard of Archie turns gray, this is a new version of his beard color. Recalling the posit about its original color and this new information gives us the following posits:

p01 [{(Archie, beard color)}, "red", '2001-01-01']
p12 [{(Archie, beard color)}, "gray", '2012-12-12']

Comparing the two posits, a version (or natural change), occurs when they have the same things and roles, but a different value at a different time. On the other hand, if someone made a mistake entering Archie’s account number, this needs to be corrected once discovered. Let’s recall the posit with the account number and the Financial Dept’s opinion, then add new posits to handle the correction.

p06 [{(Archie, account number)}, 555-12345-42, '2018-01-01']
p10 [{(p06, posit), (Financial Dept, ascertains)}, 100%, '2019-12-31']
p13 [{(p06, posit), (Financial Dept, ascertains)}, 0%, '2022-06-08']
p14 [{(Archie, account number)}, 911-12345-42, '2018-01-01']
p15 [{(p14, posit), (Financial Dept, ascertains)}, 100%, '2022-06-08']

This operation is more complicated, as it needs three new posits. First, the Financial Dept retracts its opinion about the original account number by changing its opinion to 0% certainty; complete uncertainty. For those familiar with bitemporal data, this is sometimes there referred to as a ‘logical delete’. Then a new posit is added with the correct account number, and this new posit is asserted with 100% certainty in the final posit.

Immutability takes a little bit of work, but it is necessary. Atoms cannot change their composition without becoming something else. And, as soon as something becomes something else, we are back to whispering data and inconsistencies will abound in the organization.

What’s the catch?

All of this looks absolutely great at first glance. Posits can be created anywhere in the organization provided that everyone is following the same vocabulary for the roles, after which these posits can be copied, sent around, stored, classified, dressed up with additional context, opinionated, and so on. There is, however, one catch.

Identifiers.

In the examples above we have used Archie as an identifier for some particular thing. This identifier needs to have been created somewhere. This somewhere is what owns the process of creating other things like Archie. Unless this is centralized or strictly coordinated, posits about Archie and Archie-likes cannot be created in different places. There should be a universal agreement on what thing Archie represents and no other thing may be Archie than this thing.

More likely, Archie would be stated through some kind of UID, an organizationally unique identifier. Less readable, but more likely the actual case would be:

p01 [{(9799fcf4-a47a-41b5-2d800605e695, beard color)}, "red", '2001-01-01']

The requirement for the identifier of the posit itself, p01, is less demanding. A posit depends on each of its elements, so if just one bit of a posit changes, it is a different posit. The implication of this is that identifiers for posits need not be universally agreed upon, since they can be resolved within a body of information and recreated at will. Some work has to be done when reconciling posits from several sources though. We likely do not want to centralize the process of assigning identities to posits, since that would mean communicating every posit from every system to some central authority, more or less defeating the purpose of decentralization.

Conclusions

If we are to pull off something like the data mesh, there are several key features we need to have:

  • Atomic data that can be passed around, copied, and stored without alteration.
  • As few things as possible that need to be universally agreed upon.
  • Model little first model more later, dress up data differently by locality or time.
  • Immutability so that data remains consistent across the organization.
  • Versions and corrections, while still adhering to immutability.
  • Centralized management for the assignment of identifiers.

As we have seen, getting all of these requires carefully designed data structures, like the posit, and a sound theory of how to apply them. With the work I have done, I believe we have both. What is still missing are the two things I asked you to imagine earlier, a storage medium and a communication protocol. I am well on the way to produce a storage medium in the form of the bareclad database engine, and a communication protocol should not be that difficult, given that we already have a syntax for expressing posits, as in the examples above.

If you, like me, think this is the way forward, please consider helping out in any way you can. The goal is to keep everything open and free, so if you get involved, expect it to be for the greater good. Get in touch!

We may have failed. But all is definitely not lost.

Information in Effect and Performance

Last week we had some performance issues in a bitemporal model, which by the looks of it was the result of a poorly selected execution plan in SQL Server. The reasoning behind this conclusion was that if parts of the query were first run separately with results stored in temp tables, and these later used, the issues were gone. This had me thinking though: Could something be done in order to get a better plan through the point-in-time views?

I first set about testing different methods of finding the row in effect in a unitemporal solution. In order to do so, a script was put together that creates a test bench along with a number of functions utilizing different methods. This is the script in case you would like to reproduce the test. Note that some tricks had to be employed for some methods in order to retain table elimination, a crucial feature, that may very well have skewed those results towards the negative.

The best performers in this test are the “OUTER APPLY” and “TOP 1 SUBSELECT”. We are already using the “TOP 1 SUBSELECT” variant, and they are almost tied for first place, so perhaps not much can be gained after all. That said, the execution pattern is very different between the two, so it’s hard to draw any conclusions without proper testing for the bitemporal case.

In the bitemporal point-in-time views, the rows in effect method has to be used twice. First to find the latest asserted posits, and then from those, the ones with the latest appearance time. So, I set about testing the four possible combinations of the two best approaches on one million rows in an example model. The results are summarized below (you may need to click to enlarge the images unless you have a really good monitor and incredible eye sight).

TOP 1 SUBSELECT appearance TOP 1 SUBSELECT assertion

Time to run: 8.0 seconds. This is the current approach.

OUTER APPLY appearance OUTER APPLY assertion

Time to run: 5.1 seconds. Better than current, even if the estimated cost is worse.

TOP 1 SUBSELECT appearance OUTER APPLY assertion

Time to run: 9.5 seconds. Worse than current.

OUTER APPLY appearance TOP 1 SUBSELECT assertion

Time to run: 3.9 seconds. Better than current, and lower estimated cost.

Results

The last of the alternatives above cuts the execution time in half for the test we ran. It also has the simplest execution plan of them all. This seems promising, given that our goal was to get the optimizer to pick a good plan in a live and complex environment. I will be rewriting the logic in the generator for bitemporal models during the week to utilize this hybrid method of OUTER APPLY and TOP 1 SUBSELECT.

Temporal Complexity

Having taken a deep dive into our convenience functionalities that aim to remove most obstacles for working with temporal data, I anew “appreciated” the underlying complexities. This time around I decided to quantify these. Just how difficult is it to introduce time in a database? Is bitemporal comparatively a huge leap in complexity, as I have been touting for years without substantial proof? The answer is here.

Tracking versions is four times as difficult as not tracking anything, and adding corrections in addition makes it forty times as difficult.

To see how we got to these results, we will use the number of considerations you have to take into account as a measure. This is not exact science, but likely to be sufficiently good to produce a rule of thumb.

No temporality

When you have no intent of storing any history in your database, you will still have the following considerations. The (rough) number of things to consider are printed in parentheses before the description of the consideration.

  • (2) Your key will either match no rows or one row in the database, no prep needed.
  • (2) The value for the key will either be the same or different from the one stored.

Total: 2 × 2 = 4 considerations.

Not so bad, most people can understand some if-else logic for four cases.

Tracking versions (uni-temporal)

Stepping up and adding one timeline in order to track versions, the changes of values, many additional concerns arise.

  • (3) Your key will match no rows or up to possibly many rows in the database, some prep may be needed.
  • (2) The value for the key will either be the same or different from the one stored.
  • (3) The time of change may be earlier, the same, or later than the one stored.

Total: 3 × 2 × 3 = 18 considerations.

In other words, tracking versions is more than four times as difficult as just ignoring them altogether. Ignorance is not bliss here though, mind my word.

Tracking versions and corrections (bi-temporal)

Taking the leap, to also keep track of corrections made over time, even more concerns arise.

  • (3) Your key will match no rows or up to possibly many rows in the database, some prep may be needed.
  • (3) The value for the key will either be the same, logically deleted, or different from the one stored.
  • (3) The time of change may be earlier, the same, or later than the one stored.
  • (3) The time of correction may be earlier, the same, or later than the one stored.
  • (2) Your intended operation may be an insert or a logical delete.

Total: 3 × 3 × 3 × 3 × 2 = 162 considerations.

If you managed to pull through the 18 considerations from tracking versions, imagine nine times that effort to track corrections as well. Or, if you came from not tracking anything, prepare yourself for something requiring forty times the mental exercise.

Tracking versions, and who held an opinion about those and their certainty (multi-temporal)

I just had to compare this to transitional modeling as well, for obvious reasons.

  • (3) Your key will match no rows or up to possibly many rows in the database, some prep may be needed.
  • (5) The value for the key will either be the same, logically deleted, held with some degree of certainty, either to the value itself or its opposite, or different from the one stored.
  • (3) The time of change may be earlier, the same, or later than the one stored.
  • (3) The time of assertion may be earlier, the same, or later than the one stored.
  • (3) Your intended operation may be an insert, a logical delete, or with consideration to existing data result in you contradicting yourself or not.
  • (2) Assertions may be made by one or up to any number of asserters.

Total: 3 × 5 × 3 × 3 × 3 × 2 = 810 considerations.

That’s two hundred times more complex than most databases. It sort of makes me wonder how I ended up picking this as a topic for my research. But, here I am, and hopefully I can contribute in making everything more understandable in the end. In all fairness, many of the considerations actually have trivial outcomes, but those who do not can keep your though process going for weeks.

Thankfully, in all the scenarios above, much logic can actually be hidden from the end user, thanks to “default” rules being applied by triggers, hiding the complexity.

Modified Trigger Logic

The triggers in uni-temporal have been rewritten in order to take advantage of the performance optimizations discovered in the bi-temporal generator. At the same time, the check constraints have been removed in favor of AFTER triggers, which are more lenient (but still correct) when inserting several versions at once. Early tests indicate the following improvements:

  • Insert 1 million rows into latest view of empty anchor:
    88 seconds with old trigger logic and check constraints
    44 seconds with new logic
  • Insert another 1 million rows with 50% restatements:
    64 seconds with old trigger logic and check constraints
    46 seconds with new logic
  • Insert another 1 million rows with 100% restatements:
    37 seconds with old trigger logic and check constraints
    42 seconds with new logic

As can be seen, the performance difference is almost negligible for the new logic, regardless of the number of restatements. The only test in which the old logic performs slightly better is when every inserted row is a restatement, which is an uncommon (and probably unrealistic) scenario.

The new logic can be tested in the test version of the online modeler, now at version 0.99.9.0.

New Forums

We have migrated to new forum software, since nabble was going into maintenance mode, with an uncertain future. Your user is still available if you can remember and have access to the email you used when you registered. Click “forgot password” and you will be sent instructions to reset it. Right now you have a random unguessable password.

The new forum is available here:
Anchor Forum (anchormodeling.com)

We also posted a new topic on filtered indexes here:
Filtered indexes for hot stuff – Anchor Forum (anchormodeling.com)

Bitemporal Generator

We have made some performance improvements to the bitemporal generator (for SQL Server) in the Anchor modeler. Code the from the generator has been running in a production environment for a while now without issues, so it should be rather safe to test out. Let us know if you find any issues.

The bitemporal generator is a subset of the concurrent-reliance-temporal generator, aimed at high performance.

Online modeler, test version:
https://www.anchormodeling.com/online-modeler-test-version/

Peridata between Data and Metadata

Somewhere in between data and metadata there is another kind of information, which we will name peridata. Perhaps you have found yourself looking at some piece of information and thinking, is this data or metadata? In this article, not only will you get a precise definition of what is what, but also a term for data living on the fringe of its classification. In order to achieve these definitions, we will turn to the posit, which is the fundamental building block of transitional modeling.

Posits

A posit essentially captures a piece of information. Here are two examples:

p1 = [{(Archie, beard)}, fluffy red, 2020-01-01]
p2 = [{(Archie, husband), (Bella, wife)}, married, 2004-06-19]

The first posit, p1, captures the information that Archie had a fluffy red beard on the 1st of January 2020. The second posit, p2, captures the information that Archie and Bella are married since the 19th of June 2004. Posits can express properties, as in p1, and relationships, as in p2. In transitional modeling, relationships are properties that require more than one thing to take on a value. Such an approach may be unfamiliar, since in most other modeling techniques there are separate constructs for properties and relationships. The proper way to read those two posits, using the notion of roles, is:

When Archie filled the beard role the value ‘fluffy red‘ appeared on 2020-01-01.

When Archie filled the husband role and Bella the wife role the value ‘married‘ appeared on 2004-06-19.

A singular thing filling a singular role gives rise to what we usually call properties or attributes, whereas a combination of things filling a combination of roles give rise to relationships. Whenever roles are filled, some value appears. In the case of Bella and Archie it could just as well have been ‘divorced’, ‘planned’, or ‘not applicable’. In fact, for the vast majority of people we could fill the roles with the relationship is ‘not applicable’, but we tend to document these only in the rare cases such posits carry valuable information.

Given the terminology of things (Archie, Bellla) and roles (beard, husband, wife), the structure of a posit can be formalized as:

posit = [
  {(thing 1, role 1), ..., (thing n, role n)},
  appearing value, 
  time of appearance
]

The set in the first position of the posit is called an appearance set, followed by the for that set appearing value and its time of appearance. Posits are just pieces of information and there is no requirement that they must be true. After all, there is a lot of untrue information out there and much more, maybe even most, that is uncertain to some degree. We do not want to disqualify any information from being recorded based on its certainty.

Data and Metadata

We will now make the distinction between data and metadata. Given an appearance set, if all the things it contains are not posits, then posits containing that set are classified as data. Correspondingly, given an appearance set, if at least one of the things it contains is a posit, then posits containing the set are classified as metadata. The examples given so far are data, since neither Archie nor Bella is a posit. Instead, one of the most important examples of metadata in transitional modeling is:

p3 = [{(p1, posit), (Bella, ascertains)}, 1.00, 2020-01-02]

There is no way to determine its truthfulness from a posit alone, so an additional construct is needed. An assertion is a posit that assigns a certainty to another posit. In the example above, Bella ascertains the posit about Archie’s beard, with absolute certainty on the 2nd of January 2020. This is metadata, since p1 is a posit. Assertions are subjective, and so far we only have Bella’s view of p1. Certainty is expressed by a real number in the interval [-1, 1], where 1 is being absolutely certain of what the posit is stating, 0 is having no idea whatsoever, and -1 being certain of the opposite of what the posit is stating. If you want to delve deeper into the expressiveness given by this machinery, you can read the paper “Modeling Conflicting, Unreliable, and Varying Information“.

Another common type of metadata, particularly in data warehouses, has to do with from which source posits originated.

p4 = [{(p3, source)}, The Horse's Mouth, 2020-01-01]

There could be a whole range of information related to the posit itself, like who or what recorded it, when it was entered into a database, its associated security or sensitivity, effective constraints at the time, or rules to apply in certain scenarios. These are just some examples, but all of which would be classified as metadata, because they involve a posit in their appearance sets.

Since metadata is also expressed using posits, these can be parts of appearance sets as well. For example, in p4 the assertion p3 is a part of its appearance set, so p4 is also metadata, but on a different “level” than the already metadata p3. In such a case it makes sense to distinguish these as level-1 metadata and level-2 metadata, which could be extended up to any level-n metadata. I believe that going beyond level-1 metadata is unusual in existing implementations, and that there may be few use cases that need additional levels. However, when they are needed, they are probably also very important.

Peridata

While the rules separating data and metadata are clear cut, the way to tell data from peridata is less straightforward. In transitional modeling it is possible to reserve roles for particular purposes. One such example is used for classification.

p5 = [{(Archie, thing), (Person, class)}, active, 1972-08-20]

This posit tells us that Archie belongs to the Person class since 1972-08-20, using the reserved class role. Thanks to classification being expressed through posits, it is possible to disagree on these using assertions. It is also possible to have multiple classifications at once and to let classifications expire or become active at different points in time.

As you can see, there is no posit in the appearance set of p5, so it is not metadata by our previous definition. Although, the model is likely something that traditionally would have been classified as metadata. In order to distinguish this type of data from regular data, we will use the concept of reserved roles. But then, what are reserved roles? Well, you can think of them as being similar to reserved keywords in a programming language. In fact, in the examples so far, the roles positascertainsthing, and class are already reserved in transitional modeling. The roles beardhusband, and wife depend on your domain and are instead something you as a modeler will have to bring into existance.

With this we can get definitions for all three categories.

  1. If at least one of the things contained in an appearance set is a posit, then all posits with this set are classified as metadata.
  2. If at least one of the roles contained in an appearance set is reserved, then all posits with this set are classified as peridata.
  3. If neither of above applies to an appearance set, then all posits with such sets are classified as data.

Peridata exists among your data, but sort of on the fringe, given that it requires these reserved roles. Note that it is possible to have peridata for your metadata as well, when both 1 and 2 apply. Transitional modeling will come with a set of reserved roles, all of which are domain independent, but there will also be an option for end users to reserve roles of their own.

Remarks

Thanks to transitional modeling, we have been able to break down what is traditionally thought of as a single metadata concept into two categories, metadata and peridata. On the fringe of your data you will find peridata, short for peripheral data, which capture such things as the classifications in your domain. Metadata is restricted to those pieces of information that explicitly talk about other pieces of information. Whether this distinction is useful remains to be seen, but it is certainly interesting. In a relational database, for example, the classifications in the modeled domain exists as a schema. Schemas are therefore peridata. Perhaps you can think of other commonly used model artifacts that fall within the scope of peridata or metadata?

On a side note, there are already some indications that the use of reserved roles can improve performance in a database engine based on posits. If you are interested in following the developement of such an engine, check out bareclad.

The Infinite Decay of Loyalty

When most businesses think of customers, they think of them as someones with which they have more than a fleeting engagement. It therefore makes sense to think of engagement lengths, or in other words, for how long a customer is a customer. If your business falls within this category, you are likely to have asked yourself how long an average customer engagement is. If you also have a valid answer to this question, based on your particular circumstances, then I congratulate you. As it turns out, the question “How long is an average customer engagement length?” is in almost all cases ill formulated and impossible to answer. All hope is not lost, however, as we shall see.

First, let us address the issue with the question itself. In any business over a certain size, there will be some customers that are loyal to the bone. They will stay with the business no matter what, until the demise of themselves or the business. Let us call this group the “eternals”. For the sake of illustration, even though not entirely mathematically correct, let these represent infinite engagement lengths. Now, remind yourself of how an average is calculated, as the sum of some engagement lengths divided by the number of customers having these lengths. If but one of your customers is an “eternal” the sum will be infinite, with your number of customers remaining finite, yielding an infinite average.

In reality, “eternals” stay for a very long but indefinite time, not infinitely long. Regardless, the previous discussion establishes that an average will be skewed to the point of uselessness or impossible to determine because of these customers. Interestingly, changing the question slightly circumvents the problem. If you instead ask “What is the median customer engagement length?”, it suddenly becomes much more approachable. Recall that the median is the value in the ‘middle’ of an ordered set of numbers. Given the engagement lengths 1, 8, 4, 6, 9, we order these by size to become 1, 4, 6, 8, 9, and conclude that 6 can be found in the middle and is therefore the median value. When the set of numbers has an even count, the median is the average of the two midmost numbers. The important feature of the median is that it is resilient to edge cases. Even if an infinite engagement length is added to the set, the median can still be calculated. This holds true as long as you do not have more than 50% “eternals” in your customer base.

The median engagement length represents the half life of your customer base. For a given cohort, say the customers signing up a certain year, after the median engagement length in years have passed, half of them are expected to remain. That is quite an understandable measure, but one problem still remains. In order to calculate the median, at least half of a cohort must have left. If the median engagement length is indeed years for your business, would you want to wait that long to figure it out? Of course not. Now this is a scenario I’ve found myself in more than once. With very little data, find a way to figure out the median engagement length. Surprisingly and somewhat happenstance, when I was looking for solutions, I stumbled upon what may be a universal pattern for how loyalty evolves over time. You see, most forecasting is done using curve fitting techniques, and finding the right equation is key. If you have only two or three points, there are lots of equations that you can apply, most of which will have very poor predictive power.

Fortunately, I happened to be at a company some 10 years ago where there were five yearly cohorts, whose development I could follow for 1, 2, 3, 4, and 5 years respectively. When plotting these the first year of every cohort aligned almost perfectly. That indicated to me that there is some universality in the behavior of loyalty. The surprising part was that for four of the cohorts, the first two points aligned, for three the first three, and so on. Now, this indicates that there is indeed some equation that can describe loyalty at this particular company. When found, it would with rather good accuracy predict the engagement lengths of whole cohorts, even brand new ones it seemed.

Looking at the shape of the curve the points were aligning to, it dropped off quite heavily in the first year, followed by successively smaller drops. The happenstance was that I recognized this type of curve. In a fortunate turn of events I had a couple of years earlier been working with calculations on the radioactivity of matter, and the beginning of this curve looked very much like exponential decay.

In exponential decay there is a fixed amount time that passes before a cohort is halved. If you restart there, and view this as a new cohort, after the same amount of time it will halve again. Using Excel goal seek (poor man’s brute forcing), with the formula below for exponential decay I was able to quickly figure out the half life of the cohorts I had at hand. Since the half life coincides with the median I was then able to answer the question “What is the median customer engagement length?” with some confidence, even if we had not passed that point in time yet.

In the formula N₀ is the original cohort size, t are the points in time at which you know the actual size N(t), and h is the half life constant you need to determine. In fact, looking at it purely mathematically, it is actually possible to determine the average engagement length as well, if it were to behave exactly like exponential decay. This is, however, again under the assumption that you have no “eternals” and that your cohort will truncate to zero customers once decay has brought it down to less than the number 1. Wikipedia also notes that behavior is better understood as long as the cohort is large.

“Many decay processes that are often treated as exponential, are really only exponential so long as the sample is large and the law of large numbers holds. For small samples, a more general analysis is necessary, accounting for a Poisson process.”

Now, some will likely find it extreme to assume that loyalty is decaying exponentially. But, if we dive a bit deeper, it actually turns out to be the most natural assumption. Let us change the approach and instead think of a customer as having a fixed probability to churn during a given time frame. For example, if we are looking at monthly cohorts, let p be the probability that a customer has churned in a month. For simplicity we assume all customers have the same probability to churn, but in reality some will be more likely and others less likely. Even so, there will be an average corresponding to the actual number of customers lost around which the individual probabilities are distributed, in some fashion. After a month we would then get that (1-pN₀ customers remain, after two months (1-p)(1-pN₀, and so on.

This is a recursive formula that produces a series. Interestingly, if we find the correct probability this series can be made to match exponential decay perfectly.

From this we can conclude that if customers have a reasonably similar probability to churn in given time frames, the end result is necessarily exponential decay. If you want to play around with this series and curve you can do so in my online workbook in GeoGebra. Given a half life h, the formula to calculate p is as follows. For example, in order to get a half life of two time periods, a churn rate of approximately 29% per period is needed.

Graphs like the one displayed by the exponential decay are called asymptotic, because as time approaches infinity the curve will approach zero. It is not hard to figure out that if the curve instead approached the number of “eternals” it would be an even better fit to the actual conditions. Changing the formula to accommodate for this is simple:

The formula is very similar to the earlier one, but now with the additional constant E, representing the number of “eternals”. Of course, this is another number not known, and the additional degree of freedom makes brute forcing their values harder, but far from impossible. The Excel Solver plugin can do multivariate goal seeks, for example.

The green curve above is using the new formula, with a likely exaggerated 20% eternals. Both of these have the half life set to two time units. Given how closely these overlap before the first halving, they are likely to be inseparable when doing curve fitting early on. They do, however, diverge significantly thereafter, so determining E should become easier shortly after the first halving. Before that, estimating E must be done through other means, like actually engaging with and talking to customers, or in the worst case, through gut feelings.

Note that in the new formula the half life pertains to the time it takes to halve the number of “non-eternals”. In order to get the new adjusted value for the constant given a desired half life, it must be multiplied by the unwieldy factor below. In the graph above the value h = 1.41504 gives an actual half life of two time units.

Assuming that all cohorts will behave like this, and that there is a recurring inflow of new customers, one can investigate the effects this has on a customer base over a longer period of time. If we start by taking the example of decaying cohorts without “eternals” and look at 15 consecutive time periods of acquisition, another surprise is in store for us.

The red curve is the sum of all the individual, gray, cohort curves, so it is in effect what the total customer base will look like. In reality customers will likely not come in bursts between each time period, but somewhat more continuously. That would just reduce the jaggedness of the curve, but it would still retain its general shape. What is particularly interesting about this shape is that it is not constantly growing, even though we adding the same number of new customers every time period. The customer base grows fast in the beginning, but then the growth stalls. This is a mathematical inevitability.

With a constant inflow of new customers, an exponential decay of loyalty will eventually stall the growth of your customer base.

If you noticed the dotted line in the graph above it is the upper bound, the largest number of customers you will ever get. This number can actually be calculated using the ratio in the rightmost part of the formula below. With the example of a 29% churn rate per time period, the largest number of customers is between three and four times a cohort size.

Over time, some customers are bound to return after a hiatus, at which point a business may view them as new again. Returning customers, even if the business has forgotten them in the meantime, are just a variation of “eternals”. The graph above is, in other words, only valid when there are no “eternals”, neither constant nor alternating. Let us therefore look at a similar graph for the more true to life example of decaying cohorts with “eternals”.

When “eternals” are part of the equation, the growth no longer stalls, and instead becomes more or less linear after an initial phase of more rapid growth. Recall that we use the likely exaggerated 20% in these examples, which is why the line is rather steep. This is, however, an indication that even a small percentage of “eternals” will make a significant difference in the development of your customer base.

Sustained growth of a customer base is only possible when some are eternally loyal.

That being said, growth cannot continue forever for other reasons. There is a limited number of people living on this planet, or more likely a limited number of people in your target market, in which there is also competition for the customers. This places an upper limit to the possible market share any business can get. Even so, understanding the mathematical fundamentals of customer base growth and applying these to your situation can yield early and important insights.

Now, let us return to the dotted line in the final graph and see if we can find its equation. First, the recursive formula will have to be adjusted for the presence of “eternals”, so that it becomes as follows.

When many such series are summed up, one for each cohort, the resulting total sum becomes the sum of the individual terms up to n.

From this the equation for the linear asymptote can be determined, and that line is described by the following equation, where t is the time passed.

With all the intellectually challenging and rather complex work done, what remains is that rather simple equation, which in essence describes the long term behavior of your customer base growth. From it, you can easily see that if E = 0 we get the simpler and constant upper bond discussed earlier. We can also see that the steepness of the asymptote is independent of your churn rate, p. Halving the churn rate, for example, will not double your customer base growth. Also, the smaller your churn rate is, the less the effect will be of reducing it further.

Both increasing the number of “eternals” and reducing the churn rate suffers from diminishing returns. A small change will result in a relatively even smaller change in growth, and the more loyal your customers become, the less the effect will be.

In the graph above, the purple growth is after halving the churn rate, compared to the blue growth. The orange growth is instead doubling the number of eternals. The long term effect of doubling the number of “eternals” is a higher sustained growth rate, and had the graph been longer it would soon have overtaken the halved churn rate. Efforts aimed to produce “eternals” are therefore more important than efforts to reduce general churn.

With all that said, there is still one parameter that we have not tinkered with. Everything so far has relied on the assumption that the inflow is constant, every cohort has the same size. For a mature business, this is not an unlikely scenario though. But, what if the cohorts themselves grow or shrink? How would that effect compare to the effects of increasing loyalty? In the graph below, the green growth has a 1% increase in the cohort size between every point. Similarly, the red growth has a 1% decrease in cohort size. Somewhat astoundingly, such a small increase will equal the effects of doubling the “eternals”. More frighteningly, with a small decrease, the growth will again almost completely stall. This places the importance of sales in a new perspective.

Efforts to produce incremental increase in customer inflow vastly outweigh efforts to increase loyalty in terms of effect on growth.

But, does this really apply to your business? I cannot answer that question with certainty, but I can say that in the original business where I discovered this 10 years ago, recent cohorts still adhere to this behavior, and old ones have not diverged from what was predicted. We, at the company where I work now, have also applied this at two other businesses in completely different domains and other stages of development. It was a bit of a long shot, but it turns out that the patterns holds true also for them. Loyalty is decaying exponentially. Now, this is the reason why I am writing this, because I am suspecting that this could be an innate and universal property of loyalty.

I know that most of you won’t go back and start doing calculations, but to those of you who do, please let me know the results!

If this indeed holds true, even within a limited scope, spreading this knowledge should prove valuable for many.