We are proud to announce the release of version 0.99 of the modeling tool. This version has been in the making for over one year, and code generated from it is, as usual, already in production for a number of data warehouses and system databases. Expect a more user friendly interface, that again works in Chrome, after they unexpectedly and deliberately broke SVG support. The generated code has, among other things, improved trigger logic. For example, if you try to update a static attribute, you will now get a warning, rather than a failure. This should make life easier, particularly for those using our sisula ETL Framework for metadata driven Data Warehouse automation. Using this framework with the latest modeling tool we have built and put a DW in production in record time. It took less than a week to model, populate three years of history, and start a daily incremental load of a Data Warehouse used in a high security environment.
A while back Juan-José van der Linden created a script that would reverse-engineer a database into an Anchor model. He was kind enough to donate that script to the community, and it is available in our forum. Now there’s also a second effort which is available in the form of the script below. This is a work in progress and it will be updated with more features in the future. Perhaps we can merge the best features from JJ’s script into this one or the other way around.
Please note that the script will use column statistics in order to determine if knots should be created, so it may take a long time to run when no statistics are available. It will reuse existing statistics, so a second run of the script is much faster. It tries to determine ties based on primary keys and matching column names.
Last week we had the pleasure of certifying 24 new Anchor Modelers in the Netherlands. In other words, if you want to find expertise on Anchor modeling, you should head over to our Directory and contact them! We also held a public session with an introduction to Anchor modeling, showing them how it is possible to capture facts like “someone being somewhat sure about something, while someone else is completely sure about its opposite”. We were happy to have participants from Nike, the Dutch Police, Rabobank, Essent, ChainPoint, consultant companies like Free Frogs, Ordina, Cap Gemini, Kadenza, and a number of freelancers.
Thanks everyone for participating at the courses and presentations, and the great discussions during the breaks!
We had a great time at the 34th International Conference on Conceptual Modeling, also known as ER2015, running a demo station of the online Anchor modeler, standing in for Hans Hultgren and presenting Ensemble Modeling, participating in a panel discussion on Big Data, and presenting our own paper “Big Data Normalization for Massively Parallel Processing Databases” at MoBiD.
Here are some photos of Nikolay Golov presenting our paper. He had a very interesting story to tell of how Anchor has helped Avito scale out their data warehouse as the both the business and requirements grew very rapidly, multiplying the number of sources, rows, and terabytes with a factor of ten over three years.
One common question I am asked is how constraints are implemented in an Anchor modeled database. Temporality make constraints less straightforward than in static tables. With unitemporal history the constraint needs to hold for every possible time slice in the database given every point in changing time (tc). If concurrency and bitemporal history is used, they need to hold for every possible time slice for every positor given every point in bitemporal time (p, tp, tc). The good thing is that a slice behaves just like a static database. It is a snapshot of what the information looked like at that point.
Implementing this using a CHECK constraint would be quite cumbersome. Even if you can limit the points of time that needs to be checked to those actually used in the database they quickly grow to a large number. The better way is to do the checking with an AFTER INSERT trigger. At insert time, it is possible to check only against those slices that are affected by the insert.
Below is a script that generates a unique constraint for the stage name in our example model. Note that time slice checking is done by applying the point-in-time perspective of the affected anchor. This “template” for constraints can of course be extended to include more complex rules, such that two performances cannot be held on the same stage on the same date, found further down in the same script.
Note that in the case of the performance date constraint, we do not have to take time into account, since none of the involved constructs are historized. However, there is one additional important consideration. Since the trigger joins the tie, the tie is expected to have its rows populated before the attributes. In other words, for the trigger to detect duplicates, the loading order must be:
- Generate the desired number of identities in the performance anchor.
- Populate the tie by connecting these identities to the associated stages.
- Populate the attributes on the performance anchor.
If the loading order cannot be guaranteed, a similar constraint must be placed on the tie. In that case, attributes will pass through its trigger if loaded first, since join with the tie “truncates” the result set, but the trigger on the tie will fail because of duplicates. A larger transaction containing the two steps would then still roll back to what the database looked like before the insert.
We are proud to announce the release of version 0.98 of the modeling tool. This version has been in the making for over one year, and is already used in production for a number of data warehouses and system databases. The graphical interface, while retaining the look and feel, has been completely rewritten in animated SVG, which has improved performance and usability over the old canvas implementation. This step was necessary in order to make it easier to extend the tool graphically. The test version, now at 0.99, will make use of this in order to add support for visualising natural keys. A number of changes have also been made to the triggers, utilising nested triggers in order to reduce the size of a single trigger. Remember to disable triggers before doing inserts directly into attribute tables when you use ETL tools, rather than the triggers on the latest view.
The test version has already been given a number of fixes, primarily to align it with the current research in Anchor modeling. Among these are decisiveness, which controls whether a positor may hold multiple beliefs at a given point in bitemporal time, or if only a single belief is possible (which is the default behaviour).
We created the Iceberg Exercise in order to demonstrate a business case in which all features of Anchor modeling could be showcased. It has been presented at a number of conferences, but the actual model and example data with queries had not been published online. These are now made publicly available along with an explanatory video tutorial.
The model below captures most of the requirements given in the documentation of an Iceberg Tracking and Drift Prediction System. Icebergs have attached transmitters in order to keep track of their location. Icebergs may split or merge during their lifetime and enters or exits certain geographical areas. Icebergs and transmitters have a number of attributes, some of which may change over time. The model is concurrent-reliance-temporal in order to capture concurrent, but maybe conflicting, views of an iceberg, such as sightings from passing boats.
After generating the SQL code that creates an implementation of the model, the following script for Microsoft SQL Server, used in the tutorial, can be run in order to create some example data and run some illustrative queries.
The script has been extended to show some other features as well.
A while ago we created a metadata and SQL driven ELT framework for DW automation, particularly aimed at Anchor modeled data warehouses. This is now in use for three data warehouses and has matured enough for public release. The project is Open Source and available on GitHub (click here) and we have put together a playlist of video tutorials introducing the functionality of the framework. The framework uses the same sisula engine as the online modeling tool for generating the SQL code.
The dinner/workshop last night on Ensemble Modeling at Top of Minds was great. It’s always a treat when you put a bunch of people in a room that all have solutions to the same old common problems in Data Warehousing, but where the solutions themselves have taken on very different forms. While different, the meeting was about Ensemble Modeling, and all of the represented techniques; Anchor modeling, Data Vault, Focal Point, Hyperagility, and Head & Version, have some important common denominators that separate them from other techniques, such as 3NF and Dimensional Modeling.
Much of the discussions boiled down to what these denominators are. Whenever there are differences in our approaches, those cannot, for example, be part of the definition of Ensemble Modeling. In our latest paper we view an ensemble as the concept that can remember statements about similar things, usually by storing them in a database. The act of modeling is to define the boundaries between similar and dissimilar things. In other words, an ensemble should capture the essence of things similar enough to not belong to another ensemble, or loosely speaking “be the type of the thing”. In order to single out things belonging to the same ensemble, so called instances, all forms of Ensemble Modeling assume that some part of the thing is immutable. This is the ‘glue’ that holds the thing together through time, or rather, keeps what we know about the thing together. Even if a thing itself has vanished the memory of it can live forever.
This immutable part, joyfully referred to as “the king of the thing” at the dinner, is represented by an anchor table in Anchor, a hub in Data Vault, a focal in Focal Point, a head in Head & Version, and also hub in Hyperagility. It is to this the rest of the ensemble is attached. That rest, the mutable part, is decomposed into one or many acutal parts depending on the technique. Head & Version groups all mutable values, as well as relationships, in a single table. The others have some degree of decomposition, while what drives the separation is different in different techniques. Focal Point has predefined parts corresponding to abstract concept types, Data Vault has parts depending on rate of change and sometimes point of origin, Anchor has one part per role a value plays with respect to thing, and Hyperagility has an extensible name-value-pair nature. In these respects Ensemble Modeling stands out from other techniques, where mutable and immutable parts are not distinguishable.
To conclude the evening we realized that, somehow, all of us have come up with the same fundamental ideas as solutions to problems we have seen and experienced in our careers. Even if these diverge in many respects, it is what they share at their core that has proven so successful where traditional techniques have failed. All in all, I am glad to be able to influence and be influenced by likeminded people, and hope that we have many more sessions yet to come. I suppose there is some truth to the old proverb “great minds think alike” after all.
With the help of Nikolay Golov (avito.ru) support for generating unitemporal Vertica implementations have been added to the test version of the online modeling tool. Vertica is an MPP columnar database, and Avito runs a cluster of 12 nodes with 50TB of data in an Anchor model. A relational Big Data solution that outperforms previously tested NoSQL alternatives!
In Vertica there are three distribution models, and they happen to coincide with Anchor modeling constructs.
The data is available locally (duplicated) on every node in the cluster. This suits knots very well, since they may be joined from both ties and attributes anywhere.
The data is split according to some operation across the nodes. For example, a modulo operation on the identity column in the anchor and the attribute tables could be used to determine on which node data should end up. This keeps an instance of an entity and its history of changes together on the same node.
Data necessary in order to do a cross-node join is stored across the nodes, for all directions of the join. Ties fit this purpose perfectly and does not introduce any overhead thanks to only containing the columns on which joins are done.
Anchor modeling fits MPP extremely well, since the constructs are designed in such a way that the MPP is utilized as efficiently as possible. A knotted attribute or tie can be resolved locally on a node, and a join over a tie cannot be more efficient, while the assembly of the results from different nodes is trivial.
Incidentally, if you search about issues in Vertica, implementations may suffer from having to create lots of projections or broadcasts of wide tables in order to support ad-hoc querying. This is a natural result of using less normalized models. Anchor modeling, on the contrary, works “out of the box” for ad-hoc querying. Furthermore, what should be projected and broadcast is known beforehand, and not tuning work a DBA would have to worry about on a daily basis.
Since Vertica does not have table valued functions or stored procedures, only create table statements are generated so far. We are still working on providing some form of latest views.