Great Minds Think Alike

The dinner/workshop last night on Ensemble Modeling at Top of Minds was great. It’s always a treat when you put a bunch of people in a room that all have solutions to the same old common problems in Data Warehousing, but where the solutions themselves have taken on very different forms. While different, the meeting was about Ensemble Modeling, and all of the represented techniques; Anchor modeling, Data Vault, Focal Point, Hyperagility, and Head & Version, have some important common denominators that separate them from other techniques, such as 3NF and Dimensional Modeling.

Much of the discussions boiled down to what these denominators are. Whenever there are differences in our approaches, those cannot, for example, be part of the definition of Ensemble Modeling. In our latest paper we view an ensemble as the concept that can remember statements about similar things, usually by storing them in a database. The act of modeling is to define the boundaries between similar and dissimilar things. In other words, an ensemble should capture the essence of things similar enough to not belong to another ensemble, or loosely speaking “be the type of the thing”. In order to single out things belonging to the same ensemble, so called instances, all forms of Ensemble Modeling assume that some part of the thing is immutable. This is the ‘glue’ that holds the thing together through time, or rather, keeps what we know about the thing together. Even if a thing itself has vanished the memory of it can live forever.

This immutable part, joyfully referred to as “the king of the thing” at the dinner, is represented by an anchor table in Anchor, a hub in Data Vault, a focal in Focal Point, a head in Head & Version, and also hub in Hyperagility. It is to this the rest of the ensemble is attached. That rest, the mutable part, is decomposed into one or many acutal parts depending on the technique. Head & Version groups all mutable values, as well as relationships, in a single table. The others have some degree of decomposition, while what drives the separation is different in different techniques. Focal Point has predefined parts corresponding to abstract concept types, Data Vault has parts depending on rate of change and sometimes point of origin, Anchor has one part per role a value plays with respect to thing, and Hyperagility has an extensible name-value-pair nature. In these respects Ensemble Modeling stands out from other techniques, where mutable and immutable parts are not distinguishable.

To conclude the evening we realized that, somehow, all of us have come up with the same fundamental ideas as solutions to problems we have seen and experienced in our careers. Even if these diverge in many respects, it is what they share at their core that has proven so successful where traditional techniques have failed. All in all, I am glad to be able to influence and be influenced by likeminded people, and hope that we have many more sessions yet to come. I suppose there is some truth to the old proverb “great minds think alike” after all.

Ensemble Modeling
Poe Eriksson, Dani Schneider, Lars Boström, Hans Hultgren, Lars Rönnbäck, Patrik Lager

Unitemporal Vertica support in test

With the help of Nikolay Golov (avito.ru) support for generating unitemporal Vertica implementations have been added to the test version of the online modeling tool. Vertica is an MPP columnar database, and Avito runs a cluster of 12 nodes with 50TB of data in an Anchor model. A relational Big Data solution that outperforms previously tested NoSQL alternatives!

In Vertica there are three distribution models, and they happen to coincide with Anchor modeling constructs.

Broadcast
The data is available locally (duplicated) on every node in the cluster. This suits knots very well, since they may be joined from both ties and attributes anywhere.

Segment
The data is split according to some operation across the nodes. For example, a modulo operation on the identity column in the anchor and the attribute tables could be used to determine on which node data should end up. This keeps an instance of an entity and its history of changes together on the same node.

Project
Data necessary in order to do a cross-node join is stored across the nodes, for all directions of the join. Ties fit this purpose perfectly and does not introduce any overhead thanks to only containing the columns on which joins are done.

Anchor modeling fits MPP extremely well, since the constructs are designed in such a way that the MPP is utilized as efficiently as possible. A knotted attribute or tie can be resolved locally on a node, and a join over a tie cannot be more efficient, while the assembly of the results from different nodes is trivial.

Incidentally, if you search about issues in Vertica, implementations may suffer from having to create lots of projections or broadcasts of wide tables in order to support ad-hoc querying. This is a natural result of using less normalized models. Anchor modeling, on the contrary, works “out of the box” for ad-hoc querying. Furthermore, what should be projected and broadcast is known beforehand, and not tuning work a DBA would have to worry about on a daily basis.

Since Vertica does not have table valued functions or stored procedures, only create table statements are generated so far. We are still working on providing some form of latest views.

Unitemporal PostgreSQL support in test

Thanks to the tremendous efforts of Elari Saal we are ready to test unitemporal PostgreSQL support. It is available in the test version of the online modeling tool. Please report any bugs you find on the github page for the PostgreSQL branch, leave comments to this post, or use the forums.

The test version of the tool is now hosted by github, which gives us much better control over deployment. Please update any old links you may have to the previous host (anchor.co.nf).

Anchor Modeling Certification Course

We are proud to announce the second public Anchor Modeling Certification Course, to be held in Stockholm on June 24th to 26th. Please see the details below for how to sign up for the course and the included certification. The course is held in central Stockholm, and we can help recommend accommodation for foreign participants.

Instructor: Lars Rönnbäck

Duration: 2 days of lectures and workshops followed by ½ day of assisted examination.

Cost: €2500 + VAT

Location: Stockholm, Sweden, hosted by UpToChange (www.uptochange.com).

Date: June 24th to 26th, 2015. Please register before the 10th of June.

Contact and registration: sales@uptochange.com

Target Audience: Data Modelers, Enterprise Architects, Data Warehouse Architects, Business Intelligence Architects. A prerequisite for the class is to have a general familiarity with and understanding of database modeling concepts.

Certification Validity: Your certification will get a version number, specifying which version of the certification you took. A certification of a certain version is valid for life. If and when enough things have changed in Anchor Modeling to warrant a new version of the certification, the version number will be increased and people can decide if they want to recertify for the new version or not.

Course Description: The course will cover both uni-temporal (the original) and concurrent-reliance-temporal Anchor Modeling from theoretical and practical points of view, with students building their own models using our modeling tool as exercises. After the course and certification you will be equipped with the knowledge necessary to architect your own solutions using Anchor Modeling.

The course will introduce bitemporal concepts used to handle versioning and corrections of information, while retaining a complete history of such changes. The reliability of information and how such can be used to solve otherwise difficult modeling challenges is discussed, and introduced in one of the exercises. Storing concurrent and opposing views of the same information but with different origins is also handled. The course will also demonstrate how “time traveling” is possible through the use of parametrized views, making it possible to ask complex temporal questions through simple SQL, and how these queries gain excellent performance thanks to modern query optimization techniques. Other advanced topics in Anchor Modeling, such as evolving models over time, design patterns, refactoring, and management of unknown values are handled in the course.

Background: Anchor Modeling is an agile modeling technique particularly suited for data environments that change over time. It has been developed in a collaboration between the Swedish industry and the Dept. of Computer Science at Stockholm University. With ten years since the first implementations, along with awarded Open Source tools and scientific papers, the technique has gained world wide momentum. Anchor Modeling can be implemented in traditional relational databases and it is based on current research in entity relationship modeling, database normalization and temporal databases.

About the instructor: Lars Rönnbäck is one of the authors of the Anchor Modeling technique. He is working as a senior consultant at UpToChange and is a research affiliate with Stockholm University. He has been working with some of the largest companies in the Swedish insurance and retail businesses, is an avid presenter at conferences, and has more than ten years experience from the field of Business Intelligence.

ER 2015 Conference

We will be submitting a paper for the 34th International Conference on Conceptual Modeling (ER 2015) October 19-22, in Stockholm, Sweden. We will also submit a request for a demo station where we will be showing the online modeling tool to participants. This is one of the most respected conferences on data modeling within the academic world. The father of the Entity-Relationship model, Peter Chen, has been involved in the conference from the start. The first conference was held back in 1979 and it has been around the world several times since then. For example, we won the best paper award for Anchor modeling in 2009, when it was held in Brazil.

Don’t miss this opportunity to participate! It’s not too late to submit interesting topics for papers, tutorials, demos, and more, or just be a visitor to gain insight into the very latest research. We hope to see you there!

DMZ Europe 2015

On September 28-29, 2015, we will be in Hamburg, Germany to talk about Anchor modeling at Data Modeling Zone. There will be an intro to Anchor modeling aimed at those who do not know it yet, but it will be based on the very latest research we have done, so we would also recommend it to those who are already familiar with Anchor. For those who want a bit more depth there will also be an advanced half day workshop, where we look into a case that concurrent-reliance-temporal modeling beautifully solves. Finally, we will also be taking part in a panel discussion about the future of data modeling! We are really excited to be on the road again.

There are some exciting news we would like to share with you at the conference as well. Like talking about a 40TB implementation that replaced a NoSQL solution, with significantly better performance for both loading and querying. There are also a couple of Open Source efforts. One being PostgreSQL support for the tool and the other a metadata driven data warehouse ETL and automation framework for loading Anchor models. Finally, there is also a scientific paper on its way that dives into the theory behind concurrent-reliance-temporal Anchor modeling.

If you have specific questions, there’s always the chance to grab us between or after sessions. Let us know in advance, though, if possible!

Oracle support in test

The test version of the Anchor Modeler has been updated with preliminary Oracle support. The Oracle generation was kindly donated to us by the Swedish bank SBAB. Many thanks to Mikael Backlund (SBAB) and Jonas Åhgren (Xanthe IT) for the work they put into the Oracle generation! It is not a full implementation of uni-temporal yet, so if someone wants to step in and continue the work, please let us know. It is also largely untested by us, so report back if you find something that is broken.

We have also refactored the Layout Engine, in order to make it self-contained and easier to break out and use in other projects. A new feature called equivalence is also being added to uni-temporal (with concurrent-reliance-temporal to follow). Please note that this is still experimental and not yet complete, and the exact implementation may change. Equivalence can be used to create for example multilingual or multitenancy databases. Simply put, it covers use-cases where you may have simultaneous and equivalent values in attributes and knots. We introduced the concept in the presentation: Time, back in 2011, but haven’t had any implementation of it until now.

0.98 updated in test

The test version has been updated to 0.98. In this version CRT modeling has had a number of bug fixes applied, which also made it to stable today as version 0.97.4. One of the major features implemented so far in 0.98 is a switch from canvas drawing to native SVG manipulation. The canvas drawing, while being very performant, proved to be too difficult to extend. As browsers have matured, SVG has almost caught up in performance, and with it adding new graphical primitives is much easier.

As a result of this, event handling had to be rewritten, so quite a lot has changed in the interface even if it looks deceptively alike the canvas version. Therefore we would appreciate if you report any anomalies you may find in 0.98. Some rarely used features had to go, like the drop shadow effect. Others have been refined, like the graphical debug of the force directed graph layout algorithm.

This is just the first of many new features in 0.98. Stay tuned!

Version 0.97 RELEASED

Today we released version 0.97 as stable. This brings along a lot of changes and new features. Foremost among these is the switch to a new SQL code generation engine, based on our sisula template language, and of course using it the new concurrent-reliance-temporal modeling. This technique makes it possible to build a ‘perfect memory’ which is equivalent to having a time machine, such that you can travel back to any moment in time (both forwards and backwards) and see what the information looks like. Furthermore, you can see the information through the eyes of anyone and with what reliability they state a fact.

Since the whole SQL generation is new we would not recommend moving databases running 0.96 in production to 0.97 before having done thorough testing. The old version, 0.96, can still be reached here. With this we are also moving on to 0.98 in test. We have a number of exciting things planned for 0.98. Stay tuned!

The 500th revision

We are celebrating the 500th code revision (the first was in 2010), which is now available in the test version of the modeler. With this we are also doing a feature freeze for version 0.97, only fixing potential bugs, with the intent to move it to stable soon. After all, there are already a number of databases and data warehouses using that codebase in production.

Noteworthy features in 0.97 that have been added since our last post (rev 480) along with numerous small fixes and improvements:

Concurrent-reliance-temporal modeling
The CRT generation is feature complete and has been tested, but could use more testing before we can say that it is “bug free”. If you have the possibility, please help us out with this. CRT will make it possible to time travel over your information in three dimensions: over changing time (versions), over positing time (corrections), and over positors (origins). This will definitely be the next big thing in modeling, as everywhere we have shown it, people have realised how it solves problems they either had discarded as too difficult to solve or had poor workarounds for.

Business views
The views provide a simplified naming and shows only values and keys for the different perspectives latest, current, point, and difference. It should make it easier for those who want to expose the Anchor model directly to end users like analysts or report builders. With business views the need for marts is even less than before, and we expect that many will be able to get rid of the additional maintenance layer that marts constitute.

Checksums
Some of the larger data types in SQL Server come with limitations, such as being able to take part in an index or be compared to each other using the equals operator. In order to use such types, for example varchar(max), text, or geography, you can now tick a checkbox and an MD5 checksum will be generated and used instead in indexes, triggers, and comparisons.