Support

Anchor Modeling can be used with almost any relational database, even temporal databases, but there are four features that significantly improve query performance. These are table elimination, clustered indexes, foreign keys and statistics. Note that some vendors refer to the first two as ‘join removal’ and ‘index organized tables’. In order to get the best performance in Bitemporal Anchor Modeling a fifth feature is necessary: partitioning. It is not required, but when used partitions recording time over that which is currently recorded and that which has been erased.

The Anchor Modeler, our online modeling tool, currently only supports the automatic generation of code targeted for Microsoft SQL Server. Support for other vendors is in the works.

Clustered indexes are less important when using solid state disks, but if not present the total size of the database including indexes will be larger. Table elimination is less important if every query run against the database is crafted by hand and not automatically generated, for example from an analytical or reporting tool. However, if you want to use the views and table valued functions providing temporal perspectives of the data, table elimination is crucial.

Supported Browsers

Please use a browsers that supports HTML5 canvas and 2D drawing context if you want to use the online modeling tool. Here is a list of tested browsers.

Browser Version Performance Functionality
Chrome 6 – 15 Fast Full
Chrome 5 Fast Almost full
Canary 17 Fast Full, but expect some crashes
Firefox 8 Fast Full
Firefox 4 Slow Full
Firefox 3.6 Slow Full, linefeed bug in generated SQL
Safari 5+ Good Almost full, cannot load local files
Opera 11.5 Slow Almost full
Opera 10.6 Good Partial
Internet Explorer 9 Fast Almost no functionality

Table Elimination Support

Anchor Modeling relies on a query optimization feature called table elimination to achieve high performance. Many modern database engines support this technique, which can be seen from the chart below. Anchor Modeling can be used with all relational databases but the performance will be better in those which support table elimination.

Database Engine Support Example
Microsoft SQL Server 2005 full SQL Server Example
Microsoft SQL Server 2008 full SQL Server Example
Oracle 10gR2 Express Edition* partial Oracle Example
Oracle 11gR1 Enterprise/Express Edition full Oracle Example
IBM DB2 v9.5 full IBM DB2 Example
PostgreSQL v8.4 beta full PostgreSQL Example
Teradata v12** partial Teradata Example
MySQL v5.0.70 none MySQL Example
MySQL v6.0.10 alpha none MySQL Example
MariaDB v5.1 full MariaDB Example
Sybase not tested

Partial support includes support for non-historized attributes, but not for historized. These results are based on the example scripts, for which it still may be possible to rewrite the SQL to get full support for table elimination. Should anyone be able to do such a rewrite, please let us know!

* As it turns out, our example for Oracle 10g does not work as intended. While you do get table elimination, statistics cannot be used to get an optimal plan. It seems that support will be limited in 10g and only full in 11g.

** We have gotten reports that Teradata v13 has full table elimination support when using subqueries in the select part of a statement.

Table Elimination Explained

Table elimination is a feature found in many modern query optimizers. Basically, what table elimination does, is to remove tables from the execution plan when it is unneccessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query:

select
 A.colA
from
 tableA  A
left outer join
 tableB  B
on
 B.id = A.id;

When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition, B.id = A.id, is not met the selected column, A.colA, will contain a NULL value.

However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, “val1”] and B the rows [1, “other1a”],[1, “other1b”] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution.

Instead, let’s say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unneccessary. We can remove the whole join operation from the execution plan.

Let us look at a more advanced query, where some query optimizers fail, and hence only provide partial support for table elimination.

select
 A.colA
from
 tableA  A
left outer join
 tableC  C
on
 C.id = A.id
and
 C.fromDate = (
   select
     max(sub.fromDate)
   from
     tableC sub
   where
     sub.id = A.id
 );

In this example we have added another join condition, which ensures that we only pick the matching row from tableC having the latest fromDate. In this case tableC will contain duplicates of the column C.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableC is (C.id, C.fromDate).

Furthermore, since the subselect ensures that we only pick the latest C.fromDate for a given C.id we know that at most one row will match the join condition. We will again have the situation where joining with tableC cannot affect the number of rows in the result set. Since we do not select any columns from tableC, the whole join operation can be eliminated from the execution plan.

Queries like these arise in two situations. Either when you have a denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data.

Clustered Indexes

Anchor Modeling use clustered indexes, also called index organized tables, to further gain performance. Many database engines support this type of indexing, which physically order the data on the media according to the indexed columns. In other words, such indexes take up no extra space, contrary to regular indexes. In fact, regular indexes are very rarely needed and they are not a crucial feature for Anchor Modeling, even if present in most database engines.

Over time, or when large amounts of data is loaded at once, clustered indexes may become fragmented. In other words, the organization of the data has become less than optimal and a defragmentation or recreation of the index is needed in order to solve the problem. Always check the degree of fragmentation after loading large amounts of data and on a regular basis.

Note that indexes slow down inserts, so in databases that support it, switching them off before loading large amounts of data and then switching them back on may speed up the loading process significantly, provided that you trust the loading process not to introduce rows that break the index constraint.

Foreign Keys

Having foreign keys declared are paramount for maintaining referential integrity in the database, but they also improve select query performance. There is a second type of table elimination that can be achieved only when foreign keys are declared. In Anchor Modeling, this type will eliminate anchors from queries in which there are at least one where condition on any of its attributes. They also help the optimizer in many other ways.

As with indexes, performance when loading large amounts of data can be improved by temporarily deferring the foreign key checks until after the loading is done.

Statistics

In order to achieve near optimal execution plans, the query optimizer needs to use statistics of your stored data. This can be illustrated using an example. Below is the latest view for the Actor anchor and attributes in Transact-SQL code:

CREATE VIEW lAC_Actor AS 
SELECT
  AC.AC_ID,
  AC_NAM.AC_NAM_Actor_Name,
  AC_NAM.AC_NAM_ChangedAt,
  GEN.GEN_ID, 
  GEN.GEN_Gender,
  PLV.PLV_ID,
  PLV.PLV_ProfessionalLevel,
  AC_PLV.AC_PLV_ChangedAt
FROM
  AC_Actor AC
LEFT JOIN
  AC_NAM_Actor_Name AC_NAM
ON
  AC_NAM.AC_ID = AC.AC_ID
AND
  AC_NAM.AC_NAM_ChangedAt = (
    SELECT
      max(sub.AC_NAM_ChangedAt)
    FROM
      AC_NAM_Actor_Name sub
    WHERE
      sub.AC_ID = AC.AC_ID
  )
LEFT JOIN
  AC_GEN_Actor_Gender AC_GEN
ON
  AC_GEN.AC_ID = AC.AC_ID
LEFT JOIN
  GEN_Gender GEN
ON
  GEN.GEN_ID = AC_GEN.GEN_ID
LEFT JOIN
  AC_PLV_Actor_ProfessionalLevel AC_PLV
ON
  AC_PLV.AC_ID = AC.AC_ID
AND
  AC_PLV.AC_PLV_ChangedAt = (
    SELECT
      max(sub.AC_PLV_ChangedAt)
    FROM
      AC_PLV_Actor_ProfessionalLevel sub
    WHERE
      sub.AC_ID = AC.AC_ID
  )
LEFT JOIN
  dbo.PLV_ProfessionalLevel PLV
ON
  PLV.PLV_ID = AC_PLV.PLV_ID;

Now let’s say that we want to answer how many women have reached the highest professional acting level. This can be done using the following query on the latest view:

SELECT
  count(*)
FROM
  lAC_Actor
WHERE
  GEN_Gender = 'Female'
AND
  PLV_ProfessionalLevel = 'Oscar';

There is no need for any information from the Name attribute, so it will not be touched during execution, thanks to table elimination. Since we are only interested in two particular non-null values: ‘Female’ and ‘Oscar’ from the other attributes, the anchor can also be eliminated, as foreign keys have been declared. Furthermore, when conditions are set on the attributes like this, the outer joins become inner joins. After all these optimizations, there are two possibilities for what the remaining query may look like, or rather, execute like:

SELECT
  count(*)
FROM
  AC_GEN_Actor_Gender AC_GEN
JOIN
  AC_PLV_Actor_ProfessionalLevel AC_PLV
ON
  AC_GEN.AC_ID = AC_PLV.AC_ID
WHERE
  AC_GEN.GEN_ID = 1
AND
  AC_PLV.PLV_ID = 10;

In the query above the gender attribute will first be scanned and all women picked out, then this intermediate result set will be joined with their professional levels, and finally all those that do not have the highest level removed. The other choice is to start with the professional levels:

SELECT
  count(*)
FROM
  AC_PLV_Actor_ProfessionalLevel AC_PLV
JOIN
  AC_GEN_Actor_Gender AC_GEN
ON
  AC_PLV.AC_ID = AC_GEN.AC_ID
WHERE
  AC_PLV.PLV_ID = 10
AND
  AC_GEN.GEN_ID = 1;

In order to determine the optimal join order, which of the two queries above to pick, the optimizer needs statistics. If men and women are evenly distributed, then starting with gender will remove about half the rows for the first intermediate result set. If there are ten professional levels with most actors at medium levels, then starting with the professional level will remove most rows for the first intermediate result set. This is much more efficient, and the optimizer will strive to find the join order that produces the smallest intermediate result set first continuing progressively through the joins with as few rows as possible.

When legacy data is migrated, large numbers of new rows will be created, skewing any already gathered statistics. Therefore it is a good idea to update all statistics after loading a large amount of data and on a regular basis, if not done automatically by the database. Normally, experienced users will always know more about the data than the database, so it may also be a good idea to check the executions plans of queries from time to time to ensure that they are optimal.

20 thoughts on “Support”

  1. Also, beleive it or not, italians write in SQL. Translating SQL into Italian is a funny idea, but somewhat unpractical.

    1. Thanks for pointing this out. The translations are automatically generated by Google Translate, but apparently it doesn’t do so well, so I have removed the functionality for now.

  2. I have noticed a huge number of logical reads and scans when deploying a simple schema (1 anchor, 3 static attributes, 1 historized attribute) in CRT mode. 76 entities and only 1 attribute posit/annex per entity and attribute. I will get about 3000 scans/6000 reads per annex table. And a lot of CPU is burned on top-1 sort. That does not look like high performance to me.

  3. Possible error in insert-trigger:

    In the instead-of-insert-trigger for an anchor (in our case [itCA_Caller]) a case-expression similar to the following occurs for every historized attribute:
    ——
    CASE
    WHEN [dbo].[rfCA_BRA_Caller_CallerBranch](
    v.CA_BRA_CA_ID,
    v.CA_BRA_Caller_CallerBranch,
    v.CA_BRA_ChangedAt
    ) = 1
    THEN ‘R’ — restatement
    WHEN [BRA].CA_BRA_CA_ID is not null
    THEN ‘D’ — duplicate
    ELSE ‘N’ — new statement
    END
    ——
    The function [rfCA_BRA_Caller_CallerBranch] returns 1 if the last value before CA_BRA_ChangedAt and/or the first value after CA_BRA_ChangedAt is equal to the given CA_BRA_Caller_CallerBranch (for given CA_ID).

    The problem is that in some cases duplicates are marked as restatements (which results in a key-violation error when trying to insert).

    For example: if in our case the combination
    [CA_BRA_CA_ID,CA_BRA_Caller_CallerBranch,v.CA_BRA_ChangedAt] already exists (so it is in fact a duplicate) and there is also an adjacent record, with the same [CA_BRA_CA_ID,CA_BRA_Caller_CallerBranch] but different _Changed_At, then the function [rfCA_BRA_Caller_CallerBranch] returns 1 and because the SQL-Server case-expression returns the result of the FIRST when_expression that evaluates to true, the inserted record is marked as a restatement.

    Is this indeed the case, or am I missing something?

    Kind regards,
    Mark Versteegh


    I think an easy fix would be to move the duplicate-check before the restatement-check.

      1. Thanks for the quick response. I tried the test version of the tool, but now I get a duplicate key-error when trying to insert a changed value for a static attribute.

        In the release-version these attributes where left unchanged (if an attribute value already exists for the same anchor_ID then no insert takes place), but in the test-version there is a left join on the anchor_ID and the attribute_value, which works fine for duplicates, but handles changes values as ‘new’.

        I would expect the value to be updated in this case, but I’m not 100% sure if this is the intended behavior.

        1. This should now be fixed in the test version. The intended behaviour is now to do nothing when using the triggers on the views, but you can still update a value by directly using the attribute table. A warning message will be shown, though, when you try to update static attributes.

  4. Hi,

    At this moment we need to clean some customer data from our anchor model setup. If we just to a update on the last status of a customer, the history is still there. Is there a quick way to remove all data from a key is.

    Thanks. Sjacco from M-Wise

    1. If you want to physically remove the data I would suggest doing deletes on the latest views. For example:

      declare @r int;
      set @r = 1;
      while (@r > 0)
      begin
      delete lAC_Actor
      where AC_ID in (
      select AC_ID from #AC_ID_toBeDeleted
      );
      set @r = @@ROWCOUNT;
      end

      This will delete both all attribute data and the anchor data. The loop is necessary in order to delete all history in historized attributes. If you also have related data that needs to be deleted you will have to roll your own “cascading delete”.

      create table #PE_ID_toBeDeleted (
      PE_ID int not null
      );
      delete AC_wasCast_PE_in
      output deleted.PE_ID_in
      into #PE_ID_toBeDeleted
      where AC_ID_wasCast in (
      select AC_ID from #AC_ID_toBeDeleted
      );

      After which you can use #PE_ID_toBeDeleted to do a similar looped delete on the lPE_Performance latest view. Of course, you may want to keep performances that involved other actors than the ones deleted. In other words, it is hard to fully automate deletes, since there may be underlying business logic governing them.

  5. I’m attempting using AM to create a 6NF + 3NF views in AWS Redshift. I’m using the PostgreSQL generation scheme, but after getting a little bit complicated, it is failing to generate SQL. Maybe I’m doing something incorrect or maybe it is a limitation of the modeler? I’d like to have someone take a look at my model to see which of these things are the problem.

  6. Hi,

    Your work is awesome! Thanks a lot for sharing it! I’m eager to try “Privacy actions” and “Encryption groups” (can help ABAC implementation?)!

    I am analyzing an architecture on top of the Anchor Modeling for a big Swiss telecommunication company.

    Very useful the mnemonic concept;) Is it possible to write the descriptions that we insert in the GUI in the views “_Anchor, _Attribute, _Knot, _Tie” views? I only see them in the XML file:

    My address description

    Thanks a lot

    Waldimiro Rossi, Swiss Impact Ltd.

        1. You can consider the SQL Server support to already be production quality. I was hoping to get support for a couple more databases up to better quality before I announce 1.0.

Leave a Reply

Your email address will not be published. Required fields are marked *