Support

Anchor Modeling can be used with almost any relational database, even temporal databases, but there are four features that significantly improve query performance. These are table elimination, clustered indexes, foreign keys and statistics. Note that some vendors refer to the first two as ‘join removal’ and ’index organized tables’. In order to get the best performance in Bitemporal Anchor Modeling a fifth feature is necessary: partitioning. It is not required, but when used partitions recording time over that which is currently recorded and that which has been erased.

The Anchor Modeler, our online modeling tool, currently only supports the automatic generation of code targeted for Microsoft SQL Server. Support for other vendors is in the works.

Clustered indexes are less important when using solid state disks, but if not present the total size of the database including indexes will be larger. Table elimination is less important if every query run against the database is crafted by hand and not automatically generated, for example from an analytical or reporting tool. However, if you want to use the views and table valued functions providing temporal perspectives of the data, table elimination is crucial.

Supported Browsers

Please use a browsers that supports HTML5 canvas and 2D drawing context if you want to use the online modeling tool. Here is a list of tested browsers.

Browser Version Performance Functionality
Chrome 6 – 15 Fast Full
Chrome 5 Fast Almost full
Canary 17 Fast Full, but expect some crashes
Firefox 8 Fast Full
Firefox 4 Slow Full
Firefox 3.6 Slow Full, linefeed bug in generated SQL
Safari 5+ Good Almost full, cannot load local files
Opera 11.5 Slow Almost full
Opera 10.6 Good Partial
Internet Explorer 9 Fast Almost no functionality

Table Elimination Support

Anchor Modeling relies on a query optimization feature called table elimination to achieve high performance. Many modern database engines support this technique, which can be seen from the chart below. Anchor Modeling can be used with all relational databases but the performance will be better in those which support table elimination.

Database Engine Support Example
Microsoft SQL Server 2005 full SQL Server Example
Microsoft SQL Server 2008 full SQL Server Example
Oracle 10gR2 Express Edition* partial Oracle Example
Oracle 11gR1 Enterprise/Express Edition full Oracle Example
IBM DB2 v9.5 full IBM DB2 Example
PostgreSQL v8.4 beta full PostgreSQL Example
Teradata v12** partial Teradata Example
MySQL v5.0.70 none MySQL Example
MySQL v6.0.10 alpha none MySQL Example
MariaDB v5.1 full MariaDB Example
Sybase not tested

Partial support includes support for non-historized attributes, but not for historized. These results are based on the example scripts, for which it still may be possible to rewrite the SQL to get full support for table elimination. Should anyone be able to do such a rewrite, please let us know!

* As it turns out, our example for Oracle 10g does not work as intended. While you do get table elimination, statistics cannot be used to get an optimal plan. It seems that support will be limited in 10g and only full in 11g.

** We have gotten reports that Teradata v13 has full table elimination support when using subqueries in the select part of a statement.

Table Elimination Explained

Table elimination is a feature found in many modern query optimizers. Basically, what table elimination does, is to remove tables from the execution plan when it is unneccessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query:

select
 A.colA
from
 tableA  A
left outer join
 tableB  B
on
 B.id = A.id;

When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition, B.id = A.id, is not met the selected column, A.colA, will contain a NULL value.

However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution.

Instead, let’s say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unneccessary. We can remove the whole join operation from the execution plan.

Let us look at a more advanced query, where some query optimizers fail, and hence only provide partial support for table elimination.

select
 A.colA
from
 tableA  A
left outer join
 tableC  C
on
 C.id = A.id
and
 C.fromDate = (
   select
     max(sub.fromDate)
   from
     tableC sub
   where
     sub.id = A.id
 );

In this example we have added another join condition, which ensures that we only pick the matching row from tableC having the latest fromDate. In this case tableC will contain duplicates of the column C.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableC is (C.id, C.fromDate).

Furthermore, since the subselect ensures that we only pick the latest C.fromDate for a given C.id we know that at most one row will match the join condition. We will again have the situation where joining with tableC cannot affect the number of rows in the result set. Since we do not select any columns from tableC, the whole join operation can be eliminated from the execution plan.

Queries like these arise in two situations. Either when you have a denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data.

Clustered Indexes

Anchor Modeling use clustered indexes, also called index organized tables, to further gain performance. Many database engines support this type of indexing, which physically order the data on the media according to the indexed columns. In other words, such indexes take up no extra space, contrary to regular indexes. In fact, regular indexes are very rarely needed and they are not a crucial feature for Anchor Modeling, even if present in most database engines.

Over time, or when large amounts of data is loaded at once, clustered indexes may become fragmented. In other words, the organization of the data has become less than optimal and a defragmentation or recreation of the index is needed in order to solve the problem. Always check the degree of fragmentation after loading large amounts of data and on a regular basis.

Note that indexes slow down inserts, so in databases that support it, switching them off before loading large amounts of data and then switching them back on may speed up the loading process significantly, provided that you trust the loading process not to introduce rows that break the index constraint.

Foreign Keys

Having foreign keys declared are paramount for maintaining referential integrity in the database, but they also improve select query performance. There is a second type of table elimination that can be achieved only when foreign keys are declared. In Anchor Modeling, this type will eliminate anchors from queries in which there are at least one where condition on any of its attributes. They also help the optimizer in many other ways.

As with indexes, performance when loading large amounts of data can be improved by temporarily deferring the foreign key checks until after the loading is done.

Statistics

In order to achieve near optimal execution plans, the query optimizer needs to use statistics of your stored data. This can be illustrated using an example. Below is the latest view for the Actor anchor and attributes in Transact-SQL code:

CREATE VIEW lAC_Actor AS 
SELECT
  AC.AC_ID,
  AC_NAM.AC_NAM_Actor_Name,
  AC_NAM.AC_NAM_ChangedAt,
  GEN.GEN_ID, 
  GEN.GEN_Gender,
  PLV.PLV_ID,
  PLV.PLV_ProfessionalLevel,
  AC_PLV.AC_PLV_ChangedAt
FROM
  AC_Actor AC
LEFT JOIN
  AC_NAM_Actor_Name AC_NAM
ON
  AC_NAM.AC_ID = AC.AC_ID
AND
  AC_NAM.AC_NAM_ChangedAt = (
    SELECT
      max(sub.AC_NAM_ChangedAt)
    FROM
      AC_NAM_Actor_Name sub
    WHERE
      sub.AC_ID = AC.AC_ID
  )
LEFT JOIN
  AC_GEN_Actor_Gender AC_GEN
ON
  AC_GEN.AC_ID = AC.AC_ID
LEFT JOIN
  GEN_Gender GEN
ON
  GEN.GEN_ID = AC_GEN.GEN_ID
LEFT JOIN
  AC_PLV_Actor_ProfessionalLevel AC_PLV
ON
  AC_PLV.AC_ID = AC.AC_ID
AND
  AC_PLV.AC_PLV_ChangedAt = (
    SELECT
      max(sub.AC_PLV_ChangedAt)
    FROM
      AC_PLV_Actor_ProfessionalLevel sub
    WHERE
      sub.AC_ID = AC.AC_ID
  )
LEFT JOIN
  dbo.PLV_ProfessionalLevel PLV
ON
  PLV.PLV_ID = AC_PLV.PLV_ID;

Now let’s say that we want to answer how many women have reached the highest professional acting level. This can be done using the following query on the latest view:

SELECT
  count(*)
FROM
  lAC_Actor
WHERE
  GEN_Gender = 'Female'
AND
  PLV_ProfessionalLevel = 'Oscar';

There is no need for any information from the Name attribute, so it will not be touched during execution, thanks to table elimination. Since we are only interested in two particular non-null values: ‘Female’ and ‘Oscar’ from the other attributes, the anchor can also be eliminated, as foreign keys have been declared. Furthermore, when conditions are set on the attributes like this, the outer joins become inner joins. After all these optimizations, there are two possibilities for what the remaining query may look like, or rather, execute like:

SELECT
  count(*)
FROM
  AC_GEN_Actor_Gender AC_GEN
JOIN
  AC_PLV_Actor_ProfessionalLevel AC_PLV
ON
  AC_GEN.AC_ID = AC_PLV.AC_ID
WHERE
  AC_GEN.GEN_ID = 1
AND
  AC_PLV.PLV_ID = 10;

In the query above the gender attribute will first be scanned and all women picked out, then this intermediate result set will be joined with their professional levels, and finally all those that do not have the highest level removed. The other choice is to start with the professional levels:

SELECT
  count(*)
FROM
  AC_PLV_Actor_ProfessionalLevel AC_PLV
JOIN
  AC_GEN_Actor_Gender AC_GEN
ON
  AC_PLV.AC_ID = AC_GEN.AC_ID
WHERE
  AC_PLV.PLV_ID = 10
AND
  AC_GEN.GEN_ID = 1;

In order to determine the optimal join order, which of the two queries above to pick, the optimizer needs statistics. If men and women are evenly distributed, then starting with gender will remove about half the rows for the first intermediate result set. If there are ten professional levels with most actors at medium levels, then starting with the professional level will remove most rows for the first intermediate result set. This is much more efficient, and the optimizer will strive to find the join order that produces the smallest intermediate result set first continuing progressively through the joins with as few rows as possible.

When legacy data is migrated, large numbers of new rows will be created, skewing any already gathered statistics. Therefore it is a good idea to update all statistics after loading a large amount of data and on a regular basis, if not done automatically by the database. Normally, experienced users will always know more about the data than the database, so it may also be a good idea to check the executions plans of queries from time to time to ensure that they are optimal.

One thought on “Support

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>