Anchor Modeling can be used with almost any relational database, even temporal databases, but there are four features that significantly improve query performance. These are table elimination, clustered indexes, foreign keys and statistics. Note that some vendors refer to the first two as ‘join removal’ and ’index organized tables’. In order to get the best performance in Bitemporal Anchor Modeling a fifth feature is necessary: partitioning. It is not required, but when used partitions recording time over that which is currently recorded and that which has been erased.
The Anchor Modeler, our online modeling tool, currently only supports the automatic generation of code targeted for Microsoft SQL Server. Support for other vendors is in the works.
Clustered indexes are less important when using solid state disks, but if not present the total size of the database including indexes will be larger. Table elimination is less important if every query run against the database is crafted by hand and not automatically generated, for example from an analytical or reporting tool. However, if you want to use the views and table valued functions providing temporal perspectives of the data, table elimination is crucial.
Please use a browsers that supports HTML5 canvas and 2D drawing context if you want to use the online modeling tool. Here is a list of tested browsers.
|Chrome||6 – 15||Fast||Full|
|Canary||17||Fast||Full, but expect some crashes|
|Firefox||3.6||Slow||Full, linefeed bug in generated SQL|
|Safari||5+||Good||Almost full, cannot load local files|
|Internet Explorer||9||Fast||Almost no functionality|
Table Elimination Support
Anchor Modeling relies on a query optimization feature called table elimination to achieve high performance. Many modern database engines support this technique, which can be seen from the chart below. Anchor Modeling can be used with all relational databases but the performance will be better in those which support table elimination.
|Microsoft SQL Server 2005||full||SQL Server Example|
|Microsoft SQL Server 2008||full||SQL Server Example|
|Oracle 10gR2 Express Edition*||partial||Oracle Example|
|Oracle 11gR1 Enterprise/Express Edition||full||Oracle Example|
|IBM DB2 v9.5||full||IBM DB2 Example|
|PostgreSQL v8.4 beta||full||PostgreSQL Example|
|Teradata v12**||partial||Teradata Example|
|MySQL v5.0.70||none||MySQL Example|
|MySQL v6.0.10 alpha||none||MySQL Example|
|MariaDB v5.1||full||MariaDB Example|
Partial support includes support for non-historized attributes, but not for historized. These results are based on the example scripts, for which it still may be possible to rewrite the SQL to get full support for table elimination. Should anyone be able to do such a rewrite, please let us know!
* As it turns out, our example for Oracle 10g does not work as intended. While you do get table elimination, statistics cannot be used to get an optimal plan. It seems that support will be limited in 10g and only full in 11g.
** We have gotten reports that Teradata v13 has full table elimination support when using subqueries in the select part of a statement.
Table Elimination Explained
Table elimination is a feature found in many modern query optimizers. Basically, what table elimination does, is to remove tables from the execution plan when it is unneccessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query:
select A.colA from tableA A left outer join tableB B on B.id = A.id;
When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition, B.id = A.id, is not met the selected column, A.colA, will contain a NULL value.
However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution.
Instead, let’s say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unneccessary. We can remove the whole join operation from the execution plan.
Let us look at a more advanced query, where some query optimizers fail, and hence only provide partial support for table elimination.
select A.colA from tableA A left outer join tableC C on C.id = A.id and C.fromDate = ( select max(sub.fromDate) from tableC sub where sub.id = A.id );
In this example we have added another join condition, which ensures that we only pick the matching row from tableC having the latest fromDate. In this case tableC will contain duplicates of the column C.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableC is (C.id, C.fromDate).
Furthermore, since the subselect ensures that we only pick the latest C.fromDate for a given C.id we know that at most one row will match the join condition. We will again have the situation where joining with tableC cannot affect the number of rows in the result set. Since we do not select any columns from tableC, the whole join operation can be eliminated from the execution plan.
Queries like these arise in two situations. Either when you have a denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data.
Anchor Modeling use clustered indexes, also called index organized tables, to further gain performance. Many database engines support this type of indexing, which physically order the data on the media according to the indexed columns. In other words, such indexes take up no extra space, contrary to regular indexes. In fact, regular indexes are very rarely needed and they are not a crucial feature for Anchor Modeling, even if present in most database engines.
Over time, or when large amounts of data is loaded at once, clustered indexes may become fragmented. In other words, the organization of the data has become less than optimal and a defragmentation or recreation of the index is needed in order to solve the problem. Always check the degree of fragmentation after loading large amounts of data and on a regular basis.
Note that indexes slow down inserts, so in databases that support it, switching them off before loading large amounts of data and then switching them back on may speed up the loading process significantly, provided that you trust the loading process not to introduce rows that break the index constraint.
Having foreign keys declared are paramount for maintaining referential integrity in the database, but they also improve select query performance. There is a second type of table elimination that can be achieved only when foreign keys are declared. In Anchor Modeling, this type will eliminate anchors from queries in which there are at least one where condition on any of its attributes. They also help the optimizer in many other ways.
As with indexes, performance when loading large amounts of data can be improved by temporarily deferring the foreign key checks until after the loading is done.
In order to achieve near optimal execution plans, the query optimizer needs to use statistics of your stored data. This can be illustrated using an example. Below is the latest view for the Actor anchor and attributes in Transact-SQL code:
CREATE VIEW lAC_Actor AS SELECT AC.AC_ID, AC_NAM.AC_NAM_Actor_Name, AC_NAM.AC_NAM_ChangedAt, GEN.GEN_ID, GEN.GEN_Gender, PLV.PLV_ID, PLV.PLV_ProfessionalLevel, AC_PLV.AC_PLV_ChangedAt FROM AC_Actor AC LEFT JOIN AC_NAM_Actor_Name AC_NAM ON AC_NAM.AC_ID = AC.AC_ID AND AC_NAM.AC_NAM_ChangedAt = ( SELECT max(sub.AC_NAM_ChangedAt) FROM AC_NAM_Actor_Name sub WHERE sub.AC_ID = AC.AC_ID ) LEFT JOIN AC_GEN_Actor_Gender AC_GEN ON AC_GEN.AC_ID = AC.AC_ID LEFT JOIN GEN_Gender GEN ON GEN.GEN_ID = AC_GEN.GEN_ID LEFT JOIN AC_PLV_Actor_ProfessionalLevel AC_PLV ON AC_PLV.AC_ID = AC.AC_ID AND AC_PLV.AC_PLV_ChangedAt = ( SELECT max(sub.AC_PLV_ChangedAt) FROM AC_PLV_Actor_ProfessionalLevel sub WHERE sub.AC_ID = AC.AC_ID ) LEFT JOIN dbo.PLV_ProfessionalLevel PLV ON PLV.PLV_ID = AC_PLV.PLV_ID;
Now let’s say that we want to answer how many women have reached the highest professional acting level. This can be done using the following query on the latest view:
SELECT count(*) FROM lAC_Actor WHERE GEN_Gender = 'Female' AND PLV_ProfessionalLevel = 'Oscar';
There is no need for any information from the Name attribute, so it will not be touched during execution, thanks to table elimination. Since we are only interested in two particular non-null values: ‘Female’ and ‘Oscar’ from the other attributes, the anchor can also be eliminated, as foreign keys have been declared. Furthermore, when conditions are set on the attributes like this, the outer joins become inner joins. After all these optimizations, there are two possibilities for what the remaining query may look like, or rather, execute like:
SELECT count(*) FROM AC_GEN_Actor_Gender AC_GEN JOIN AC_PLV_Actor_ProfessionalLevel AC_PLV ON AC_GEN.AC_ID = AC_PLV.AC_ID WHERE AC_GEN.GEN_ID = 1 AND AC_PLV.PLV_ID = 10;
In the query above the gender attribute will first be scanned and all women picked out, then this intermediate result set will be joined with their professional levels, and finally all those that do not have the highest level removed. The other choice is to start with the professional levels:
SELECT count(*) FROM AC_PLV_Actor_ProfessionalLevel AC_PLV JOIN AC_GEN_Actor_Gender AC_GEN ON AC_PLV.AC_ID = AC_GEN.AC_ID WHERE AC_PLV.PLV_ID = 10 AND AC_GEN.GEN_ID = 1;
In order to determine the optimal join order, which of the two queries above to pick, the optimizer needs statistics. If men and women are evenly distributed, then starting with gender will remove about half the rows for the first intermediate result set. If there are ten professional levels with most actors at medium levels, then starting with the professional level will remove most rows for the first intermediate result set. This is much more efficient, and the optimizer will strive to find the join order that produces the smallest intermediate result set first continuing progressively through the joins with as few rows as possible.
When legacy data is migrated, large numbers of new rows will be created, skewing any already gathered statistics. Therefore it is a good idea to update all statistics after loading a large amount of data and on a regular basis, if not done automatically by the database. Normally, experienced users will always know more about the data than the database, so it may also be a good idea to check the executions plans of queries from time to time to ensure that they are optimal.