Data is Dimensional

Dimensional Modelling for Advanced Data Analytics and Cloud Solutions

Eliminate Data Latency in Analytics

Built-In Delay

Since the beginning of Data Warehousing there have been delays inherent to the process that stretches the meaning of “near real-time” into marketing puffery.

Both Inmon’s and Lindsted’s Architectures intentionally forbid any access to newly received data without first “publishing” the data into an Analytic repository. Even Dimensional repositories had scheduled batch update processes to deal with the Transform & Load part. Then the mini-batch concept of frequent small updates reduced delays to “minutes.” All was calm, for a while.

Demand for analytics grew but growth often led to significant investments in capacity, which many companies simply could not afford. Data centers are not cheap. Then along came the Internet, the invention of multiplexed fiber-optic lines exploding data transmission capacity in the late 90’s, and big, cheap computers. With orders of magnitude increases in thoughput, the notion of big data centers “in the cloud”, became cheap and practical.

In the cloud, anybody can have access to massive amounts of compute power and storage on a pay-as-you-go basis. Once security concerns were addressed, many organizations now use Cloud Services to house and manage their analytic capabilities. These systems are capable of receiving transaction information as it is accepted by the business system for use in real-time analytics. Businesses demanded faster access to information, now they can achieve it.

Time for a Surrogate

The prime cause of the delay is the use of Surrogate Keys. Use of a simple surrogate key provides massive improvements to query response and flexibility. In the past, this involved supporting and using a lookup to cross-reference the Natural Key with the Surrogate Key. This was slow and required the Dimension to be updated first.

Cut the cross-reference, cut the delay.

Why Surrogates

Natural keys can be complex. It may involve multiple columns making its use prone to error. A column may be omitted in the join causing run-away queries. It is not efficient as it requires significantly more work to compare two strings over two binary integers. A surrogate is simple and easy to understand:

  • The key is a single column
  • The key is always a BIGINT
  • It is always an equi-join from Fact to Dimension (foreign key = primary key)

The entire data set may be encapsulated as a view eliminating any need to code joins.  A query against such a view would perform the same as a bespoke query written traditionally, and will always be faster than the same as a materialized view.

The difference between compairing two VARCHAR strings versus two BIGINT values may not seem like much. If you are using your own computer, you may not know the difference. But cloud services charge for compute cycles and the CPU consumes much more for a string comparison over comparing two BIGINT values. A switch to a surrogate could reduce compute costs for a query by 20% or more.

Views are also handy for hierarchies.  Users occasionally make mistakes joining through the hierarchy bridge table.  Separate views for a ‘drill up’ versus ‘drill down’ direction eliminates confusion.

Big Data Solution

This key generation delay was addressed head-on by Big Data engineers by using a Hash Function to convert a Natural Key to a simple integer. Done.

Ok. Is it? Professionals I trust assured me that when using a BIGINT (64 bit) hash, collisions are not a problem. I was skeptical.

The technique of using a 64-bit hash to generate Dimension Primary Keys allows you to perform all Surrogate Foreign Key assignments immediately, as a new row is written to the fact table. Essentially a pipeline constantly fills the repository, data is loaded and available the moment it arrives.

In the old process, you had to wait until the Dimensions were resolved so the cross-reference is correct.

Quantify the Risk

You are probably familiar with the “Birthday Problem”. If there are 23 random people in the room, there is better than a 50% chance that 2 people have the same birthday. I know it because I had to write a program (In COBOL no less) to solve it in my sophomore year. It has a bunch of factorials (365!) that make it a challenge to code.

Given we now have AI, I asked both ChatGPT and Grok the same question, and got the same answer.

Given a dimension with 1 billion unique keys, what is the probability that the next 64 bit random key will collide with one in the table?

Grok’s summary was

“We’re dealing with a set of 1 billion unique 64-bit keys, and we want to figure out the probability that the next randomly generated key collides with one of the existing ones. This is a classic scenario that reminds me of the birthday problem…”

So, it understood the question.

“The probability is roughly 0.0000005421%.” (Thanks, Mr. Spock, for such a ‘rough’ estimate!}

The second question is:

What is the probability the dimension already had at least 1 collision?

The answer: 2.68% (roughly).

64 bit integers are really big numbers. The odds of a collision are extremely low, and any error it may create is minimal, given “clean” raw data is usually 5% wrong.

To achieve a similar probability of hash collisions as 1 billion entries with a 64-bit hash, you would need approximately 15,259 entries in a 32-bit hash space.
– Chat CPT

Clean Up

As with any load process, there are still a few loose ends. Dimensions need to be updated with any new Primary Keys. These are inferred dimension rows, created as a place holder until good data is received. In practice, the number of inferred dimension rows are minimal and usually resolved the next Dimension update.

It’s Not Personal

A hash like this can never be used as a means of reliable personal identification. Larger hashes, used in cryptography, are capable of such use. But they are stored as large strings, which are slower and clumsier than simple integers as a key. So, if you are a retailer with billions of customers, your analysis accuracy may be off by a small fraction of a percent. But it will run faster by avoiding use of strings as keys.

Cloud database systems with a 64-bit hash function include:

All listed Functions receive a VARCHAR as a parameter and return a BIGINT.  Azure (Synapse and standard) does not include such a function, I have listed open source hash functions at the end of this blog.

What about History?

At this point a 64-bit hash is an acceptable way of creating a reliable surrogate key. But how do you handle historical Type 2 Dimension tables? How do I load fact tables without knowing the current Type 2 key? On an MPP, how do I distribute Fact tables?

Create a new Type 2 key by hashing the natural key concatenated with the current timestamp.  Maintain a cross reference of the current Type 2 value and Type 1 key. These are used when loading a new row. While this adds work, the process only occurs when maintaining the Dimension table. Not when loading Facts.

Maintaining Type 2 history requires some design considerations in your database:

  1. Store the Type 1 and Type 2 foreign keys in Facts.
    This allows the user to perform current or point-in-time analysis from the same table.
  2. Use a cross-reference loading facts.
    The Dimension table should be Type 2 and contain both the Type 1 and Type 2 keys. When loading facts for the dimension, create a cross reference using the keys of the current dimension row.  The initial dimension row will contain the same value (Type 1) in both keys.
  3. Maintain one Dimension table.
    This simplifies things for the User. They always join the same table regardless of the key they use. If you are using an MPP system, distribute on the Type 1 key. Join using the Type 1 and a filter on the Type 2 key to select the row you want.  On a standard system you can join directly using the Type 2 key.
  4. If you are updating an inferred row, update in place as the original row.
  5. Late arriving attribute changes require the creation of a new Type 2 row.  This can imply updating old Fact rows.  This should be discouraged. You should consider generating new offset rows representing the reversal and restatement of the fact rows. This allows the change to occur without updating existing rows, ensuring 24/7 availability., and a record of such changes.

Some Don’t Like It

Real time changes are not something most analysts like or need.  The report you run now could be different to the one created 10 minutes ago. An easy work around is to store an insert timestamp and a “business date” attribute on every row to filter rows for those user groups who need it.

In Conclusion

Using a BIGINT hash is a reliable method to create a surrogate key. Even in exceptionally large volume applications. Doing so will simplify and streamline your load procedures.

The result is dependencies between Fact table and Dimension table maintenance are removed. Facts can be received and Dimensions cleaned up afterward.

Open Source Hash Functions

Publicly available HASH functions:

https://github.com/JonHanna/SpookilySharp

https://github.com/N-R-K/ChibiHash

https://github.com/google/highwayhash