Dimensional Modelling for Advanced Data Analytics and Cloud Solutions
In my earlier blog I discussed MPP style cloud databases. These databases imitate a physical MPP system on conventional hardware. An MPP requires very specific thoughts on how to arrange the data, so queries run efficiently. But almost all other databases claim they can handle huge amounts of data without ANY preparation. What gives?
Imagine you are faced with multiple sources of similar data and there is no industry standard to identify what the data refers to. Many professions have standard coding systems that are shared by all involved. Examples include diagnosis codes, ISO codes, state codes, ZIP codes, and so on. In this case, each data feed unambiguously…
Natural keys can be complex. It may involve multiple columns making its use prone to error. A column may be omitted in the join causing run-away queries. It is not efficient as it requires significantly more work to compare two strings over two binary integers. A surrogate is simple and easy to understand: The key…
The optimal data schema for parallelization is a Star Schema. Normalized data models are very poor for such systems because all tables are based on a unique primary key. Vendors that encouraged such modeling (Teradata) included an extensive array of bizarre indexing strategies to overcome the issue. So, the machine required a lot of handholding…