Creating a Virtual Analytics Environment

A Comprehensive Guide to Manipulating Any Dataset under a Virtual Logical Interface.

Introduction

In most organizations, the most valuable data analysts—those hired to uncover mission-critical insights—spend up to 80% of their time not on analysis, but on the frustrating, manual labor of finding, cleaning, and reconciling data from a dozen different systems. This massive waste of talent and time is a direct result of a fragmented data landscape. This paper presents a practical architecture to flip that ratio, creating a virtual environment where trusted, analysis-ready data is always at the analyst’s fingertips, allowing them to focus on what truly matters: driving business value.

In today’s data-driven world, the ability to analyze and visualize data effectively is crucial for making informed decisions. A virtual data analytics environment provides a powerful platform for displaying and combining datasets from various sources, enabling users to uncover insights and trends that might otherwise remain hidden.

The key to this environment is the data it presents is fully vetted and ensured reliable and accurate. Data sources are clearly identified and current. Data across these sources is maintained locally or accessed remotely without knowledge or concern of the users. Subscription and access rules are maintained for all data sources.

This paper presents an architecture to support global data access to any relational dataset and successfully integrate with any compatible relational dataset whether in-house or from a 3rd party site.

The intent is to develop a single interface that allows analysts and management access to any dataset regardless of its type or location.

Components of a Virtual Environment

SQL-like Interface
Having a SQL-like interface enables any analytic tool to access the data with some modification. Any data source can be represented by a table that may be queried and combined with any other compatible dataset. The break from SQL is you are manipulating entire datasets, rather than rows. Context is managed through shared conforming attributes and not through explicit declaration.
Virtual Model
This organizes the metadata and creates a data model for queries to reference. Many virtual models may exist to suit multiple applications and security needs.
Semantic Model
This defines and describes all data in the system. All queries are described against the sematic layer. The semantic layer also defines the structure of the virtual environment.
This is the clear business definition, name and characterization of a data element. Such models may be created internally or via published models. It defines what everything means. And, it identifies it’s role in a query, as Measure, Attribute, or Media. The latter is sub-typed to identify the appropriate “player”. The Media, if presented, functions in a separate stream.
Each data element defined in the model is assigned a UUID. This is used to identify the data element internally and to coordinate with external models. This allows a model to be created using subject matter teams as a series of components. These would be vetted and combined to create a common semantic terminology.
The data names and semantic definitions may exist in many languages and share the same UUID.
Semantic Mapping Layer
This layer maps semantic terms to data sources. It will also cross-reference UUID’s received from 3^rd party sources.
Each data source is defined as a simple dataset. A collection of Attributes and Measures as defined in the Semantic Model. The source definition may contain transformation logic that may be pushed down to the physical server.
Physical Layer
These are the “data sources” referenced by the mapping layer. All access to any data source is read-only.
Implementation of this layer is a local decision. There are a host of existing architectures addressing various means of data storage and retrieval including schema, self-identifying sources, and data objects.
These sources are not physical in its true sense, but represent a cohesive “row” to this environment. Relationships are implied based on the Attributes shared between datasets. A row may contain a mix of data values and data objects, for example: references to media objects, would require a “player” to present it in a separate stream. Coordination between the streams is a matter of the application. Do not cross the streams.
Security
Security is prevalent throughout the architecture but will only be discussed in a future blog. Outside of that discussion, presume a secure environment.
Management
This architecture is compatible with many existing methodologies and architectures. Example, if you function as a Data Mesh, portions of the Semantic and Virtual models can be assigned to different SME groups to manage.
Nonetheless, this will be discussed fully in a future blog.

This series will examine each layer and how it may be constructed.

What Is Data?

In the relational world, data consists of numbers, strings, or BLOBs (Binary Large Objects), a.k.a. videos, audio, and other digital products. This data is organized into relational Tables. This alone is not significant. What the data values represent allows the system to consolidate disparate datasets into meaningful results.

The question becomes how you describe data to define its purpose and relation to other data elements. For this I draw on Relational Theory and Dimensional Modeling. From a Relational standpoint, the smallest entity to access and manipulate is a dataset. This implies relationships among all data elements in the same row. Secondly, use Dimensional Modeling to categorize each element as either a Measure or Attribute. This defines the data element’s purpose in analysis.

In the Sematic layer, any data element is defined as a Measure or Attribute. Elements that can be both defined twice with unique names and appropriate definitions. Knowing which an element is defines how it is used when performing analysis.

With these characteristics established, combining data sets across various sources becomes a straightforward and systematic exercise.

Quality at 3 Levels

To be successful, I strongly recommend a three-tier structure for all implementations of all models being discussed. This is NOT the medallion architecture, but a traditional 3-layer promotion of processes. All live data is processed at the Gold Level, not all three. Nonetheless, the data sources may be structured according to your requirements. This environment can sit over any collection of databases and data warehouses.

All new data sources begin at the Base Level, for development and testing. This includes any third party, regardless of authority or reputation.

Many processes will be influenced by the metadata directing it. Metadata changes should begin at the Base Layer, validated and promoted.

Base Level
All development and introduction of changes occurs here. Any data source, data set, models, transformations, metadata must be reviewed and approved before promotion to the QA Level. Access limited to developers and subject-matter analysts.
QA Level
Integration testing occurs at this point. This should be the primary responsibility of the customer/department receiving the new functionality. Deficiencies should be documented and evaluated. A release may be returned to the Base Level for rework prior to acceptance.
After acceptance, the changes are scheduled for release to production.
Gold Level
This is the production system. It only permits access to fully vetted and accepted data from approved sources.

All write access is controlled by dev-ops in the QA and Gold Levels.

The Dataset

The dataset is the atomic unit and key object of this architecture. Like atoms, there are many different kinds of datasets containing different things. Unlike atoms, they are not inseparable (Ok, that’s what they thought around the turn of the 19th century. Now, it’s a weak analogy. Apologies. The editor has been sacked.). Combining two datasets may discard incompatible attributes. New datasets are introduced in the Base Level environment. This is where the data is reviewed for technical content, business value, completeness, or any other concern. The content needs to conform to the Semantic model and may require additional transformations to do so.

This Base level is essentially a sandbox to introduce new data to the environment. It would be open to technical and business audiences to discover and understand new sources of data. Proper management of this level is important to maintain a reliable and trusted release schedule.

Some of these transformations could occur in the source itself or may be implemented as “column expressions” in the Mapping model.

Non-conforming data elements are not promoted. This may result in a subset of available data elements being defined in the QA and Gold Levels. Failure to promote a data element means it does not appear in a higher-level repository and cannot be accessed.

You Don’t Join

The basic rule of combining star schema is to aggregate each Fact Table on common dimensions and union the result. In this case, dimensions are each distinct attribute in the query. If attributes are conforming, the results should be usable. If not, some aggregates are skewed and distorted by incompatible attribute values.

The “COMBINE” (ex JOIN) clause of the query would list the datasets with optional clauses describing how to manage exceptions.

The dataflow for a basic query is as follows:

For item #3, “physical” means SQL queries against what is assumed to be actual datasets. In practice, these may be a view of a flattened star schema, a cached pre-aggrgation, or anything else. A set may include media that may be activated in a sparate stream.

It is concivable that steps 4 through 5 may occur wholely in active memory, sigificantly improving performance. Assembling results is fairly straight-forward. It’s simplicity is due to the fact that each result is a simple dataset and data values are comformed. The only real options are aggregation methods for Measures and allocation rules for missing Attributes.

The Semantic Layer should accommodate generic and specific references to media of all types. These are supported by a more direct data flow, as well as local caching and other features. For example, a query requesting data and related media may be split into separate data and media streams. Separate does not imply independent and the mapping should include any conditions, triggers, or dependancies.

Coming up

The next post in this series will look at the Semantic and Mapping layers. What capabilities it should have, and the challenges of integrating disparate data. Later I will look at the constructing physical interfaces to improve performance, and how the final integration occurs. Finally I will take a look at Managing such and environment and possible commercialization opportunities to support this architecture. Finally I will identify current products and where they may fit.