Building the Environment
Opening the Box
In theory, a virtual data environment sounds perfect: a single interface to all your data. But how does it actually work? What happens when you drag “Sales” and “Inventory” into a query and click “run”? This post lifts the hood on the machine. We’ll trace the complete path of a query—from the moment an analyst logs in to the final, combined result set—and introduce the powerful new concept of “Allocation” that finally solves the problem of analyzing data at different levels of detail.
Clear Separation
The key feature of this approach is the separation between the Virtual Environment and the actual data. Aside from metadata interchange, the outcome from the interface is always a data set and is always read-only. You can continue to use any methodology you have in place to maintain your current infrastructure, although it may require creating views to flatten the data structure you wish to represent.
There will be a standard interface protocol with metadata exchange allowing new Sources to be introduced locally or provided by 3rd parties.
Effect of Transactions
This environment does not recognize transaction states. Transaction states are a characteristic of the Source and its definition.
In general, there are 5 different types of sources:
- STREAM sources are continuous. MEDIA sources are handled by the appropriate driver.
- CURRENT sources change frequently. Anything updated more often than ‘DAILY’.
- DAILY sources change at a regularly scheduled frequency.
- STABLE sources are those that do not change very often.
- FIXED sources do not change.
But remember, Source is a logical term. You can define multiple Sources for the same physical feed. For example, you are receiving raw transaction data, cleaning it up and storing it in the local database. In this case, you could define multiple Source objects for the same content. A STREAM Source could provide the raw data (not recommended), a second STREAM Source with a cleaned version of the data, and a third CURRENT source using the local database.
All STREAM sources are cached locally to support reuse. The Source definition should contain metadata that specifies cache parameters. These would be set by an Administrator. An Administrator should be able to manage this cache and redirect to a new Source that refences the same content created outside of the environment.
Raw Data
Being able to see and evaluate raw data is critical to testing new Sources. An analysis should be able to see a raw view which explodes the details of the transformation logic. While an internal system may provide conformed data elements, minimizing the need for such an ability. A 3rd party may not. This will assist validating transformation logic in the Mapping Layer. The new Source is mapped to the data elements the logic represents. You can then perform comparative analysis between the current and new Source values.
This capability moves development testing out of the real world and into the virtual world. You can easily modify transformation logic and see a result.
What about Media?
The nature of Media data elements is a function of the Source itself. Each Source has metadata that describes the nature of the Media: LIVE, ACTIVE or STATIC. Live media cannot repeat or shift time, while Active media can. Static media is fixed in time (photos, documents, etc). All such sources may have time as an attribute; the environment would allow the analyst to manipulate the media presentation as appropriate.
As a Source may have many data elements, there could be many Media types. The system should provide a means of synchronizing Media presentation among the Sources. This also implies the system should cache LIVE sources as the analyst manipulates the controls.
For example, an analyst is examining insurance claims. He has two Sources, one has data about all claims, the other has various media about the claim. Both will have attributes to associate with each other: date, time, claim number and so on. Presentation can be synchronized with matching data elements.
One way to do this is to define new data elements in the Semantic Layer. The purpose of these data elements is to synchronize Media. You may define ‘Video Timestamp’, its format, time precision and so forth. Add the data element to the Source containing the Media object. The Media object is then explicitly synchronized to these data elements. The system will alert the synchronized media players when the data values change.
Media by Reference
In the system any Media data element is a JSON string containing the URL of the object and metadata providing technical information about the object, as well as access information if it is an external source.
If a Media data element is viewed, it will be cached as received. If other analysts request the same object, they will use the same cache. Any analyst may view/hear any portion of the object that has been cached. Live media will be cached and shared as well.
How Does it Work?
Ok. You have built a Semantic Layer, complete with working and tested Mappings. How does all this work? Let’s walk through the path of a Query:
1
Log In
An analyst must be authenticated. Their permission decides which data sets the user may access, as well as which data elements, ability to access atomic (granular) level data, and so on.
This informs the Dictionary as to what it can present. Using the Obsolete Element Mapping, it reassigns or cuts the UUID in the list.
The analyst is shown the sources that are available to query, and the data elements available from that source, with clear description if its content. The analyst should be able to review all metadata about the source, as well as find and associate other sources.
2 Present
the World
3 Compile
The analyst selects a list of data elements from a lone source. In addition, the analyst specifies bespoke aggregations of some Measures, as well as a Media data element.
Using the Equivalence Mapping, the system will find the UUID used in the Source definition. The system will generate a query for that specific Source. There may be many interfaces between this system and the “real world”. I assume generic SQL interfaces will be the most common.
The Source receives the generated SQL query and executes it.
The response will vary depending on the nature of the Source. If it is a batched real-time source, it may not respond until a row or time limit has been exceeded. Other than a possible delay, such resolution is transparent to the user.
When it does respond, Media data elements return as complex data objects. These objects could be simple JSON arrays of URL’s and metadata, or something more complex.
4 Execute
5 Generate Results
The result set is summarized at this point if it cannot be done in the Execute phase. Media objects are examined, URLs verified and presented as separate streams. A stream may be bookmarked and synchronized with another data element in the dataset. At this point the system may hide or obfuscate data elements based on the analyst’s access permissions.
This allows the interface to decide how it presents these added streams. Anything is possible.
The analyst would have diverse options to consume and save the results. This could include a full download of the media to alternative servers.
6 Finish
Not So Difficult
As you can see, once all the pieces are in place, creating results from a single dataset is fairly straightforward. In a multi-source query, each source is queried in parallel and cached. The result sets are combined based on the query.
Of course, it was a simple example. But what about a query that requires multiple sources? How does this work out?
Combining Multiple Sources
Let’s review:
- The Semantic Layer has organized data elements into Attributes, Measures, and Media
- A data element conforms across multiple Sources
Sources are combined based on the Dimensional rule: Aggregate Measures to a common grain (i.e. all matching Attributes). This leads to 4 different cases:
Case 1 – Combination by Union
This is the simplest form. The two datasets are aggregated on shared Attributes
Case 2 – Combination by Shared Context
Combine measures from different sources on shared Attributes.
Case 3 – Combination by Allocation
Allocate Measures in one source to match the higher grain of the other source.
Case 4 – Cross Product
In this situation there are no common Attributes across the Sources. Such a request should be rejected by the system or ask if you want a simple aggregate of Measures. Media data elements are ignored.
Example: Combine 3 Sources
This is a hypothetical example. Data element names have been masked to protect their identity. All Attribute names begin with A-, all Measures as M-, and all Media objects as X-.
The analyst sits down and chooses Source 1.

As you can see Source 1 provides 4 Attributes (A, B, C, D), 2 Measures (A, B), and a Media object (A).
Next, the analyst decides to add Source 2 to the query.

This source does not contain A-C, M-A or M-B. And adds A-E, A-F, M-C, M-D and X-B.
When you combine these, it transforms the avenues of analysis. Guiding the analyst through available options.

The moment Source 2 is introduced, the display changes for the analysts. Incompatible data elements are shown in Yellow. If the analyst does nothing, it will produce the result set shown. Elements shown in Green are fully compatible, while Yellow indicates it may not have content from all sources. It does not indicate it may be null; all data elements have the potential to be null.
The default result set includes all common Attributes, all Measures, and all Objects. Let’s say X-A represents a video feed while X-B contains photos. It is reasonable to expect the analyst to be interested in seeing both. Same thing with Measures. Measures with common Attributes are directly comparable.
Then the analyst adds a 3rd source:

By default, the result set of all three sources is:

The Attributes in common (i.e. granularity) have not changed, it added M-E and X-C and flagged all the Media elements as missing from some sources.
Reviving Attributes
In this example, you will notice that the default data set continually loses Attributes as you add more Sources. This affects the granularity of your results. But what if A-C is critical to your work? Do you remove the source that does not have it, or do you allocate Measures for that Attribute?
The following matrix shows the relationship between Sources and Data Elements:

The X’s show where a data element appears in each source. These symbols show which measures need to be allocated to suit the grain. For example, if I MUST have attribute A-C in my analysis, I need to allocate measures M-A, M-B, and M-E. At this point there should be various options, the default being to no present values for those measures at that grain (i.e. NULL). Or the analyst can define a means to allocate the Measure. The same is true for attributes A-E and A-F.
Allocation rules are the only thing the Analyst needs to provide to perform the query. They could be saved in the system and reused as needed. Every other specification is optional. The process will default and combine sources as previously explained.
Internal Systems
If a data source is fully under internal control, you could enhance the source to include missing data elements. After some perfunctory updates to the Symantec Level, the revised data source is ready to go.
Working with Media
To fit Media into this environment, it needs to be “relationalized”. To that end, any Media element is stored as a JSON string. Actual content is stored as a URL and needs to be handled by an appropriate display process. The system will provide normal controls as well as synchronizing content with a data element in both feeds (such a time, claim number, location…).
This puts the creation of such JSON strings on the source provider. If the source is providing detailed data, it should be a simple pass-thru. But, if it is aggregating, it should aggregate the Media value as an array of Media objects. This results in a JSON array of Media objects.
The reason detail feeds have fewer issues is the environment will handle the aggregation issue internally and present all Media available to the analysts.
As it is a JSON object, the content defines what it can access and how. This can reference the correct driver to effect its presentation.
Content Filtering
This system functions with sets. As such, sets can and will exist to filter and organize data content. For example, the World Health Organization (WHO) defines the ICD-10 coding systems for Diagnosis and Procedures. The CDC maintains the list for US diagnosis and CMS oversees inpatient procedures.
These organizations provide datasets that provide standardized grouping of codes to identify certain conditions. Such lists can be made available for this environment to consume. (A 3rd party Source). Essentially all it needs is a URL and access permission.
The notion of lists goes a long way to assist correct analysis. The advantage is that such lists, when coordinated through the Semantic Layer, provide an easy way to apply them to a query. A simple list would contain Attribute values and test wither a data value is or is not in the list. In this environment, it could also be a list of Objects that perform complex logic.
A typical enterprise does not face the complexity of dealing with ICD coding but has its own needs. For example, create a list of product groups to track the performance of a campaign. They can group attributes, conform data values, or provide additional attributes (context) to an analysis.
Having worked in a lot of different industries, the longest WHERE clause I have seen was in Health Care. But multi-page WHERE clauses lurk everywhere. Many of these involve complex logic so such lists alone may not be sufficient.
The Source metadata will specify the source as a filter, and the logic to apply. The logic JSON string is compiled and applied against the source data to filter results. But most of the time, a simple list works best.
Semi-Structured Data
The interface is a dataset. For purposes of this interface the object structure needs to be flattened out. Undefined data elements are ignored and are not consumed.
As with any data source, data elements must be mapped and defined in the Sematic Layer. That is the purpose of the Base and QA data layers. Once a data source reaches the Gold Level it has become a trusted and well-defined source.
A security facility should allow simultaneous access to multiple levels of data and metadata. This allows an analyst to evaluate a new data source against production data.
Quality Control
Since this is a virtual environment, the Base Level, QA Level, and Gold Level can be virtual as well. Any testing can be performed using Gold Level data if it does not create Gold Level objects.
There should be the ability to create containers to serve as ‘test environments’ that override Sources and redirect them to a different Level. That helps avoid a lot of data movement traditionally found with setting up a QA environment. This would allow multiple independent QA sessions at the same time. Streamlining the ability to test expansions to the environment.
AI Analyst Interface
The analyst should be guided by the system as sources are selected. The list of data elements in a source should display an element differently if it is common, shared with some, or unique to that source. It should know when allocations are necessary and request guidance. An AI type assistant may aid with that.
The session between the System and Analyst should be fully interactive and controlled by the analyst with System providing constructive feedback in creating and executing the query. It also requires the System to alert of missing (not available) data sources. The Source definition itself may provide actions that should be taken if the source is unavailable, such as an alternative Source.
The Source interface should support a periodic pinging action to verify something’s on the other side. This should allow it to actively monitor the Source and provide alerts and recovery functions should the connection become lost. An Administrator should be able to take a Source off-line, blocking any access to that Source. A Source may also take itself off-line (logically), if its metadata requires it, due to contractual terms.
Query Language?
All Human queries should be interactive. There are a lot of ways to do that today, and probably a lot more tomorrow. To facilitate voice communication and provide a means to document and transport a query, there needs to be specific terminology for the types of operations.
COMBINE [<source>,<source>,…]
This establishes what you are looking at. The source list defines which data elements you can work with, what needs allocation, and what is not available.
The source list may have a hierarchy of sets within sets defining complex sources. Such definitions can be stored as “views” for easy reuse.
<SOURCE>
The source declaration is the name of the source. It may be aliased and used to reference specific data elements in that source.
…ALLOCATE [<measure>,<measure>,…] ON [<attribute>, <attribute>,…] USING <expression>
This defines how a specific source should be allocated to create measures for missing attributes of that source. Further analysis is needed to figure out how to express this. There are a variety of techniques to allocate a value
…CALCULATE <measure> AS <expression>
This creates a new measure specific to this source. It can be used to separate measures from different sources or perform a calculation to create a measure compatible with another source. Probably include a RECALCULATE clause to change the data element value.
…RENAME <measure> AS <alias name>
The changes the name of the data element in that source to separate it from other sources. For example, doing a comparative report, such as “Same Store Sales” where you are combining two sources, one containing current data and the other last year’s data. Such name changes are solely within the query presentation and does not affect the Semantic Level.
Global Qualifiers
The CALCULATE expression can be applied to the entire set. It follows the source list and defines new measures using calculations against other measures.
Since the language needs to represent what the analyst decides to do, it needs to support additional features, such as sorting, filtering, and mathematical functions. The interface should warn the analyst when they specify a function that cannot be calculated. This is an issue with summarized data sources.
Conclusion
In this environment, the goal is to encourage the analyst to combine different sources. More context means you can drill deeper into the data. The flat dataset interface creates a clean break between the real and virtual worlds, letting you expand or change your physical systems whenever you need to.
The result is a more productive analyst. They don’t need to hunt through schemas or write SQL. They simply pick and combine Sources.
In the next blog, I will look at what this means for Management and how you handle 3rd party sources.


Leave a Reply