What is Semantics?
“Semantics, within the realm of data analytics, refers to the meaning and interpretation of data elements and their relationships. Instead of viewing data as mere numbers or strings, semantics provides context—explaining what the data represents, how it should be understood, and how different pieces of data relate to each other logically. By establishing clear semantics, analysts can ensure that data from varied sources is interpreted consistently, facilitating meaningful analysis, accurate insights, and effective data integration across virtual environments.” – Copilot
That’s an accurate definition. The underlying concept is fundamental—establishing clear definitions is the prerequisite for meaningful data integration. Once we have a clear understanding of what we need, the challenge shifts to execution. That is the role of the Mapping Layer.
The Data Dictionary
The core component of the Semantic Layer is the Data Dictionary (for lack of a better term). The dictionary defines individual data elements with each identified by a UUID (Universally Unique Identifier) which is used throughout the environment as an immutable identifier.
Within the Dictionary, these data elements may be organized and presented in any manner feasible. Enhancing these are data source definitions, which enumerate the data elements it contains. The data source defines a row, implying all data elements in the source are related.
This source-data element relationship mimics Kimball’s Bus Matrix, a mapping of fact tables to dimension tables. This metadata drives the user interface to suggest other datasets with similar granularity.
Types of Data Elements
A data element may be one of three types:
- Attribute
Provides context to the Measures and Media - Measure
A value (usually numeric) that reflects the magnitude of an event - Media (custom objects)
Image, video, audio, or object that require separate handling. These are not used directly in analysis but may be used to enhance information about an event for the analyst.
In general, these are complex JSON objects containing URLs, media type and attributes including time, size, description, frame rate, and anything else depending on the media type. There may also be a security package to access the media.
Building the Dictionary
This can be the most difficult undertaking of any organization. I have observed heated discussions between departments over subtle wording of a definition. So, start small and narrow and create some small successes. Find a champion to support the implementation. Demonstrating its worth leads to more support.
I will go into more detail on how to manage such a task in a later blog.
I see the dictionary as a separate object with UUID’s identifying each data element. Such a dictionary can come from anywhere: Professional Organizations, Standards Groups and so on. As these evolve and “leading” dictionaries gain support, mapping addendums can be produced to equate elements between two different dictionaries. AI may be a good tool to compare dictionaries and identify differences.
Addressing Context
One issue is dealing with context in a Measure’s definition. Is “Last Year Same Store Sales” measure its own data element or is it “Store Sales” bounded by date? Generally, Measures should avoid any context in its definition, but sometimes it is unavoidable. Also, there is no reason not to have both. The former element could implicitly bound the value.
Another example, a multinational corporation may have a “standard” currency for their corporate reporting. Internally, they store two measures: one valued in the standard currency, and the other in the local currency. Both would be defined in the dictionary.
Defining Hierarchies
The Dictionary should allow definition of parent/child relationships between Attributes. For example, a City belongs to a State. The relationships can be chained to create multi-level hierarchies.
The Mapping Layer
The interface between the Virtual World and Real Data is the Mapping Layer.
When a new data source is added to the environment, it is described in terms of the data elements it contains. A source may be fixed in time, such as ‘2020 Census Data’, or contain no detail, such as ‘Total net monthly Global trade for 2000-2020’.
Introducing a New Source
While working in the Base layer, new sources are added manually or seamlessly absorbed into the environment if it shares the Data Dictionary.
The source definition is packaged as a JSON object. It contains an array of data element UUID’s, transformation logic, casting and so on. Source level transformations should be optimized as much as possible. This really goes without saying, expansive memory space and parallel processing goes a long way to create snappy applications.
The source definition should define the key aspects of the source. This includes:
- Where is it located?
- Should it be cached?
- What data elements do it contain?
- Is it detailed or a summary?
- If a summary, does it provide cardinality?
- Transformation logic to conform an element’s value.
- What summaries can I produce?
For example, a use case could be the daily sales summary that is widely used throughout the organization. Such a dataset could be generated nightly and cached in memory for the day. Other features may include the ability to provide cross-references to tune a process.
The need to include transformation logic is important when dealing with 3rd party sources. While you can ensure an internal source provides the right data as needed, not so with 3rd party sources. This is potentially a major security hole that needs to be isolated and protected from bad actors.
The last item is determined by the characteristics of the source. Generally, there are three categories of aggregations:
- Distributive Functions (SUM, MIN, MAX…)
- Algebraic Functions (AVG, Standard Deviation…)
- Holistic Functions (COUNT(DISTINCT), MEDIAN, PERCENTILE)
Distributive functions are fully executable against any source. Algebraic functions require a summary source to provide at least a cardinality value. The Holistic functions must have detailed source data, however there are approximation methods for some calculations. The system must realize this situation and warn of using functions that will not work.
Obsolete Element Mapping
We all make mistakes, so there needs to be a way of correcting them. Even if your promotion strategy is rock solid, stuff happens. This is a list of UUID’s and the UUID that replaces it. When the system displays available elements, it will present the substitute data element or nothing if the substitute is null.
The reason is that this and the next item are mappings so there is no direct change to the dictionary itself as well as provide a means to audit the environment. These mappings should be secured to prevent malicious changes, as such changes can create chaos for the organization.
Equivalence Mapping
I envision a variety of Dictionaries created internally or by 3rd parties. Most likely they will use UUID’s that don’t match up. So, a means should be provided to map the external UUID with your internal UUID. This mapping may also include transformation logic to conform data values.
Putting it Together
So, you have sources, now what?
My next blog will take a query and break down how this whole thing would work. I will also delve into possible mechanisms to allow a query to move into a new environment without significant changes.
We will also take a look at the interaction between the environment and a data source and different ways of interacting with it.


Leave a Reply