Managing Implementation
Where You At?
Creating a Semantic Layer is far along the Analytic Environment Maturity scale. Few have succeeded and are achieving benefits from understanding and leveraging their data assets. This architecture understands this challenge and offers a mechanism to utilize 3rd party providers, such as professional organizations, to provide a large part of the content. In AI terms, this becomes a ‘small language model’.
Bill Inmon discusses a similar concept in a recent post (1) BUILDING THE BUSINESS LANGUAGE MODEL – BLM – PART 1 | LinkedIn. What he outlines would go a long way to providing the bulk of a Semantic Layer. With UUID’s identifying data elements, you are free to alter the jargon to enhance understanding without changing its meaning. It would also be a key assistant to the analyst in locating and analyzing related data.
The Virtual Environment I am describing does not yet exist, and the biggest issues may be resolved by business-focused language models. So, regardless of the future, resolving Semantics across a large organization sets the stage for implementing a modern analytic environment that improves an Analyst’s productivity by 300% or more.
Review
In the last blog post, I discussed classifying Data Elements as either Attribute, Measure, or Media. This describes how each Data Element should be used. That is:
- Match Attributes across Sources
- Aggregate Measures across Attributes
- Media is a reference to an object that contains references to documents, pictures, video, audio, etc. Each Media object is referenced by a data object which contains a URL and other metadata. These objects will be accumulated into an array when aggregated. These would be accessible through appropriate players.
This allows it to present basic analysis as Sources are chosen. Once complete, all the Analyst needs to decide is how to filter the data, the granularity, and formulas to aggregate a Measure. In other words, the Analyst does analyst stuff, not programmer stuff.
Virtual Reality
The embodiment of this environment is metadata. Everything discussed is about what information is needed to describe different capabilities (i.e. a Schema) of Sources and Data Elements.
When describing a Data Element, the two definitive properties are its UUID and Data Type. Everything else is, essentially, documentation. The Data Type (Attribute, Measure, Media) defines its role in analysis. This allows the environment to combine multiple sources and produce results with minimal input from the Analyst.
The UUID is the identity of a Data Element. If a UUID appears in many Sources, they are the same Data Element. If it is an Attribute, it is used to integrate the Sources. Measures and Media are aggregated.
Security
Not my forte, but security is a critical issue. Each Data Element should contain a security classification segment to control the presentation of any content to an Analyst in conjunction with their own security classification. The system should provide multiple ways to obfuscate any data element based on the analyst’s clearance. Information should include things as the source nation and adjust access depending on local laws.
A new data set developed by an analyst should be vetted and promoted like any other data set to the Gold Level. The Source interface should provide information such as timeliness, validation level, and identification of the low level (Base, QA) Data Elements when presenting results.
The environment should address all security protocols and masking situations. Any Analyst assigned to the Gold Lever will only have access to approved data through this system. Without developer credentials, there is no ‘backdoor’ to Source data systems.
Physical Data Structures
Any performance this environment could deliver is deeply dependent on the physical structure of the datasets. One could consider ‘end of day’ snapshots of a source maintained by the environment in cache to stabilize financial analysis. The environment should provide an Administrator with the ability to direct how cache should be used. Strategies could be based on the Source, account, availability or other parameters.
At the extreme, you could define your sources as individual tables in a fully normalized database. Define each primary key and its foreign counterparts in other Sources as the same data element. As such, everything will function correctly, but performance could be dismal. So, it is important that a data source is as complete as possible, so a COMBINE only occurs with other complete sources.
This requirement suits all current cloud deployment architectures (Dimensional, Data Lake, Data Lakehouse, etc.), so it becomes a matter of prioritizing what datasets you implement.
I would present a Dimensional Model as individual ‘flattened’ Star Schema Views. If you are using a bridge table, you will need to create an ‘up’ and ‘down’ version of the Source with different Views. Any reasonable database will resolve the view by eliminating any unnecessary joins and only project the columns being queried.
A similar technique may be used to access semi structured data. However, it is reasonable to expect that the interface provides a connector to such sources and allows definition of the data source content.
Mechanics
The Base level is where all development takes place. Nothing in the Base layer should be considered stable or correct. The QA and Gold Levels are strictly read-only except for Administrators handling metadata releases to that level.
Tests can be run in containers that amalgamate the different levels so test sources are combined with Gold Level sources.
Analysts can produce output. This is maintained in local storage and (by default) accessible as a Base level source. All output from this system is tagged with the sources used. Any analysis that uses a non-Gold Level source should not be considered actionable. Any source must be fully vetted and promoted by a management process.
Implementing a Semantic Layer
Success Factors
Building a complete and effective Semantic Layer is a long-term project that may not see practical results soon. In a word, this is infrastructure. This is a long-term solution.
To achieve success, the project requires the following commitments:
- High Level Support
Budget and authority from C-level executives to move the project forward. - Ownership
Departments need to take ownership of data in their domain. It is no longer an “IT problem”. Designate Subject Matter Experts within the Departments to guide the development of the Dictionary. - Cooperation
Professional respect is paramount to reaching consensus among different teams. Different operational areas of an organization may have vastly different wording to describe the same thing. A multi-national organization has this challenge within any native language itself. - Buy-In
A commitment from participating Departments for Success. The best way to get this moving forward is to identify two or three departments that would have the most to gain and implement a ‘bare bones’ integration to demonstrate its utility. - Technology
AI can be a big win here. Its ability to assess and compare large volumes of text, transcribe spoken text, and translate languages can go a long way towards aiding the building of the Semantic Layer.
As this architecture takes hold, one would imagine commercial offerings that provide a Semantic Layer for specific industries. As these products become accepted, common use of a compatible Dictionary will lead to seamless integration of a vast variety of data. A 3rd party Source that utilizes a common Dictionary may be integrated with the environment almost instantly.
However, in a practical sense, Dictionaries will vary, each with a unique collection of UUID’s. To integrate this new Source, you need to map the Source’s UUID to a known UUID in the Dictionary. This defines an equivalence between the two Dictionaries. The two Data Elements represent the same thing.
As the implementation of this architecture matures, one can expect a consolidation and standardization of Dictionaries. As the market matures, the use of a common Dictionary is likely. This would permit instant integration of a new Source.
Divide The Effort
Developing the Dictionary requires a lot of effort from a lot of people with different talents. Here are some of the key roles.
- The Subject Matter Expert
This is the person that knows their business area as well as anyone. Often this is also the same person that has little spare time. The appointed person should be allowed to prioritize this project. Their role is to facilitate meetings and recommend final resolution of differences. - Process Expert
This is a person that fully understands a specific business process. They investigate and advise where and when data is created, and semantic differences between collection points. - Contributors
These are those who provide definitions. These should be organized around specific business events or documents. Each group gets a list of data elements to define These lists are consolidated by the group and reviewed by the Subject Matter and Process experts.
As the project progresses to an enterprise implementation, individuals can become a bottleneck. One should be aware of this and reassign priorities or supplement resources. However, as a process evolves, the clarity gained drives interest to complete work.
Be Gradual
As mentioned earlier, start small. Pick two business events that are commonly compared together and use those as the scope of the project. The first effort needs only be a strawman implementation of a general understanding of the content. A proof of concept.
The entirety of the virtual environment is driven by metadata. Physical data is not touched in this phase. Therefore, any previous work can be enhanced with updates to the metadata. The metadata will be governed by a JSON Schema with objects defined for particular features or capabilities.
New functionality is added by extending the Metadata. This is used to direct the system how to implement it. When released, such enhancements would apply to any instance of the Data Element.
Treat these changes as you would for a software update. When promoted to QA, test plans should be developed to demonstrate it works as intended. A lot of this could be done with AI simulating Sources and evaluating results.
In Conclusion
The virtual environment described here is not a product you can buy today. However, current and future data environments rely on a functional understanding of the data available. With that understanding you reach a foundational capability you must build. It is an investment in infrastructure, driven entirely by metadata, with the clear and tangible goal of liberating your analysts.
By starting small with a committed team and proving the value of a common semantic understanding, you pave the way for a transformative shift. The path requires dedication, but the destination is an analytics ecosystem where data is seamlessly integrated, and analysts are free to deliver insights at a velocity previously unimaginable.
There are a lot of bits and pieces of technology that solves the problem. So, an integrated system that could (potentially) access anything, is well within our reach.
Over time, I will look into the state of the Art. Many vendors offer pieces of the solution, but there is nothing preventing a comprehensive resolution that sits above current implementations. Furthermore, AI is not necessary for query execution. It can be a significant aid in creating filters and allocation algorithms, but is not necessarty to execute the query. Execution is handled by well established optimization techiques using an established plan. This plan establishes an audit trail for the query content,


Leave a Reply