This is an abbreviated and updated version of a presentation from Ontotext’s Knowledge Graph Forum 2023 by Peter Hopfgartner, CEO at Ontopic
At the end of last year, DBEngines ranked Databricks as the fastest-growing database system with Snowflake a close second. Databricks is a great place to store data, it has a rich computational framework based on Apache Spark, and is scalable and cost-effective. All in all, data lakehouses seem to solve many data problems, which is why more and more companies are investing in them.
Does this mean that data lakehouses like Databricks will replace knowledge graph technology? I don’t think so. Mainly because they lack semantics and because the data stored in them can be very heterogeneous. As you can see in the diagram below, Databricks offers great solutions for storing and accessing data, but it is missing the top of the pyramid.
Using Databricks for knowledge graph construction
There are multiple ways to build a knowledge graph using Databricks.
Extract Transform Load
Typically, data needs to be loaded into a knowledge graph from other data sources. An established way is to extract data from the data lakehouse, apply transformation rules, and load them into the graph database. This process is known as Extract-Transform-Load or ETL.
I don’t recommend using a general-purpose programming language for this task, because it doesn’t scale well. Instead, I recommend specialized software with clean transformation rules that scale well with the complexity and size of data.
With ETL you have your data in the graph database and you can start analyzing it.
However, there are some use cases where this approach may be too costly, for example, if your data is very large, frequently updated, or infrequently queried. In these cases, the decision it is often made not to load the data into the knowledge graph, but to use an approach that doesn’t require moving the data.
Data virtualization
A virtual knowledge graph answers queries in the same way as a traditional knowledge graph. The big difference is that it doesn’t get the data from the graph database, but directly from the source.
Today, virtual knowledge graph engines have become very smart. They can limit the amount of data read from the target source to only what is strictly necessary. Thanks to many smart optimizations, they can be very performant and fast.
However, there are still use cases where this approach wouldn’t be ideal. Maybe you need some kind of analysis that only works well in a graph database, or you don’t want to load the original data source with the query.
A hybrid approach
As we can see, both ETL and data virtualization have advantages and disadvantages. But if we combine them, they can cover many more scenarios. A hybrid approach allows you to decide for each dataset whether you want to use it from the graph database (materialized) or virtually.
A good example is time series sensor data. If you have a large amount of time series data, you can keep the information about the sensors in the graph database and have the actual sensor data in a dedicated time series database. The data in the time series database is then used to create the virtual part of the graph. In this way, you optimize storage, reduce costs, and have a very convenient way to query all the data.
You can use the same transformation rules for both ETL and the virtual approach. This allows you to change your mind about how to have your data in the knowledge graph at any time. For example, you may decide that you need one of your datasets more often than expected and prefer to have it in the graph database. The data is then moved to the graph database using the exactly same rules as used for the virtualization.
Ontopic Studio
Creating transformation rules can be done in different ways and the productivity and robustness of these approaches vary widely. At Ontopic, we believe that software should be accessible not only to specialists but also to people who don’t have a deep understanding of semantic technologies. To address these issues, we have developed Ontopic Studio — an environment that focuses entirely on mapping rule design and is best suited for virtualization at scale.
As we already said, the virtual approach has three important characteristics:
- data is not moved
- data can always be seen as fresh as it’s in the data source
- you always work with the latest transformation rule
These features are the main reasons for choosing virtual knowledge graphs as the environment for designing the transformation rules. It allows you to view the data, map the original data to an ontology, and then test your mapping on the fly. You can easily change your rules and immediately see the impact. This is much more agile than traditional ETL, where you have to start the ETL process to see the change.
As I already said, the mapping rules can be used for both ETL and virtualization. They are based on W3C standards and can be exported as R2RML, allowing you to work with many other tools.
Ontopic Studio also provides an environment where multiple people can work on the design of your enterprise knowledge graph. This is important because the people who work with the data usually have a lot of knowledge and you don’t want to miss out on it.
It’s also easy to revisit mappings and see if they behave as expected. The debugging experience is good and it’s simple to communicate where the data comes from.
Another priority for us was to provide a no-code experience. The situations where you need to write even a single line of SQL are very rare. Everything is done through the UI and non-technical users can easily work in the environment.
The Ontotext/Ontopic bundle
To take things even further, we have created a bundle together with Ontotext. The core component is GraphDB and our virtualization technology enables virtual access to data lakehouses. You can also easily create rules to move data into GraphDB.
An additional benefit is that this is a single product commercialized by Ontotext. It means you only have one point of contact for purchasing and support, making the process much smoother.
To sum it up
Data lakehouses and knowledge graphs are complementary and allow the building of comprehensive solutions for the efficient use of data
The former allows handling enterprise scale data volumes very efficiently, with all the advantages we know such as easy loading of data, single access, consistent access control, and great data processing.
The GraphDB/Ontopic bundle adds to this the well-known power of knowledge graphs to make data easier to explore, analyze, and consume by humans and machines. The flexibility provided by combining and integrating virtualization and the graph database allows you to get results much faster at a lower cost and to be open for future optimizations.
The reusability of mapping rules not only supports this dual use, but also scales well with data volume and complexity. It reduces development time and error rate, favoring team collaboration and inclusion of subject matter experts in the process of knowledge graph creation.
Originally published at https://www.ontotext.com on June 6, 2024.