Migrating From LPG to RDF Graph Model

Ontotext
11 min readMay 8, 2024

--

This post discusses approaches and techniques to migrate an LPG model to an RDF model along with data and query migration.

In the first part of this series, we discussed the basics of LPG and graph models and their pros and cons in building. The second part of the series discusses how to migrate LPG models to RDF as well as the associated data and queries. knowledge graphs

LPG to RDF Migration

In the first part of this series, we discussed the basics of LPG and RDF graph models and their pros and cons in building knowledge graphs. The second part of the series discusses how to migrate LPG models to RDF as well as the associated data and queries.

LPG to RDF Migration

Migration from LPG to RDF involves migrating the data model, data, and query. In LPG, there is no schema, which often leads to data integrity issues and an implicit expectation that a node or edge with a given label will always have certain properties.

Contrastingly, in an RDF statement (a.k.a. triple) of Subject-Predicate-Object labels are human-readable, attached to “things”, while the predicate is universally identified “edge type” with associated constraints and multilingual labels. The Object can be a primitive (a “literal” in RDF terminology) or some other Subject.

With RDF it is advisable to leverage OWL or familiar vocabulary of classes, inheritance, instances, and properties and use SHACL to specify data constraints and data integrity. To translate from LPG to RDF, one needs to obey the logical foundation of RDF, which is that every property, every edge is a logical statement about something. One can attach metadata to node properties and meta-metadata about them.

As always, we need to start small, iterate, test, and extend the model based on use cases and requirements without looking to make it complete and exhaustive at the beginning.

Figure 1 shows the Semantic stack and components.

Example Scenario — Movie DB picture

Let us consider a fragment from the movies domain — the movie “Alice in Wonderland” — with actor, Johnny Depp, playing the role of the “Mad Hatter”. Figure 2 shows a side-by-side comparison of LPG and RDF data model for the movie domain.

Some of the key modeling observations and differences between LPG and RDF include:

  • RDF are fine-grained: 2 nodes and 1 edge in LPG translates to 5 nodes and 6 edges in RDF.
  • Richness and verbosity of RDF allows having schema and metadata together, with access to them through the same query language.
  • Nodes in LPG translate to OWL instance entities while labels are modeled as OWL classes.
  • Node properties in LPG translate to data properties where the Object in the triple is a literal of primitive type.
  • Edges become object properties where the Object could be an individual entity.
  • Edge labels translate to OWL properties.
  • Edge properties can be modeled as statement properties.

Identifiers

In LPG nodes and edge identifiers are not universal. Contrastingly, in RDFs everything other than a literal (primitive type) and relationships have an associated universal identifier — URI (Unified Resource Identifier) or IRI. These can be lengthy, and prefixes are used to shorten them. For example:Person instead of http://www.example.com/ontology/Person

Node Labels to Classes

A node label in LPG can be translated to an OWL class in RDF. For example, the Person in the movie data can be declared in RDF as a triple like :Person a owl:Class.

Labels in LPG lack semantics and can have multiple intentions. To incorporate context and semantics, we need to create an OWL property to represent a :facet with a label or typed entity like categories from a taxonomy. Alternatively, one can select industry standards like SKOS to model taxonomies and controlled vocabularies in RDF.

An OWL entity can belong to a number of classes and can contain multiple values for a given property. An LPG model with multiple labels in that form a conceptual hierarchy, is best modeled with OWL by declaring the most specific class with the automatically inferred as part of the hierarchy. For example, if a Person is an Actor, in RDF the declaration can be

:Actor rdfs:subClassOf :Person

This will automatically infer an Actor to be a Person due to the subclass declaration.

Node Properties

RDF properties are global. To take advantage of validation and data integrity capabilities, it is best to define constraints in appropriate classes to remove ambiguity for a given property.

For example, the property “favorite” in some cases is the name of a favorite character, but in others, it’s a number indicating how many people marked a movie as a favorite. A semantically clean approach to model it would be to create two properties, for instance, “favorite-character” and “favorite-count”.

Edge to Object Properties

Edges in LPG are translated into object properties in RDF, that is, properties whose values are entities identified with a URI rather than plain values / literals. Object properties in OWL can also be used to specify constraints like domain and range.

An edge in RDF, like LPG, has only one label, which is the URI of the property. However, with OWL one can organize properties hierarchically. For example, one can declare:

:lead-actor-in rdfs:isSubPropertyOf :acted-in

to separate lead actors from supporting ones.

This needs to be defined only once and querying people who “acted-in” some movie will return both lead actors and actors while querying for those who “lead-actor-in” will return only the lead actors. This is a powerful concept — not available in LPGs whereby one can create a hierarchy of properties associated with an entity.

Edges in RDF do not have identifiers though edge labels are URIs. To refer to an edge in a query, one has to spell out the whole relationship with both of its ends. This may be verbose, but it preserves the intuition that the essence of a relationship is the whole of: subject, predicate, and object — an irreducible construct.

Properties on Edges with RDF-Star

Properties on edges has been a standard feature of LPGs, while this capability is a relatively new addition to the Semantic Web standards with RDF-Star.

Unlike LPG, attaching metadata to relationships in RDF is not limited to primitive types. With RDF-Star one can attach any property whatever be its data type including other entities to an edge. For example, there can be an edge property “introduced by”, whose value is the node in the graph representing another entity.

An RDF relationship can freely participate in other relationships and do this recursively at any nesting and depth. This is extremely valuable and applied to solving provenance modeling, a common need in enterprise data architectures.

For example, to add the role an actor plays in a movie such as the character details, it can be modeled as a property of the acts_in relationship, which can be expressed as:

<< :BenA­eck :acted-in :BatmanVsSuperman>> :as-character :Batman

The relationship becomes a node in a “meta graph”.

LPG based models often blow up in size due to its inability to have entities as metadata values in a relationship, which requires workarounds to appropriately model the scenario.

Edge to Data Properties

Edge properties are converted to data properties attached to RDF statements. For example, the SPARQL query for this would be:

INSERT DATA {<< ?actor :acted-in ?movie >> :salary "1000000" }

The declarative aspect of RDF ensures that one never has to worry about inserting duplicate data in the knowledge graph. Inserting the same triple in a triple store is a NoOp.

Query Migration

Cypher is the query language for LPG while SPARQL for RDF and both are pattern-oriented. The meat of the query is in specifying the graph pattern — a graph structure with variables that are instantiated by a match to the pattern and variables to return results. Everything that can be done in Cypher can be done losslessly in SPARQL.

Below is the basic structure of Cypher & SPARQL

MATCH < graph pattern> RETURN <result>

And the equivalent form in SPARQL:

SELECT <result > WHERE { <graph pattern > }

The graph pattern is a subgraph where nodes or edges are unknowns, represented as variables that are referred to in the result portion of the query. Both Cypher and SPARQL offer constructs to create the graph pattern.

Example Query Conversion

Let’s take a Cypher query and build an equivalent one in SPARQL.

The objective of our example query in plain English is to find all male actors who have played the same character in a movie and who have also worked with the same director while earning at least a seven-figure salary.

Searching for actors is the main entity of interest while applying the constraints through their relationships with other things in the graph and through their properties. Here for example, we are trying to match pairs of actors by some common aspects. Hence the pattern will need two variables referring to two distinct entities of type actor.

Cypher Query

In Cypher, this looks like:

MATCH (a1:Actor {gender:"male"})->[:HAS_ROLE]-> 
(r1:Role)->[:FOR_CHARACTER]->(c:Character)<-[:FOR_CHARACTER]
-(r2:Role)<-[:HAS_ROLE]<-(a2:Actor {gender:"male"}),
(a1)-[:ACTED_IN]->()<-[:DIRECTED_BY]-(d),
(a2)-[:ACTED_IN]->()<-[:DIRECTED_BY]-(d),
WHERE r1.salary > 1000000 and r2.salary > 1000000

using the variables a1 and a2 for the two actors, r1 and r2 for their respective roles, c for the common character and d for the common director. The first portion of the pattern is a long path connecting actor a1 to actor a2 via their roles of playing c. Then we list the linkage to a common director separately for each actor.

SPARQL Query

An actor playing a given character in a movie is expressed in a fundamentally different way in RDF.

The SPARQL version of the query looks like:

select ?a1 ?a2 where {
?a1 a :Actor; :gender "male" . ?a2 a :Actor; :gender "male" .
<< ?a1 :acted-in ?m1 >> :as-character ?character ; :salary ?a1-salary .
<< ?a2 :acted-in ?m2 >> :as-character ?character ; :salary ?a2-salary .
?id ^:directed-by/^:acted-in ?a1, ?a2 .
filter (?a1-salary > 1000000 && ?a2-salary > 1000000)

Note that in Cypher, variables appear on the LHS of a colon with their label on the RHS. In SPARQL, variables are syntactically recognized from the ? prefix and labels don’t have any special status.

The colon in front of various names is just the empty/default namespace. For example, in a query with multiple namespaces one would be spelling the namespaces — movies:Actor instead of :Actor, movies:as-character instead of :as-character and so forth.

Query Migration Analysis

In both queries, a graph pattern is a sequence of constructs “and-ed” together, each describing the form of a graph edge or a node or a longer path. In Cypher, they are separated by a comma and in SPARQL by a dot.

In Cypher, the node properties are specified in curly braces:

(a1:Actor {gender:"male"})

while, in SPARQL, they are specified as a statement. Conceptually, RDF unifies attributes and relationships, and both are properties stated as subject-predicate-object statements.

The semicolon ‘;’ in SPARQL lets one list multiple properties for a given subject. For instance, the following constrains the variable ?a1 to be something of type Actor and also to have “male” as the value of the property gender:

?a1 a :Actor; :gender "male"

To specify an edge in Cypher, one uses dashes and arrows to specify the graph structure and the directionality of the relationship:

(a1)-[:HAS_ROLE]->(r1)->[:FOR_CHARACTER]->(c1)

or equivalently

(c1)<-[:FOR_CHARACTER]-(r1)<-[:HAS_ROLE]-(a1)

In SPARQL, chaining in this looks like:

?a1 :has-role/:for-character ?c1

or equivalently, using ^ to reverse the direction of the edge

?c1 ^:for-character/^:has-role ?a1

In Cypher, the direction of the edge is indicated with the arrow direction <- or -> while in SPARQL the carat ^ is used to reverse the direction of a property.

Reading carefully through the SPARQL query reveals that in the SPARQL version the path ?a1 :has-role/:for-character ?c1 is absent.

This is because the RDF model does not need the extra role node. It leverages RDF-Star meta-level capability, to treat edges as nodes and link them to anything required.

In this case, to match the character that must be played by the two actors, the LPG version the graph pattern requires two separate role nodes that link to the same character and the query is “ASCII drawing” the whole thing.

SPARQL relies on modeling the variation and doesn’t need the intermediate role node as that criteria can be specified as a metadata on the acted-in edge itself.

Filtering conditions are similar, and it filters matches that do not satisfy the logical expression. Note that with SPARQL, since everything is a graph pattern, separate variables have to be instantiated for the salaries. Finally, it is worth mentioning that both Cypher and SPARQL support grouping and aggregation functions like SQL.

Data Migration

Triplestore vendors offer tools for mapping and importing data into the RDF model. Ontotext provides Ontotext Refine, a GUI based drag and drop tool to map headers from CSV and other file types to an ontology. It allows preview results and automatically generates SPARQL query data loading.

The open source CLI tool Tarql treats rows in the input file as a pattern match. The first line of the file gives the names of variables and then each row instantiates their values.

For example, take a CSV file of nodes with labels and properties as shown below:

"id","labels","born","name","released","tagline","title","_start","_end","_type" 
"1",":Person","1964","Keanu Reeves","","","",,,"2",":Person","1967",
"Carrie-Anne Moss"","","","",,,"3",":Person","1961","Laurance Fishburne"",
"","","",,,..."128",":Movie","","","1999","Walk a mile you'll never forget.",
"The Green Mile",,,"92",":Movie","","","2006","Based on the extraordinary true
story of one man's fight for freedom","RescueDawn",,,...

The Tarql file specifies the mapping of the nodes and labels to instances of OWL classes shown in the code below.

# movies.tarql file 
prefix movie: <http://www.example.com/ontologies/movie/> prefix rdfs:
<http://www.w3.org/2000/01/rdf-schema#>
# Turn nodes with label :Person into instances of the class Person
construct {
?thing a movie:Person;
rdfs:label ?name.

where {
filter (?labels = ":Person").
BIND(tarql:expandPrefixedName(CONCAT(movie:, ?id)) as ?thing).

#Turn nodes with label :Movie into instances of the class Movie construct
?thing a movie:Movie;
rdfs:label ?title.

where {
filter (?labels = ":Movie").
BIND(tarql:expandPrefixedName(CONCAT(movie:, ?id)) as ?thing).

The RDF structure is specified in the construct portion. This pattern is instantiated as the output, for every row in the input CSV file. Variables (prefixed with ?) are taken from the first row of the CSV file (the header), while others such as ?thing are created in the where clause. The BIND clause constructs a URI from the id column in the input by using the prefixes declared at the beginning of the file.

Once the LPG graph is extracted to a CSV file, this translation/mapping from LPG to RDF model can be easily done using queries as shown above.

Main Takeaways

Migrating an LPG to RDF is relatively straightforward. It starts with migrating the existing LPG domain model to RDF leveraging industry standards and RDF capabilities to establish the conceptual backbone. Thereafter the data is migrated from an LPG store to a triple store. Finally the Cypher queries are migrated to SPARQL.

Migrating the model consumes most of the time and resources, and requires RDF semantic engineers, data modelers, and domain experts. Data migration is achieved with tools associated with the triple stores, while query migration requires semantic engineers to rewrite the queries and test them semantically.

Sumit Pal, Strategic Technology Director at Ontotext

Originally published at https://www.ontotext.com on May 8, 2024.

--

--

Ontotext

Ontotext is a global leader in enterprise knowledge graph technology and semantic database engines.