SHACL-ing the Data Quality Dragon I: the Problem and the Tools

Ontotext
10 min readNov 10, 2023

--

Learn how SHACL can help you tame your unruly data graphs

RDF has matured beyond the restrictive and controlled environment of academia. It is common nowadays to see environments where various actors have access to the same database and can write their own data. Cross-institutional efforts such as the Transparency Energy Knowledge Graph (TEKG) depend on the interoperability of multiple datasets owned by different enterprises.

While everyone may subscribe to the same design decisions and agree on an ontology, there may be differences in the data quality. Sometimes there is no room for error. In such situations, data must be validated. The most popular solution to this problem in the RDF world is called SHACL, the SHApes Constraint Language

“The old ways” — how validation used to work

Prior to the introduction of SHACL, validation would mostly be manual, or reliant on OWL ( Web Ontology Language) constraints. OWL constraints, however, are counterintuitive. They work on the open-world concept, which assumes that in addition to the data we have, there may be other data to take into consideration, and are used for reasoning. This means that we can have any other data that violates the constraint — so long as the data is explicit, there wouldn’t be a violation. Consider the following constraint:

foaf:name owl:maxCardinality 1

On the surface, this states that each foaf:name can have at most one value. However, the interpretation is rather different - it's interpreted as "infer that something is a name only if it has at most one value". There are no limitations for extra definitions. Some reasoners are implemented to also be able to check for OWL constraint compliance, but there is no standardized way to report the violations.

In such an environment, most tools relied on workarounds. A restrictive data input frontend can be a cheap solution but is hard to implement when the users can write SPARQL or RDF directly. This gives rise to SPARQL-based constraints. A data engineer would write a SPARQL rule and it would be evaluated periodically (typically, querying for invalid values), or automated to run on each transaction. This was implementation-specific, preventing this approach from becoming scalable and reusable. Since these days, SHACL has come to the forefront as a W3C standard. Its key strength is that it is declarative, allowing it to be more portable.

Nowadays, SHACL is the standard for RDF validation. There are tools for running SHACL on the client side and server side, for specific datasets, but also incrementally on a database. Ontotext’s RDF database engine GraphDB mostly takes the latter approach to ensure smooth operation for large datasets.

Structure of a SHACL shape

Just like OWL, SHACL is modeled on RDF data. Each constraint is defined with a few triples. Constraints form a grouping called a SHACL shape. Shapes are very similar to classes in OWL. Each shape is described with a target to show what its constraints apply to.

By far, the most popular target type is sh:targetClass. This closely corresponds to the owl:Class or rdfs:Class concepts. With inference, SHACL can also employ the class hierarchy to apply constraints for a root class on its subclasses.

Let’s build a very simple test dataset. It will consist of two things — the data to be validated and the shapes containing the validation constraints.

ex:Alice a foaf:Person ;
foaf:name "Alice" ;
ex:age "two" .
ex:Bob foaf:knows ex:Alice .

This states that we have one person, identified as ex:Alice, an instance of the class foaf:Person. Her name is Alice and her age is two.

Target

Even with such a simple model, we have several different ways in which we can specify that we want to target a particular instance of the class Person for validation.

  • sh:targetNode ex:Alice - this specifically targets the node representing the person Alice. No other instance of Person would be validated. This construct is very efficient if used sparingly, but since it's also very specific, it's rarely used.
  • sh:targetClass foaf:Person - this targets all instances of the class Person (and, as mentioned above, any subclasses). It's the most popular approach to targeting in SHACL.
  • sh:targetSubjectsOf foaf:name - this would target all subjects who have a foaf:name value. That would mostly be useful in instances where we want to impose some specific validation on that specific subject-predicate pair - for example, only instances of ex:NamedInidividual that have a foaf:name when an ex:NamedIndividual is not guaranteed to have a foaf:name.
  • sh:targetObjectsOf foaf:knows - this targets all objects that are known by something else (that is, objects that are the foaf:knows value for any subjects). Just like the previous example, this is mostly useful for validating on that predicate-object pair.
  • sh:target [ a sh:SPARQLTarget ; sh:prefixes ex: ; sh:select " SELECT ?this WHERE { ?this a foaf:Person .}" ;] ; - the ability to use an arbitrary SPARQL query to specify a target, which is known as a SHACL-SPARQL target, gives you a lot of flexibility at the cost of performance - executing complex SHACL can be taxing on the RDF engine.
  • rsx:targetShape [sh:path rdf:type; sh:hasValue foaf:Person]; - a less flexible, but substantially more performant version of SHACL-SPARQL targets, the Eclipse RDF4J SHACL Extensions, rsx are specific to RDF4J and GraphDB.

For our sample case, we can use sh:targetClass. It is straightforward and performant, and we don't need a more complicated targeting mechanism.

Validation path

Shapes typically perform validation on the properties related to a given subject. In our example, this would be either Alice’s name or age. This is controlled by the path attribute. In cases where it is missing, validation will be carried out on the subject itself. This is sometimes used for validating the IRI pattern via regular expressions. The paths follow the same patterns you can use in pure SPARQL:

  • sh:path - this option can be used to express single-step paths, but also for complex paths using the standard SPARQL operands. Sequence paths are expressed in parentheses, such as sh:path ( ex:stepOne ex:stepTwo ).
  • sh:inversePath - equivalent to the ^ operand in SPARQL. Validates the subject of a given triple.
  • sh:alternativePath - equivalent to the | operand in SPARQL. Takes a list of paths as an object. For example, the following construct is used for validating RDF lists: sh:alternativePath ( [ sh:zeroOrMorePath rdf:rest ] rdf:first ).
  • sh:zeroOrMorePath - equivalent to the * operand in SPARQL.
  • sh:oneOrMorePath - equivalent to the + operand in SPARQL.
  • sh:zeroOrOnePath - equivalent to the ? operand in SPARQL.

For our test set, we can simply use sh:path. The complexity of the data is limited, so no complex paths are necessary.

Constraint

A shape is incomplete without some sort of constraint. SHACL has many different options for this. There are several types of constraints:

  • Type constraints — validating the datatype (for literals, like checking that an amount value is an integer), class (for objects, like checking that an employee's reportsTo value is an instance of Employee), or node kind - IRI, blank node, literal, or any combination of two - of a given value. The node constraint also fits this group - an instance passes this constraint only if its target conforms with its own SHACL shape (for example, not only does an employee need to reportTo another Employee, but the person they report to must have passed SHACL validation).
  • Cardinality constraints — min and max count. For example, each person needs to have at least one name and at most one date of birth.
  • Range constraints — min and max exclusive and inclusive. Those should be applied only to numerical values. “Has value” and “in” also fit in this category. “Has value” specifies that a subject-predicate pair must have at least a specific value. “In” specifies that a pair must have at least one of a limited set of values.
  • String constraints — min and max length, pattern (regular expressions), and language validations.
    -sh:flags - used in conjunction with sh:pattern for passing REGEX flags, such as "i" to ignore case.
    -sh:languageIn - asserts that the target of this constraint must be a literal and that it must have a language tag that matches one of the listed language tags or their subtags. Let's take an example: sh:languageIn ( "en" "mi" ). For a given subject-predicate pair that is being validated, it can be treated as valid if at least one of its languages is @en, @mi, or a subtag, such as @en-gb. For the subject-predicate pair to be invalid, none of its values should satisfy this constraint.
    -sh:uniqueLang - a boolean flag that states that you can't have multiple values in the same language. For example, each country should have only name per language - e.g., "Germany"@en and "Deutschland"@de, but not "Germany"@de.
  • Property pair constraints — compares two subject-predicate pairs’ values. Can either be equals, does not equal, less than, and less than or equals. If the second pair does not exist, it will not satisfy the equality constraint, but the other three are satisfied.
  • Logical constraints — not, and, or and xone, used for chaining other constraints in complex rules.
  • Qualified constraints — used to qualify that a specific number of objects need to conform to a given shape.
    -sh:qualifiedValueShape - the value shape to conform to. For example, if we want to define a hand, we can uses sh:qualifiedValueShape [ sh:class ex:Finger ] and sh:qualifiedValueShape [ sh:class ex:Thumb ].
    -sh:qualifiedValueShapesDisjoint - when using multiple qualified value shapes, this specifies if a given validation target can fit both groups. In our hand example, a Finger and a Thumb would be disjoint.
    -sh:qualifiedMinCount and sh:qualifiedMaxCount - the minimum and the maximum number of objects that fit the qualified value shape. In the example, a hand would have 4 fingers and 1 thumb.

There are several constraints that we could use for the test case. For our example, we’ll constrain ourselves to sh:datatype.

Putting it all together

In the previous three sections, we have described how to select a specific instance for validation, how to target its properties, and how to set the constraints that need to be satisfied. This is all that is needed for a validation case.

In our small test dataset, we had a somewhat dubious age set for ex:Alice - the string "two". Strings are not a good representation of numerical data. A much more appropriate datatype would be an integer, so we will add that as a constraint.

This gives us the following SHACL definition:

ex:ExampleOne a sh:NodeShape ; 
sh:targetClass foaf:Person ;
sh:property [
sh:path ex:age ;
sh:datatype xsd:integer ;
] .

This is enough to work with a SHACL engine of our choosing.

Non-validating properties

Besides the constraints previously listed, there are additional ways to control the validation. Of major interest are ways to change the RDF assumption from an open-world to a closed-world one. In a closed-world environment — such as standard relational databases — each property that is not explicitly defined is considered a violation.

A good example would be a limited object, such as a study about a new drug — a study should have a sponsor, a supervisor, and at least one participant, description and subject, and nothing else. This can be set using sh:closed. Since we don't necessarily want to have constraints for every property that is part of a valid study, the sh:closed component works together with sh:ignoredProperties. The ignored properties component points to a list of properties that don't have any validations associated with them but still should be part of a closed shape.

A shape can be deactivated using sh:deactivated. Since this lets you keep information on the shape, it can be more useful than deleting it outright. If you are validating some newly imported data and getting alerts about hundreds of invalid values, temporarily deactivating some of your shapes makes it easier to focus on certain types of errors before you do the final cleanup of the dataset.

The severity of a shape can also be set. sh:severity is used to modify the violation report. A shape can have severity values of Info, Warning, or Violation. In incremental use cases, sh:severity is not usable, as the database should always reject violations. The report also contains a message, which can be configured using sh:message. If not set, you will get the default built-in error messages for constraint violations. Different severity settings in violation reports can be valuable when analyzing report collections for patterns.

Finally, there are a few properties intended for grouping the shapes better. Those are sh:name, sh:description, sh:group and sh:order, and sh:defaultValue. It's important to note that none of these have any effect on the validation. Instead, they provide metadata about the shapes. sh:defaultValue in particular, has no specific semantics defined. Its implementation and function are left to the specific SHACL engine.

Conclusion

Knowing the enemy is half the battle, and knowing yourself — the other one. In this post, we have acquainted you with the dragon of invalid data and given you a quick overview over the wide array of SHACL constraints you can apply to combat it.

In the next few posts we’ll look at some validation reports, tips for greater validation efficiency, and architectural approaches to using SHACL validation in a large-scale RDF knowledge graph system. So stay tuned!

Meanwhile, you can download GraphDB and use the documentation to get started doing SHACL validation yourself!

Radostin Nanov, Solution/System Architect at Ontotext

Originally published at https://www.ontotext.com on November 10, 2023.

--

--

Ontotext
Ontotext

Written by Ontotext

Ontotext is a global leader in enterprise knowledge graph technology and semantic database engines.

No responses yet