SHACL-ing the Data Quality Dragon II: Application, Application, Application!

Ontotext
7 min readNov 17, 2023

Applying SHACL to your data and handling the output

Immanuel Kant’s quotation “Theory without practice is empty” is especially relevant in applied sciences such as computing.. In the first part of this series of technological posts, we talked about what SHACL is and how you can set up validation for your data.

Now, we are diving into the more exciting part — actually putting the theory into practice and seeing the fruits of our labors in a validation report.

Tacking the data quality issue — bit by bit or incrementally

There are two main approaches to validating your data, which would be dependent on the specific implementation. SHACL does not force specific details on the engines that implement the standard.

The first and most common approach is bulk validation. Every engine is inherently capable of working in bulk mode. This can be applied to a file or ingested data. You present your dataset and your shapes and then validation happens on the entire information. The downside of this is performance. If your data is evolving, you are always validating the whole dataset. Engines that offer bulk validation usually have an API for calling it explicitly.

The less common approach is incremental validation. This is the default setting in Ontotext GraphDB. It follows the transactional approach to adding data. With each update transaction, the shapes are invoked and the relevant data that must be validated is pulled from both the database and the transaction context (the data being added or removed with this operation) itself. This makes validation much more efficient but means that data inside the database must always be valid, rendering the Warning and Info severities unusable. Engines that offer incremental validation must have the validation process tied to their transaction mechanism.

The validation report

Recall the data and shapes we created in our previous post. When we combine them, we can invoke a SHACL engine’s validation. The output will be a validation report. This report can either be ephemeral or stored inside the database that backs the validation engine, if any. GraphDB follows the first approach: the report is given back as a REST response and logged as an error within the application logs. It is up to you to process this result. Since the report is natively RDF data, you can simply feed it into the database and then process it with SPARQL. Alternatively, some database engines may directly store the report data automatically.

In the best case, you have no errors and the validation report is short. In such a case, you’ll get a triple saying that sh:conforms is set to true. In order to check if the validation was relevant and performing as expected, you can rely on the sh:shapesGraphWellFormed attribute. This will be true if all the SHACL rules are syntactically valid, and false otherwise.

Assuming that something went wrong and you have invalid data, in addition to a sh:conforms value of false, your report will contain a number of sh:ValidationResult instances, associated with the report by a sh:result predicate.

[ 
a sh:ValidationResult ;
sh:resultSeverity
sh:Violation ;
sh:sourceConstraintComponent
sh:DatatypeConstraintComponent ;
sh:sourceShape _:n125 ;
sh:focusNode ex:Alice ;
sh:value "two" ;
sh:resultPath ex:age ;
sh:resultMessage "Value does not have datatype xsd:integer" ;
] .

Validation report results

Let’s break this result down. We have already talked about sh:resultSeverity when we discussed sh:severity in the first part of this series. The sh:sourceConstraintComponent property indicates what constraint was violated. In this case, the problem is related to the sh:DatatypeConstraintComponent. The sh:focusNode property points to the instance that triggered the violation. The sh:resultPath uses the predicate chain to identify the property with the invalid value. The sh:value value is the offending object and the sh:resultMessage is a human-readable error message that you can use in an end-user error message. For at least some engines, this report can also contain a sh:defaultValue, if defined, pointing to a value that would not trigger a violation, e.g., "1" in our example.

The sh:sourceShape property is a bit more complex. It points to the violated SHACL shape, which is typically identified with a blank node identifier. This would be OK for cases where the validation result and SHACL shapes are kept on the same database. However, blank nodes are temporary identifiers that can change, or the result may be physically separate from the shapes. This can be a headache, because another system might use a different identifier for the same blank node, making it difficult to verify that they refer to the same thing. The resolution is simple - give your property shapes a proper IRI. Here's what our modified shape looks like:

ex:ExampleOne a sh:NodeShape ; 
sh:targetClass foaf:Person ;
sh:property ex:datatypeShape .
ex:datatypeShape a sh:PropertyShape ;
sh:path ex:age ;
sh:datatype xsd:integer .

Now, our result would contain the specific sh:sourceShape identified as ex:datatypeShape. That's much more reliable.

Pitfalls and implementation specifics

You should now have a working understanding of SHACL. However, it has the benefits and drawbacks of a relatively new standard. This means that each implementation may differ significantly. We have already touched upon a few areas that you need to be aware of, but this is an excellent opportunity to reiterate them.

Incremental vs. bulk validation

Be aware of your engine’s capabilities — bulk validation can save you some overhead, but, with large datasets, should be a once-off affair since it always has to process the whole database. On the other hand, incremental validation makes sh:severity irrelevant and has overhead on each transaction. At the same time, it offers flexibility on how you ingest data.

GraphDB can support either approach. By default, it uses incremental validation. However, you can force bulk validation if you use the Java Transactions API. Here’s an example of setting the Bulk validation:

try (RepositoryConnection 
conn = rep.getConnection()) { conn.begin(ShaclSail.TransactionSettings.ValidationApproach.Bulk);
....
conn.commit(); }

Still, if you don’t use the Java API directly, you can achieve the same result by deleting and reinserting your SHACL schema or by using the REST API. Incremental validation is triggered upon any update transaction. If the update transaction is to add a whole schema, that’s essentially validating the database in bulk.

Processing reports

It’s rare that a report would be fed directly to the database manager backing the SHACL engine. When you need to do it, you would have to intercept the report from the return message. But if you are doing validation on the whole database, the REST response may be prohibitively large. For this reason, GraphDB has sampling capabilities that let you restrict both the overall number of violations reported and the number of reports per specific shape. It’s possible to have thousands of reports for the same systematic issue that has propagated through the whole database. This is not useful and it could also create a problem when trying to understand the output of a validation report.

Besides sampling the report and avoiding the use of blank nodes, another thing to keep in mind is that the full report might be constrained to the engine logs. This is particularly true if you are using a browser UI for your validation. The reason is that browsers could have a hard time processing a very large report.

SHACL-SPARQL performance

When using SHACL-SPARQL, one of the advanced features of SHACL, make sure to use efficient SPARQL queries in the shape validation constraints. These queries will be carried out for every instance that satisfies the target. This can be computationally expensive with complex validations and permissive targets. It can be more beneficial to make the SHACL target as specific as possible — essentially, to only ever target instances that will be violating the constraints.

For example, the SPARQL query used in the following shape only targets sh:Person instances whose sh:age value is not an integer:

ex: a owl:Ontology; 
sh:declare
[sh:prefix 'ex'; sh:namespace 'https://example.org/'^^xsd:anyURI],
<....>

ex:ExampleOne a sh:NodeShape ;
sh:target [ a sh:SPARQLTarget ;
sh:prefixes foaf:, ex:, xsd: ;
sh:select """
select $this where {
$this a foaf:Person;
ex:age ?o
FILTER(DATATYPE(?o) != xsd:integer }
""" ] ;
sh:sparql [ a sh:SPARQLConstraint ;
sh:prefixes ex: ;
sh:select """
select $this ?label ?value where {
$this rdfs:label ?label;
ex:age ?value }
""" ] ;
sh:message "The person {?label} has a wrong age, {?value}" ; ] .

As you can see, the actual validation is done at the sh:SPARQLTarget class, and the sh:SPARQLConstraint is used to form an error message with the offending value.

Custom and unsupported constraint components

Each SHACL engine may have its own set of custom and unsupported functions. You should refer to the documentation of your engine of choice. GraphDB has support for the rsx and dash extensions, but it does not support comparison constraints. There is a SHACL test suite you can run against your engine to check its level of compliance with the standard.

Conclusion

In this post, we put the theory of what SHACL is into practice and we examined an example validation report. To practice even more, as a starting point, you can use SHACL with GraphDB out of the box, by simply creating a SHACL-enabled repository. If you want to develop your shapes fast in a responsive UI, we can recommend a frontend tool such as the excellent online SHACL playground.

Once you start using SHACL, you can be sure that you are doing the most for your data quality.

Stay tuned for our next post where we’ll get down to the gritty details of how GraphDB does its SHACL validation!

Radostin Nanov, Solution/System Architect at Ontotext

Originally published at https://www.ontotext.com on November 17, 2023.

--

--

Ontotext

Ontotext is a global leader in enterprise knowledge graph technology and semantic database engines.