GraphDB: Semantic Text Similarity for Identifying Related Terms & Documents

7 min readJul 11, 2019

Databases are very good for answering precise and complex questions. Still, they fail at simple human cognitive tasks such as: “Show me similar articles to the currently selected” or “Find resources that look very close to this one”.

GraphDB’s Similarity plugin is a promising new feature that brings cognitive awareness to Ontotext’s semantic graph database by leveraging the knowledge graph links. The similarity indices are a fuzzy match heuristic based on statistical semantics, which is particularly useful when retrieving the closest related texts or when grouping a cluster of graph nodes based on their topology.

The plugin integrates the Semantic vectors library and its underlying Random Indexing algorithm. The core idea behind semantic vector models is that words and concepts are represented by points in a mathematical space. So, words that co-occur frequently, documents composed by similar words, and concepts with similar or related meaning are placed close to one another in that space. As a result of this, by using vector calculus and other algebraic operations, we can find similar texts and graph nodes in the database.

Let’s go through some of the main types of text semantic similarities searches with a simple but representative example.

Data

For this blogpost, we will import into GraphDB a subset of DailyMed database, which contains drug listings with their title, indications text and company, and a subset of PubMed citations for biomedical literature with title and abstract text.

Preparation

To work quickly with our data, first we need to create an autocomplete index. Our model consists of drugs and citations, each of which has its own title. Now we want to find these objects by autocompleting their URIs based on exact terms found in their titles. We can do this by creating an autocomplete index with the predicates for titles.

Once the index is built, navigating through our data is easy. Now, let’s explore the DailyMed objects for “aspirin”.

Similarity for DailyMed drugs

In the screenshot below, we can see that each DailyMed item is represented by title, indicationText and company. Using similarity indexing, we can now create a vector model by indexing all indication texts and then we can find similar drugs based on the similarity of their indications.

To do that, we go to Explore -> Similarity -> Create Similarity index. The most important part here is the Data Query, which tells the index how to map text to documents. We should also keep in mind that once the index is built, the vector for the document will be created based on the text selected for it in the query. This gives us the flexibility to use different predicates for each index or combine them by concatenating the texts, if necessary.

In our case, we want to model drugs as documents and find similar drug instances based on the similarity of their indications text.

Here, we should also pay attention to the minfrequency parameter. It filters terms that occur less than three times in all the texts to remove misspelled words and errors.

Term to Term

The difference between statistical semantic similarity and a full text search is easily explained by using the Term to Term search that returns similar terms. These are terms that occur frequently in the same context, which indicates that they are similar or very closely related. In other words, semantic similarity will treat not only documents that contain the same words as similar, but also documents that contain similar terms. This empowers richer search experience.

To explore Term to Term similarity, let’s look for related or similar words to “headache”.

Term to Document

Term to Document similarity search is more powerful than full text search because we can find similar documents even when they do not contain exactly the same terms but similar ones. For each term, document or a set of terms there is a corresponding vector in the space and we compare vectors’ distance at the end.

To explore Term to Document similarity, let’s find all DailyMed documents related to sunburn prevention.

As we can see, this returns a lot of results with a high score, but their URIs do not mean much to us. So it’s good to modify the index by configuring the Search SPARQL query to fetch the title and indications for each result document as well.

To do that, we click on View SPARQL Query and then copy and paste it in the SPARQL editor. This allows us to modify the query by fetching the title and indications for each result.

If we want to fetch more data from GraphDB for each similarity document result, we can change the Search Query of the similarity index. To do that, we can create a copy of the existing similarity index and set the following query there:

Now we can see the information in the additional columns for each result in the similarity search page.

By default, 20 results are fetched from the index, but we can change this by modifying the search options.

Document to Document

The Document to Document search provides one of the most interesting scenarios.

For example, we can find similar drugs to a selected one based on vectors built on top of their indications text. Here, we find that drugs like “ibuprofen” share exactly the same indication text among its variations sold by different companies.

We can also find drugs with the same indications sold by the same company. Or different drugs with similar indications.

Document to Term

Last but not least, we can use the Document to Term search to find the most relevant terms for a document. These are the more specific terms that occur rarely in other documents and are most significant for this one.

To explore Document to Term similarity, let’s look for the most specific indications that occur with “aspirin” in our document.

Similarity for PubMed abstracts

Now let’s explore the PubMed documents set and find documents similar to a term or a phrase.

The more data we have, the more precise our vector model and search results will be. Building a similarity index on top of PubMed abstract texts provides a richer semantic model as vectors have more context to be trained properly.

For example, we can see that “cancer” in the context of PubMed abstract texts has different similarity context than “cancer” in the context of medication indications. This shows the importance of the data we use for training the vectors.

Daily Med indications

PubMed abstracts

However, the more vectors to search through also means slower result time. We can play with the vector dimension and type. Bigger dimension means better results and slower searches. Slower dimension is faster, but less accurate. Also binary vectors are faster, but require bigger dimension. How we will configure the parameters for our indexes depends on our data (see GraphDB Documentation for more clues about the index parameters).

You can find the indexes of these examples at: http://similarity.ontotext.com/similarity.

Quick Takeaway

Although it is easy for humans to decide if two or more texts are related based on the similarity of the words they contain and our cognitive associations, this is not a trivial task for computers. To aid that, GraphDB’s Similarity plugin aims to enrich the RDF graph with different types of semantic similarity indices, based on a highly scalable vector space model.

The plugin allows us to define various indices covering specific types of documents, specific attributes and property paths, etc. Most importantly, it allows us to perform statistical analysis and get more results based on the matching of semantically close terms and documents.