Semantic Search with Pre Trained Neural Transformer Model using Document, Sentence and Token Level Embedding

Some time ago I worked on an enterprise search project, where we were tasked to improve the performance of an enterprise Solr search deployment. We recommended various improvements based on classic NLP techniques. One of the items on the agenda was deep learning language model based semantic search. Unfortunately we never got that to it because of time and budgetary constraints.

Recently I got a chance to experiment with BERT pre trained Transformer model for semantic search. I experimented with various similarity algorithms for query and document vector embeddings. I will share my findings in this post along with suggestions on how to integrate it with Solr or ElasticSearch to boost performance. The python script is available in my open source Github project avenir.

Semantic Search

Traditional search is lexical and based on literal keyword match.One popular algorithm for the lexical approach is BM25. These literal match based algorithms are incapable of handling the intent of the user and the semantic meaning of the keywords. Semantic search fills those gaps. Generally, semantic search improves recall while sacrificing precision. Semantic search is effective in handling the following.

Intent of the user as conveyed through the keywords
Conceptual meaning of the keywords

Here is a simple example of what semantic search can achieve. If you type in the keywords “Steve Jobs”, it will fetch documents about Apple and iPhone with high ranks even if the documents don’t explicitly contain the name “Steve Jobs”.

There have been attempts in past to enhance search engines with Ontology and various NLP techniques with some success. Out focus in this post will be on using Deep Learning pertained language models for semantic search

Early Deep Learning Language Model

In Deep Learning based Language modeling, artifacts of a document i.e document, sentence and words have high dimensional vector space representation where semantically related items will be close to each other in that vector space.

Some of the early embedding examples are Word2Vec and Glove. They both use feed forward neural network. In Word2Vec the context of a word is learnt because a sequence of word is fed as input or output to the network. Glove learns co occurrence matrix of word pairs with a feed forward neural network. The internal representation of a word in the form of network weight is the vector space embedding. One advantage Glove has over Word2Vec is that it learns global context also in addition to local context, which is learnt by both.

One of the problems with this techniques is that there is only one vector representation of a word. In other words all contexts for a word are squashed into one global context. Current generation neural models over come this limitation. Also the current breed of networks have more complex architecture, generating better quality embeddings.

Transformer based Deep Learning Language Model

Transformer is a neural architecture pattern for sequence to sequence modeling. it includes an encoder decoder pair. The encoder and decoders typically have multiple layers. Encoders and decoders could be uni directional or bi directional. Instead of using RNN or LSTM Transformer uses attention mechanism. It’s considered to be superior in quality compared earlier RNN or LSTM based sequence modeling networks.

Attention is a mechanism to decide for a given location in a given layer how the attentions will be focussed on various locations of the previous layer or same layer. It’s essentially a sophisticated way of defining context i.e what is relevant and what is not.. For sequence modeling there are 3 kinds of attention mechanisms as below

Encoder decoder attention
Encoder self attention
Decoder self attention

For language modeling problems typically pre trained Transformers models are used. Since they generate language model, the decoder of the transformer is not necessary. These models are pre trained on vast amount of corpus available publicly. The three most popular models are BERT from Google, the GPT family of models from OpenAI and ELMo from AllenNLP. ELMo, unlike the other two uses LSTM.

We will be using pre trained BERT model for our experiments. Since BERT does language modeling, the decoder is not necessary. Instead it has a classification layer on top of the encoder output, used for training. BERT as the name suggests, is bi directional i.e it learns from the left as well as the right context of a word. BERT uses two kinds of training as below

Some words are masked in the input and predicted in the output
In sentence pair training, for some pair the next sentence is replace with a random sentence and second sentence is predicted in the output

Both training techniques are unsupervised, since no tabled data is used. The second training technique in list has chatbot and QA use case as the objective.

Semantic Search with BERT

We can think of the semantic search problem as follows. By applying BERT encoding to a query and a document we get two tensors as out put. The shapes of the query tensor is (q, 768) where q is the number of tokens in the query string. Similarly, the shape of the document tensor is (d, 768) where d is the number of tokens in the document. In BERT large model eroded vector size for a word is 768. The problem boils down to finding similarity between the two tensors.

We need to find similarity between the 2 tensors which will be search score of the document for the given query. We can find similarity by considering various levels of granularity of a document as below

document
sentence
word

A document will have multiple sentences. You can think of the query string to be just one sentence. I have tried the following similarity calculation algorithms for the query and document tensors at various levels of granularity.

Average document level: Similarity is calculate for the whole query and the whole document.Token tensors are averaged for both query and document
Average sentence level: Similarity is found between the query sentence and all sentences in the document and then averaged
Median sentence level: Similarity is found between the query sentence and all sentences in the document and then take median
Max sentence level: Similarity is found between the query sentence and all sentences in the document and then take maximum
Max token level: Similarity is found between the query tokens and all tokens in the document and then take maximum
Average token level: Similarity is found between the query tokens and all tokens in the document and then take average
Median token level: Similarity is found between the query tokens and all tokens in the document and then take median
Max average token level: Similarity is found between a query token and all tokens in the document and then take average. repeat for all query tokens. Find max of the averages
Average max token level: Similarity is found between a query token and all tokens in the document and then take max. repeat for all query tokens. Find average of the average max values

BERT creates a Special token [CLS] which stands for a whole sentence. Encoding of this token is the whole sentence encoding. It could used for example for sentence classification.

Search Performance Test Result

I have a smal toy corpus comprising of only 10 documents. Each contains between 50 to 100 words. Topics center around the following topics

apple computer
iPhone
smartphone,
fruit apple
fruit peach.

The queries are as follows

iPhone sale
steve jobs
smartphone
apple nutrition
fruit food value

The second query is the kind of query that illustrates the power of semantic search. Although none of the documents had the query string steve jobs, all documents related to apple and iPhone showed up at the top of the search result, because they are semantically related.

I ran all 5 queries with 9 different aforementioned similarity algorithms. I evaluated the results just perusing and judging how the order of the 10 documents appearing in the search result deviated from the order I was expecting, Obviously, it’s not technically rigorous. Nonetheless, it provides some good insight on similarity algorithms. I found the following similarity algorithms to be best in the order of performance.

Average token level
Max average token level
Average sentence level
Max sentence level

Because my test set up was very limited, it won’t be wise to accept my observations in a broader context. If you have an enterprise Solr and Elastic Search deployment and want to enhance it with deep learning based semantic search, you should run your own tests as follows

Use large corpus from your domain
Run representative queries with all the similarity algorithms
Measure performance with proper metrics e.g. precision, recall etc

The python script works as follows

Loads BERT large model
Encodes all 10 documents using loaded BERT model
Enters a loop prompting the user for a query.
Returns query results

Here is some sample output

akash:misc pranab$ ./ssearch.py  ssma
using sentence max similarity
loading BERT transformer model
encoding all docs
enter a search query or q to quit: iphone sale

sorted search result
doc6 score 11.165314
doc4 score 10.993372
doc2 score 10.461808
doc7 score 10.118464
doc8 score 9.900048
doc3 score 9.870462
doc1 score 9.836157
doc5 score 9.611451
doc10 score 8.561757
doc9 score 7.774605

I have used BERT pre trained transformer as packaged and made available through Spacy NLP library.

Integration with Elastic Search or Solr

ElasticSearch or Solr will provide search score for any query , document pair. How do we reconcile that semantic search based score with the native score. There are many ways to do that. One solution is to take weighted average of the native search score and semantic search score and re order the search results. All documents should be encoded and the encoded tensor be stored and indexed in ElasticSearch

ElasticSearch has started supporting vector embedding based search out of the box and from I could tell, this is work in progress.

Vector Similarity Based Retrieval

In literal word match based search, all documents are indexed by the words they contain. During query time, all documents that have any of the keywords in the query are fetched and and the relevance score algorithms is run on the documents to rank them.

Retrieval based on vector similarity is lot more challenging. There some python libraries available e.g annoy and milvus. Between the 2 , milvus seems to be more matured.. Here are the steps for for using milvus indexing, retrieval and using the similarity algorithm of your choice.

Steps for indexing documents

Get document tensor with BERT encoding for each document
Squash each document tensor into a vector by taking average of the rows.
Index each document vector into milvus.

Steps for retrieving documents for a query

Get the query tensor with BERT encoding
Squash the query tensor into a vector by taking average of the rows
Retrieve the nearest neighbor documents based on the query vector. You can specify how many nearest neighbor to retrieve
Run the similarity algorithm of your choice from the aforementioned list on the documents and rank them based on similarity score

Wrapping Up

We have walked through a process of doing semantic search with BERT pre trained transformer model. There are various ways to compute tensor similarity. We have experimented with them to evaluate them albeit for a small corpus. Here is a tutorial document, if you want to play around with it.

Semantic Search with Pre Trained Neural Transformer Model using Document, Sentence and Token Level Embedding

Semantic Search

Early Deep Learning Language Model

Transformer based Deep Learning Language Model

Semantic Search with BERT

Search Performance Test Result

Integration with Elastic Search or Solr

Vector Similarity Based Retrieval

Wrapping Up

Primary Services

Pre-Built Applications

Data & AI Solutions

ThirdEye Exclusive

Insights

Talk To Us

Semantic Search with Pre Trained Neural Transformer Model using Document, Sentence and Token Level Embedding

Semantic Search

Early Deep Learning Language Model

Transformer based Deep Learning Language Model

Semantic Search with BERT

Search Performance Test Result

Integration with Elastic Search or Solr

Vector Similarity Based Retrieval

Wrapping Up

Share This Article

Related Posts

A Comprehensive Guide on Latent Dirichlet Allocation

Time Series Data Exploration with Wavelet Transform

Synthetic Time Series Data Generation

AI – Past, Present and Future

Primary Services

Pre-Built Applications

Data & AI Solutions

ThirdEye Exclusive

Insights

Talk To Us