Semantic Search with Pre Trained Neural Transformer Model using Document, Sentence and Token Level Embedding

Some time ago I  worked on an enterprise search project, where we were tasked to improve the performance of an enterprise Solr search deployment. We recommended various improvements based on classic NLP techniques. One of the items on the agenda was deep learning language model based semantic search. Unfortunately we never got that to it because of time and budgetary  constraints.

Recently I got a chance to experiment with BERT pre trained Transformer model for semantic search. I experimented with various similarity algorithms for query and document vector embeddings. I will share my findings in this post along with suggestions on how to integrate it with Solr or ElasticSearch to boost performance. The python script is available   in my open source Github project avenir.

Semantic Search

Traditional search is lexical and based on literal keyword match.One popular algorithm for the  lexical  approach is  BM25.  These literal match based algorithms are incapable of handling the intent of the user and the semantic meaning of the keywords. Semantic search fills those gaps. Generally, semantic search improves recall while sacrificing precision. Semantic search is effective in handling the following.

  • Intent of the user as conveyed through the keywords
  • Conceptual meaning of the keywords

Here is a simple example of what semantic search can achieve. If you type in the keywords “Steve Jobs”, it will fetch documents about  Apple and iPhone with high ranks even if the documents  don’t explicitly contain the name “Steve Jobs”.

There have been attempts in past to enhance search engines with Ontology and various NLP techniques with some success. Out focus in this post will be on using Deep Learning pertained language models for semantic search

Early Deep Learning Language Model

In Deep Learning based Language modeling, artifacts of a document i.e document, sentence and words have high dimensional vector space representation where semantically related items will be close to each other in that vector space.

Some of the early embedding examples are Word2Vec and Glove. They both use feed forward neural network. In Word2Vec the context of a word is learnt because a sequence of word is fed as input or output to the network. Glove learns co occurrence matrix of word pairs with a feed forward neural network. The internal representation of a word in the form of network weight is the vector space embedding. One advantage Glove has over Word2Vec is that it learns global context also in addition to local context, which is learnt by both.

One of the problems with this techniques is that there is only one  vector representation of a word. In other words all contexts for a word are squashed into one global context. Current generation neural models over come this limitation. Also the current breed of networks  have more complex architecture, generating better quality embeddings.

Transformer based Deep Learning Language Model

Transformer is a neural architecture pattern for sequence  to sequence modeling. it includes an encoder decoder pair. The encoder and decoders typically have multiple layers.  Encoders and decoders could be uni directional or bi directional. Instead of using RNN or LSTM Transformer uses attention mechanism. It’s considered to be superior in quality compared earlier RNN or LSTM based sequence modeling networks.

Attention is a mechanism to decide for a given location in a given layer how the attentions will be focussed on various locations of the previous layer or same layer. It’s essentially a sophisticated way of defining context i.e what is relevant and what is not.. For sequence modeling there  are 3 kinds of attention mechanisms as below

  1. Encoder decoder attention
  2. Encoder self attention
  3. Decoder self attention

For language modeling problems typically pre trained Transformers models are used. Since they generate language model, the decoder of the transformer is not necessary. These models are pre trained on vast amount of corpus available publicly. The three most popular models are BERT from Google, the GPT family of models from OpenAI and ELMo from AllenNLP.  ELMo, unlike the other two uses LSTM.

We will be using pre trained BERT model for our experiments. Since BERT does language modeling, the decoder is not necessary. Instead it has a classification layer on top of the encoder output, used for training. BERT as the name suggests, is bi directional i.e it learns from the left as well as the right context of a word.  BERT uses two kinds of training as below

  1. Some words are masked in the input and predicted in the output
  2. In sentence pair training, for some pair the next sentence is replace with a random sentence and second sentence is predicted in the output

Both training techniques are unsupervised, since no tabled data is used. The second training technique in list has chatbot and QA use case as the objective.

Semantic Search with BERT

We can think of the semantic search problem as follows. By applying BERT encoding to a query and a document we get two tensors as out put. The shapes of the query  tensor is (q, 768) where q is the number of tokens in the query string. Similarly, the shape of the document  tensor is (d, 768) where d is the number of tokens in the document. In BERT large model eroded vector size for a word is 768. The problem boils down to finding similarity between the two tensors.

We need to find similarity between the 2 tensors which will be search score of the document for the given query.  We can find similarity by considering various levels of granularity of a document as below

  1. document
  2. sentence
  3. word

A document will have multiple sentences. You can think of the query string to be just one sentence. I have tried the following similarity calculation algorithms for the query and document tensors at various levels of granularity.

  1. Average document level: Similarity is calculate for the whole query and the whole document.Token tensors are averaged for both query and document
  2. Average sentence level: Similarity is found between the query sentence and all sentences in the document and then averaged
  3. Median sentence level: Similarity is found between the query sentence and all sentences in the document and then take median
  4. Max sentence level: Similarity is found between the query sentence and all sentences in the document and then take maximum
  5. Max token level: Similarity is found between the query tokens and all tokens in the document and then take maximum
  6. Average token level: Similarity is found between the query tokens and all tokens in the document and then take average
  7. Median token level: Similarity is found between the query tokens and all tokens in the document and then take median
  8. Max average  token level: Similarity is found between a query token and all tokens in the document and then take average. repeat for all query tokens. Find max of the averages
  9. Average max  token level: Similarity is found between a query token and all tokens in the document and then take max. repeat for all query tokens. Find average of the average max values

BERT creates a Special token [CLS] which stands for a whole sentence. Encoding of this token is the whole sentence encoding. It could used for example for sentence classification.

Search Performance Test Result

I have a smal toy corpus comprising of only 10 documents. Each contains between 50 to 100 words. Topics center around the following topics

  1. apple computer
  2. iPhone
  3. smartphone,
  4. fruit apple
  5. fruit peach.

The queries are as follows

  1. iPhone sale
  2. steve jobs
  3. smartphone
  4. apple nutrition
  5. fruit food value

The second query is the kind of query that illustrates the power of semantic search. Although none of the documents had the query string steve jobs, all documents related to apple and iPhone showed up at the top of the search result, because they are semantically related.

I ran all 5 queries with 9 different aforementioned similarity algorithms. I evaluated  the results just perusing  and judging how the order of the 10 documents appearing in the search result deviated from the order I was expecting, Obviously, it’s not technically rigorous. Nonetheless, it provides some good insight on similarity algorithms. I found the following similarity algorithms to be best in the order of performance.

  1. Average token level
  2. Max average  token level
  3. Average sentence level
  4. Max sentence level

Because my test set up was very limited, it won’t be wise to accept my observations in a broader context. If you have an enterprise Solr and Elastic Search deployment and want to enhance it with deep learning based semantic search, you should run your own tests as follows

  1. Use  large corpus from your domain
  2. Run representative queries with all the similarity algorithms
  3. Measure performance with proper metrics e.g. precision, recall etc

The python script works as follows

  1. Loads BERT large model
  2. Encodes all 10 documents using loaded BERT model
  3. Enters a loop prompting the user for a query.
  4. Returns query results

Here is some sample output

akash:misc pranab$ ./ssearch.py  ssma
using sentence max similarity
loading BERT transformer model
encoding all docs
enter a search query or q to quit: iphone sale

sorted search result
doc6 score 11.165314
doc4 score 10.993372
doc2 score 10.461808
doc7 score 10.118464
doc8 score 9.900048
doc3 score 9.870462
doc1 score 9.836157
doc5 score 9.611451
doc10 score 8.561757
doc9 score 7.774605

I have used BERT pre trained transformer as packaged and made available through Spacy NLP library.

Integration with Elastic Search or Solr

ElasticSearch or Solr will provide search score for any query , document pair. How do we reconcile that semantic search based score with the native score. There are many ways to do that. One solution is to take weighted average of the native search score and semantic search score and re order the search results. All documents should be encoded and the encoded tensor be stored and indexed in ElasticSearch

ElasticSearch has started supporting vector  embedding based search out of the box and from I could tell, this is work in progress.

Vector Similarity Based Retrieval

In literal word match based search, all documents are indexed by the words they contain. During query time, all documents that have any of the keywords in the query are fetched and and the relevance score algorithms is run on the documents to rank them.

Retrieval based on vector similarity is lot more challenging. There some python libraries available e.g annoy and milvus.  Between the 2 , milvus seems to be more matured.. Here are the steps for for using milvus indexing, retrieval and using the similarity algorithm of your choice.

Steps for indexing documents

  1. Get document tensor with BERT encoding for each document
  2. Squash each document tensor into a vector by taking average of the rows.
  3. Index each document vector into milvus.

Steps for retrieving documents  for a query

  1. Get the  query tensor with BERT encoding
  2. Squash the query  tensor into a vector by taking average of the rows
  3. Retrieve the nearest neighbor documents based on the query vector. You can specify how many nearest neighbor to retrieve
  4. Run the similarity algorithm of your choice from the aforementioned list on the documents and rank them based on similarity score

Wrapping Up

We have walked through a process of doing semantic search with BERT pre trained transformer model. There are various ways to compute tensor similarity. We have experimented with them to evaluate them albeit for a small corpus. Here is a tutorial document, if you want to play around with it.

A