How to Build A Vertical Search Engine Using Python?

Elham Amini
5 min readDec 13, 2020

--

What is a vertical search engine?

Sometimes Google, or better to say a generalized search engine is not the most efficient way to retrieve information! Suppose you would like to analyze people’s reviews for one of the restaurants in your town. In this case, searching through a specific search engine such as Yelp is more efficient than Googling. We can say that, a vertical search engine focuses on one particular subject.

A search engine for airplane crashes

Despite the significant role of air travel in human life, there have been many fatal airplane crashes throughout history. There are various reasons that can lead to an airplane crash including but not limited to adverse weather, unexpected fire, and equipment failure. Understanding the underlying reasons can help researchers, policy makers and engineers to provide safer flights. There are also many passengers that are curious to know the reasons for crashes so that they can make a better decision to book their flight. However, it would take a long time to retrieve mentioned information from a general search engine. In this article, we are going to build a vertical search engine for airplane crashes.

Dataset (Airplane Crashes from 1902 to 2020)

The datasets are scrapped from ​planecrashinfo​ website using python and stored in SQL databases. The Data Frame has ​4980​ entries, including crashes’ date and a document that explains why the crash had happened. The following shows one of a sample document for one of the 💔 crashes that happened in 2020:

The plane took off from Tehran International Airport at 06:11. After reaching 20 km from the airport and at an altitude of 7,900 ft. the plane suddenly crashed to the ground. The aircraft disintegrated leaving a 300 m long path of wreckage. A misaligned missile battery, miscommunication between troops and their commanders and a decision to fire without authorization all led to Iran’s Revolutionary Guard shooting down a Ukrainian jetliner.

Setup the environment

The main libraries that we use to build the search engine are:

Download the list of all packages required and install them by running:

pip install -r requirements.txt

Preprocessing

To start we need to clean crash documents for a better search. We do the following steps to prepare data:

1- Lower casing

This step is lower casing all words. For instance, “This is the earliest known airliner hijacking” will be transfer to:

this is the earliest known airliner hijacking

2- Tokenizing each document

This step is splitting documents into words or tokens.

This, is, the, earliest, known, airliner, hijacking

3- Removing stop words

This step is removing stop words such as a, the, and is. The above document will be transfer to:

earliest, known, airliner, hijacking

4- Stemming

stemming is the process of transferring words to their word base. After stemming, The above document will be transfer to:

earliest, known, airliner, hijack

All crash documents are stored in a python dictionary. We need to open the dictionary, read all crash documents (dictionary values), and perform the preprocessing steps on them.

TF-IDF

How we can say a document is relevant to a query? You might answer “Term Frequency” a.k.n (TF). However, certain terms have little or no discriminating power in determining relevance. For instance, a collection of documents on the airplane crashes is likely to have the term crash in almost every document, which can not discriminative.

Inverse Document Frequency (IDF) evaluates the relevance of a term by measuring how often it appears in a collection of documents. Therefore the TF of a rare term might be high, while the IDF of a frequent term might likely to be low.

Ranking documents

To determine the relevance of a document to a give search query we use Okapi BM25, which is a ranking function. In simple words, bm25 is the product of a variant form of Term Frequency (TF), a variant form of Inverse Document Frequency (IDF), and a variant form of Query Term Frequency (QTF) as follows: TF*IDF*QTF

Given a search query term, documents with higher bm25 score will be retrieved as most relevant documents.

Python Implementation

We use stopwords and PorterStemmer modules from nltk library to perform the preprocessing. Then, we use BM25Okapi from rank-bm25 library to perform the ranking step. You can see a piece of python code of the mentioned steps in the following:

User Search Query

Each time that user search for a query, we need to perform the same preprocessing on the query as we did on the collection of document to have an apple to apple comparison. Then we pass the process query to the bm25 ranking function to retrieve the top n relevant documents. get_top_n module from rank-bm25 documents perfectly does this job:

bm25.get_top_n(tokenized_query, corpus, n=1)

To sum up

In this article, we designed a domain specific search engine for airplane crashes. The objective was enabling the user to search for a query such as “pilot error” and retrieve most relevant crashes to that query.

To prepare the corpus, we used a dictionary containing date of crashes as keys and document of crashes as values. We used nltk library to remove stopwords and stemming documents tokens in lower case format. Then passed a list of clean tokenized formats of documents to the ranking module of rank_bm25 library to get the inverted index of documents.

What I would like to emphasize is the significance of preprocessing. Before doing a complete preprocessing, the search engine cannot retrieve relevant documents. For instance, if the query is “crashes due to hijacking”, the search engine would retrieve documents with the words such as “hijacking”, but not a document containing the word “hijack” or “hijacked”. You can access all the data and codes needed to set up airplane crashes search engine in this GitHub repository.

Next…

In the next part, we’re going to use Flask to build a nice interface for our search engine.

--

--

Elham Amini

I am a Data Science student at the University of Michigan.