Build a Search Engine with TF-IDF

TF-IDF short for term frequency – inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today.

1. Bag of words

Tokenize a sentence

from nltk.tokenize import TreebankWordTokenizer

sentence = "The faster Harry got to the store, the faster Harry, the faster, would get home."
tokenizer = TreebankWordTokenizer()
token_sequence = tokenizer.tokenize(sentence.lower())
print(token_sequence)
['the', 'faster', 'harry', 'got', 'to', 'the', 'store', ',', 'the', 'faster', 'harry', ',', 'the', 'faster', ',', 'would', 'get', 'home', '.']

Tokens counter

We only care about the unique words and how many times a word occurs in the sentence.

from collections import Counter
bag_of_words = Counter(token_sequence)
print(bag_of_words)
Counter({'the': 4, 'faster': 3, ',': 3, 'harry': 2, 'got': 1, 'to': 1, 'store': 1, 'would': 1, 'get': 1, 'home': 1, '.': 1})

What is the most common words?

word_list = bag_of_words.most_common()
print(word_list)
[('the', 4), ('faster', 3), (',', 3), ('harry', 2), ('got', 1), ('to', 1), ('store', 1), ('would', 1), ('get', 1), ('home', 1), ('.', 1)]

The number of times a word occurs in a document is called term frequency or TF. Let compute one example.

harry_count = bag_of_words['harry']
unique_words = len(bag_of_words)

print(round(harry_count / unique_words, 4))

0.1818

Let compute with another example. Take a look at below document.

A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. A kite consists of wings, tethers, and anchors. Kites often have a bridle to guide the face of the kite at the correct angle so the wind can lift it. A kite’s wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at a single point. A kite may have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is still often called the kite.

The lift that sustains the kite in flight is generated when air flows around the kite’s surface, producing low pressure above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving (such as the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).

The same principles of fluid flow apply in liquids and kites are also used under water. A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon.

Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have been made.

kite_text = "A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. A kite consists of wings, tethers, and anchors. Kites often have a bridle to guide the face of the kite at the correct angle so the wind can lift it. A kite's wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at a single point. A kite may have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is still often called the kite. The lift that sustains the kite in flight is generated when air flows around the kite's surface, producing low pressure above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving (e.g., the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle). The same principles of fluid flow apply in liquids and kites are also used under water. A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon. Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have been made."

from collections import Counter
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(kite_text.lower())
token_count = Counter(tokens)
print(token_count)

Counter({'the': 26, 'a': 20, 'kite': 16, ',': 15, 'and': 10, 'of': 10, 'kites': 8, 'is': 7, 'in': 7, 'or': 6, 'wing': 5, 'to': 5, 'be': 5, 'as': 5, 'lift': 4, 'have': 4, 'may': 4, 'at': 3, 'so': 3, 'can': 3, 'also': 3, 'kiting': 3, 'are': 3, 'flown': 3, 'tethered': 2, 'craft': 2, 'with': 2, 'that': 2, 'air': 2, 'consists': 2, 'tethers': 2, 'anchors.': 2, 'often': 2, 'bridle': 2, 'wind': 2, "'s": 2, 'designed': 2, ';': 2, 'when': 2, 'for': 2, 'moving': 2, 'technical': 2, 'even': 2, 'called': 2, 'surface': 2, 'pressure': 2, 'drag': 2, 'force': 2, 'by': 2, 'which': 2, '.': 2, 'used': 2, 'power': 2, 'traditionally': 1, 'heavier-than-air': 1, 'surfaces': 1, 'react': 1, 'against': 1, 'create': 1, 'drag.': 1, 'wings': 1, 'guide': 1, 'face': 1, 'correct': 1, 'angle': 1, 'it.': 1, 'not': 1, 'needed': 1, 'sailplane': 1, 'launch': 1, 'tether': 1, 'meets': 1, 'single': 1, 'point.': 1, 'fixed': 1, 'untraditionally': 1, 'tether-set-coupled': 1, 'sets': 1, 'though': 1, 'system': 1, 'still': 1, 'kite.': 1, 'sustains': 1, 'flight': 1, 'generated': 1, 'flows': 1, 'around': 1, 'producing': 1, 'low': 1, 'above': 1, 'high': 1, 'below': 1, 'wings.': 1, 'interaction': 1, 'generates': 1, 'horizontal': 1, 'along': 1, 'direction': 1, 'wind.': 1, 'resultant': 1, 'vector': 1, 'from': 1, 'components': 1, 'opposed': 1, 'tension': 1, 'one': 1, 'more': 1, 'lines': 1, 'attached.': 1, 'anchor': 1, 'point': 1, 'line': 1, 'static': 1, '(': 1, 'e.g.': 1, 'towing': 1, 'running': 1, 'person': 1, 'boat': 1, 'free-falling': 1, 'anchors': 1, 'paragliders': 1, 'fugitive': 1, 'parakites': 1, 'vehicle': 1, ')': 1, 'same': 1, 'principles': 1, 'fluid': 1, 'flow': 1, 'apply': 1, 'liquids': 1, 'under': 1, 'water.': 1, 'hybrid': 1, 'comprising': 1, 'both': 1, 'lighter-than-air': 1, 'balloon': 1, 'well': 1, 'lifting': 1, 'kytoon.': 1, 'long': 1, 'varied': 1, 'history': 1, 'many': 1, 'different': 1, 'types': 1, 'individually': 1, 'festivals': 1, 'worldwide.': 1, 'recreation': 1, 'art': 1, 'other': 1, 'practical': 1, 'uses.': 1, 'sport': 1, 'aerial': 1, 'ballet': 1, 'sometimes': 1, 'part': 1, 'competition.': 1, 'multi-line': 1, 'steerable': 1, 'generate': 1, 'large': 1, 'forces': 1, 'activities': 1, 'such': 1, 'surfing': 1, 'landboarding': 1, 'fishing': 1, 'buggying': 1, 'new': 1, 'trend': 1, 'snow': 1, 'kiting.': 1, 'man-lifting': 1, 'been': 1, 'made': 1})

Hmm, you can see that the, a, and and many other stop-words occurs too much. How about apply stopwords clean up.

import nltk

nltk.download('stopwords', quiet=True)
stopwords = nltk.corpus.stopwords.words('english')
tokens = [x for x in tokens if x not in stopwords]
kite_counts = Counter(tokens)
print(kite_counts)

Counter({'kite': 16, ',': 15, 'kites': 8, 'wing': 5, 'lift': 4, 'may': 4, 'also': 3, 'kiting': 3, 'flown': 3, 'tethered': 2, 'craft': 2, 'air': 2, 'consists': 2, 'tethers': 2, 'anchors.': 2, 'often': 2, 'bridle': 2, 'wind': 2, "'s": 2, 'designed': 2, ';': 2, 'moving': 2, 'technical': 2, 'even': 2, 'called': 2, 'surface': 2, 'pressure': 2, 'drag': 2, 'force': 2, '.': 2, 'used': 2, 'power': 2, 'traditionally': 1, 'heavier-than-air': 1, 'surfaces': 1, 'react': 1, 'create': 1, 'drag.': 1, 'wings': 1, 'guide': 1, 'face': 1, 'correct': 1, 'angle': 1, 'it.': 1, 'needed': 1, 'sailplane': 1, 'launch': 1, 'tether': 1, 'meets': 1, 'single': 1, 'point.': 1, 'fixed': 1, 'untraditionally': 1, 'tether-set-coupled': 1, 'sets': 1, 'though': 1, 'system': 1, 'still': 1, 'kite.': 1, 'sustains': 1, 'flight': 1, 'generated': 1, 'flows': 1, 'around': 1, 'producing': 1, 'low': 1, 'high': 1, 'wings.': 1, 'interaction': 1, 'generates': 1, 'horizontal': 1, 'along': 1, 'direction': 1, 'wind.': 1, 'resultant': 1, 'vector': 1, 'components': 1, 'opposed': 1, 'tension': 1, 'one': 1, 'lines': 1, 'attached.': 1, 'anchor': 1, 'point': 1, 'line': 1, 'static': 1, '(': 1, 'e.g.': 1, 'towing': 1, 'running': 1, 'person': 1, 'boat': 1, 'free-falling': 1, 'anchors': 1, 'paragliders': 1, 'fugitive': 1, 'parakites': 1, 'vehicle': 1, ')': 1, 'principles': 1, 'fluid': 1, 'flow': 1, 'apply': 1, 'liquids': 1, 'water.': 1, 'hybrid': 1, 'comprising': 1, 'lighter-than-air': 1, 'balloon': 1, 'well': 1, 'lifting': 1, 'kytoon.': 1, 'long': 1, 'varied': 1, 'history': 1, 'many': 1, 'different': 1, 'types': 1, 'individually': 1, 'festivals': 1, 'worldwide.': 1, 'recreation': 1, 'art': 1, 'practical': 1, 'uses.': 1, 'sport': 1, 'aerial': 1, 'ballet': 1, 'sometimes': 1, 'part': 1, 'competition.': 1, 'multi-line': 1, 'steerable': 1, 'generate': 1, 'large': 1, 'forces': 1, 'activities': 1, 'surfing': 1, 'landboarding': 1, 'fishing': 1, 'buggying': 1, 'new': 1, 'trend': 1, 'snow': 1, 'kiting.': 1, 'man-lifting': 1, 'made': 1})

Look better, you can easily see that there are some importance terms kite, kites, wing, lift. Just by looking purely at this, you will learn something about this document.

2. Vectorizing

How to compute document vector?

Document vector on the kite example

document_vector = []
doc_length = len(tokens)

for key, value in kite_counts.most_common():
document_vector.append(value / doc_length)

document_vector

[0.07207207207207207,
0.06756756756756757,
0.036036036036036036,
...
0.0045045045045045045,
0.0045045045045045045,
0.0045045045045045045]

Hmm, interesting. This is one vector of a document. We might need more than one document to do math on. Let build another simple example. Let's say I have this three documents.

doc_0 = "The faster Harry got to the store, the faster Harry, the faster, would get home."
doc_1 = "Harry is hairy and faster than Jill."
doc_2 = "Jill is not as hairy as Harry."

Now, convert those docs to tokens

tokens_0 = tokenizer.tokenize(doc_0.lower())
tokens_1 = tokenizer.tokenize(doc_1.lower())
tokens_2 = tokenizer.tokenize(doc_2.lower())
lexicon = set(tokens_0 + tokens_1 + tokens_2)

print(lexicon)
print(len(lexicon))

{'hairy', 'jill', 'harry', 'faster', 'the', 'would', 'get', 'is', 'got', 'and', 'than', ',', '.', 'not', 'as', 'store', 'home', 'to'}
18

All of this three documents must have the same length, that is 18.

from collections import OrderedDict

# I create a zero vector as a template for all the vectors in this example
zero_vector = OrderedDict((token, 0) for token in lexicon)
zero_vector

OrderedDict([('hairy', 0), ('jill', 0), ('harry', 0), ('faster', 0), ('the', 0), ('would', 0), ('get', 0), ('is', 0), ('got', 0), ('and', 0), ('than', 0), (',', 0), ('.', 0), ('not', 0), ('as', 0), ('store', 0), ('home', 0), ('to', 0)])

Here a how three document vectors created

import copy

document_vectors = []
for doc in [doc_0, doc_1, doc_2]:
# Don't edit directly on the template, just clone one for new vector
vec = copy.copy(vector_template)

tokens = tokenizer.tokenize(doc.lower())
token_counts = Counter(tokens)

for key, value in token_counts.items():
# compute TF
vec[key] = value / len(lexicon)
document_vectors.append(vec)

What is vector?

Cosine Similarity Implementation

\[\begin{align*} \cos \theta &= \frac{A \cdot B}{|A| |B|} \end{align*}\]

import math

def cosine_sim(vec1, vec2):
"""
Since our vectors are dictionaries, lets convert them to lists for easier mathing.
"""
vec1 = [val for val in vec1.values()]
vec2 = [val for val in vec2.values()]

dot_prod = 0
for i, v in enumerate(vec1):
dot_prod += v * vec2[i]

mag_1 = math.sqrt(sum([x**2 for x in vec1]))
mag_2 = math.sqrt(sum([x**2 for x in vec2]))

return dot_prod / (mag_1 * mag_2)

A consine similarity of 1 represents vectors that point in the exactly same direction. In NLP, that mean two documents have the same meaning.

A cosine similarity of 0 represent two vectors share no components. They are orthogonal.

A consine similarty of -1 represents two vectors that are anti-similar, they are pointing in the opposite. Or NLP, two documents talking about the opposite things. But, we never see the opposite direction in TF vectors, because counting words can never be negative.

So, let go to verify the above three vectors

print(cosine_sim(document_vectors[0], document_vectors[1]))
print(cosine_sim(document_vectors[0], document_vectors[2]))
print(cosine_sim(document_vectors[1], document_vectors[2]))

0.3162277660168379
0.14142135623730948
0.5590169943749475

We can easily see that, doc1 and doc2 have much similar score. They are talking something similar.

3. Topic Modeling

Word counts are useful, but it doesn't tell much more information and the relative with all other documents in the corpus. Inverse Document Frequency (IDF), compute on documents count in the entire corpus.

\[\begin{align*} tf(t, d) &= \frac{count(t)}{count(d)} \end{align*}\]

\[\begin{align*} idf(t, D) &= \log \frac{count(D)}{count(D_t)} \end{align*}\]

\[\begin{align*} tfidf(t, d, D) &= tf(t, d) * idf(t, D) \end{align*}\]

The more time a word appear in a document, the important it is in a document. On the other hand, the number of documents contains a word, the less important it is in the entire corpus.

Compute the tfidf on the above example

document_tfidf_vectors = []
documents = [doc_0, doc_1, doc_2]
for doc in documents:

vec = copy.copy(vector_template)

tokens = tokenizer.tokenize(doc.lower())
token_counts = Counter(tokens)

for key, value in token_counts.items():
docs_containing_key = 0
for _doc in documents:
if key in _doc:
docs_containing_key += 1
tf = value / len(lexicon)
if docs_containing_key:
idf = len(documents) / docs_containing_key
else:
idf = 0
vec[key] = tf * idf
document_tfidf_vectors.append(vec)

Let try with a simple query
documents = [doc_0, doc_1, doc_2]
query = "How long does it take to get to the store?"
query_vec = copy.copy(vector_template)

query_vec = copy.copy(vector_template)

tokens = tokenizer.tokenize(query.lower())
token_counts = Counter(tokens)

for key, value in token_counts.items():
docs_containing_key = 0
for _doc in documents:
if key in _doc.lower():
docs_containing_key += 1
if docs_containing_key == 0: # We didn't find that token in the lexicon go to next key
continue
tf = value / len(tokens)
idf = len(documents) / docs_containing_key
query_vec[key] = tf * idf

print(cosine_sim(query_vec, document_tfidf_vectors[0]))
print(cosine_sim(query_vec, document_tfidf_vectors[1]))
print(cosine_sim(query_vec, document_tfidf_vectors[2]))

0.5235048549676834
0.0
0.0

This is clearly that, the doc0 is the only documents that match with the query. The faster Harry got to the store, the faster Harry, the faster, would get home.

What an interesting algorithm to search for relevant documents. This searching algorithm had to evaluate the entire corpus. That O(N) algorithm. Not bad isn't it?

4. Tools

Let simpler our implementation by using tools.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = docs
vectorizer = TfidfVectorizer(min_df=1)
model = vectorizer.fit_transform(corpus)

print(model.todense().round(2))

[[0.16 0. 0.48 0.21 0.21 0. 0.25 0.21 0. 0. 0. 0.21 0. 0.64 0.21 0.21]
[0.37 0. 0.37 0. 0. 0.37 0.29 0. 0.37 0.37 0. 0. 0.49 0. 0. 0. ]
[0. 0.75 0. 0. 0. 0.29 0.22 0. 0.29 0.29 0.38 0. 0. 0. 0. 0. ]]