With the recent trend of RAG it has become popular to rely on Embedding Models. You might have heard of OpenAI’s Ada model or even some model on Huggingface such as nomic-embed-text.

The idea behind converting textual data to a vector is not new. In fact we often use this while training classifiers. This can be something such as TF-IDF or even simple word count.

We do this because computers can only understand numbers not text or images.

Computers can then perform mathematical operations on numerical data. This is commonly used for tasks such as

  1. Classification
  2. Clustering
  3. Similarity Measurement

Let us look at one such embedding method.

TF-IDF (Term Frequency - Inverse Document Frequency)

It is a statistical measure to evaluate the importance of a word in a document relative to a collection of documents (corpus).

Term Frequency is how often a word appears in a document. A higher frequency suggests greater importance.

image.png

This is where Inverse Document Frequency (IDF) comes in

image.png

This is aimed at reducing the weight of common words across the document while increasing the weight of rare words.