Is cosine similarity always the best choice for text embedding search?

May 29, 2024

While implementing text embedding search for a Retrieval Augmented Generation (RAG) demo, I wondered if cosine similarity is always the best similarity measure to use. There are other measures, including dot product and Euclidean distance (L2). I learned that all these three measures produce the same rankings, as long as the vectors are normalized.

What are text embeddings

Text embeddings are arrays of numbers (vectors) that capture the meaning and relationships of words and concepts. Think of these vectors as coordinates: embeddings close to each other represent text with similar meanings. This makes them suitable for powering document search.

To compute a text embedding, you can use an open model or a proprietary one. The MTEB leaderboard provides a great overview of different options. MTEB is a benchmark that compares the performance of text embedding models over different types of tasks. For my demo, I used Vertex AI text embeddings

Implementing text embedding search

The recipe for implementing text embedding search is straightforward. Begin by computing text embeddings for all (chunks of) your documents and store them in a vector database. When searching, you compute the text embedding of the query, and find text embeddings in the database that are closest to the query’s embedding. For that, you need a distance function.

Three most common distance functions

Here are the three most common distance functions:

Cosine similarity: Measures the similarity between vectors based on the angle they form. This measure completely ignores the length of the vectors.
Dot product: The dot product is comparable to cosine similarity, but it does take the length of the vectors into account.
Euclidean distance (L2): This function returns at the absolute distance between two vectors, treating them as coordinates. You might also see this referred to as L2 distance.

For normalized vectors, all three measures produce the same rankings

Here’s the key insight. If the output of your model is a normalized vector, these three distance functions provide exactly the same rankings. Read this Wikipedia page on cosine similarity to learn why.

It looks like it’s common for text embedding models to produce normalized vectors. The Vertex AI text embeddings are normalized for the full-length 768 dimensions vectors, and so are the OpenAI text embeddings.

So yes, cosine similarity works great for normalized vectors. Theoretically, dot product might be faster to compute.

For non-normalized vectors, it depends

If your model outputs non-normalized vectors, the best distance measure likely depends on the task type (classification, search, or others) and how the model works - check its documentation.