Wietse Venema's blog

Is cosine similarity always the best choice for text embedding search?

While working on a recent Retrieval Augmented Generation (RAG) demo with text embedding search, I wondered if cosine similarity is always the best similarity measure to use. I figured other measures such as dot product and Euclidean distance (L2) might be just as suitable. Before I knew it I was refreshing my linear algebra. Let’s lay it all out, starting with some context.

What are text embeddings

Text embeddings are arrays of numbers (vectors) that capture the meaning and relationships between words and concepts. You can also think of vectors as coordinates: text embeddings that are close to each other are derived from text that has similar meaning, which makes them ideal for implementing document search.

To compute a text embedding, you can use an open source model or a proprietary one. The MTEB leaderboard provides a great overview. MTEB is a benchmark that compares the performance of text embedding models over different types of tasks. For my demo, I used Vertex AI text embeddings

The recipe for implementing text embedding search is straightforward. Begin by computing text embeddings for all (chunks of) your documents and store them in a vector database. When searching, you compute the text embedding of the query, and find text embeddings in the database that are closest to the query’s embedding. For that, you need a distance function.

Distance functions

Let’s focus on the most common distance functions:

Cosine similarity
Measures the similarity between vectors based on the angle they form. This measure completely ignores the length of the vectors.
Dot product
The dot product is comparable to cosine similarity, but it does take the length of the vectors into account.
Euclidean distance (L2)
This function returns at the absolute distance between two vectors, treating them as coordinates. You might also see this referred to as L2 distance.

For normalized vectors, all three measures produce the same rankings

Here’s the key insight. If the output of your model is a normalized vector, these three distance functions provide exactly the same rankings. Read this Wikipedia page on cosine similarity to learn why.

It looks like it’s common for text embedding models to produce normalized vectors. The Vertex AI text embeddings are normalized for the full-length 768 dimensions vectors, and so are the OpenAI text embeddings.

So yes, cosine similarity works great for normalized vectors. Theoretically, dot product might be faster to compute.

For non-normalized vectors, it depends

If your model outputs non-normalized vectors, the best distance measure likely depends on the task type (classification, search, or others) and how the model works - check its documentation.