Vector Search: Revolutionizing the Way We Find Information

Caleb Knox
Oct 30, 2024
12 min read

Moving from Keywords to Concepts with Vector Search

In the ever-evolving world of information retrieval, search strategies and algorithms have long relied on keyword matching in order to help users find the results they are looking for. While this process can be extremely accurate if a user knows the exact content of the item they are seeking out, it is limited to the preciseness of the query and has a very difficult time interpreting more abstract inputs. As the demand for more intuitive search experiences grows, a shift towards vector-based search has become a promising solution.

Vector search is being used in multiple disciplines, most notably search engines. However, there are plenty of other applications for businesses such as recommendation systems, which utilize vector search to provide personalized suggestions based on users' preferences and online behaviors. Additionally, vector search is being applied to vital industries such as fraud detection, where it helps identify suspicious patterns, and drug discovery, where it helps researchers identify potential drug candidates by comparing molecular structures, helping to match drug compounds to specific diseases or conditions.

The Brains of the Operation: How Vector Embeddings Work

At the core of vector search are embeddings—mathematical representations of data, such as words, sentences, visuals, or audio—created by deep learning models. These embeddings map data into a high-dimensional vector space where semantically similar items are positioned close to one another. For instance, if the input is a word like "dog," an embedding model will produce a vector that places it in close proximity to other semantically related terms like "puppy," "cat,” or “animal.” The proximity of these vectors in the multi-dimensional space reflects their underlying semantic and contextual similarity.

In order to understand this concept deeper, let’s take a look at an example in action. Displayed on this graph below is the vectorization of three different sentences:

A: The quick brown fox jumps over the lazy dog.

B: The speedy hazel fox leaps above the resting puppy.

C: The hotel boasts a very high standard of comfort.

Vectors A and B are relatively close to each other in direction and magnitude, indicating a high degree of semantic similarity. This aligns with expectations, as both sentences describe a similar fictitious scenario with slight variations in vocabulary. Vector C, however, points in a distinct direction, diverging significantly from the other two vectors. This suggests a clear semantic difference, as the third sentence focuses on describing hotel comfort rather than a scene involving animals. The spread in vector directions effectively demonstrates the model's ability to capture similarities and differences based on sentence context, grouping related concepts together while differentiating unrelated ones.

The more dimensions that a vector embedding is encoded with typically indicates a deeper semantic and contextual understanding, however, this is disadvantaged by the increased storage requirements and processing speed of higher dimensions. While a vector embedding can be any size, typically they are generated between 50 and 1024 dimensions.

It’s also important to note that while we’re seeing this in 3D, the vectors seen in the graph were originally generated as 384-dimensional vectors. To visualize them here, those 384 dimensions had to be reduced down to just three. This reduction captures the general relationships, but it’s a simplified view. In reality, the vectors in the full 384-dimensional space maps far more nuanced differences and connections among the sentences.

In order to generate an embedding such as the ones in the graphing scenario, we must either build or utilize a machine learning model. There are many vector embedding models for public use that have varying strategies in how they encode the data they are representing as well as the quantity and type of information that they are trained on. Here are some of the most commonly used:

Word2Vec

Advantages
- Efficient and Scalable: It can be trained efficiently on large datasets and provides well-constructed word vectors.
- Handles Large Vocabularies: The various techniques employed under the hood allow the model to handle large vocabularies without any significant computational burden.
Drawbacks
- Context Independent: It does not consider the context of words; each word has a single fixed vector, which makes it struggle with words with multiple meanings.
- Word Limited: It is trained for word embeddings rather than sentence embeddings, which drastically limits its use cases.

BERT / SBERT

Advantages
- Contextual Understanding: These models generate embeddings based on the entire sentence's context, allowing them to handle words with multiple meanings effectively.
- Highly Available: There are many pre-trained BERT and SBERT models available for download locally with varying sizes depending on hardware constraints and processing requirements.
Drawbacks
- Resource Intensive: These models are computationally expensive and require significant memory and processing power, particularly for tasks like fine-tuning.
- Complexity in Usage: Fine-tuning and tweaking model parameters for specific tasks often requires expertise in machine learning and NLP, making it less approachable for beginners.

OpenAI Embedding Model

Advantages
- High Quality: OpenAI’s embedding model is trained on vast and diverse data, making the embeddings high quality and capable of generalizing across a wide variety of domains.
- Rich in Context: Like BERT and SBERT, it captures rich contextual information, providing high-performance embeddings for tasks such as semantic search or clustering.
Drawbacks
- Costs Money: OpenAI's embedding model usage involves a pay-per-use cost, which can add up, particularly for large-scale projects or continuous use cases.
- Limited Customizability: Since it is not open-source and fine-tuning options are limited, it is difficult to adapt the model to very specialized or domain-specific contexts.
- Privacy and Control Concerns: Using OpenAI's model means sending data to an external service, which may raise privacy concerns or introduce compliance issues in certain industries.

While we have mainly been focusing on vector embeddings as a way to represent textual information, the truth is that anything can be stored as a vector. This includes images, video, and audio—allowing complex data types to be represented as numerical arrays that capture essential patterns and relationships for tasks like similarity comparison, search, and classification. This boasts an additional benefit which is the ability to easily search across data types. One of the main benefits of this would be natural language queries to find images, however, interesting combinations such as videos to audio are possible as well!

Pre-trained Models and Fine-Tuning

Using pre-trained embeddings models is an effective way to save both time and cost and reduces the barrier to entry into the utilization of artificial intelligence. There are libraries that can generate high-quality embeddings in just a few lines of code.

Example

Here is an example using the SentenceTransformers Python library from Hugging Face. The code converts a sentence into a vector embedding using the pre-trained multi-qa-MiniLM-L6-cos-v1 model, which can be used for tasks like similarity search or text analysis.

print(embedding)

The last line in this code snippet prints out the embedding which is an array of floating point numbers: [0.04385032877326012, -0.0050516980700194836, …, 0.12051314115524292]. The array has been condensed for simplicity, however, the original output array has a length of 384.

Fine-Tuning Models

Oftentimes these pre-trained models are trained on certain types of data (e.g. news articles) that yield solid general context; this is extremely useful when the goal is to embed text that does not fall under one topic or category. For many use cases however, there is a huge benefit to be gained by fine-tuning the pre-trained models on even more data that is specific to your use case. Fine-tuning essentially means that you are taking new data and feeding it into a model so that it can understand the context of that information on top of the general understanding it has to begin with.

Exploring the Vector Space: How are Results Ranked

In order to make vector embeddings useful, we must have a way to compare their closeness in the vector space. There are multiple ways to go about doing this:

Cosine Similarity

How it's measured: Cosine similarity gives the cosine of the angle between two vectors and produces a value lying in the range of [-1, 1]. This metric measures the orientation and direction of the vectors and disregards the magnitude. A value close to 1 indicates similarity, while values close to -1 indicate dissimilarity. Additionally, a value of zero means complete orthogonality (the vectors are directionally identical).

Use case: Cosine similarity has wide applications in information retrieval, text analysis, and anywhere the magnitude of a vector does not matter; for example, computing document similarity regardless of document length.

Euclidean Distance

How it's measured: Euclidean distance calculates the minimum amount of separation between two points, or rather, vectors in n-dimensional space. The value returned from this formula belongs to the range [0, ∞]. The larger the distance, the less similar the vectors are.

Use Case: Euclidean distance finds wide applications in many clustering scenarios, for example, K-nearest neighbors algorithm where the physical distance among the data points is going to be key consideration in establishing a relationship.

Dot Product

How it’s measured: The dot product finds similarity by multiplying the corresponding elements of two vectors and then adding all the results together. In practice, dot products return any value in the range [-∞, ∞], however, many vector embedding models available will normalize the vectors to a magnitude of one. Therefore, the resulting dot product will be identical to the cosine similarity and fall into the range [-1, 1].

Use case: The dot product is often used in machine learning, especially in recommendation systems and neural networks, where it helps quantify the similarity between feature vectors or compute weighted sums.

Example

This code demonstrates how to use the SentenceTransformers library to calculate similarity metrics between sentences based on their vector embeddings. The model used, multi-qa-MiniLM-L6-cos-v1, converts text into 384-dimensional vectors, which are then compared using various mathematical techniques: cosine similarity, Euclidean distance, and dot product. These metrics help to measure how similar the sentences are with each other.

Sentences being compared: The quick brown fox jumps over the lazy dog. and The speedy hazel fox leaps over the resting puppy.

Cosine Similarity: 0.7698067261367332

Euclidean Distance: 0.6785179107904203

Dot Product: 0.7698067387390093

Sentences being compared: The quick brown fox jumps over the lazy dog. and The hotel boasts a very high standard of comfort.

Cosine Similarity: 0.06580566983400683

Euclidean Distance: 1.3668901190986225

Dot Product: 0.06580566762770222

Sentences being compared: The speedy hazel fox leaps over the resting puppy. and The hotel boasts a very high standard of comfort.

Cosine Similarity: 0.01766820197002008

Euclidean Distance: 1.4016646343052688

Dot Product: 0.017668203327981143

K-Nearest Neighbors (KNN) Search

A K-nearest neighbor search is a method used to find the arbitrary K closest points to a given query in a vector space. It is commonly employed in vector search to return the most similar results based on some distance metric. In order to find the K-nearest neighbors of a query vector in the vector space, there are two main strategies that are implemented:

Exact Nearest Neighbor (ENN) Search

This type of search is the easiest and most precise method of finding the closest neighbors in the vector space. Utilizing one of the previously mentioned distance metrics, every single vector is compared to the query vector and the top K results are returned. While this method guarantees preciseness in the exact nearest neighbors, it is also computationally expensive and often time intensive.

Approximate Nearest Neighbor (ANN) Search

In order to combat the high computational and time requirements in an ENN search, a more clever algorithm must be implemented. The most popular and widely used ANN algorithm is the Hierarchical Navigable Small Worlds (HNSW) algorithm; this utilizes a multi-layered graph and is used in popular vector databases such as MongoDB, Pinecone, and Weaviate. HNSW builds multiple levels of graphs, starting from a sparse, high-level overview down to a base layer containing every vector. At each level, the search finds the closest matches and follows their path to the next layer, making it much more efficient compared to brute force searches. This clever hierarchy allows HNSW to quickly find the closest vectors, giving a good balance of speed and accuracy without needing to exhaustively search through every data point. The illustration below showcases HNSW in action.

The nodes in blue indicate a path followed by the algorithm. In the last layer, the vectors surrounding the query vector are the top K results returned. One important parameter in HNSW is the exploratory factor (ef), which controls how many nodes are considered during the search. A higher value of ef means that more nodes are explored, leading to better accuracy since the search is more thorough. However, this also increases the computational time. Choosing the right ef value allows for balancing accuracy and speed depending on the specific use case.

Managing a Vector Database in MongoDB: Key Challenges and Considerations

Vector search and storage presents unique challenges that differ from many of the traditional structured database queries that developers are accustomed to. One primary issue is the complexity of high-dimensional data. Vectors often contain hundreds, if not thousands, of dimensions, making it computational intensive to query and index in an efficient manner. Additionally, indexing data in order to find a match through a query requires specialized methods and design, including the hierarchical navigable small worlds (HNSW) structures and specific similarity metrics. While MongoDB handles most of these intricacies for the developer such as generating the search index based on configurable parameters, there are still a few tasks that must be handled manually. For example, MongoDB does not handle the vectorization of text or queries, and developers must also create and configure the search index themselves, ensuring it aligns with the needs of their database.

When to Use Certain Vector Sizes

Small Vectors (50-199 dimensions): Suitable for simple tasks like document classification or small-scale recommendation systems. These offer faster query times and lower storage costs but may lack accuracy.

Medium Vectors (200-499 dimensions): Commonly used for general purpose semantic search and recommendation. They offer a balance between accuracy and resource efficiency.

Large Vectors (500+ dimensions): Ideal for complex tasks like image recognition, detailed similarity searches, or precision data analysis. Larger vectors provide greater accuracy but at the cost of increased computation and storage requirements.

Creating a Search Index

Creating a vector search index involves optimizing your database for fast retrieval of similar vectors, and choosing the right similarity metric, like cosine similarity or Euclidean distance, to compare them. This allows you to tailor the search process to your specific application, balancing speed and accuracy based on the chosen metric.

Example

This code demonstrates how to work with vector embeddings in MongoDB, covering both insertion and search. First, vectors are stored in a MongoDB collection using the insert_many() function, allowing embeddings to be efficiently saved for future use. To enable vector search, a vector index is created using cosine similarity, allowing the database to quickly retrieve the most similar vectors. A query embedding is then matched against the stored vectors using a pipeline, which returns the top K results ranked by similarity.

Inputting Vectors into a MongoDB Database

Adding a Search Index to a MongoDB Vector Database

Searching a MongoDB Vector Database

Future Trends in Vector Search

Vector embeddings for improved searches started showing up in the 2010s on search engines such as Google. No longer would it solely rely on word matching; instead, it could infer meaning based on a query. It was this transition that started with the release of Word2Vec by Google back in 2013. This shift to vector search helped the engines become smarter because it now knows the context behind those words and is not purely reliant on a keyword match.

Since then there have been many advancements in artificial intelligence, such as publicly available generative AI models like ChatGPT that output human-like responses based on input prompts. These LLMs are trained on massive corpora that allow for generative text on most topics, however, there is information that these models lack. First, in order to stay up to date on information you need to constantly update the data the model is trained on; this is expensive so models are often behind. Second, models lack any sort of private information that an organization might have. This is where retrieval augmented generation picks up some slack.

Retrieval Augmented Generation

Retrieval augmented generation (RAG) is a method that combines the strengths of vector search and generative models to enhance query responses. In RAG, relevant information is first retrieved from a knowledge base using a KNN vector search, and this retrieved content is used as context for a generative language model. This approach allows the model to generate more accurate and context-aware responses by grounding the generation in real up-to-date data rather than solely relying on the model’s training knowledge.

Example

This code demonstrates a retrieval augmented generation (RAG) approach by combining vector search with generative language models. It uses MongoDB to store fictitious company policy embeddings generated by a pre-trained SentenceTransformer model. When a query is made, it’s embedded and compared to stored policies using cosine similarity. The most relevant policy is retrieved and appended as context to the query, which is then sent to a language generation API (Gemini in this case) for an enhanced response, incorporating the company's handbook information.

Inputting the Data for Future Retrieval

Calling the Generative AI Model with Included Context

Query & Context:

“What is the subscription cancellation policy? — with this information from the company handbook: Clients who cancel their subscription within the first 10 days are eligible for a full refund minus a $50 processing fee.”

Answer:

The subscription cancellation policy states that:

Clients can receive a full refund of their subscription fee if they cancel within the first 10 days.
However, there is a $50 processing fee deducted from the refund.

In other words, clients who cancel their subscription within the initial 10-day period will be eligible for a partial refund, with $50 subtracted from the total amount.

Real Time Vector Search

Another future trend in vector search is its real time implementation, allowing systems to deliver instant results as users interact with various platforms. Real time vector search is a simple idea that is very difficult to implement in practice. Creating live recommendation systems requires both powerful hardware and the ability to execute large vector comparisons at lightning speed. However, the advancements in computing power over the past few years has shown a spike in the number of predictive tools in our everyday applications like Google Docs, Microsoft Word, Spotify, and more.

Conclusion

As technology gets more advanced, managing all the data we generate is becoming a bigger challenge. That’s why vectorization is such a game changer for developers. In the past, it was all about keeping data as organized as possible, but now we’re learning to work with a mix of both structured and unstructured data. Vectorization helps make sense of it all, allowing us to connect and reference different types of information more easily and efficiently.

Vector Search: Revolutionizing the Way We Find Information

Moving from Keywords to Concepts with Vector Search

The Brains of the Operation: How Vector Embeddings Work

Pre-trained Models and Fine-Tuning

Exploring the Vector Space: How are Results Ranked

Managing a Vector Database in MongoDB: Key Challenges and Considerations

Future Trends in Vector Search

Conclusion

Recent Posts

Comments

CONTACT