Understanding Semantic Search: How Vectors Power Meaning-Based Queries
Imagine you have a massive amount of text stored in a database. It is broken up into blocks, where each block might be a blog post, a Reddit thread, or a customer support ticket. Now, let's say you want to build a search feature for this content. You want users to type in a query, similar to how they would on Google, and retrieve the items most relevant to their question.
In this post, we are going to explore why traditional search methods often fail at this task and how semantic search using vectors offers a powerful solution.
The Limitations of Keyword Search
To understand why we need semantic search, we first need to look at how a basic keyword search operates. Suppose a user comes along and asks, "Tell me about fruit." A keyword search engine takes these specific words and looks for them directly in your content database.
If you have articles that explicitly contain the word "fruit," the search engine will find them. It might highlight the word in green and return those results. This works reasonably well for basic, rudimentary queries where the user knows exactly what vocabulary the content uses.
However, this approach falls apart quickly, particularly if you have a diverse and large dataset.
Figure 1: Keyword search struggles with ambiguity, matching "apples" to technology articles instead of nutritional information.
Consider the example shown above. Let's say a user asks, "Should I eat apples?"
Interestingly, the word "apple" might not appear in a general article about a balanced diet or bananas, even though those articles are semantically relevant to the topic of fruit. However, the word "Apple" (capitalized) appears frequently in articles about MacBook Pros and Snapdragon chips because it refers to the technology company.
Because keyword search engines simply match strings of text, there is a very high chance that asking "Should I eat apples?" will return technical articles about laptops rather than advice on fruit and vegetables. This is exactly where keyword search breaks down and where semantic search comes in.
What is Semantic Search?
Semantic search is all about searching by meaning rather than just matching characters.
To visualize this, imagine that every item in your database is given a tag that describes the meaning of that article. You might tag articles about bananas and balanced diets with "Fruit Facts." Conversely, you would tag articles about MacBooks and processors as "Computer Facts."
Figure 2: Semantic search conceptually involves tagging content with labels that represent its underlying topic.
In this system, we have the raw text, but we also have a label indicating the topic. When a user asks, "Should I eat apples?", we don't just look for the word "apple." We try to determine the label or intent of the search query.
We determine that the user is asking about "Fruit Facts." Consequently, we go to our database and pull out all the content related to that topic. This is the bare-bones fundamental of how semantic search works. It determines the meaning of the input query and finds articles with a similar meaning.
How Vectors Make Meaning Computable
We know we want to search by meaning, but how do we implement this technically? We cannot manually tag millions of database records. This is where vectors come in.
If you can represent each piece of content in your database as a vector (an array of numbers), you can perform simple mathematics to find the closest matches to your search query.
Plotting Meaning on a Graph
Let's visualize this on a two-dimensional graph. Imagine each point on the graph represents a record in your database. Ideally, these points are not randomly distributed. Instead, articles with similar meanings are grouped together.
Figure 3: Representing text as coordinates allows us to group similar concepts, such as Fruit and Computers, into distinct clusters.
If we can map the meaning of the text to a position on this graph, we can see clear clusters. The articles about computers appear in one area (perhaps the top right), and the articles about fruit appear in another (the bottom left).
When a user searches for "Do apples grow on trees," we create a vector from that search query and plot it on the same graph. We then look for the vectors that are closest to that specific point. In this case, the search query would land right next to the fruit articles. We simply pull those records from the database and return them as the result.
If the user searched for something regarding MacBooks, that vector would land near the computer cluster, and we would return the tech articles.
From 2D to Multi-Dimensional Space
While the examples above use a simple X and Y axis, real-world vector search is much more complex. We don't use just two dimensions; we use hundreds or even thousands. A typical vector might exist in 512-dimensional space or higher.
Theoretically, however, the concept remains the same. We are grouping things by meaning in multi-dimensional space. We take a search query, plot it into that space, and find the nearest neighbors.
A Real-World Example with Embeddings
To demonstrate this with actual data, I used a Large Language Model (LLM) to convert a few English words into vectors. This process is often called generating "embeddings."
I used the OpenAI embedding model to capture the meaning of specific words and map them onto a 2D space for visualization.
Figure 4: A real-world example using OpenAI embeddings shows how semantically similar words like 'cat', 'lion', and 'tiger' naturally cluster together.
As you can see in the graph above, the model has successfully grouped related concepts. The words "cat," "tiger," and "lion" are clustered tightly together in the top left corner. Meanwhile, "train" and "helicopter" (transportation modes) are grouped at the bottom. Words that are less related, like "mouse" or "space," appear as outliers or in their own distinct positions.
The important takeaway here is that even in this simplified view, the math has managed to capture the semantic relationships between words. If we were to introduce a new word, like "panther," and plot it on this graph, it would almost certainly appear near the existing cat/lion/tiger cluster.
By identifying that proximity, we can programmatically determine that a "panther" is semantically similar to a "tiger," allowing us to return relevant results even if the user never typed the word "tiger."
Recap
- Keyword Search Limitations: Traditional search fails when the user's vocabulary doesn't match the database text exactly or when words have multiple meanings (ambiguity).
- Semantic Search: This approach focuses on the intent and meaning of the query rather than just the syntax.
- Vectors: We convert text into arrays of numbers (vectors) to represent their meaning mathematically.
- Proximity equals Similarity: By plotting these vectors in multi-dimensional space, we can find relevant content by simply locating the data points closest to the search query.
In future posts, we will dive deeper into the code required to build these graphs and how to apply these concepts to real data from sources like Wikipedia.