Skip to content

mhirschberg/ftsdemo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

Vector Search demo

Prerequisites

Create a free Couchbase Capella account.

Create a Project:

image

Then, click on the project name and click Create cluster button. That opens a cluster creation dialog, simply hit the blue Create cluster button:

image

After 5 minutes your cluster is ready to be used.

Now click on the cluster name and then on Import data. We're going to import a sample data set illustrating the well-known RGB model.

image

Select Load sample data and color-vector-sample bucket. Then hit Import button.

image

Sample document

Each color is a JSON document. Let's open one and understand its structure.
Click Data tools, select Documents tab, there set the context to color-vector-sample bucket, color scope and rgb collection.
Now get the 000080 document and open it.

image

There you can see the following fields:

  • id: the hex code of the color
  • color: the name of the color
  • brightness: a calculation of the brightness to the human eye
  • colorvect_l2: vector based on the RGB color
  • description: a text describing the color
  • embedding_model: the model used to encode the embedding_vector_dot vector
  • embedding_vector_dot : vector based on the field “description” encoded via the text-embedding-ada-002 OpenAI model
  • verbs: list of qualifiers

Now, what is the color described in the document?
Let's use a color picker (in this example I used whatever Google found first) to find out.
You can use either the hex code (the document key 000080) or the RGB values stored in the colorvect_l2 field (0, 0, 128).

image

The rgb collection contains 153 documents. This is obviously just a fraction of the 16M+ colors you can encode with RGB. Here are some examples of the colors available in the rgb collection.

image

Simple Vector Similarity Queries

Create your first Vector Search index

We are interested in running search queries leveraging the “colorvect_l2" vector which contains the 3 RGB dimensions of colors.
This is a very simple example that has the benefits of being easy to understand and small enough to display the entire vector.
Go to Data Tools -> Search page and click Create Search Index.

image

Choose the Quick Mode (default), set the context to the color-vector-sample bucket, color scope and rgb collection. Name your index rgb_idx.

image

In the schema area of the Type Mappings section, click on the field colorvect_l2.
This will automatically populate the right area with the type, dimension, similarity metric, Optimized for configuration settings.
Now, click Add to Index.

image

This will automatically populate the right area with the type, dimension, similarity metric, optimized for configuration settings.

image

Click Create Index. Your index rgb_idx is now available in the list of Search indexes.

Run your first Vector Search Query

At the end of the row of your index, click the search icon 🔎
This will open a page to run your search queries.

image

Let's find the top 3 nearest colors to the color Navy, which is encoded with the vector [0, 0, 128].
In the search area, run the following search query:

{
  "knn": [
    { "field": "colorvect_l2",  "vector": [0, 0, 128],"k": 3 }
  ]
}

You should see the following results:

image

What are those color ids referring to?
Let's tweak our Vector index a bit to get the colors displayed as well.
Click Back to List then click your rgb_idx to open its definition.
Click the field color which contains the name of each color in the json documents.
In the Type Mapping configuration area, check Include in search results. Click Add to Index and then Update Index at the bottom of the screen.

image

Now, run the same query as above, but now including the color field.

{
  "knn": [
    { "field": "colorvect_l2",  "vector": [0, 0, 128],"k": 3 }
  ],
  "fields": ["color"]
}

You should see the following results:

image

But wait, are those results accurate?
To check that out, here is a table of both the Vector Query and results.
The cells of the table are simply filled with their hex codes using a color picker.

image

Not only of course, Navy itself is returned as the color is already present in the database hence obviously the most similar color.
Remember that exact match gets extra boosting.
You can see that in the score itself where the color navy score is 1.7976931348623157e+308, which is way larger than the score of the other results.
That said, the next 2 colors are very similar to the color navy!

Run Simple Vector Search Query

We now want to run a Vector Search Query with a vector that does not exist already in the database.
First, let's find out if the color vector [0,0,64] exists in the database. A simple RGB color picker tells us that this color is #000040.

image

Let's verify that this color doesn't exist in our database.
Click Data tools, select Documents tab, there set the context to color-vector-sample bucket, color scope and rgb collection.
Try to get a document with the key 000040 - there's none, so we're good to go.

image

Now, click Search and then at the end of the row of your index, click the search icon 🔎
Let's run our query to retrieve the top 3 nearest colors to [0,0,64] and review the results:

{
  "knn": [
       {"field": "colorvect_l2", "vector": [0,0,64], "k": 3}
  ],
  "fields": ["color"]
}

Results are returned while [0,0,64] itself is not in the database.

image

Interestingly enough, midnight blue comes as the top result, while we would probably consider black more similar to the color [0.0, 0.0, 64.0] when simply comparing the 2 colors with our own eyes. Why is the vector search returning black in third position then?

First of all, what is the similarity metric used to calculate the similarity between the color vectors colorvect_l2? Go back to your rgb_idx index definition and take a closer look at the type mapping for the field colorvect_l2. This is the l2_norm, also known as the euclidean distance.

image

Let's first add the RGB vector of each color in the table of results as well. You can get that information either using a color picker, enter the hex code and it will provide the RGB. Or you can click on each document in the list of results and take a look at the colorvect_l2 vector field.

image

Let's run some simple math here.
The euclidean distance between 2 vectors of 3 dimensions [x1,y1,z1] and [x2,y2,z2] is √(〖(x2-x1)〗^2+〖(y2-y1)〗^2+〖(z2-z1)〗^2 ).

  • The distance between [0,0,64] and midnight blue [25,25,112] is √(25^2+25^2+〖(112-64)〗^2) ≃ 59.6
  • The distance between [0,0,64] and black [0,0,0] is √(64^2 ) = 64.

In other terms, midnight blue is indeed closer to the color [0,0,64] than black wrt the L2 similarity metric.

Now, take a closer look at the score of each result that is displayed on the right side.

image

Note that both black and navy have the same score.
Running the same simple math on navy we got the distance between [0,0,64] and navy [0,0,128] is √(〖(128-64)〗^2) = 64

Navy and midnight blue are as similar to [0,0,64] wrt the L2 similarity metric.
That's why they get the same score in the list of results.
So you might have navy either in second or third position depending on the mood of the search engine at the time of you are running this query.

Number of Results of a Vector Query

Up to now, we have run vector queries returning the top 3 nearest neighbors of the query vector.
We used the k parameter of the knn vector query to specify that.
Let's explore the number of results of a vector query and its practical implications in more detail.

Let's get crazy and change the k parameter of the vector query to 153, which is the total number of documents in our collection rgb:

{
  "knn": [
       {"field": "colorvect_l2", "vector": [0,0,64], "k": 153}
  ],
  "fields": ["color"]
}

Notice that the query actually returns 153 results. Of course, the last results have a low score compared to the first ones.
That's very intuitive as the color white is not similar at all to [0,0,64].
The question is now: what is a correct value for k so that the results are relevant for the application and ultimately the end user?
Imagine your application is a recommendation engine. This is great to provide many recommendations, but it's better if they provide some value to the customer.

Now, let's change the k parameter of the vector query to 5.

{
  "knn": [
       {"field": "colorvect_l2", "vector": [0,0,64], "k": 5}
  ],
  "fields": ["color"]
}
image

You can see with your bare eyes that the relevance of the results is decreasing rapidly. T he question is how fast? And at which point they shouldn't be considered relevant at all for the application?
Let's take a look at the last result dark purple #500050 and compare its score with the score of the top result, midnight blue.

image

What does this mean that dark purple got a score of 0.00015024038461538462? Not much in itself.
The point is that the interpretation of similarity is relative. In other words, midnight blue is closer to [0,0,64] than dark purple. It gets a score almost twice better than dark purple.
But there is no way to define what "close enough" would be. This is why most applications would simply retrieve the top 3 results.

Running more Advanced Vector Queries

Run Vector Search Queries with Multiple Vectors

Run the following search query and review the results:

{
  "knn": [
      { "field": "colorvect_l2",  "vector": [0, 0, 128],  "k": 3 },
      { "field": "colorvect_l2",  "vector": [0, 0, 64],   "k": 3 }
  ],
  "fields": ["color"]
}

The result is the union of both vector queries. Notice that the knn_operator: or is implicit here.
This is the default behavior when you put multiple vector queries in the knn array field.

Results with vector [0,0,128] only:

image

Notice that navy is at the very top because this is an exact match.
As we discussed previously, the color navy gets the highest score of 1.7976931348623157e+308, compared to the scores of the other results.

Let's now explore a search query combining multiple vector queries with AND.
Run the following search query and review the results.

{
    "knn": [
       { "field": "colorvect_l2", "vector": [0, 0, 128],  "k": 3},
       { "field": "colorvect_l2",  "vector": [0, 0, 64],  "k": 3}
  ],
  "knn_operator": "and",
  "fields": ["color"]
}

The result is the intersection of both vector queries: navy and midnight blue are returned from both vector queries.
This is knn_operator: and. Let's double check those results in more detail.

Results with vector [0,0,128]:image image image image

image

Results are the intersection between results from vector [0,0,128] and vector [0,0,64]

image

Let's now explore a search query combining multiple vector queries with explicit boosting.

{
  "knn": [
    { "field": "colorvect_l2", "vector": [0, 0, 127], "k":3, "boost": 0.1},
    { "field": "colorvect_l2", "vector": [0, 99, 0], "k":3, "boost": 4.0}
  ],
  "fields": ["color"]
}

Here we apply boosting for the query vector [0, 99, 0] - the boost value is greater than 1, and deboosting for the query vector [0, 0, 127] - the boost value is less than 1.
Results with vector [0,99,0] - boost=4.0. They should get a higher score.

image

Results of the boosting results from [0,99,0] and deboosting results from [0,0,127]

image

Notice that most of the colors coming from the deboosted side (the blue side) are at the bottom of the list, navy is still quite high on the results.
The reason why is because navy, encoded with [0,0,128], is so close to the query vector [0, 0, 127] that it gets a very high score anyway.

Run Hybrid Search and Vector Query

We want to run the query below. This is a hybrid search query between the traditional search side (query) and the vector search side (knn).
But before we can do that, we need to create a search index that covers this query.

Click Create Search Index, name it hybrid_idx and select the usual context.

Select the field brightness and check the include in search results checkbox. Click Add To Index.
Select the field colorvect_l2. Click Add To Index.
Finally, select the field color and check the Include in search results checkbox. Click Add To Index.

image

Click Create Index at the bottom of the page.

Now, let's run the hybrid search and vector query. Insert the query in the hybrid_idx index search area, and review the results.

{
  "query": {
        "field": "brightness", "min": 70,  "max": 80,
        "inclusive_min": false,  "inclusive_max": true  },
  "knn": [
      {"field": "colorvect_l2", "vector": [0.0, 0.0, 108.0],  "k": 5}
   ],
  "fields": ["color","brightness"],
  "size": 5
}

The query (traditional) side search and the knn (vector search) side are OR’d.
If the same IDs are found on both sides the results are boosted to the top. Let's double-check those results.

Run the following query against the hybrid_idx index to get the top 5 results with vector [0,0,108] together with their brightness.

{
    "knn": [  {"field": "colorvect_l2",  "vector": [0,0,108], "k": 5} ],
  "fields": ["color","brightness"]
}

The results are clearly having a brightness that is not between 70 and 80.

image

Now, let examine the results of the combined search query below.
They all have a brightness between 70 and 80 but since the query color [0,0,108] has initially a brightness that seems to be closer to navy's around 14, no wonder why the results are not that close to [0,0,108].

image

Run Hybrid SQL++ and Search Query

Now, let's say that the meaning of the query you are looking for is "I want colors similar to [0,0,108] AND having a brightness between 10 and 20.
There is where running a hybrid SQL++ and Search query can come handy.

Go to the Query page, set the right context and run the following SQL++ query and review the results.

SELECT color, brightness
FROM rgb AS t1
WHERE
   brightness <= 20 AND brightness>=10
AND
  SEARCH (t1, {
    "query": {  "match_none": {} },
    "knn": [{ "field": "colorvect_l2", "vector": [0.0, 0.0, 108.0],"k": 3 }]
    }
  )
image

The SQL++ side and the Search (a pure vector search in this example) side are AND’d via the SQL++ syntax.
In this case, while you were asking for the top 3 results from the vector side (search), only 2 results are returned because only those colors also satisfy the condition of brightness between 10 and 20.

Review the execution time and plan for this query. In this database, it took 778.8 ms to run the query.

image

Notice that the index advisor suggests another index to speed up this query. Accept the suggestion and build the index.

image

Re-run the query and observe the improved execution time.

image

This time, the query performs an intersectScan between the FTS index and the new GSI index.
Before, the execution plan leveraged only the FTS index.

About

A demo of Couchbase's vector search

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors