cat articles/vector-search-params

Measuring speed, data size, and accuracy for vector search algorithms and quantization parameters

Recently, more use cases convert sentences into feature vectors such as embeddings. When searching for similar vectors, a few thousand vectors usually require almost no special thought. From tens of thousands of vectors, you often use approximate nearest neighbor algorithms such as HNSW to speed up search. From millions of vectors, you often combine optimization techniques such as quantization to keep the data size practical.

A simplified and wide image representing the concepts of Artificial Neural Networks (ANN) and quantum compression. The image should feature a minimalist design, with clean and clear lines. For the ANN part, depict a few interconnected nodes and simple pathways in a subtle color scheme. For the quantum compression aspect, include basic quantum-inspired symbols and patterns, focusing on a more straightforward and less complex design. The overall composition should be elegant, with a balance between simplicity and the representation of these advanced technological concepts.
A simplified and wide image representing the concepts of Artificial Neural Networks (ANN) and quantum compression. The image should feature a minimalist design, with clean and clear lines. For the ANN part, depict a few interconnected nodes and simple pathways in a subtle color scheme. For the quantum compression aspect, include basic quantum-inspired symbols and patterns, focusing on a more straightforward and less complex design. The overall composition should be elegant, with a balance between simplicity and the representation of these advanced technological concepts.

These optimizations for similar vector search, such as HNSW, IVF, and quantization, create tradeoffs among search speed, data size, and accuracy. When thinking about optimization strategies under those tradeoffs, I often see recall@10 or recall@100 used as the reported accuracy metric. For example, Choose the k-NN algorithm for your billion-scale use case with OpenSearch evaluates with recall@10, and Byte-quantized vectors in OpenSearch uses recall@100.

If search results are combined with information other than embeddings, or if a reranker re-sorts the results, recall@10 or recall@100 may be fine. But when using retrieval for RAG, I do not often put top-10 results into the LLM or Reader in-context. In my own use, top-3 or top-5 is more common. So I evaluated recall@1, @2, @3, and @5 with FAISS, representative algorithms, and quantization, and measured search speed, data size, and accuracy, or recall. I used FAISS as the library. Even outside FAISS, major vector search engines usually implement the main algorithms and quantization methods, and OpenSearch can also use FAISS internally as a vector search engine.

Algorithms and parameters in vector search databases

Before measuring, here is a review of algorithms and parameters used when creating indexes. Major examples include IVF and HNSW search algorithms, compression with PQ, or Product Quantization, and each of their parameters. The following descriptions are GPT-4 output with some edits. For nearest neighbor search itself, I recommend reading Professor Matsui's Theory and Applications of Approximate Nearest Neighbor Search Using Graphs, which explains it thoroughly.

  • HNSW (Hierarchical Navigable Small World)
    • Approach: HNSW uses a graph-based approach. Each node, or data point, has links to neighboring nodes, and search is performed efficiently through those links.
    • Characteristics:
- Fast search: The graph's hierarchical structure enables very fast approximate nearest neighbor search.
- Dynamic addition: New data points can be added dynamically.
- High accuracy: It provides high accuracy compared with many other approximate algorithms.
- Memory usage: Because of the graph structure, memory usage may be relatively large.
  • Parameter: M
- The maximum number of neighboring nodes each data point has.
  • IVF (Inverted File Index)
    • Approach: IVF divides data into multiple clusters and creates a separate index for each cluster. During search, it identifies the clusters closest to the query and searches only inside those clusters.
    • Characteristics:
- Efficient large-scale search: It can make search efficient for large datasets.
- Scalability: It can be applied to large datasets and is scalable.
- Customizable: Many parameters can be customized, such as the number of clusters, nlist, and the quantization level.
- Memory usage: It is often more memory-efficient than HNSW, though this depends on the number of clusters and quantization level.
  • Parameter: nlist
- The number of clusters.
  • Comparison of HNSW and IVF
    • Accuracy: HNSW generally provides higher accuracy than IVF, but memory usage tends to increase.
    • Speed: HNSW enables fast search, while IVF is more scalable and memory-efficient for large datasets.
    • Use cases: HNSW is suitable when accuracy is important or for realtime search. IVF is suitable for large datasets or limited memory resources.
  • Product Quantization (PQ)
    • Product Quantization is a technique for efficiently compressing high-dimensional vectors. It includes the following steps:
- Vector splitting: Each vector is split into multiple lower-dimensional subvectors.
- Subvector quantization: Each subvector is quantized using a small separate codebook, or predefined set of values. Each subvector is mapped to the nearest value in that codebook.
- Compressed representation: Finally, the original vector is represented as a combination of these quantized subvectors.

The main parameter for HNSW is M; for IVF it is nlist, and also mbit, though that did not appear above. When using PQ, the parameter is the number of subvectors. The number of subvectors must divide the original vector dimension. For example, if the original dimension is 384, possible values include 32, 64, and 96. These parameters are needed before training, but the number of graph nodes or clusters searched can be decided at search execution time with parameters such as efSearch for HNSW and nprobe for IVF.

In a FAISS + Python environment, index_factory() lets you create indexes from strings like this, which is convenient:

faiss.index_factory(d, "IVF2048,PQ64") # nlist = 2048, PQ = 64
faiss.index_factory(d, "HNSW32,PQ64") # M = 32, PQ = 64

Dataset and code

This time I used ANN_SIFT1M, which is often used in FAISS code. As the name says, it is a dataset of 1M, or one million, 128-dimensional vectors. I used 10,000 search queries from it and measured recall@N. FAISS can also use GPU for search, but most searches will probably run on CPU, so I used a CPU, Ryzen 9 5950X.

The benchmark code is bench_gpu_sift1m_ivf_hnsw.py. If you put this source file into the benchs directory of a cloned FAISS repository, place the dataset appropriately, and run it, it should reproduce the benchmark.

Measurement results

Results
Results

The result spreadsheet has two sheets. Showing everything would be too large, so one sheet contains extracted data.

As described at the beginning, IVF, HNSW, and PQ, or quantization, show tradeoffs. IVF is outstandingly memory-efficient with small data size, but its speed and accuracy are worse than HNSW. However, once the data no longer fits in memory, HNSW speed will likely degrade too. The recall@1 to recall@100 values are interesting. Recall@100 and recall@10 approach 1.0 fairly quickly, but if recall@3 or recall@5, meaning whether the same data appears in the top 3 or 5 results, is important for your system, you need to choose parameters carefully. Also, when using PQ, the probability that top-1 matches under recall@1 can be fairly low. If top-1 is important, you need to think carefully about what to do.

Summary

For systems where top-3 or top-5 matters in RAG, judging that everything is fine based on a larger recall metric without using a reranker may diverge from the results you actually want. The point is obvious, but choose metrics based on what you want to do, and set optimal parameters that fit those metrics. These results are only for the SIFT1M dataset, so different datasets should produce different results.

There is no best parameter that works without thought. But for vectors beyond one million, especially if they may reach tens of millions, I feel that IVF + quantization is a good direction. nlist around 1024 seems good because it is near the square root of the actual count, so for ten million items something like 4096 may be appropriate. PQ should be as large as possible; for example, for 128-dimensional vectors, 64 seems good. If accuracy and speed are important and the scale is up to a few million vectors, HNSW + PQ can also be handled with a practical data size, so it should be considered too.

The FAISS benchs directory includes measurements beyond IVF and HNSW, such as a benchmark for dimensionality reduction with PCA. It contains quite a few measurements for things you may want to do in vector search, so it is worth a look.

Extra: a Kaggle competition mistake

In the already-finished Kaggle competition LLM Science Exam, many solutions improved scores by putting top 1 to 5 RAG results into the prompt. I did that too. At the time, I chose FAISS parameters fairly casually, so the data size became quite small, but I did not notice that accuracy had actually dropped a fair amount. I would like to tell my past self to measure properly and think about parameters when accuracy matters.

cat related_articles/vector-search-params.yaml

  1. Releasing Japanese SPLADE v2, a Strong Retrieval Model for Texts Under 512 TokensI released japanese-splade-v2, an improved Japanese SPLADE model with strong JMTEB retrieval scores for documents up to 512 tokens.
  2. Releasing High-Performance Japanese Rerankers, and What Rerankers AreI released Japanese reranker models trained for search reranking and explain how rerankers improve retrieval quality after initial vector or keyword search.
  3. Building Japanese Wikipedia embeddings and a FAISS index for RAGI created embeddings for about 5.5 million Japanese Wikipedia passages and published FAISS indexes that can be used easily for RAG-style retrieval and question answering experiments.