cat articles/colbert

ColBERT reaches e5-large-level performance on a Japanese RAG task

created 2024-02-02

The recently released Japanese pretrained ColBERT model, JaColBERT, seems to perform well. I evaluated it on the AI-Ou Q&A RAG task that I usually use for evaluation.

https://docs.google.com/spreadsheets/d/1eSYzxzIfN3uMIpFKDGCTQsIxuWYELBtD49LQbl88GUE/edit#gid=140790548

The result was only slightly below multilingual-e5-large. Considering that the training data is small and the model size is almost the same as a 12-layer BERT, which is the same size as multilingual-e5-small, this is impressive.

Reading the ColBERT implementation and papers

That made me interested in ColBERT, so I read the papers and implementation.

ColBERT is not the usual method of outputting a sentence embedding with SentenceTransformer or similar and searching by similarity comparison. It is token-based similarity search. The final hidden layer of a sentence has contextual information for each token, so ColBERT uses token-level representations, not only a single sentence vector, to calculate similarity.

The similarity calculation uses a method called MaxSim. It takes cosine similarities between query token outputs and document token outputs, then sums the maximum values. The MaxSim calculation itself is simple.

The query and document must be encoded separately, but the model used is BERT with 12 layers plus a custom head, a 128-dimensional linear layer. BERT's 12-layer hidden output is 768 dimensions, so it is converted to 128 dimensions through the linear layer.

In the actual implementation, the query and document are distinguished only by adding custom tokens after CLS: for a query, a prefix like [CLS][unused0]; for a document, [CLS][unused1]. The encoder itself is the same.

After encoding them, MaxSim is computed, and the document with the largest score is judged to be similar to the query. In the ColBERT implementation, symbols and padding tokens in document tokens are masked and ignored during calculation.

Solving search-time performance issues

With ordinary sentence-vector search, approximate nearest neighbor search, or ANN, can search quickly even from hundreds of millions of documents. However, ColBERT uses MaxSim over token similarities rather than sentence vectors, so that approach cannot be used as-is.

ColBERTv2 describes how to create an index that can solve this problem with fast nearest neighbor search. It seems to compress vectors in several ways, calculate centroids with KMeans, and search from there. The implementation imports FAISS, so I first thought it used FAISS indexes directly. But FAISS was only used to calculate cluster centroids with KMeans. Once the index is created, later search is quick.

Creating an index for the 5.5 million passages in the AI-Ou Q&A RAG task above took around five hours on a Ryzen 3900 + RTX 3090 environment. Be careful that FAISS is quite slow unless you use the GPU version, faiss-gpu.

Trainable with little data

According to the JaColBERT report, it was made from bert-base-japanese-v3 by training for 10 hours with 10 million triplet examples and 8 NVIDIA L4 GPUs. If the data amount is small and training time is short, the possibilities are exciting.

ColBERT's problems and practical difficulty

Reading the implementation made me realize that ColBERT is not easy to use casually because both the processing itself and the implementation code are complex. RAGatouille addresses that with an approach that can be used quickly even with zero configuration. I used RAGatouille for this evaluation too.

RAGatouille can create indexes and search them, of course, and can also train models with a Trainer. It also supports modern integrations such as becoming a LangChain retriever.

Can ColBERT be used on a production search server?

One concern with ColBERT is whether it can be operated in production. As of early February 2024, it seems to require implementing and operating your own search API server, which is not easy to run casually.

However, the RAGatouille documentation says that the search engine Vespa will support it soon. If that happens, operation should become much easier. Adding data to an index also still seems experimental, but if that works properly, it should become reasonable to consider for production.

Above all, if you can train on your own domain data at low cost, it may be usable as high-quality RAG retrieval for your own data. For that kind of use case, I would actively consider it.

Closing

So this was a note saying that ColBERT is impressive. People researching information retrieval probably already know ColBERT, but I did not, so learning about it felt fresh. I probably would not have become interested without JaColBERT, so I am grateful to Benjamin Clavié, its author. He is also the author of RAGatouille, which is very helpful.