cat articles/japanese-reranker-release

Releasing High-Performance Japanese Rerankers, and What Rerankers Are

created 2024-04-02

💡 A newer version is available: Releasing very small, fast, and practical Japanese rerankers: japanese-reranker-tiny and xsmall v2

There were very few rerankers trained specifically for Japanese, so I created a family of reranker models that properly learn Japanese. The family includes models from small to large sizes.

The evaluation results are below. As of early April 2024, I think these are among the strongest publicly available models on Japanese reranking tasks, partly because almost no rerankers trained on Japanese had been published.

Model name	layers	hidden_size	JQaRA	JaCWIR	MIRACL	JSQuAD
japanese-reranker-cross-encoder-xsmall-v1	6	384	0.6136	0.9376	0.7411	0.9602
japanese-reranker-cross-encoder-small-v1	12	384	0.6247	0.939	0.7776	0.9604
japanese-reranker-cross-encoder-base-v1	12	768	0.6711	0.9337	0.818	0.9708
japanese-reranker-cross-encoder-large-v1	24	1024	0.7099	0.9364	0.8406	0.9773
japanese-bge-reranker-v2-m3-v1	24	1024	0.6918	0.9372	0.8423	0.9624

Technical details about how these rerankers were created are in Japanese Reranker Technical Report.

What Is a Reranker?

A reranker, as the name suggests, reranks documents. Given a query, it reorders documents by relevance. You might wonder how this differs from sorting by similarity between text embeddings. In practice, embedding similarity can also be used for ranking, but there are two important differences.

Rerankers Have Strong Reranking Performance

Text embeddings represent the query and document in the same vector space and measure similarity. This makes them efficient at large scale because document vectors can be computed offline in advance.

Rerankers, however, are specialized for reranking. The models I created use a CrossEncoder architecture that evaluates a query and document as one pair. This allows the model to understand finer nuances and contextual relationships between the query and document. As a result, relevant documents are more likely to move higher in the ranking.

Rerankers Cannot Be Precomputed and Are Slow

If rerankers are more accurate, one might ask why not evaluate everything with a reranker instead of using embeddings. The issue is that rerankers use both the query and document as input. With text embeddings, document vectors can be computed offline in advance. At search time, only the query vector needs to be computed.

A reranker, or CrossEncoder, cannot precompute document-only representations in the same way. If there are only 100 candidate documents, evaluating all of them online may be fine. As the number of documents grows, however, evaluating every candidate at search time becomes impractical.

Where Rerankers Fit

This does not mean rerankers are unusable in real-world search. A common approach is to first retrieve the top 100 related documents using an efficient offline-computable method such as text embeddings, and then use a reranker to reorder those 100 candidates more accurately.

Quoted from Retrieve & Re-Rank Pipeline.

The same idea is also described in articles such as Google Cloud's Your RAGs powered by Google Search technology, part 2, in the section on Deep re-ranking. Reranking is an important technique.

How Much Can Reranking Improve Retrieval Scores?

In a previous article, Taking on the First AI-Ou Quiz Competition with Vector Search Only: Evaluating Multiple Japanese Embeddings on a Q&A Task, I converted about 5.5 million documents into vectors with several text embedding models and evaluated retrieval with approximate nearest neighbor search using IVFPQ. The accuracy results for those embedding models are below.

JAQKET v1 vector search: Japanese embeddings evaluation

Now let's take the top 100 results from that approximate nearest neighbor search and rerank them with the smallest and fastest model in this family, the xsmall reranker.

rerank - japanese-reranker-cross-encoder-small-v1

The results improve substantially. Because the task searches about 5.5 million documents at practical speed with approximate nearest neighbor search, accuracy is lower than exhaustive embedding similarity search. Reranking those results with a reranker leads to a large score improvement.

OpenAI text embeddings, for example, often do not score especially well on Japanese information retrieval tasks, but reranking improves their scores considerably.

Next, let's rerank with the larger large reranker model.

rerank - japanese-reranker-cross-encoder-large-v1

The score improves further. If you have enough compute, using a larger model is reasonable, but reranking latency increases with model size. The time required to evaluate JaCWIR on an RTX 3090 was as follows.

Model name	layers	hidden_size	Runtime (sec)
japanese-reranker-cross-encoder-xsmall-v1	6	384	196
japanese-reranker-cross-encoder-small-v1	12	384	265
japanese-reranker-cross-encoder-base-v1	12	768	481
japanese-reranker-cross-encoder-large-v1	24	1024	1253
japanese-bge-reranker-v2-m3-v1	24	1024	1173

The xsmall and large models differ by about 6x in speed. There is a clear tradeoff between performance and latency, so it is important to choose a reranker according to the needed accuracy and speed. In runtime search systems, reranker latency often matters.

For evaluation results against many other models, see Japanese Reranker Technical Report. Also note that although the reranking evaluation in this article uses the test data from the AI-Ou Quiz competition and was not directly trained on it, these models did train on JQaRA, a dataset made from the competition's dev and unused data. This may make scores easier to improve.

Rerankers Are Surprisingly Important

I decided to build Japanese rerankers because, when searching over millions of documents, combining embeddings plus approximate nearest neighbor search with a reranker produced much better results than embeddings plus ANN alone. At that time I was using the multilingual reranker cross-encoder-mmarco-mMiniLMv2-L12-H384-v1. If a multilingual model could improve accuracy that much, I thought a model properly trained on Japanese might improve it further.

Rerankers require online computation, which is both a drawback and an advantage. The drawback is computational cost. A benefit other than accuracy is that you do not need to recompute precomputed data. If you want to replace a text embedding model with a better one, existing document vectors stored in a database must be changed carefully in production, and recomputing hundreds of millions of vectors can be expensive. A reranker is more like replacing the sorting algorithm. It can often be swapped in without changing precomputed data.

I have also observed that rerankers can improve performance substantially when trained on domain data for the task being solved. This suggests a useful split: use a general-purpose model for text embeddings, and use a domain-specific model for reranking.

This article introduced Japanese reranker models and explained what rerankers are. Much of the current attention is on training and using LLMs, but as LLM usage expands, I think search will increasingly be optimized for AI rather than humans, and information retrieval will become even more important.

Rerankers will likely become an essential tool for improving retrieval. I hope this article helps more people become interested in rerankers and information retrieval.

This article was lightly edited from text generated by Claude 3 Opus based on my draft.