cat articles/japanese-reranker-v2

Releasing Small, Fast, and Practical Japanese Rerankers: tiny, xsmall, small, and base v2

created 2025-05-08

I released very small Japanese reranker models, japanese-reranker-tiny-v2 and japanese-reranker-xsmall-v2. In information retrieval systems, rerankers improve the precision of search results, but model size and compute cost are practical challenges.

🆕 Update, 2025-07-10: I also added moderately small rerankers, japanese-reranker-small-v2 and japanese-reranker-base-v2.

These models are built with minimal layer counts and parameter counts, and they run at practical speed even on CPU and Apple silicon. This makes it possible to improve RAG system accuracy without expensive GPU resources, and should make them useful for edge deployment and production environments that require low latency. In evaluation, they achieve competitive scores even compared with larger models.

What Rerankers Are, and Why Small Rerankers Matter

A reranker is a model that evaluates the relevance between a question, or query, and documents, then reorders the documents by relevance. Its strength is that it can evaluate complex relationships that ordinary embedding search may miss. In particular, CrossEncoder architectures take the query and document as one input pair, allowing finer-grained nuance and contextual understanding.

Releasing High-Performance Japanese Rerankers, and What Rerankers Are

Small rerankers matter for several reasons. First, a reranker must evaluate every combination of a query and candidate document. Reranking 100 candidate documents requires 100 model inferences. Smaller models therefore directly improve throughput and reduce latency.

Small models can also run in resource-limited environments. They can run at realistic speed on CPU-only environments, edge devices, and mobile devices, improving the practicality of RAG systems. In server environments, they also reduce GPU memory usage and make it easier to share GPU resources, improving cost efficiency.

Ask! NIKKEI RAG Search Technology Deep Dive

Small rerankers therefore provide important benefits in speed, cost, and resource efficiency, and can play a useful role in practical RAG systems.

Benchmark Performance

The benchmark results are below. Considering their model size, the tiny and xsmall v2 models perform quite well. Among larger models, ruri-v3-reranker-310m is clearly strong. The fact that these high-performing models are based on ModernBERT likely contributes to the improvement.

Japanese models have learned the tendencies of JQaRA, a quiz-style dataset, which puts bge-reranker-v2-m3 at a disadvantage. This is also an example of how much a reranker score can improve when the domain task is learned appropriately.

Model name	avg	JQaRA	JaCWIR	MIRACL	JSQuAD
japanese-reranker-tiny-v2	0.8138	0.6455	0.9287	0.7201	0.9608
japanese-reranker-xsmall-v2	0.8699	0.7403	0.9409	0.8206	0.9776
japanese-reranker-small-v2	0.8856	0.7633	0.9586	0.8385	0.9821
japanese-reranker-base-v2	0.8930	0.7845	0.9603	0.8425	0.9845
japanese-reranker-cross-encoder-xsmall-v1	0.8131	0.6136	0.9376	0.7411	0.9602
japanese-reranker-cross-encoder-small-v1	0.8254	0.6247	0.9390	0.7776	0.9604
japanese-reranker-cross-encoder-base-v1	0.8484	0.6711	0.9337	0.8180	0.9708
japanese-reranker-cross-encoder-large-v1	0.8661	0.7099	0.9364	0.8406	0.9773
japanese-bge-reranker-v2-m3-v1	0.8584	0.6918	0.9372	0.8423	0.9624
bge-reranker-v2-m3	0.8512	0.6730	0.9343	0.8374	0.9599
ruri-v3-reranker-310m	0.9171	0.8688	0.9506	0.8670	0.9820

Inference Speed

The table below shows inference time for reranking about 150,000 pairs with the Hugging Face Transformers library. Tokenization time is excluded, so this is pure model inference time. I used an M4 Max for MPS and CPU measurements, an RTX 5090 for GPU, and FlashAttention 2 for ModernBERT-family models on GPU.

japanese-reranker-tiny-v2 and xsmall-v2 are clearly fast. ruri-v3-reranker-310m is also fast for its size, likely because FlashAttention 2 is effective. Other models can also use FlashAttention 2 through tools such as text-embeddings-inference, and may run faster than in this evaluation.

Model name	Layers	Hidden size	Speed (GPU)	Speed (MPS)	Speed (CPU)
japanese-reranker-tiny-v2	3	256	2.1s	82s	702s
japanese-reranker-xsmall-v2	10	256	6.5s	303s	2300s
japanese-reranker-small-v2	13	384	15.2s
japanese-reranker-base-v2	19	512	32.5s
japanese-reranker-cross-encoder-xsmall-v1	6	384	20.5s
japanese-reranker-cross-encoder-small-v1	12	384	40.3s
japanese-reranker-cross-encoder-base-v1	12	768	96.8s
japanese-reranker-cross-encoder-large-v1	24	1024	312.2s
japanese-bge-reranker-v2-m3-v1	24	1024	310.6s
bge-reranker-v2-m3	24	1024	310.7s
ruri-v3-reranker-310m	25	768	81.4s

The benchmark script is here.

I also publish models converted to ONNX for CPU use, so with ONNX and ARM quantized models, they should be usable even in edge environments such as Raspberry Pi.

Short Technical Report

The training data for japanese-reranker-tiny-v2, xsmall-v2, small-v2, and base-v2 is based on the dataset used to train hotchpotch/japanese-splade-v2, plus hard negatives and some additional private data. The large improvement over v1 likely comes from using ruri-v3-pt-30m, a ModernBERT-based model pretrained for the target task, using several times more data than v1, and extracting higher-quality data with hard negatives, including filtering positives and negatives with scores from various rerankers.

For the Tiny model's parameter extraction source, I evaluated sbintuitions/modernbert-ja-30m and cl-nagoya/ruri-v3-pt-30m. ModernBERT alternates global attention and local attention layers. For example, modernbert-ja-30m has 10 layers, where [0,3,6,9] are global attention layers and the others are local attention layers.

At first I expected all global attention layers to work best, but including layers 3, 6, and 9 generally made results worse. Including layers close to the output also made results worse. The table below shows reranking evaluation results for models trained on the same dataset. Results including layers close to the output, such as 6 and 9, were much worse and training was stopped early, so they are not included. Layer 0 alone did not produce useful performance.

name	JQaRA	miracl	jsquad	JaCWIR
modernbert-ja-30m + full layers	0.7261	0.8095	0.9752	0.9420
modernbert-ja-30m + layer 0,2,4	0.6455	0.7185	0.9588	0.9265
modernbert-ja-30m + layer 0,2	0.6171	0.6784	0.9516	0.9155
modernbert-ja-30m + layer 0	0.2515	0.4416	0.3172	0.0738
ruri-v3-pt-30m + full layers (= xsmall-v2)	0.7403	0.8206	0.9776	0.9409
ruri-v3-pt-30m + layer 0,2,4 (= tiny-v2)	0.6455	0.7201	0.9608	0.9287
ruri-v3-pt-30m + layer 0,1,3	0.6405	0.7124	0.9552	0.9211
ruri-v3-pt-30m + layer 0,3	0.6177	0.6619	0.9482	0.9076

From these results, I published ruri-v3-pt-30m as xsmall, and ruri-v3-pt-30m + layer 0,2,4 as tiny. small-v2 and base-v2 are based on ruri-v3-pt-70m and ruri-v3-pt-130m, respectively. Model merging slightly improves performance, but I did not use it this time.

Closing

This article introduced the small, lightweight, and practical Japanese reranker models japanese-reranker-tiny-v2, japanese-reranker-xsmall-v2, japanese-reranker-small-v2, and japanese-reranker-base-v2. The tiny and xsmall models run at practical speed on CPU and Apple silicon, and can improve search accuracy for local RAG systems without requiring expensive GPU resources. Running them on GPU also enables fast responses.

Recent high-performance encoder models such as ModernBERT make it easier to build practical models with stronger performance. I hope this article contributes to the further development of Japanese language processing technology.