cat articles/japanese-reranker-tech-report
Technical Report on Building Japanese Rerankers
This is a technical report on building Japanese reranker, or CrossEncoder, models.
This is a technical report on building Japanese reranker, or CrossEncoder, models. For an explanation of what rerankers are, see Releasing High-Performance Japanese Rerankers, and What Rerankers Are.
The models created are:
How CrossEncoders Are Trained
A CrossEncoder can be trained as a simple regression task. Text in the form query text[SEP]passage text, separated by a SEP token or similar, is labeled as 1.0 for positives and 0.0 for negatives. For concrete training code, the SentenceTransformers CrossEncoder training examples are easy to understand.
Performance improves significantly when multiple negatives, or hard negatives, are trained in the same batch as the positive. FlagEmbedding's reranker trainer is a useful reference for this approach.
Training Datasets
Training requires datasets of questions, positives, and negatives. I used one positive and 15 hard negatives per item, for a group of 16 examples. The datasets were:
- JQaRA: 7,270 records from
devandunused - JSQuAD:
- 62,859 records from
train - Additional Wikipedia passages for hard-negative mining
- 62,859 records from
- miracl: 6,984 Japanese records from
train - mmarco: 346,413 filtered Japanese records from
train - mr_tydi:
- 3,697 Japanese records from
train - The Japanese MIRACL data contains many records overlapping with this mr_tydi data
- 3,697 Japanese records from
- Wikipedia lead sections:
- 40,130 pairs of Wikipedia titles and lead paragraphs
- Hard-negative mining also used only Wikipedia lead paragraphs
Evaluation Datasets
The models were evaluated with:
- JQaRA:
- 2,000
testrecords - Metric:
NDCG@10, as defined for JQaRA evaluation
- 2,000
- JSQuAD:
- 4,442
validationrecords - 19 negatives added by hard-negative mining from Wikipedia, evaluated with
MAP@10over 20 total candidates
- 4,442
- miracl:
- 704 records from
dev, filtered to records with at least 9 negatives - 1 positive and 9 negatives, evaluated with
MAP@10 - Japanese MIRACL has some overlap between
devandtrain, so training more ontraintends to raisedevevaluation
- 704 records from
- JaCWIR:
- 5,000
evalrecords - Metric:
MAP@10, as defined for JaCWIR reranker evaluation
- 5,000
Hard-Negative Mining
Hard negatives are examples that a model is likely to mistakenly judge as positives, even though they are actually negative. Actively mining them increases the diversity and difficulty of the training data and can improve model accuracy.
For these models, I mined hard negatives with BM25 and multiple SentenceTransformer models. Using semantic textual similarity tasks, I extracted texts semantically similar to positives but actually negative, and randomly sampled from high-similarity candidates.
Pretrained Base Models
The following pretrained models were used as bases. For BAAI/bge-reranker-v2-m3, training on all data reduced generalization, so I randomly sampled 10,000 records each from mMARCO, JSQuAD, and Wikipedia lead sections, while using all records from the other datasets.
japanese-reranker-cross-encoder-xsmall-v1- Microsoft mMiniLMv2-L6-H384
- 6 layers, 384 hidden size
japanese-reranker-cross-encoder-small-v1- Microsoft mMiniLMv2-L12-H384
- 12 layers, 384 hidden size
japanese-reranker-cross-encoder-base-v1- cl-nagoya/sup-simcse-ja-base
- tohoku-nlp/bert-base-japanese-v3
- A merged model from models trained from both sources
- 12 layers, 768 hidden size
japanese-reranker-cross-encoder-large-v1- cl-nagoya/sup-simcse-ja-large
- tohoku-nlp/bert-large-japanese-v2
- A merged model from models trained from both sources
- 24 layers, 1024 hidden size
japanese-bge-reranker-v2-m3-v1- BAAI/bge-reranker-v2-m3
- 24 layers, 1024 hidden size
Handling Overfitting
During CrossEncoder training, I found that because Wikipedia passages were used as hard negatives, evaluation on Wikipedia-related tasks such as JQaRA, JSQuAD, and Japanese MIRACL improved, while generalization outside Wikipedia domains degraded as training continued. To balance this, I created JaCWIR, an out-of-domain dataset not included in the training data, and used it for evaluation while training.
Training beyond 1 epoch caused overfitting, so training was limited to 1 epoch.
Training Parameters
The main model training used roughly the following parameters:
batch_size:512with gradient accumulation- Since 16 examples form one group, the actual batch contains
512 * 16 = 8192positive and negative examples
- Since 16 examples form one group, the actual batch contains
warmup_ratio:0.25- Scheduler:
cosine - Optimizer:
paged_adamw_32bit learning_rate:xsmall=2e-04small=5e-04base=8e-05large=3e-05
- Loss:
- Cross entropy
Using Large Models as Teachers
For xsmall and small, I also used inference outputs from japanese-reranker-cross-encoder-large-v1 and japanese-bge-reranker-v2-m3-v1 as teacher labels. Teacher outputs are continuous inference values, such as pos=0.98 and negs=[0.02, 0.07, ...], so they can be used as regression targets rather than only 0 and 1. Using teacher outputs gave a small score improvement. MSE loss was used for this training.
Creating Mix Models
Changing datasets, score parameters, and seeds produces diverse training results. Linearly combining separately trained models can improve performance by adding diversity. I confirmed score improvements by combining multiple trained models. I used LM_Cocktail for model merging.
One caveat is that the merged model has a smaller output standard deviation, so there may be some performance degradation during quantization or similar processing.
Evaluation Results
The CrossEncoder evaluation results are below. BAAI/bge-reranker-v2-m3 already has strong multilingual generalization and high Japanese performance. If model size is not an issue, I think it is a good base model to fine-tune for reranker training, even with only a few thousand examples.
Scores on these evaluation datasets tend to increase when training on the corresponding public train data. The models created here learned the tendencies of train data for all datasets except JaCWIR, so that should be considered when reading the scores.
| Model Name | JQaRA | JaCWIR | MIRACL | JSQuAD |
|---|---|---|---|---|
| japanese-reranker-cross-encoder-xsmall-v1 | 0.6136 | 0.9376 | 0.7411 | 0.9602 |
| japanese-reranker-cross-encoder-small-v1 | 0.6247 | 0.939 | 0.7776 | 0.9604 |
| japanese-reranker-cross-encoder-base-v1 | 0.6711 | 0.9337 | 0.818 | 0.9708 |
| japanese-reranker-cross-encoder-large-v1 | 0.7099 | 0.9364 | 0.8406 | 0.9773 |
| japanese-bge-reranker-v2-m3-v1 | 0.6918 | 0.9372 | 0.8423 | 0.9624 |
| bge-reranker-v2-m3 | 0.673 | 0.9343 | 0.8374 | 0.9599 |
| bge-reranker-large | 0.4718 | 0.7332 | 0.7666 | 0.7081 |
| bge-reranker-base | 0.2445 | 0.4905 | 0.6792 | 0.5757 |
| cross-encoder-mmarco-mMiniLMv2-L12-H384-v1 | 0.5588 | 0.9211 | 0.7158 | 0.932 |
| shioriha-large-reranker | 0.5775 | 0.8458 | 0.8084 | 0.9262 |
| bge-m3+all | 0.576 | 0.904 | 0.7926 | 0.9226 |
| bge-m3+dense | 0.539 | 0.8642 | 0.7753 | 0.8815 |
| bge-m3+colbert | 0.5656 | 0.9064 | 0.7902 | 0.9297 |
| bge-m3+sparse | 0.5088 | 0.8944 | 0.6941 | 0.9184 |
| JaColBERTv2 | 0.5847 | 0.9185 | 0.6861 | 0.9247 |
| multilingual-e5-large | 0.554 | 0.8759 | 0.7722 | 0.8892 |
| multilingual-e5-small | 0.4917 | 0.869 | 0.7025 | 0.8565 |
| bm25 | 0.458 | 0.8408 | 0.4387 | 0.9002 |
This article was lightly edited from text generated by Claude 3 Opus based on my notes and instructions.