cat articles/japanese-reranker-tech-report

Technical Report on Building Japanese Rerankers

created 2024-04-02

This is a technical report on building Japanese reranker, or CrossEncoder, models. For an explanation of what rerankers are, see Releasing High-Performance Japanese Rerankers, and What Rerankers Are.

The models created are:

Model name	layers	hidden_size
hotchpotch/japanese-reranker-cross-encoder-xsmall-v1	6	384
hotchpotch/japanese-reranker-cross-encoder-small-v1	12	384
hotchpotch/japanese-reranker-cross-encoder-base-v1	12	768
hotchpotch/japanese-reranker-cross-encoder-large-v1	24	1024
hotchpotch/japanese-bge-reranker-v2-m3-v1	24	1024

How CrossEncoders Are Trained

A CrossEncoder can be trained as a simple regression task. Text in the form query text[SEP]passage text, separated by a SEP token or similar, is labeled as 1.0 for positives and 0.0 for negatives. For concrete training code, the SentenceTransformers CrossEncoder training examples are easy to understand.

Performance improves significantly when multiple negatives, or hard negatives, are trained in the same batch as the positive. FlagEmbedding's reranker trainer is a useful reference for this approach.

Training Datasets

Training requires datasets of questions, positives, and negatives. I used one positive and 15 hard negatives per item, for a group of 16 examples. The datasets were:

JQaRA: 7,270 records from dev and unused
JSQuAD:
- 62,859 records from train
- Additional Wikipedia passages for hard-negative mining
miracl: 6,984 Japanese records from train
mmarco: 346,413 filtered Japanese records from train
mr_tydi:
- 3,697 Japanese records from train
- The Japanese MIRACL data contains many records overlapping with this mr_tydi data
Wikipedia lead sections:
- 40,130 pairs of Wikipedia titles and lead paragraphs
- Hard-negative mining also used only Wikipedia lead paragraphs

Evaluation Datasets

The models were evaluated with:

JQaRA:
- 2,000 test records
- Metric: NDCG@10, as defined for JQaRA evaluation
JSQuAD:
- 4,442 validation records
- 19 negatives added by hard-negative mining from Wikipedia, evaluated with MAP@10 over 20 total candidates
miracl:
- 704 records from dev, filtered to records with at least 9 negatives
- 1 positive and 9 negatives, evaluated with MAP@10
- Japanese MIRACL has some overlap between dev and train, so training more on train tends to raise dev evaluation
JaCWIR:
- 5,000 eval records
- Metric: MAP@10, as defined for JaCWIR reranker evaluation

Hard-Negative Mining

Hard negatives are examples that a model is likely to mistakenly judge as positives, even though they are actually negative. Actively mining them increases the diversity and difficulty of the training data and can improve model accuracy.

For these models, I mined hard negatives with BM25 and multiple SentenceTransformer models. Using semantic textual similarity tasks, I extracted texts semantically similar to positives but actually negative, and randomly sampled from high-similarity candidates.

Pretrained Base Models

The following pretrained models were used as bases. For BAAI/bge-reranker-v2-m3, training on all data reduced generalization, so I randomly sampled 10,000 records each from mMARCO, JSQuAD, and Wikipedia lead sections, while using all records from the other datasets.

japanese-reranker-cross-encoder-xsmall-v1
- Microsoft mMiniLMv2-L6-H384
- 6 layers, 384 hidden size
japanese-reranker-cross-encoder-small-v1
- Microsoft mMiniLMv2-L12-H384
- 12 layers, 384 hidden size
japanese-reranker-cross-encoder-base-v1
- cl-nagoya/sup-simcse-ja-base
- tohoku-nlp/bert-base-japanese-v3
- A merged model from models trained from both sources
- 12 layers, 768 hidden size
japanese-reranker-cross-encoder-large-v1
- cl-nagoya/sup-simcse-ja-large
- tohoku-nlp/bert-large-japanese-v2
- A merged model from models trained from both sources
- 24 layers, 1024 hidden size
japanese-bge-reranker-v2-m3-v1
- BAAI/bge-reranker-v2-m3
- 24 layers, 1024 hidden size

Handling Overfitting

During CrossEncoder training, I found that because Wikipedia passages were used as hard negatives, evaluation on Wikipedia-related tasks such as JQaRA, JSQuAD, and Japanese MIRACL improved, while generalization outside Wikipedia domains degraded as training continued. To balance this, I created JaCWIR, an out-of-domain dataset not included in the training data, and used it for evaluation while training.

Training beyond 1 epoch caused overfitting, so training was limited to 1 epoch.

Training Parameters

The main model training used roughly the following parameters:

batch_size: 512 with gradient accumulation
- Since 16 examples form one group, the actual batch contains 512 * 16 = 8192 positive and negative examples
warmup_ratio: 0.25
Scheduler: cosine
Optimizer: paged_adamw_32bit
learning_rate:
- xsmall = 2e-04
- small = 5e-04
- base = 8e-05
- large = 3e-05
Loss:
- Cross entropy

Using Large Models as Teachers

For xsmall and small, I also used inference outputs from japanese-reranker-cross-encoder-large-v1 and japanese-bge-reranker-v2-m3-v1 as teacher labels. Teacher outputs are continuous inference values, such as pos=0.98 and negs=[0.02, 0.07, ...], so they can be used as regression targets rather than only 0 and 1. Using teacher outputs gave a small score improvement. MSE loss was used for this training.

Creating Mix Models

Changing datasets, score parameters, and seeds produces diverse training results. Linearly combining separately trained models can improve performance by adding diversity. I confirmed score improvements by combining multiple trained models. I used LM_Cocktail for model merging.

One caveat is that the merged model has a smaller output standard deviation, so there may be some performance degradation during quantization or similar processing.

Evaluation Results

The CrossEncoder evaluation results are below. BAAI/bge-reranker-v2-m3 already has strong multilingual generalization and high Japanese performance. If model size is not an issue, I think it is a good base model to fine-tune for reranker training, even with only a few thousand examples.

Scores on these evaluation datasets tend to increase when training on the corresponding public train data. The models created here learned the tendencies of train data for all datasets except JaCWIR, so that should be considered when reading the scores.

Model Name	JQaRA	JaCWIR	MIRACL	JSQuAD
japanese-reranker-cross-encoder-xsmall-v1	0.6136	0.9376	0.7411	0.9602
japanese-reranker-cross-encoder-small-v1	0.6247	0.939	0.7776	0.9604
japanese-reranker-cross-encoder-base-v1	0.6711	0.9337	0.818	0.9708
japanese-reranker-cross-encoder-large-v1	0.7099	0.9364	0.8406	0.9773
japanese-bge-reranker-v2-m3-v1	0.6918	0.9372	0.8423	0.9624
bge-reranker-v2-m3	0.673	0.9343	0.8374	0.9599
bge-reranker-large	0.4718	0.7332	0.7666	0.7081
bge-reranker-base	0.2445	0.4905	0.6792	0.5757
cross-encoder-mmarco-mMiniLMv2-L12-H384-v1	0.5588	0.9211	0.7158	0.932
shioriha-large-reranker	0.5775	0.8458	0.8084	0.9262
bge-m3+all	0.576	0.904	0.7926	0.9226
bge-m3+dense	0.539	0.8642	0.7753	0.8815
bge-m3+colbert	0.5656	0.9064	0.7902	0.9297
bge-m3+sparse	0.5088	0.8944	0.6941	0.9184
JaColBERTv2	0.5847	0.9185	0.6861	0.9247
multilingual-e5-large	0.554	0.8759	0.7722	0.8892
multilingual-e5-small	0.4917	0.869	0.7025	0.8565
bm25	0.458	0.8408	0.4387	0.9002

This article was lightly edited from text generated by Claude 3 Opus based on my notes and instructions.