cat articles/japanese-retromae

Releasing Japanese BERT RetroMAE Models and Evaluating Them on Downstream Retrieval Tasks

created 2024-10-30

Neural retrieval models that capture semantic similarity between queries and documents are important for search tasks. However, conventional language models such as BERT are mainly pretrained on token-level tasks, so their sentence-level representations are not always well developed. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder was proposed as a new pretraining method specialized for retrieval.

In this article, I created and released Japanese BERT models pretrained with RetroMAE and evaluated them on downstream retrieval tasks, JMTEB.

model_name	Avg.	jagovfaqs 22k	jaqket	mrtydi	nlp_journal abs_intro	nlp_journal title_abs	nlp_journal title_intro
bert-base-japanese-v3	0.7266	0.6532	0.6236	0.4521	0.8774	0.9732	0.7803
bert-base-japanese-v3 retromae	0.7352	0.6631	0.6632	0.4526	0.8893	0.9722	0.7708
ruri-pt-base retromae	0.7397	0.6678	0.6691	0.4667	0.8931	0.9605	0.7812

The results show improvements in almost all scores, confirming the usefulness of RetroMAE. The training method is also practical because it is unsupervised and only requires text.

About RetroMAE

RetroMAE uses a masked auto-encoder approach with three main design choices:

A new workflow that applies different masks to the input sentence
An asymmetric encoder-decoder structure
Different mask ratios for the encoder and decoder

These choices allow the model to learn representations that understand document meaning more deeply and support effective retrieval. RetroMAE also performs well on benchmarks such as BEIR and MS MARCO. The high-performing multilingual dense embedding model BAAI/bge-m3 also uses RetroMAE pretraining.

There is also a later method, RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models, also called DupMAE. This article covers RetroMAE.

Workflow with Different Masks

RetroMAE applies two different masks to the input sentence. The encoder generates a sentence embedding from the first masked input. The decoder then combines the second masked input with that sentence embedding to reconstruct the original sentence.

Asymmetric Encoder and Decoder

RetroMAE intentionally uses an asymmetric structure. The encoder uses BERT's 12-layer Transformer to capture the meaning of the input sentence sufficiently. The decoder, on the other hand, is only a very simple one-layer Transformer. This simple decoder makes the reconstruction task harder, encouraging the encoder to learn higher-quality sentence embeddings.

The one-layer decoder also introduces a special mechanism called enhanced decoding. It prepares two inputs: a query that combines the sentence embedding and position embedding, and a context that combines the sentence embedding, token embedding, and position embedding. It then applies an attention mask according to position. This allows all input tokens to be reconstruction targets while each token is reconstructed from its own context, enabling efficient training even with a shallow decoder.

Different Mask Ratios

The encoder uses a moderate mask ratio, 15-30%, so it can retain most of the information in the input sentence. The decoder uses a more aggressive mask ratio, 50-70%. With this high mask ratio, the decoder cannot easily reconstruct the input from the masked input alone, so it must rely heavily on the sentence embedding produced by the encoder. This forces the encoder to learn deeper semantic understanding.

Pretraining Japanese RetroMAE Models

The original paper trains on English Wikipedia, BookCorpus, and MS MARCO. For Japanese, I used the following datasets, which contain similar tasks:

(A) Japanese Wikipedia: hpprc/jawiki-paragraphs
(A) jawiki-books: hpprc/jawiki-books-paragraphs
(B) Japanese MQA: hpprc/mqa-ja
(B) JSNLI: shunk031/jsnli

For Wikipedia and jawiki-books, I used only paragraphs and did not include titles. For MQA, I concatenated query and document. For JSNLI, I removed spaces.

Instead of training from zero weights, I used tohoku-nlp/bert-base-japanese-v3 and cl-nagoya/ruri-pt-base as the base models for RetroMAE training. ruri-pt-base is a pretrained model based on bert-base-japanese-v3 and trained with contrastive learning. Because the MLM decoder layer is lost in that process, I used a model where the decoder layer weights were copied from bert-base-japanese-v3.

For the training script, I used the MIT-licensed OSS retromae_pretrain. The encoder mask ratio was 30%, and the decoder mask ratio was 50%. Other Trainer hyperparameters were:

  "learning_rate": 1e-4,
  "num_train_epochs": 2,
  "per_device_train_batch_size": 16,
  "gradient_accumulation_steps": 32,
  "warmup_ratio": 0.05,
  "lr_scheduler_type": "cosine",
  "bf16": true,
  "dataloader_drop_last": true,
  "dataloader_num_workers": 12

Using these settings, I created RetroMAE pretrained models using only dataset group (A), and using (A) + (B).

Evaluation on Downstream Retrieval Tasks

For downstream retrieval evaluation, I trained Japanese SPLADE models using only the mMARCO dataset. The settings are based on japanese-splade-base-v1-mmarco-only, with the model epochs reduced from 12 to 10 and model_name replaced with the model being evaluated.

For evaluation, I used my fork of JMTEB, modified to evaluate sparse vectors, and ran retrieval tasks.

The evaluation scores are:

model_name	Avg.	jagovfaqs 22k	jaqket	mrtydi	nlp_journal abs_intro	nlp_journal title_abs	nlp_journal title_intro
bert-base-japanese-v3	0.7266	0.6532	0.6236	0.4521	0.8774	0.9732	0.7803
bert-base-japanese-v3 retromae(A)	0.7361	0.6655	0.6621	0.4557	0.888	0.9604	0.7848
ruri-pt-base retromae(A)	0.737	0.6657	0.6541	0.4608	0.8823	0.9768	0.7821
bert-base-japanese-v3 retromae(A+B)	0.7352	0.6631	0.6632	0.4526	0.8893	0.9722	0.7708
ruri-pt-base retromae(A+B)	0.7397	0.6678	0.6691	0.4667	0.8931	0.9605	0.7812

In almost all evaluations, the models trained with RetroMAE scored higher than models not trained with RetroMAE. The best model, ruri-pt-base retromae(A+B), improved by about 2% compared with bert-base-japanese-v3.

The datasets also show that training on both (A) and (B) generally produced higher scores than training only on (A). This suggests that adding more datasets or training on domain-specific text may further improve performance.

The RetroMAE models trained on (A+B) are published on Hugging Face:

Closing

This article applied RetroMAE, a retrieval-oriented pretraining method, to Japanese BERT models and evaluated its effect. In downstream SPLADE evaluation, models trained with RetroMAE improved over the baseline bert-base-japanese-v3 on almost all retrieval tasks. In particular, the model based on contrastively trained ruri-pt-base and trained on multiple datasets such as Wikipedia, books, and question-answer data achieved an average improvement of about 2%.

Another advantage of RetroMAE is that it can be trained in an unsupervised way using only text data. This makes it useful for customizing models for specific domains or business tasks. Further improvements may be possible by adding more training data or continuing training on domain-specific text.

The RetroMAE models are published on Hugging Face and can be used. I hope this article helps improve Japanese retrieval task performance.