cat articles/japanese-splade-tech-report

How to Build a SPLADE Model: Japanese SPLADE Technical Report

created 2024-10-23

In recent years, the rise of large language models has made information retrieval increasingly important. Applications such as Retrieval-Augmented Generation, or RAG, need search systems that are both efficient and accurate.

In neural retrieval, dense retrievers have become mainstream, and strong multilingual models such as multilingual-e5 and bge-m3 are available. At the same time, sparse retrieval models represented by SPLADE, Sparse Lexical and Expansion Model with Contextualized Embeddings, have shown strong performance in English.

However, SPLADE depends heavily on lexical features, and tokenization depends on the model tokenizer. This made multilingual SPLADE difficult. Multilingual tokenizers often split many languages at the character level, making meaningful word-level tokenization difficult. I therefore developed and evaluated a SPLADE model specialized for Japanese.

The original SPLADE implementation, naver/splade, is released under CC-BY-NC and has commercial-use restrictions. I implemented a Trainer based on the papers and released it as MIT-licensed open source software.

YAST - Yet Another SPLADE or Sparse Trainer

This report covers implementation details, evaluation results, and future directions for the Japanese SPLADE model.

SPLADE Algorithm

SPLADE learns sparse document and query representations for information retrieval. This section describes how it is trained.

Word Importance and Output Tokens

SPLADE uses the per-token output of a model pretrained with Masked Language Modeling, or MLM, to compute context-dependent word importance. More concretely, it uses the vocabulary space of a pretrained model such as BERT and applies max pooling over token scores at each input position. Applying a log-saturation function suppresses extreme values while emphasizing important features. These operations produce sparse and efficient document and query representations that capture salient features.

This operation is called SPLADE Max. A Python implementation is below.

def splade_max_pooling(logits, attention_mask):
    # Step 1: apply log saturation, log(1 + x)
    # - torch.relu() clamps negative values to 0
    # - torch.log(1 + x) converts values to log scale and suppresses large values
    relu_log = torch.log(1 + torch.relu(logits))
    
    # Step 2: mask scores at padded positions with attention_mask
    # unsqueeze(-1) aligns dimensions: (batch_size, seq_len, 1)
    weighted_log = relu_log * attention_mask.unsqueeze(-1)
    
    # Step 3: apply max pooling
    # torch.max() takes the maximum over sequence length (dim=1)
    # This selects the most important score for each vocabulary item
    max_val, _ = torch.max(weighted_log, dim=1)
    
    return max_val

Predicting Document-Query Relevance

The word importance scores from SPLADE Max are used to predict the relevance between documents and queries, mainly with an inner product. The difference between the prediction and the training data is defined as the loss.

Loss functions such as KL divergence, MarginMSE, and cross entropy can be used to measure differences between the model's predicted vocabulary distribution and the target distribution. These can be used alone or in combination. SPLADE-v3 combines KL divergence and MarginMSE.

Sparsity and Regularization

Regularization is included in the loss to make the output word-importance distribution sparse. The main algorithms are:

L1 regularization: minimizes the sum of absolute parameter values, pushing many values toward zero and encouraging sparse representations.
FLOPs regularization: in high-dimensional sparse representation learning, distributes non-zero elements across dimensions to reduce matrix-operation FLOPs quadratically. See Minimizing FLOPs to Learn Efficient Sparse Representations.

Different losses and regularization coefficients can be applied to queries and documents. Applying strong regularization from the beginning of training can harm importance prediction, so a warmup period that gradually increases the regularization loss weight is also used.

Training and Relevance Computation

By training with these methods, SPLADE can increase query-document relevance while encouraging sparsity. It combines sparse representations with neural contextual vocabulary information, enabling strong information retrieval.

Training Method for the Japanese Model

Dataset Preparation

For the final japanese-splade-base-v1 model, I used several subsets from hpprc/emb, which contains Japanese questions, answers, and hard negatives. The subsets include auto-wiki-qa, mmarco, jsquad, jaquad, auto-wiki-qa-nemotron, quiz-works, quiz-no-mori, miracl, jqara, mr-tydi, baobab-wiki-retrieval, and mkqa.

I also created hotchpotch/hpprc_emb-scores, a scored dataset using high-performance Japanese cross-encoder rerankers, BAAI/bge-reranker-v2-m3 and cl-nagoya/ruri-reranker-large. For English data, I used MS MARCO and data scored with BAAI/bge-reranker-v2-m3.

For filtering, I used the average score of the rerankers: positives with scores of 0.7 or higher, and negatives with scores of 0.3 or lower. This removes passages that the rerankers judge to be inappropriate for the query.

For datasets with small proportions, I increased the amount of training per epoch so the model would not forget their characteristics.

For an mMARCO-only training dataset, I created and used hotchpotch/mmarco-hard-negatives-reranker-score, based on mMARCO and scored with BAAI/bge-reranker-v2-m3. It uses the same filtering rule: positives at 0.7 or higher and negatives at 0.3 or lower.

Training Settings and Hyperparameters

I used simple cross-entropy loss as the training loss. I tried KL divergence and MarginMSE as well, but cross entropy produced the best result. The goal was to let the model learn the scores from high-performance rerankers.

For sparsity regularization, I used L1 regularization. Compared with FLOPs loss, L1 regularization encouraged sparsity more effectively for Japanese.

The learning rate was 5.0e-2, a common value for a 110M-parameter model in this setting. I used a cosine learning-rate scheduler and set 10% of the total steps as warmup.

Each batch contains one positive and seven negatives, for eight examples total. The batch size was 32 for japanese-splade-base-v1 and 128 for japanese-splade-base-v1-mmarco-only. For mMARCO-only training, query and document sparsity converged quickly even with a large batch. For japanese-splade-base-v1, which trains on diverse datasets, larger batch sizes slowed sparsity convergence, so smaller batches worked better. If more training time and resources are available, larger batches might still improve japanese-splade-base-v1.

Detailed parameters are available in the actual training configuration files.

Removing Noise Tokens

In Japanese training, punctuation and symbols such as 、, 。, 「, and ： appeared prominently as noisy features. When these tokens remained in SPLADE Max output, I added their scores to the loss as a penalty. I extracted symbolic words with fugashi and unidic-lite.

By treating these as noise tokens and including them in the loss, the trained model almost stopped outputting them. Training also became more stable and converged faster.

Base Model

The base model was tohoku-nlp/bert-base-japanese-v3, which has lexical semantic features in its output layer from MLM pretraining and is based on the Japanese BERT architecture.

Training

Using these settings, I fine-tuned and created japanese-splade-base-v1 and japanese-splade-base-v1-mmarco-only. On an RTX 4090, training took about 33 hours for japanese-splade-base-v1 and about 24 hours for japanese-splade-base-v1-mmarco-only.

japanese-splade-base-v1 was trained for 2 epochs because the dataset was large. japanese-splade-base-v1-mmarco-only was trained for 12 epochs because the dataset was smaller and contained only mMARCO. Increasing the number of epochs for japanese-splade-base-v1 lowered training loss but reduced retrieval performance during evaluation, probably because of overfitting.

The trained models are published on Hugging Face:

Evaluation Results

JMTEB Retrieval

The JMTEB results are below. I used my fork modified to evaluate sparse vectors.

model_name	Avg.	jagovfaqs	jaqket	mrtydi	nlp_journal abs_intro	nlp_journal title_abs	nlp_journal title_intro
japanese-splade-base-v1	0.7465	0.6499	0.6992	0.4365	0.8967	0.9766	0.8203
japanese-splade-base-v1-mmarco-only	0.7313	0.6513	0.6518	0.4467	0.8893	0.9736	0.7751
text-embedding-3-large	0.7448	0.7241	0.4821	0.3488	0.9933	0.9655	0.9547
GLuCoSE-base-ja-v2	0.7336	0.6979	0.6729	0.4186	0.9029	0.9511	0.7580
multilingual-e5-large	0.7098	0.7030	0.5878	0.4363	0.8600	0.9470	0.7248
multilingual-e5-small	0.6727	0.6411	0.4997	0.3605	0.8521	0.9526	0.7299
ruri-large	0.7302	0.7668	0.6174	0.3803	0.8712	0.9658	0.7797

On average, japanese-splade-base-v1 performed best, though it trained on some domain tasks such as Mr. TyDi and JAQKET, not the test data used in JMTEB evaluation. japanese-splade-base-v1-mmarco-only trained only on mMARCO but was best on Mr. TyDi and competitive on other tasks.

SPLADE models perform relatively poorly on jagovfaqs. This may be because the queries are FAQ-like and often resemble summarization or contextual similarity tasks. Other models learn semantic similarity, while japanese-splade-base-v1 does not. Strong Japanese models such as ruri-large and GLuCoSE-base-ja-v2 may also benefit from training on Japanese data from MQA, a multilingual FAQ and CQA dataset.

JAQKET contains many quiz-style questions with distinctive Japanese phrasing. Models that learn these expressions score well, and because answer documents contain the correct answer words, SPLADE's lexical features likely help.

The Mr. TyDi result is counterintuitive: japanese-splade-base-v1, which should have learned the domain, is worse than japanese-splade-base-v1-mmarco-only, which did not. I have not fully analyzed this.

For the three NLP Journal tasks, SPLADE models perform well on title_abs, while text-embedding-3-large is much stronger on abs_intro and title_intro. This is because title_abs documents average 442 tokens, while abs_intro and title_intro average 2052 tokens. All models except text-embedding-3-large have a maximum input length of 512 tokens, while text-embedding-3-large supports 8191, so the other models evaluate only the beginning of long documents.

Reranking Evaluation

For reranking, I used JQaRA and JaCWIR.

model_name	JaCWIR map@10	JaCWIR HR@10	JQaRA ndcg@10	JQaRA mrr@10
japanese-splade-base-v1	0.9122	0.9854	0.6441	0.8616
japanese-splade-base-v1-mmarco-only	0.8953	0.9746	0.5740	0.8176
text-embedding-3-small	0.8168	0.9506	0.3881	0.6107
GLuCoSE-base-ja-v2	0.8567	0.9676	0.6060	0.8359
bge-m3+dense	0.8642	0.9684	0.5390	0.7854
multilingual-e5-large	0.8759	0.9726	0.5540	0.7988
multilingual-e5-small	0.8690	0.9700	0.4917	0.7291
ruri-large	0.8291	0.9594	0.6287	0.8418

Although japanese-splade-base-v1 learned the JQaRA domain, it achieved the best results across these evaluations.

English Evaluation

japanese-splade-base-v1 includes English MS MARCO data in training, so I evaluated it on MS MARCO dev with the script from naver/splade.

model_name	MRR@10 (MS MARCO dev)
japanese-splade-base-v1	0.047
japanese-splade-base-v1-mmarco-only	0.036
naver/splade_v2_max	0.340

There is a small improvement compared with the model that did not train on English data, but the score is far below naver/splade_v2_max, which is trained for English. The model has little English retrieval capability.

Sparsity Evaluation

I measured sparsity with the number of non-zero elements, the L0 norm, for queries and documents. The following results were measured on JMTEB retrieval tasks, top 1000, with JMTEB_L0.py.

JMTEB tasks	v1	v1-mmarco-only
jagovfaqs_22k-query	27.9	43.4
jaqket-query	23.3	38.9
mrtydi-query	13.8	20.5
nlp_journal_abs_intro-query	75.3	127.2
nlp_journal_title_abs-query	19	26.4
nlp_journal_title_intro-query	19	26.4
jagovfaqs_22k-docs	73.2	97.9
jaqket-docs	146.2	231.8
mrtydi-docs	89.3	100.4
nlp_journal_abs_intro-docs	95.7	182
nlp_journal_title_abs-docs	75.2	126.9
nlp_journal_title_intro-docs	95.7	182

The L0 norms show that v1-mmarco-only generally has more non-zero elements and is less sparse. Query and document sparsity are both important, but they have different requirements.

For search speed, higher query sparsity is especially valuable. Document sparsity also matters for memory and disk usage, but in production, millions to tens of millions of documents can often be searched in memory on one machine, so document sparsity may not need to be managed as strictly as query sparsity. At the same time, if documents have too few non-zero elements, retrieval quality can suffer. Tuning query and document sparsity is important for balancing search quality and efficiency.

Summary of Evaluation

These results suggest that japanese-splade-base-v1 is competitive with recent models for Japanese retrieval, especially on tasks where lexical features are important. Query and document sparsity are also sufficient for practical use.

Other models in the comparison are dense vector models, while SPLADE is a sparse vector model that emphasizes lexical features. Combining different models can produce more diverse search results than using dense models alone. This is important in real systems where diverse retrieval results are useful, such as passing varied search information to an LLM.

Future Work

japanese-splade-base-v1 has been released as a first artifact, but there is still room for improvement. The original SPLADE papers improve performance with self-distillation, multiple loss scores, and hard-negative sampling using SPLADE itself.

I have also not fully explored selecting or training pretrained models suited to retrieval tasks. Methods such as Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval and RetroMAE may improve performance.

Other possibilities include adding FAQ-style task data, supporting longer context, and adding more diverse datasets. Current data tends to be Wikipedia-heavy.

Recent models such as Llama 3.1 have licenses that allow their outputs to be used for training, making it easier to create retrieval datasets without licensing issues. The hpprc/emb dataset used here provides high-quality data using LLM outputs, as described in Ruri: Japanese General Text Embeddings.

Creating retrieval-suitable queries from documents used to require significant manual effort. LLMs now make it possible to generate large numbers of queries at low cost. Training on specific domains often improves generalization to those domains, so richer datasets should further improve retrieval models.

Closing

This report described japanese-splade-base-v1, a SPLADE model specialized for Japanese, and evaluated it. The results show that it performs strongly compared with recent models for Japanese information retrieval.

Future work includes methods for further performance improvement, selecting pretrained models better suited to retrieval, and using more diverse datasets.

I hope releasing the Japanese SPLADE model and SPLADE training Trainer contributes to the development of information retrieval technology.

References

@article{tateno2024splade,
    title={SPLADE モデルの作り方・日本語SPLADEテクニカルレポート},
    author={TatenoYuichi},
    year={2024},
    url={/articles/japanese-splade-tech-report}
}