cat articles/splade

Releasing a High-Performance Japanese SPLADE Sparse Retrieval Model

created 2024-10-07

I created and released a Japanese SPLADE sparse vector model for text retrieval. On retrieval tasks over large text collections and reranking tasks that reorder documents related to a query, it achieves strong competitive results compared with recent dense vector models such as multilingual-e5-large, ruri-large, GLuCoSE-base-ja-v2, and OpenAI text embeddings.

https://huggingface.co/hotchpotch/japanese-splade-base-v1

For technical details on building the Japanese SPLADE model, see How to Build a SPLADE Model: Japanese SPLADE Technical Report.

What Is SPLADE?

SPLADE, Sparse Lexical and Expansion Model, is a retrieval model that uses sparse vectors. BM25 is the representative sparse retrieval algorithm and has been widely used for many years because of its strong performance. However, BM25 depends on exact word matches between queries and documents, so it can miss documents that contain related words or synonyms.

SPLADE uses a Transformer architecture and can include contextually related words in the vector representation. This allows words beyond exact matches to become retrieval candidates, enabling more flexible and effective search.

Characteristics and Benefits

SPLADE uses a pretrained Transformer model, such as BERT, to understand the context of the input text. It does not depend only on exact word matches and can effectively extract contextually related words. Each word is assigned an importance score, making it clear which words matter for retrieval. It also produces sparse vectors, where many dimensions are zero, which keeps computation manageable and enables efficient search.

These characteristics make SPLADE suitable for flexible retrieval needs involving related terms and synonyms. Sparse vectors allow fast search with lower computation, improving the efficiency of the overall system. Because each word has an explicit importance score, the retrieval result is also easier to interpret. Finally, SPLADE is relatively easy to introduce into existing search engines, so it can be integrated smoothly into current systems.

A Concrete Example

To understand how SPLADE works, here is a concrete example from the actual japanese-splade-base-v1 model. You can also get outputs easily from the Japanese SPLADE demo.

Example of Word Expansion

SPLADE output for "How can I improve my car's fuel efficiency?"

Score	Word (vocab)
2.1797	車
2.1465	燃費
1.7344	向上
1.5586	方法
1.3291	燃料
1.1377	効果
0.8716	良い
0.8452	改善
0.8340	アップ
0.7065	いう
0.6450	理由
0.4355	価格
0.3184	は
0.2510	家
0.2417	せる
0.2286	目的
0.1735	店
0.1627	手段
0.0851	用
0.0752	率
0.0734	上昇

As shown here, the model understands the context of the query and extracts related words such as "fuel" and "effect", even though they are not present in the original sentence. Each word also has an importance score. Some words that look unrelated or noisy, such as Japanese particles, are also included. Because such words appear in many outputs, they often become noise that can be mostly ignored, and search can still work well.

The same process can be applied to documents. By taking the dot product between the sparse vector for a query and the sparse vector for a document, we can compute how related they are.

Performance

As noted above, the SPLADE model performs well on many Japanese information retrieval tasks. Benchmark results on JMTEB retrieval, JQaRA, and JaCWIR are shown below. It performs strongly on tasks where lexical features matter. On the other hand, it is weaker on tasks such as jagovfaqs, where understanding similar sentence meaning appears to be more important.

JMTEB Retrieval

JQaRA and JaCWIR Reranking

Most open source search engines, including Elasticsearch, OpenSearch, Qdrant, and Vespa, support sparse retrieval, so adoption is relatively easy. Sparse vector search has also existed for a long time and is fast, similarly to BM25.

SPLADE and BM25 strongly reflect lexical features, so their results often differ from dense vector models such as multilingual-e5. Combining both sets of results as hybrid search can produce better and more diverse results. Most of the search engines mentioned above also support hybrid search, and many make it easy to use.

Is It Hard to Run in Production?

SPLADE can be operated almost the same way as a dense vector model, so it is not especially difficult. As mentioned above, most search engines support sparse search.

Obtaining a SPLADE sparse vector is also not complicated. It passes token scores through a combination of max pooling, often called SPLADE max, and a log-saturation function.

Example: obtaining sparse vectors with the Transformers library

It can also be used from text-embedding-inference, a fast inference server that is convenient for production operation.

https://huggingface.co/hotchpotch/japanese-splade-base-v1-dummy-fast-tokenizer-for-tei

Closing

At first I was not sure whether SPLADE would really perform well. However, SPLADE-v3, trained only on the English MS MARCO dataset, performs well across a variety of retrieval tasks. That made me interested in what would happen if it were trained properly for Japanese.

SPLADE also depends on the tokenizer vocabulary. That makes it a poor fit with multilingual model tokenizers that often split Japanese at the character level, so specialized training for Japanese is needed. This was another reason the project seemed interesting. High-performance multilingual dense vector models that support Japanese are already being pursued by many companies.

As a result of training, even though some known-domain tasks such as JAQKET and Mr.TyDi are included, I was able to create a base-size 110M-parameter sparse retrieval model that outperforms large OpenAI models on some benchmarks.

Training took about 33 hours on an RTX 4090. Because SPLADE can be trained with relatively modest compute and time, creating a model trained on domain-specific data with SPLADE seems like a useful approach for teams that need retrieval results adapted to their own domain.

I expect Japanese sparse retrieval performance with SPLADE to continue improving, and I think it remains an interesting research area.