cat articles/japanese-splade-v2-release

Releasing Japanese SPLADE v2, a Strong Retrieval Model for Texts Under 512 Tokens

created 2024-12-19

In 2024 I became interested in information retrieval and have been building retrieval-related models as a hobby as @hotchpotch. Transformers are enjoyable because they often learn reasonably well even when the setup is not overly elaborate.

I have been running consumer GPUs at home and released Japanese SPLADE v2, japanese-splade-v2, an improved version of the Japanese SPLADE v1 retrieval model I published earlier. On JMTEB retrieval benchmarks, it achieves very strong scores for document lengths up to 512 tokens, which is common in RAG. Considering the model size and performance, I think it is a well-balanced retrieval model.

This article is also day 24 of the Information Retrieval / Search Technology Advent Calendar 2024.

What Is SPLADE?

SPLADE is probably unfamiliar to many people, so before explaining SPLADE, I will briefly discuss dense vector search and sparse vector search.

When people talk about natural language search today, the popular approach is dense retrieval, also called text embeddings or embedding search. However, sparse retrieval is still actively used in many places. Keyword-based methods such as TF-IDF and BM25 are representative examples.

Suppose you search for "Tell me a good cafe." Sparse vector search, such as TF-IDF or BM25, scores how important keywords are and returns results. In this case, the results depend on the frequency and rarity of words such as "good" and "cafe". Documents with matching distinctive keywords tend to rank higher.

Dense retrieval represents the meaning of words and phrases as vectors. For "good cafe", it can also retrieve contextually related phrases such as "popular coffee shop" or "pleasant cafe". This is because the neural network model, usually a Transformer, has learned broad semantic representations of words and sentences.

In short, sparse vectors such as BM25 emphasize the keywords themselves, while dense vectors emphasize the meaning and nuance of the keywords. Which one to use depends on whether exact keyword matching or semantic breadth is more important.

Difference in Dimensions

Dense and sparse vectors also differ in the number of dimensions used to represent information.

Dense vectors typically have large dimensionality, often from 384 to 3072 dimensions, and sometimes more depending on the model. For example, OpenAI's text-embedding-3-large uses 3072 dimensions by default. Higher dimensionality means that vector computations, such as dot products or cosine similarity, become more expensive and require more storage and memory. This is one of the challenges of dense vectors.

In real search systems, searching all high-dimensional vectors exactly is too inefficient, so algorithms such as approximate nearest neighbor search, ANN, are used to trade a little accuracy for better computational efficiency.

Sparse vectors, when keyword-based, theoretically use the vocabulary of the whole document collection as dimensions, creating a large vector space. In practice, however, most dimensions are zero, and only a small number of elements are non-zero. A query such as "Tell me a good cafe" uses only the dimensions corresponding to words like "good", "cafe", and "tell". The other tens or hundreds of thousands of possible dimensions remain zero. This greatly reduces storage, memory, and computation in production and enables fast search.

Sparse vectors also have the advantage that it is easy to understand what each non-zero dimension means. It is clear which dimensions correspond to keywords such as "good" and "cafe", making results easier to interpret.

Here is an example of a dense vector:

dense_vector = [
 0.0023, -0.0008, 0.0017, 0.0009, -0.0025,
 ... # elements continue for the number of dimensions
]

All dimensions in a dense vector have meaning, but it is hard to understand what each value specifically represents.

A sparse vector, on the other hand, has an easier-to-understand structure:

sparse_vector = {
  33721: 1.5, # dimension 33721 corresponds to "good"
  1191: 2.3, # dimension 1191 corresponds to "cafe"
  997: 0.2 # dimension 997 corresponds to "tell"; frequent words have lower scores
  # all other dimensions are zero and do not need to be written
}

In this example, it is clear which dimensions correspond to "good", "cafe", and "tell". This makes it easier to interpret which words contributed to the search result.

Dense vectors are good at capturing broad meaning, but their many dimensions make them computationally expensive. Sparse vectors are efficient and especially useful for precise keyword search.

Weaknesses of Sparse Vector Search

Sparse vectors use relatively few active dimensions and make it easy to understand which words matched. That may sound like an obvious win.

However, dense retrieval is popular for natural language search because of accuracy. Algorithms such as BM25 basically match predefined keywords and manually maintained synonym dictionaries. If you search for "good cafe", BM25 will usually not match "tasty coffee shop" unless the keywords align. Dense vectors use fuzzier semantic representations and can match texts like "tasty coffee shop" as similar.

SPLADE: Sparse Vectors That Understand and Expand Context

Sparse vector search is well suited to exact keyword matching, such as e-commerce search where similar but different products can be wrong. For natural language queries, dense retrieval often seems more suitable.

As people increasingly want systems to find target documents from casual natural language, similar to talking with AI, dense retrieval models have become popular.

This is where SPLADE, Sparse Lexical and Expansion Model, comes in. SPLADE's key feature is that it understands context and proposes multiple appropriate words or tokens. For example, for the query "What time of day has the highest household TV rating in Japan?", SPLADE can output related terms inferred from context, not only words directly included in the query.

Words directly included in the query:
- Japan
- viewing
- household
- time
Related words inferred from context:
- TV and broadcasting: broadcast, program, slot
- Metrics: rate, rise, high
- Time-related: time, period

Traditional sparse vector search could only find documents where the entered keywords matched exactly. SPLADE can understand context and search with related words as well, while preserving the fast retrieval performance of sparse vectors.

Efficient Retrieval

SPLADE can perform this kind of advanced search efficiently.

For example, for the query "What time of day has the highest household TV rating in Japan?"
sparse_vector = {
    1423: 1.71,  # corresponds to "Japan"
    5891: 1.59,  # corresponds to "viewing"
    8754: 1.57,  # corresponds to "household"
    2341: 1.33,  # corresponds to "time"
    9876: 0.96,  # corresponds to "broadcast"
    # ...other related dimensions
}

Only the necessary information is stored as a sparse vector, and matching uses a small number of dimensions. The important point is that SPLADE scores are not simple frequencies; they represent contextual importance.

Why SPLADE?

Compared with other retrieval approaches:

Traditional sparse retrieval such as BM25
- Depends on keyword combinations such as "rating" + "time slot"
- Weak against paraphrases such as "broadcast peak time"
- Strong for exact keyword matches
- Easy to explain results
Dense retrieval
- Represents queries and documents with dense vectors, so stronger accuracy often requires larger models and higher vector dimensions
  - This affects inference speed and search speed
- Results are harder to interpret
SPLADE, context-aware sparse retrieval
- Can search with contextual understanding
- Maintains fast search performance
  - Queries are often around 20-40 dimensions and documents around 150-400 dimensions
  - Runtime tradeoffs between accuracy and speed are possible by not searching or indexing low-importance words
- Results are easy to interpret because you can see which word tokens matched

SPLADE balances many requirements of modern search systems.

How Good Is the Actual Performance?

Let's look at performance, especially the ability to retrieve appropriate documents for natural language questions.

This is the JMTEB retrieval benchmark result, nDCG@10. For texts of 512 tokens or fewer, Japanese SPLADE v2 achieves the best score on most tasks. The benchmark tasks nlp_journal_abs_intro and nlp_journal_title_intro contain documents longer than 512 tokens, so models with shorter maximum input lengths score lower across the board.

In practical use cases such as retrieval for RAG, documents are often split into smaller chunks, so depending on the use case, handling only up to 512 tokens may not be a problem.

The JMTEB retrieval datasets are roughly:

JaGovFaqs_22k
- QA dataset based on Japanese government agency FAQs
- Queries: 3,420
- Documents: 22,794
- Mostly 512 tokens or fewer
Mr. TyDi
- Retrieval benchmark of manually created questions and related Wikipedia passages
- Queries: 720
- Documents: 7,000,027
- Mostly 512 tokens or fewer
JAQKET
- Dataset from the AI-Ou quiz competition, containing quiz questions and Wikipedia articles with the answers
- Queries: 997
- Documents: 114,229
- Mostly 512 tokens or fewer
NLP Journal
- Dataset built from the Japanese NLP Journal LaTeX Corpus, combining titles, abstracts, and introductions
- Many introductions exceed 512 tokens

Japanese SPLADE v2 did not use the train, dev, or test data from Mr. TyDi, MIRACL, JAQKET, or JQaRA as training sources. Using those as training data can improve performance on that domain, but I avoided doing so in order to measure generalization.

Model Size and Dimensions

The model parameter counts and output dimensions are shown in the original table. Parameter counts are roughly computed from layer weights. Larger models usually cost more for training and inference. Larger document output dimensions also require more memory and storage.

Because SPLADE output dimensionality, the number of non-zero elements, depends on the text, I included rough numbers for JMTEB queries and documents.

License

Japanese SPLADE v2 has no special usage restrictions and is released under the MIT license. You can use it freely.

Using It from Code

Sample code is available on huggingface.co/hotchpotch/japanese-splade-v2.

FAQ

Can sparse vector search be used in production?

Yes. Classic search technologies such as TF-IDF and BM25 are sparse vector search methods, and many search systems, including Elasticsearch, Vespa, and Qdrant, support sparse vector search and hybrid search combining dense and sparse vectors.

Is SPLADE better than dense vector models?

On benchmarks it can be better, but it depends on the use case. Even in relatively simple search systems, such as finding a corresponding document from a natural language question, the best method depends on what kinds of questions and documents you expect and what requirements you need to satisfy. Simple BM25 may be best in some cases.

Dense vector models and SPLADE often return results with different characteristics, so hybrid search that combines both is also recommended.

For hybrid search, another useful approach is to train either the dense or sparse model on domain-specific data while keeping the other model more general. The Trainer implementation for Japanese SPLADE v2, YAST, and the training data and settings are published. By creating query-document training data from your own domain and adding it to the training data, retrieval accuracy may improve significantly. Recently, if you have text, it has also become easier to create synthetic supervised data with LLMs, expanding the ways data can be used.

Closing

SPLADE uses context-aware word expansion to cover some weaknesses of keyword-based methods such as BM25, and it is gaining attention as one practical neural search option.

Japanese SPLADE v2, trained properly on Japanese, is likely one of the strongest current models for natural language question tasks such as Mr. TyDi. It is also a high-performing and well-balanced retrieval model that should be practical in production.

I hope this model and article are useful to people working on AI development, natural language processing, and information retrieval.