cat articles/static-embedding-japanese

Releasing a Japanese StaticEmbedding Model for Practical 100x Faster Text Embeddings

created 2025-01-21

Dense text vectors can be used for many tasks, including information retrieval, text classification, and similar-text extraction. However, even small recent Transformer models can be slow, especially on CPU, and that often makes them impractical.

A recently released non-Transformer StaticEmbedding model offers a new approach. In benchmark comparisons with intfloat/multilingual-e5-small, or mE5-small, it achieved at least reasonable performance, around 85% of the score, while creating sentence vectors 126 times faster on CPU. That speed is impressive.

I therefore trained and released a Japanese and English model, static-embedding-japanese.

https://huggingface.co/hotchpotch/static-embedding-japanese

The JMTEB results for Japanese text embeddings are below. The overall score is slightly below mE5-small, but it wins on some tasks and is sometimes stronger than other Japanese base-size BERT models. Before training it, I was not sure a model this simple would really perform this well, so the result was surprising.

Model	Avg(micro)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
text-embedding-3-small	69.18	66.39	79.46	73.06	92.92	51.06	62.27
multilingual-e5-small	67.71	67.27	80.07	67.62	93.03	46.91	62.19
static-embedding-japanese	67.17	67.92	80.16	67.96	91.87	40.39	62.37

Technical notes on training the Japanese StaticEmbedding model are in the latter half of this article.

Usage

Usage is simple. You can create sentence vectors with SentenceTransformer as usual. This example runs on CPU without a GPU. I tested with SentenceTransformer 3.3.1.

pip install "sentence-transformers>=3.3.1"

from sentence_transformers import SentenceTransformer

model_name = "hotchpotch/static-embedding-japanese"
model = SentenceTransformer(model_name, device="cpu")

query = "美味しいラーメン屋に行きたい"
docs = [
    "素敵なカフェが近所にあるよ。落ち着いた雰囲気でゆっくりできるし、窓際の席からは公園の景色も見えるんだ。",
    "新鮮な魚介を提供する店です。地元の漁師から直接仕入れているので鮮度は抜群ですし、料理人の腕も確かです。",
    "あそこは行きにくいけど、隠れた豚骨の名店だよ。スープが最高だし、麺の硬さも好み。",
    "おすすめの中華そばの店を教えてあげる。とりわけチャーシューが手作りで柔らかくてジューシーなんだ。",
]

embeddings = model.encode([query] + docs)
print(embeddings.shape)
similarities = model.similarity(embeddings[0], embeddings[1:])
for i, similarity in enumerate(similarities[0].tolist()):
    print(f"{similarity:.04f}: {docs[i]}")

(5, 1024)
0.1040: 素敵なカフェが近所にあるよ。落ち着いた雰囲気でゆっくりできるし、窓際の席からは公園の景色も見えるんだ。
0.2521: 新鮮な魚介を提供する店です。地元の漁師から直接仕入れているので鮮度は抜群ですし、料理人の腕も確かです。
0.4835: あそこは行きにくいけど、隠れた豚骨の名店だよ。スープが最高だし、麺の硬さも好み。
0.3199: おすすめの中華そばの店を教えてあげる。とりわけチャーシューが手作りで柔らかくてジューシーなんだ。

The document matching the query gets a higher score. In this example, BM25 would have difficulty because direct words such as "ramen" in the query do not appear in the documents.

Here is an example of a similar-sentence task:

sentences = [
    "明日の午後から雨が降るみたいです。",
    "来週の日曜日は天気が良いそうだ。",
    "あしたの昼過ぎから傘が必要になりそう。",
    "週末は晴れるという予報が出ています。",
]

embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)

print(similarities)

# Show similarity between the first sentence and the others.
for i, similarity in enumerate(similarities[0].tolist()):
    print(f"{similarity:.04f}: {sentences[i]}")

tensor([[1.0000, 0.2814, 0.3620, 0.2818],
        [0.2814, 1.0000, 0.2007, 0.5372],
        [0.3620, 0.2007, 1.0000, 0.1299],
        [0.2818, 0.5372, 0.1299, 1.0000]])
1.0000: 明日の午後から雨が降るみたいです。
0.2814: 来週の日曜日は天気が良いそうだ。
0.3620: あしたの昼過ぎから傘が必要になりそう。
0.2818: 週末は晴れるという予報が出ています。

The similar sentence receives a higher score here as well.

Many people have experienced that creating sentence vectors with Transformer models on CPU takes a long time even for a small amount of text. With StaticEmbedding, it should finish almost instantly if the CPU is reasonably fast.

Reducing Output Dimensions

The default sentence vector has 1024 dimensions, but you can reduce it further. For example, here is truncate_dim=128.

# truncate_dim can be 32, 64, 128, 256, 512, or 1024.
model = SentenceTransformer(model_name, device="cpu", truncate_dim=128)

This produces 128-dimensional vectors. The score changes slightly because reducing dimensions lowers performance a little. On the other hand, reducing from 1024 to 128 dimensions reduces storage size and makes similarity computation about 8 times cheaper, so lower dimensions can be preferable depending on the use case.

Why Is CPU Inference Fast?

StaticEmbedding is not a Transformer model. It has no attention computation, the core of "Attention Is All You Need." It stores token embeddings in a 1024-dimensional table and creates a sentence vector by averaging the token vectors that appear in the sentence. Because there is no attention, it does not understand context in the same way a Transformer does.

Internally, it uses PyTorch's nn.EmbeddingBag, passing concatenated tokens and offsets so that PyTorch can use optimized CPU parallel processing and memory access.

According to the speed evaluation in the original article, it is 126 times faster than mE5-small on CPU.

Evaluation

All JMTEB results are in this JSON file. Comparing with other models on the JMTEB leaderboard shows the relative difference. Considering the model size, the overall JMTEB result is very good. The Mr. TyDi task in JMTEB requires vectorizing 7 million documents and usually takes a long time, around 1 to 4 hours on an RTX 4090 depending on the model. StaticEmbeddings processed it very quickly, finishing in about 4 minutes on an RTX 4090.

Can It Replace BM25 for Retrieval?

Looking at the Retrieval results, StaticEmbedding performs very poorly on Mr. TyDi. Mr. TyDi has far more documents than the other tasks, 7 million documents, so results may be poor for tasks that search over very large document collections. Since the model simply averages tokens without considering context, the more documents there are, the more likely similar averages may appear.

For large document collections, it may perform much worse than BM25. For smaller collections where exact keyword matches are rare, it may often perform better than BM25.

The JAQKET retrieval score is unusually good compared with other models. This may be because the model trained on JQaRA dev and unused data, which includes JAQKET-style questions, but the score still feels high. I do not think the test data leaked, but I am not fully sure why the score is this good.

Clustering Is Weak

I have not investigated this in detail, but the clustering score is clearly worse than other models. Classification is not bad, so this is somewhat surprising. It may be related to the embedding space being created with Matryoshka Representation Learning.

JQaRA and JaCWIR Reranking Evaluation

JQaRA:

model_names	ndcg@10	mrr@10
static-embedding-japanese	0.4704	0.6814
bm25	0.458	0.702
multilingual-e5-small	0.4917	0.7291

JaCWIR:

model_names	map@10	hits@10
static-embedding-japanese	0.7642	0.9266
bm25	0.8408	0.9528
multilingual-e5-small	0.869	0.97

On JQaRA it is slightly better than BM25 and slightly worse than mE5-small. On JaCWIR it is much lower than BM25 and mE5-small.

JaCWIR asks the model to find web article titles and summaries from queries, and those texts are often not clean. Transformer models are robust to noise, so it makes sense that a simple token-average StaticEmbedding model falls behind. BM25 matches distinctive words, so noisy words in documents often do not match the query in the first place, which helps it remain competitive with Transformer models on JaCWIR.

This suggests StaticEmbedding may score poorly compared with Transformer models or BM25 when texts contain a lot of noise.

Reducing Output Dimensions

The model created here outputs 1024 dimensions. Higher dimensionality increases computation cost for downstream tasks such as clustering and retrieval. Because the model is trained with Matryoshka Representation Learning (MRL), however, the 1024-dimensional vector can be easily truncated to smaller dimensions.

MRL encourages earlier dimensions to hold more important information, so using only the first 32, 64, 128, or 256 dimensions can still produce reasonable results.

According to the StaticEmbedding article, the model retains 91.87% performance at 128 dimensions, 95.79% at 256 dimensions, and 98.53% at 512 dimensions. This is useful when accuracy requirements are not too strict and downstream computation should be reduced.

Dimension Reduction Results for static-embedding-japanese

JMTEB can pass truncate_dim, making it easy to benchmark dimension-reduced outputs.

Dimensions	Avg(micro)	Score ratio (%)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
1024	67.17	100.00	67.92	80.16	67.96	91.87	40.39	62.37
512	66.57	99.10	67.63	80.11	65.66	91.54	41.25	62.37
256	65.94	98.17	66.99	79.93	63.53	91.73	42.55	62.37
128	64.25	95.65	64.87	79.56	60.52	91.62	41.81	62.33
64	61.79	91.98	61.15	78.34	58.23	91.50	39.11	62.35
32	57.93	86.24	53.35	76.51	55.95	91.15	38.20	62.37

I had previously measured the 512-dimensional score incorrectly and corrected it. Matryoshka Representation Learning appears to work: reducing dimensions causes a small score drop, but the reduced dimensions should lower downstream cost.

Interestingly, clustering improves over 1024 dimensions even when reduced to 128 dimensions. Normally, keeping more information should help, so this is unexpected. It may mean that, for clustering, using only the earlier dimensions that capture more global features works better than using later dimensions, depending on the clustering algorithm.

For this model, 512, 256, and 128 dimensions seem like reasonable tradeoffs between performance and dimensionality reduction.

Impressions After Building a StaticEmbedding Model

I was honestly skeptical that a simple average of token embeddings could perform this well, but after training it, I was surprised by the performance of such a simple architecture. In an era dominated by Transformers, it is interesting to see a practical model based on a more traditional word-embedding style approach.

A fast CPU sentence embedding model should be useful for converting large amounts of text locally, edge devices, and environments with slow networks where calling a remote inference server is difficult.

Technical Notes on Training the Japanese StaticEmbedding Model

Why Training Works

StaticEmbedding is very simple. It tokenizes a sentence, obtains N-dimensional word embeddings from an EmbeddingBag table, 1024 dimensions in this model, and averages them.

Traditional word embeddings such as word2vec and GloVe learn from word context with Skip-gram or CBOW. StaticEmbedding instead trains with entire sentences. It uses contrastive learning with large batches over many kinds of text, which can learn useful word embeddings.

Contrastive learning treats everything except the positive as a negative. With batch size 2048, one positive is compared against 2047 negatives for 2048 examples, about 4 million comparisons. This allows the model to update weights appropriately over the original word space.

Training Datasets

For the Japanese model, I created and used datasets suitable for contrastive learning:

hotchpotch/sentence_transformer_japanese
- This is arranged with column names and structures easy to use with SentenceTransformer training, such as (anchor, positive), (anchor, positive, negative), and (anchor, positive, negative_1, ..., negative_n).
- It is based on datasets including hpprc/emb, hpprc/llmjp-kaken, hpprc/msmarco-ja, hpprc/mqa-ja, and hpprc/llmjp-warp-html. For hpprc/emb and msmarco-ja, I filtered positives and negatives with reranker scores, using positive(>=0.7) and negative(<=0.3).
- I used many subsets from the constructed dataset, with augmentation to increase the amount of information retrieval-oriented data.
For English data, I used datasets such as sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1, sentence-transformers/squad, sentence-transformers/all-nli, sentence-transformers/trivia-qa, nthakur/swim-ir-monolingual, sentence-transformers/miracl, and sentence-transformers/mr-tydi.

As always, I am grateful to the dataset authors, especially hpprc.

Japanese Tokenizer

To train StaticEmbedding, it seemed easiest to use a tokenizer that can be processed in Hugging Face's tokenizer.json format, so I created hotchpotch/xlm-roberta-japanese-tokenizer, with a vocabulary size of 32,768.

This tokenizer was trained by segmenting Japanese Wikipedia data with UniDic and training SentencePiece unigram. I originally thought it also used sampled English Wikipedia and Japanese CC-100, but after checking the creation code, it used only Japanese Wikipedia. It also works as an XLM-Roberta-style Japanese tokenizer. I used this tokenizer for the model.

Hyperparameters

Notes and changes from the original training code:

Batch size was changed from 2048 to 6072.
- In large-batch contrastive learning, having positives and negatives in the same batch can hurt training. BatchSamplers.NO_DUPLICATES avoids this, but sampling can become slow with huge batches.
- I used BatchSamplers.NO_DUPLICATES and set the batch size to 6072, which fit in 24 GB on an RTX 4090. Larger batches may produce better results.
Epochs were changed from 1 to 2.
- 2 epochs performed better than 1, though with a larger dataset, 1 might be better.
Scheduler:
- Changed from the default linear scheduler to cosine, which has often worked better in my experience.
Optimizer:
- Kept the default AdamW. Switching to Adafactor made convergence worse.
Learning rate:
- Kept 2e-1. I wondered whether it was too large, but lower values worsened results.
dataloader_prefetch_factor=4
dataloader_num_workers=15
- Tokenization and batch sampler sampling take time, so I set this relatively high.

Training Resources

CPU: Ryzen 9 7950X
GPU: RTX 4090
Memory: 64 GB

With these resources, full-scratch training took about 4 hours. GPU core load was very low, often near 0%, unlike Transformer training where it stays around 90%. Most of the time appears to be spent transferring huge batches into GPU memory. Faster GPU memory bandwidth may improve training speed further.

Further Improvements

The tokenizer used here is not specialized for StaticEmbedding, so a more suitable tokenizer may improve performance. Larger batch sizes may also stabilize training and improve performance.

Using broader text resources, including various domains and synthetic datasets, may further improve performance.

Training Code

The training code is published under the MIT license. Running the script should reproduce the model.

https://huggingface.co/hotchpotch/static-embedding-japanese/blob/main/trainer.py

License

static-embedding-japanese publishes model weights and training code under the MIT license.