cat articles/static-embedding-japanese
Releasing a Japanese StaticEmbedding Model for Practical 100x Faster Text Embeddings
I released static-embedding-japanese, a Japanese and English StaticEmbedding model that can create practical text embeddings very quickly on CPU.
Dense text vectors can be used for many tasks, including information retrieval, text classification, and similar-text extraction. However, even small recent Transformer models can be slow, especially on CPU, and that often makes them impractical.
A recently released non-Transformer StaticEmbedding model offers a new approach. In benchmark comparisons with intfloat/multilingual-e5-small, or mE5-small, it achieved at least reasonable performance, around 85% of the score, while creating sentence vectors 126 times faster on CPU. That speed is impressive.
I therefore trained and released a Japanese and English model, static-embedding-japanese.
The JMTEB results for Japanese text embeddings are below. The overall score is slightly below mE5-small, but it wins on some tasks and is sometimes stronger than other Japanese base-size BERT models. Before training it, I was not sure a model this simple would really perform this well, so the result was surprising.
| Model | Avg(micro) | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|---|---|---|---|---|---|---|---|
| text-embedding-3-small | 69.18 | 66.39 | 79.46 | 73.06 | 92.92 | 51.06 | 62.27 |
| multilingual-e5-small | 67.71 | 67.27 | 80.07 | 67.62 | 93.03 | 46.91 | 62.19 |
| static-embedding-japanese | 67.17 | 67.92 | 80.16 | 67.96 | 91.87 | 40.39 | 62.37 |
Technical notes on training the Japanese StaticEmbedding model are in the latter half of this article.
Usage
Usage is simple. You can create sentence vectors with SentenceTransformer as usual. This example runs on CPU without a GPU. I tested with SentenceTransformer 3.3.1.
pip install "sentence-transformers>=3.3.1"
from sentence_transformers import SentenceTransformer
model_name = "hotchpotch/static-embedding-japanese"
model = SentenceTransformer(model_name, device="cpu")
query = "美味しいラーメン屋に行きたい"
docs = [
"素敵なカフェが近所にあるよ。落ち着いた雰囲気でゆっくりできるし、窓際の席からは公園の景色も見えるんだ。",
"新鮮な魚介を提供する店です。地元の漁師から直接仕入れているので鮮度は抜群ですし、料理人の腕も確かです。",
"あそこは行きにくいけど、隠れた豚骨の名店だよ。スープが最高だし、麺の硬さも好み。",
"おすすめの中華そばの店を教えてあげる。とりわけチャーシューが手作りで柔らかくてジューシーなんだ。",
]
embeddings = model.encode([query] + docs)
print(embeddings.shape)
similarities = model.similarity(embeddings[0], embeddings[1:])
for i, similarity in enumerate(similarities[0].tolist()):
print(f"{similarity:.04f}: {docs[i]}")
(5, 1024)
0.1040: 素敵なカフェが近所にあるよ。落ち着いた雰囲気でゆっくりできるし、窓際の席からは公園の景色も見えるんだ。
0.2521: 新鮮な魚介を提供する店です。地元の漁師から直接仕入れているので鮮度は抜群ですし、料理人の腕も確かです。
0.4835: あそこは行きにくいけど、隠れた豚骨の名店だよ。スープが最高だし、麺の硬さも好み。
0.3199: おすすめの中華そばの店を教えてあげる。とりわけチャーシューが手作りで柔らかくてジューシーなんだ。
The document matching the query gets a higher score. In this example, BM25 would have difficulty because direct words such as "ramen" in the query do not appear in the documents.
Here is an example of a similar-sentence task:
sentences = [
"明日の午後から雨が降るみたいです。",
"来週の日曜日は天気が良いそうだ。",
"あしたの昼過ぎから傘が必要になりそう。",
"週末は晴れるという予報が出ています。",
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# Show similarity between the first sentence and the others.
for i, similarity in enumerate(similarities[0].tolist()):
print(f"{similarity:.04f}: {sentences[i]}")
tensor([[1.0000, 0.2814, 0.3620, 0.2818],
[0.2814, 1.0000, 0.2007, 0.5372],
[0.3620, 0.2007, 1.0000, 0.1299],
[0.2818, 0.5372, 0.1299, 1.0000]])
1.0000: 明日の午後から雨が降るみたいです。
0.2814: 来週の日曜日は天気が良いそうだ。
0.3620: あしたの昼過ぎから傘が必要になりそう。
0.2818: 週末は晴れるという予報が出ています。
The similar sentence receives a higher score here as well.
Many people have experienced that creating sentence vectors with Transformer models on CPU takes a long time even for a small amount of text. With StaticEmbedding, it should finish almost instantly if the CPU is reasonably fast.
Reducing Output Dimensions
The default sentence vector has 1024 dimensions, but you can reduce it further. For example, here is truncate_dim=128.
# truncate_dim can be 32, 64, 128, 256, 512, or 1024.
model = SentenceTransformer(model_name, device="cpu", truncate_dim=128)
This produces 128-dimensional vectors. The score changes slightly because reducing dimensions lowers performance a little. On the other hand, reducing from 1024 to 128 dimensions reduces storage size and makes similarity computation about 8 times cheaper, so lower dimensions can be preferable depending on the use case.
Why Is CPU Inference Fast?
StaticEmbedding is not a Transformer model. It has no attention computation, the core of "Attention Is All You Need." It stores token embeddings in a 1024-dimensional table and creates a sentence vector by averaging the token vectors that appear in the sentence. Because there is no attention, it does not understand context in the same way a Transformer does.
Internally, it uses PyTorch's nn.EmbeddingBag, passing concatenated tokens and offsets so that PyTorch can use optimized CPU parallel processing and memory access.

According to the speed evaluation in the original article, it is 126 times faster than mE5-small on CPU.
Evaluation
All JMTEB results are in this JSON file. Comparing with other models on the JMTEB leaderboard shows the relative difference. Considering the model size, the overall JMTEB result is very good. The Mr. TyDi task in JMTEB requires vectorizing 7 million documents and usually takes a long time, around 1 to 4 hours on an RTX 4090 depending on the model. StaticEmbeddings processed it very quickly, finishing in about 4 minutes on an RTX 4090.
Can It Replace BM25 for Retrieval?
Looking at the Retrieval results, StaticEmbedding performs very poorly on Mr. TyDi. Mr. TyDi has far more documents than the other tasks, 7 million documents, so results may be poor for tasks that search over very large document collections. Since the model simply averages tokens without considering context, the more documents there are, the more likely similar averages may appear.
For large document collections, it may perform much worse than BM25. For smaller collections where exact keyword matches are rare, it may often perform better than BM25.
The JAQKET retrieval score is unusually good compared with other models. This may be because the model trained on JQaRA dev and unused data, which includes JAQKET-style questions, but the score still feels high. I do not think the test data leaked, but I am not fully sure why the score is this good.
Clustering Is Weak
I have not investigated this in detail, but the clustering score is clearly worse than other models. Classification is not bad, so this is somewhat surprising. It may be related to the embedding space being created with Matryoshka Representation Learning.
JQaRA and JaCWIR Reranking Evaluation
| model_names | ndcg@10 | mrr@10 |
|---|---|---|
| static-embedding-japanese | 0.4704 | 0.6814 |
| bm25 | 0.458 | 0.702 |
| multilingual-e5-small | 0.4917 | 0.7291 |
| model_names | map@10 | hits@10 |
|---|---|---|
| static-embedding-japanese | 0.7642 | 0.9266 |
| bm25 | 0.8408 | 0.9528 |
| multilingual-e5-small | 0.869 | 0.97 |
On JQaRA it is slightly better than BM25 and slightly worse than mE5-small. On JaCWIR it is much lower than BM25 and mE5-small.
JaCWIR asks the model to find web article titles and summaries from queries, and those texts are often not clean. Transformer models are robust to noise, so it makes sense that a simple token-average StaticEmbedding model falls behind. BM25 matches distinctive words, so noisy words in documents often do not match the query in the first place, which helps it remain competitive with Transformer models on JaCWIR.
This suggests StaticEmbedding may score poorly compared with Transformer models or BM25 when texts contain a lot of noise.
Reducing Output Dimensions
The model created here outputs 1024 dimensions. Higher dimensionality increases computation cost for downstream tasks such as clustering and retrieval. Because the model is trained with Matryoshka Representation Learning (MRL), however, the 1024-dimensional vector can be easily truncated to smaller dimensions.
MRL encourages earlier dimensions to hold more important information, so using only the first 32, 64, 128, or 256 dimensions can still produce reasonable results.

According to the StaticEmbedding article, the model retains 91.87% performance at 128 dimensions, 95.79% at 256 dimensions, and 98.53% at 512 dimensions. This is useful when accuracy requirements are not too strict and downstream computation should be reduced.
Dimension Reduction Results for static-embedding-japanese
JMTEB can pass truncate_dim, making it easy to benchmark dimension-reduced outputs.
| Dimensions | Avg(micro) | Score ratio (%) | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|---|---|---|---|---|---|---|---|---|
| 1024 | 67.17 | 100.00 | 67.92 | 80.16 | 67.96 | 91.87 | 40.39 | 62.37 |
| 512 | 66.57 | 99.10 | 67.63 | 80.11 | 65.66 | 91.54 | 41.25 | 62.37 |
| 256 | 65.94 | 98.17 | 66.99 | 79.93 | 63.53 | 91.73 | 42.55 | 62.37 |
| 128 | 64.25 | 95.65 | 64.87 | 79.56 | 60.52 | 91.62 | 41.81 | 62.33 |
| 64 | 61.79 | 91.98 | 61.15 | 78.34 | 58.23 | 91.50 | 39.11 | 62.35 |
| 32 | 57.93 | 86.24 | 53.35 | 76.51 | 55.95 | 91.15 | 38.20 | 62.37 |
I had previously measured the 512-dimensional score incorrectly and corrected it. Matryoshka Representation Learning appears to work: reducing dimensions causes a small score drop, but the reduced dimensions should lower downstream cost.
Interestingly, clustering improves over 1024 dimensions even when reduced to 128 dimensions. Normally, keeping more information should help, so this is unexpected. It may mean that, for clustering, using only the earlier dimensions that capture more global features works better than using later dimensions, depending on the clustering algorithm.
For this model, 512, 256, and 128 dimensions seem like reasonable tradeoffs between performance and dimensionality reduction.
Impressions After Building a StaticEmbedding Model
I was honestly skeptical that a simple average of token embeddings could perform this well, but after training it, I was surprised by the performance of such a simple architecture. In an era dominated by Transformers, it is interesting to see a practical model based on a more traditional word-embedding style approach.
A fast CPU sentence embedding model should be useful for converting large amounts of text locally, edge devices, and environments with slow networks where calling a remote inference server is difficult.
Technical Notes on Training the Japanese StaticEmbedding Model
Why Training Works
StaticEmbedding is very simple. It tokenizes a sentence, obtains N-dimensional word embeddings from an EmbeddingBag table, 1024 dimensions in this model, and averages them.
Traditional word embeddings such as word2vec and GloVe learn from word context with Skip-gram or CBOW. StaticEmbedding instead trains with entire sentences. It uses contrastive learning with large batches over many kinds of text, which can learn useful word embeddings.
Contrastive learning treats everything except the positive as a negative. With batch size 2048, one positive is compared against 2047 negatives for 2048 examples, about 4 million comparisons. This allows the model to update weights appropriately over the original word space.
Training Datasets
For the Japanese model, I created and used datasets suitable for contrastive learning:
- hotchpotch/sentence_transformer_japanese
- This is arranged with column names and structures easy to use with SentenceTransformer training, such as
(anchor, positive),(anchor, positive, negative), and(anchor, positive, negative_1, ..., negative_n). - It is based on datasets including
hpprc/emb,hpprc/llmjp-kaken,hpprc/msmarco-ja,hpprc/mqa-ja, andhpprc/llmjp-warp-html. Forhpprc/embandmsmarco-ja, I filtered positives and negatives with reranker scores, using positive(>=0.7) and negative(<=0.3). - I used many subsets from the constructed dataset, with augmentation to increase the amount of information retrieval-oriented data.
- This is arranged with column names and structures easy to use with SentenceTransformer training, such as
- For English data, I used datasets such as
sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1,sentence-transformers/squad,sentence-transformers/all-nli,sentence-transformers/trivia-qa,nthakur/swim-ir-monolingual,sentence-transformers/miracl, andsentence-transformers/mr-tydi.
As always, I am grateful to the dataset authors, especially hpprc.
Japanese Tokenizer
To train StaticEmbedding, it seemed easiest to use a tokenizer that can be processed in Hugging Face's tokenizer.json format, so I created hotchpotch/xlm-roberta-japanese-tokenizer, with a vocabulary size of 32,768.
This tokenizer was trained by segmenting Japanese Wikipedia data with UniDic and training SentencePiece unigram. I originally thought it also used sampled English Wikipedia and Japanese CC-100, but after checking the creation code, it used only Japanese Wikipedia. It also works as an XLM-Roberta-style Japanese tokenizer. I used this tokenizer for the model.
Hyperparameters
Notes and changes from the original training code:
- Batch size was changed from 2048 to 6072.
- In large-batch contrastive learning, having positives and negatives in the same batch can hurt training.
BatchSamplers.NO_DUPLICATESavoids this, but sampling can become slow with huge batches. - I used
BatchSamplers.NO_DUPLICATESand set the batch size to 6072, which fit in 24 GB on an RTX 4090. Larger batches may produce better results.
- In large-batch contrastive learning, having positives and negatives in the same batch can hurt training.
- Epochs were changed from 1 to 2.
- 2 epochs performed better than 1, though with a larger dataset, 1 might be better.
- Scheduler:
- Changed from the default linear scheduler to cosine, which has often worked better in my experience.
- Optimizer:
- Kept the default AdamW. Switching to Adafactor made convergence worse.
- Learning rate:
- Kept
2e-1. I wondered whether it was too large, but lower values worsened results.
- Kept
dataloader_prefetch_factor=4dataloader_num_workers=15- Tokenization and batch sampler sampling take time, so I set this relatively high.
Training Resources
- CPU: Ryzen 9 7950X
- GPU: RTX 4090
- Memory: 64 GB
With these resources, full-scratch training took about 4 hours. GPU core load was very low, often near 0%, unlike Transformer training where it stays around 90%. Most of the time appears to be spent transferring huge batches into GPU memory. Faster GPU memory bandwidth may improve training speed further.
Further Improvements
The tokenizer used here is not specialized for StaticEmbedding, so a more suitable tokenizer may improve performance. Larger batch sizes may also stabilize training and improve performance.
Using broader text resources, including various domains and synthetic datasets, may further improve performance.
Training Code
The training code is published under the MIT license. Running the script should reproduce the model.
License
static-embedding-japanese publishes model weights and training code under the MIT license.