cat articles/qwen3-embedding-jmteb

Evaluating the Japanese Performance of Qwen3 Embedding with JMTEB

created 2025-06-11

The open-weight, high-performance multilingual embedding and reranker series Qwen3 Embedding has been released. It includes 8B, 4B, and 0.6B model sizes, performs well for text embeddings and reranking, and currently ranks at the top of the Multilingual MTEB leaderboard.

However, multilingual models often do not place much emphasis on Japanese, so I measured Qwen3-Embedding-0.6B with JMTEB: Japanese Massive Text Embedding Benchmark. jsick and jsts errored, so STS tasks are excluded.

JMTEB Results

Model	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
Qwen3-Embedding-0.6B	72.81	--	66.09	93.10	48.84	62.42
ruri-v3-310m	81.89	81.22	78.66	93.43	55.69	62.60
ruri-v3-130m	81.89	79.25	77.16	93.31	55.36	62.26
ruri-v3-70m	79.96	79.82	76.97	93.27	52.70	61.75
PLaMo-Embedding-1B	79.94	83.14	77.20	93.57	53.47	62.37
ruri-v3-30m	78.08	82.48	74.80	93.00	52.12	62.40
sbintuitions/sarashina-embedding-v1-1b	77.61	82.71	78.37	93.74	53.86	62.00
jinaai/jina-embeddings-v3	75.22	80.05	76.39	92.71	51.46	62.37
OpenAI/text-embedding-3-large	74.48	82.52	77.58	93.58	53.32	62.35
pkshatech/GLuCoSE-base-ja-v2	73.36	82.96	74.21	93.01	48.65	62.37
pkshatech/RoSEtta-base-ja	73.21	81.39	72.41	92.69	53.23	61.74
intfloat/multilingual-e5-large	70.98	79.70	72.89	92.96	51.24	62.15
OpenAI/text-embedding-3-small	66.39	79.46	73.06	92.92	51.06	62.27

These are the results. Perhaps because Japanese tasks were not trained heavily, the Japanese results were not strong. The ruri-v3 series is smaller and clearly much stronger for Japanese.

For Retrieval and Reranking tasks, I added the query prefix Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:.

The JMTEB configuration, summary JSON, and execution commands used for this measurement are available here. The Qwen3-Embedding-0.6B score feels low, so if I made a mistake, please let me know.

https://gist.github.com/hotchpotch/f6be186010e70d6eb6e46447cea258f9

Extra: Reading the Qwen3 Embedding Paper

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models was published, so I read it briefly. I found the synthetic dataset creation process especially interesting.

These are notes from the parts that interested me:

It does not convert a decoder to an encoder like LLM2Vec; it uses causal attention as-is.
The embedding model obtains the final embedding from the hidden state of the final layer's [EOS] token.
- Queries are built as Instruction + Query. Documents are used as-is.
- The score improves on InfoNCE rather than using simple contrastive learning, including multiple hard negatives and adjustments for false negatives by tuning positive and negative similarity.
Reranking uses the chat template directly and computes a relevance score from the probabilities of the "yes" and "no" tokens.
- It applies the usual decoder-model label-classification approach, looking at the probability of the target label token.
- It can be trained with SFT.
In the first stage, training uses a synthetic dataset created with Qwen3-32B.
- It creates four types: information retrieval, bitext mining, semantic similarity, and classification.
- For the information retrieval synthetic dataset, it creates detailed settings and generates queries from documents in the Qwen3 pretraining corpus.
In the second stage, training uses 7 million existing datasets such as MS MARCO and MIRACL, plus 12 million records filtered by cosine similarity from the first stage.
Finally, it uses model merging with diversity in mind.
- The details are not written, so this is an inference, but multiple second-stage checkpoints could include task-specialized checkpoints or checkpoints focused on particular languages.
- If you have many checkpoints, model merging seems worth trying. Even with limited compute, you can often observe benchmark improvements by merging checkpoints and evaluating the result.