cat articles/embedding-gemma300m

Evaluating the Japanese Performance of Embedding Gemma 300M with JMTEB

created 2025-09-18

Google recently released the embedding model google/embeddinggemma-300m. It performs quite well on MTEB Multilingual v2, so I benchmarked it with JMTEB v1 to properly measure its Japanese performance.

The short version is that, for Japanese, Embedding Gemma performed very poorly in my measurement.

JMTEB v1 Benchmark

Model	Params	Avg	Retrieval	STS	Classification	Reranking	Clustering	PairClass
google/embeddinggemma-300m	308M	58.10	42.18	73.36	63.23	91.55	45.87	62.42
intfloat/multilingual-e5-small	118M	69.52	67.27	80.07	67.62	93.03	46.91	62.19
intfloat/multilingual-e5-large	560M	71.65	70.98	79.70	72.89	92.96	51.24	62.15
cl-nagoya/ruri-v3-30m	37M	74.51	78.08	82.48	74.80	93.00	52.12	62.40
cl-nagoya/ruri-v3-310m	315M	77.24	81.89	81.22	78.66	93.43	55.69	62.60

Note: This is the micro average, simple average, across the 16 Japanese tasks in JMTEB v1.

The JMTEB configuration is here, and prefixes and similar settings should be applied. The result JSON, summary.json, is in this gist. The reproduction steps are also in the gist. If my measurement is wrong, please let me know.

Update, 2025-10-03: Because of a Transformers bug, the latest version reportedly improves performance to around the level of ruri-base. Thank you to LM8 (@ShengzheLi) for the information.

JQaRA / JaCWIR

Because the JMTEB v1 score was very low, I also evaluated it separately on JQaRA and JaCWIR. The results were again quite low.

Model	JQaRA (nDCG@10)	JQaRA (MRR@10)	JaCWIR (MAP@10)	JaCWIR (HIT_RATE@10)
google/embeddinggemma-300m	0.261	0.457	0.730	0.904
intfloat/multilingual-e5-small	0.492	0.729	0.869	0.970
intfloat/multilingual-e5-large	0.554	0.799	0.876	0.973

Strong MTEB Does Not Necessarily Mean Strong Japanese Performance

This was also true for Qwen3 Embedding, which I evaluated recently in Evaluating the Japanese Performance of Qwen3 Embedding with JMTEB. Recent multilingual embedding models with high MTEB scores often have weak Japanese performance. Looking at the Language-specific Japanese section of the MTEB leaderboard, both Qwen3 Embeddings and Embedding Gemma only show Pair Classification for Japanese, so it is not very informative. That makes the meaning of multilingual performance somewhat unclear.

Both Qwen3 Embedding and Embedding Gemma are based on decoder-model architectures. Looking inside embeddinggemma-300m, it uses an embedding head, pooling plus two dense layers, with mean pooling.

For decoder-based models with small parameter counts, at least in Japanese, the performance was much lower than other encoder-based multilingual models. It is unclear whether this is because the models were barely trained on Japanese embedding tasks, or because the original small decoder model has weak Japanese generalization.