cat articles/text-embedding-3-small
Evaluating OpenAI's new text-embedding-3-small on a RAG task
OpenAI recently released new embedding models, text-embedding-3-small and text-embedding-3-large. According to OpenAI, text-embedding-3-small costs one fifth as much as the older ada-v2 embedding model while improving performance.
According to OpenAI's article, MTEB scores improved slightly, and MIRACL scores improved substantially. MIRACL, Multilingual Information Retrieval Across a Continuum of Languages, is an information retrieval task across multiple languages. A large score improvement there suggests we can also expect better accuracy for Japanese information retrieval tasks.
Evaluation on a Wikipedia Q&A RAG task
So I evaluated it right away. I used the same method as Solving the first AI-Ou quiz competition with vector search only: vector search over about 5.5 million passages and checking whether the retrieved results contain the answer. In other words, this tests whether RAG retrieval can find text containing the appropriate answer. For text-embedding-3-small, I passed an option to the OpenAI API to reduce the embedding dimension to 512, and used those dimension-reduced results. If the original 1536-dimensional data were used, the score would probably improve slightly.
The results are below.

The text-embedding-3-small result was low, and the multilingual-e5 series remained overwhelmingly strong for Q&A-style information retrieval tasks. That said, one strength of OpenAI embeddings may be their ability to embed long token inputs. The passages used here were under 400 Japanese characters, and the score might also be a little higher without dimension reduction.
Data and code used
- https://github.com/hotchpotch/wikipedia-passages-jawiki-embeddings-utils
- https://huggingface.co/datasets/hotchpotch/wikipedia-passages-jawiki-embeddings
Extra: cost
The total was 1,490,618,785 tokens, and cost about 30 USD. With the old model this would have cost five times as much, so I would not have felt like trying it. At this level, it feels roughly acceptable even for an individual.