cat articles/vector-search-ai-ou-comp
Solving the first AI-Ou quiz competition with vector search only
This article was written for day 21 of the Kaggle Advent Calendar 2023.
With the appearance of LLMs that can handle long token contexts, I feel that Retrieval-Augmented Generation, or RAG, has become increasingly important as a way to improve LLM output accuracy. For example, in the Kaggle competition LLM Science Exam, all top solutions used RAG. One core element of RAG is a search method that retrieves text well representing the target text, such as a question. Major approaches include keyword-based search such as BM25 and vector search using sentence features, or embeddings.
In this article, I use only Japanese vector search to solve the task from the already-finished AI-Ou: Quiz AI Japan Championship, First Competition, and check what score can be achieved. I also evaluate several Japanese embedding conversion models.

What is the first AI-Ou competition?
The first AI-Ou, or Quiz AI Japan Championship, competition asks systems to choose one correct answer from about 20 candidates for each question. About 13,000 examples were published for training and about 2,000 for validation. The quiz answers are always included in Japanese Wikipedia. A dataset example looks like this:
## 質問
1868年に化石が発見された南フランスの地名から名が付いた、現在の人類の直接的な祖先とされる化石人類は何でしょう?
## 回答候補
['ホモ・ハイデルベルゲンシス', 'ホモ・サピエンス・イダルトゥ', 'クロマニョン人', 'ホモ・エルガステル', 'ジャワ原人', 'オロリン', 'サヘラントロ プス', 'アウストラロピテクス・アフリカヌス', 'ホモ・アンテセッサー', '猿人', 'ネアンデルタール人', 'ホモ・ ゲオルギクス', 'ホモ・エレクトス', '元謀原人', 'アウストラロピテクス', 'ホモ・フローレシエンシス', 'ホモ・ローデシエンシス', 'アウストラロピテクス・アファレンシス', 'ホモ・サピエンス', 'ホモ・ハビリス']
## 正解
クロマニョン人
Predicting the answer with only vector search
Because the quiz answers are always included in Japanese Wikipedia, I convert the question text into embeddings, search over embeddings of Japanese Wikipedia passages, and extract the top-N passages and Wikipedia titles with high similarity. Then I search that text for the first occurrence position of each of the 20 candidate answer strings, and predict the answer whose first occurrence is earliest. For Wikipedia search, I use the roughly 5.5 million passage dataset from Building Japanese Wikipedia embeddings and a FAISS index for RAG.
For example, for the question above, I convert "1868年に化石が発見された南フランスの地名から名が付いた、現在の人類の直接的な祖先とされる化石人類は何でしょう?" into embeddings, retrieve top-N by vector search, and concatenate the results into one text. The example below uses top-3.
南アフリカの人類化石遺跡群 クロマニョン人 化石人類 そのため、180万年前から150万年前と推測されるその時期、東アフリカではヒト属が優勢になっていたのに対し、南アフリカで優勢だったのはパラントロプス属の方だったのだろうと考えられている。グラディスヴェール はスタルクフォンテインから8 km ほどの場所にある遺跡で、1948年には探索が行われていたが、化石人骨の出土は1992年になってのことだった。この地で調査に当たっていた古人類学者リー・バーガー(英語版)は、アウストラロピテクス・アフリカヌスの断片を見つけるにとど まっていたという。しかし、バーガーは2008年8月にヨハネスブルグからグラディスヴェールに向かう大きな道を数 km 手前で脇に逸れ、グーグル・アースで見当をつけていた近隣の石灰石採掘場跡に赴いた。その場所で彼は9歳の息子マシューとともに、新種の猿人化石を発見した。 クロマニョン人(クロマニョンじん、Cro-Magnon man)とは、南フランスで発見された人類化石に付けられた名称である。1868年、クロマニョン (Cro-Magnon) 洞窟で、鉄道工事に際して5体の人骨化石が出土し、古生物学者ルイ・ラルテ(フランス語版、英語版)によって研究された。その後、ヨーロッパ、北アフリカ各地でも発見された。現在ではクロマニョン人を、現世人類と合わせて解剖学的現代人(英語: anatomically modern human) (AMH) と呼ぶことがある。またネアンデルタール人を、従来の日本語では旧人と呼ぶのに対し(ネアンデルター ル人以外にも、25万年前に新人段階に達する前の、現代型サピエンスの直接の祖先である古代型サピエンス等も旧人段階の人類とみなすこ とがある)、クロマニョン人に代表される現代型ホモ・サピエンスを、従来の日本語では新人と呼ぶこともある。 化石人類(かせきじんるい、英語: fossil hominidまたはfossil man)は、現在ではすでに化石化してその人骨が発見される過去の人類。人類の進化を考察していくうえで重要な化石資料となる。資料そのものは化石人骨(かせきじんこつ)とも称する。また、主に第四紀更新世(洪積世)の地層で発見される ので更新世人類ないし洪積世人類とも称する。
From this text, I search for the answer candidates above and choose the one that appears first. The candidates include 'ホモ・ハイデルベルゲンシス', 'ホモ・サピエンス・イダルトゥ', 'クロマニョン人', ..., so the predicted answer is "クロマニョン人". The true answer is also "クロマニョン人", so this case is correct.
Japanese embedding models and accuracy
For data, I used the roughly 2,000 validation examples, dev1 and dev2, provided by the first AI-Ou competition, and evaluated with accuracy. As Japanese embedding conversion models, I used:
- intfloat/multilingual-e5-small
- intfloat/multilingual-e5-base
- intfloat/multilingual-e5-large
- pkshatech/GLuCoSE-base-ja
- cl-nagoya/sup-simcse-ja-base
For the e5 series, different embeddings are generated by adding passage: to the original text for retrieval, or query: otherwise, so I tried both. Also, because search uses a FAISS IVFPQ-compressed index, based on the values from Measuring speed, data size, and accuracy for vector search algorithms and quantization parameters, answer accuracy may fluctuate by about plus or minus 2% for top-3 and plus or minus 0.5% for top-5 compared with an uncompressed index.
The results are below. As a reference, scores for the 13,000 training examples are also written in a separate sheet. acc@N is accuracy calculated from top-N data, and NMR@N is the no match rate, where none of the 20 answer candidate keywords were found from top-N.
- Evaluation code
- Score summary

For top-1, 3, 5, 10, 20, and so on, multilingual-e5-large wins decisively. I had assumed, with some bias, that multilingual-e5-small and multilingual-e5-large might not differ that much. But a 7% difference in accuracy between small and large changes the picture a lot. Another surprise was that for e5 embeddings, the difference between using passage: and query: as the prefix was almost nonexistent except for base. In fact, outside base, the results were almost reversed. For retrieval tasks that retrieve answer text for a question, I expected passage: to score better, but that was not the case here. Looking only at this result, it feels like for RAG search with e5 embeddings, using the more general query: prefix, which also works for similar sentence tasks, may be fine.
The reason many results are worse at acc@100 than acc@10 is the keyword search order. I concatenate text in the order title@1, title@2, ..., title@N, passage@1, passage@2, so as N grows, the chance of matching a wrong title increases.
Difference from top competition teams
According to the AI-Ou retrospective, top teams in the competition, including the LB first-place team, had CV scores over 0.95 on the dev dataset. Even the best e5-large score of 0.7791 is nowhere close.
Still, for only vector search and simple string search with no training, preprocessing, or postprocessing, I think the score is fairly good. I cannot confirm it now, but the originally published BERT fine-tuning baseline was reportedly around 0.8. If no-training search reaches 0.78, that does not feel bad.
Closing
This time I tried the first AI-Ou quiz competition using only vector search. In question-answering systems and similar applications, techniques that insert knowledge an LLM does not have and produce desired output through RAG + in-context learning will probably continue until LLMs can cheaply add external knowledge through training and hallucinations are almost eliminated. In Kaggle too, RAG and in-context learning may appear again in NLP tasks.
For this competition task, where the goal is to search for likely answer-containing Japanese sentences for a question, multilingual-e5-large performed well as an embedding conversion model. But for similar sentence search tasks, other models may perform better, as shown by JSTS, JSICK, and related evaluations. It seems necessary to evaluate performance appropriately for the task and data you want to handle.
I hope this article helps with Kaggle tasks or with using and choosing Japanese embeddings.