cat articles/wikipedia-ja-embeddings

Building Japanese Wikipedia embeddings and a FAISS index for RAG

created 2023-12-04

This article was written for December 4 of the Information Retrieval and Search Technology Advent Calendar 2023.

With the recent rise of LLMs, I feel that RAG, or Retrieval-Augmented Generation, has become increasingly important as a way to improve LLM outputs by using external information as in-context learning. But when building a system with RAG, it is not very interesting if the data is too small. So I created embeddings for about 5.5 million Japanese Wikipedia passages that can be searched easily and used as RAG input data, along with FAISS indexes for fast vector search.

For example, you can search for relevant Wikipedia sentences like this:

from datasets.download import DownloadManager
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import faiss

# wikipedia 日本語データセットのロード
wikija_dataset = load_dataset(
    path="singletongue/wikipedia-utils",
    name="passages-c400-jawiki-20230403",
    split="train",
)
# faiss index のダウンロード
dm = DownloadManager()
index_local_path = dm.download(
    f"https://huggingface.co/datasets/hotchpotch/wikipedia-passages-jawiki-embeddings/resolve/main/faiss_indexes/passages-c400-jawiki-20230403/multilingual-e5-small-passage/index_IVF2048_PQ96.faiss"
)
# faiss index のロード
faiss_index = faiss.read_index(index_local_path)

# embeddings へ変換するモデルのロード
model = SentenceTransformer("intfloat/multilingual-e5-small")
model.max_seq_length = 512

# embeddings へ変換
def to_emb(model, text, prefix="query: "):
    return model.encode([prefix + text], normalize_embeddings=True)

emb = to_emb(model, "楽曲『約束はいらない』でデビューした、声優は誰?")

# faiss で検索して、関連 Top-5 を取り出す
TOP_K = 5
scores, indexes = faiss_index.search(emb, TOP_K)
for idx, (id, score) in enumerate(zip(indexes[0], scores[0])):
    data = wikija_dataset[int(id)]
    print((score, data["title"], data["text"][:100]))

The result is below. For the question asking "who?", the appropriate person's name appears in the top 3.

(0.21018645, '約束はいらない', '「約束はいらない」(やくそくはいらない)は、坂本真綾のデビューシングル。')
(0.24241784, '約束はいらない', '坂本真綾の歌手デビュー作品。当時坂本はまだ無名の声優であったが、同曲がテーマソングとなったアニメ『天空のエスカフローネ』とともに知名度を上げることとなる。後に「指輪」が同アニメの劇場版映画の主題歌とな')
(0.2441598, '坂本真綾', '本格的な歌手活動は、1996年にテレビアニメ『天空のエスカフローネ』へ出演したことがきっかけで始めており、同作のオープニングテーマソングである「約束はいらない」(岩里祐穂作詞、菅野よう子作曲)をシング')
(0.26043588, 'Suara', '大阪府立豊中高等学校を経て大阪外国語大学でインドネシア語を専攻中にバンド・ユニットを組んで音楽活動を始めた。普段はお笑い番組が大好きなこともあってよく喋るほうだが、東京の仕事で標準語の喋りをする時は、')
(0.2690969, '早見沙織', '声優デビュー時より、数多くの主題歌やキャラクターソングを担当し、バラードからポップス、ヒットソングのカバー曲や英語の楽曲など、様々な曲を歌いこなす。2009年には吉田仁美とのユニット「blue dro')

Because the full Wikipedia text and the search index can be loaded from Hugging Face datasets, this can be used easily without doing difficult setup. The FAISS index used for search is memory-efficient and fast. For about 5.5 million vectors, it can search quickly on CPU without requiring large compute resources.

I also made a Hugging Face Spaces demo for RAG, so you can try RAG search and QA interaction from a Web UI.

https://huggingface.co/spaces/hotchpotch/wikipedia-japanese-rag-qa

Using Wikipedia sentences as RAG context

Next, let's use the answer data for the question "楽曲『約束はいらない』でデビューした、声優は誰？" as in-context learning for RAG. First, asking ChatGPT 3.5 directly produced this result:

楽曲『約束はいらない』でデビューした、声優は誰？

「約束はいらない」は、声優の悠木碧（ゆうきあおい、英: Aoi Yuki）によって歌われた楽曲です。悠木碧は、日本の声優として知られており、多くのアニメ作品やゲームで声優として活躍しています。この曲は彼女のデビューシングルの一つとして知られています。

ChatGPT 3.5 gave a wrong answer. Next, I injected the Wikipedia search results above and used them for generation, in other words as RAG search results for in-context learning.

楽曲『約束はいらない』でデビューした、声優は誰？以下のデータにヒントが含まれる場合は参考にして答えよ。また以下のデータ自体が間違っている場合もあるため、その場合はデータを利用せずに答えよ。

(0.21018645, '約束はいらない', '「約束はいらない」(やくそくはいらない)は、坂本真綾のデビューシングル。') (0.24241784, '約束はいらない', '坂本真綾の歌手デビュー作品。当時坂本はまだ無名の声優であったが、同曲がテーマソングとなったアニメ『天空のエスカフローネ』とともに知名度を上げることとなる。後に「指輪」が同アニメの劇場版映画の主題歌とな') (0.2441598, '坂本真綾', '本格的な歌手活動は、1996年にテレビアニメ『天空のエスカフローネ』へ出演したことがきっかけで始めており、同作のオープニングテーマソングである「約束はいらない」(岩里祐穂作詞、菅野よう子作曲)をシング') (0.26043588, 'Suara', '大阪府立豊中高等学校を経て大阪外国語大学でインドネシア語を専攻中にバンド・ユニットを組んで音楽活動を始めた。普段はお笑い番組が大好きなこともあってよく喋るほうだが、東京の仕事で標準語の喋りをする時は、') (0.2690969, '早見沙織', '声優デビュー時より、数多くの主題歌やキャラクターソングを担当し、バラードからポップス、ヒットソングのカバー曲や英語の楽曲など、様々な曲を歌いこなす。2009年には吉田仁美とのユニット「blue dro')

提供されたデータによれば、楽曲「約束はいらない」は坂本真綾のデビューシングルであり、坂本真綾が歌手デビューした作品であることが確認できます。したがって、この楽曲でデビューした声優は坂本真綾（さかもとまあや）です。

This answered correctly. The test is here. GPT-4 answered correctly from the start even without adding RAG information.

How to create Wikipedia sentence and passage data

There are several ways to obtain Wikipedia data, such as full dump data, CirrusSearch dump data, and calling the API. Each has advantages and disadvantages. Personally, I think calling the API is the best way to get clean text. However, calling the API for every article is hard, so at first I used the easier CirrusSearch dump data and split it into chunks for testing.

Then I found singletongue/wikipedia-utils, created by Masatoshi Suzuki. Looking inside, it contains clean text data including article titles and section titles. It also publishes passages of useful lengths, such as 200, 300, and 400 characters, so I created embeddings based on that data.

Which model to use for sentence embeddings

Embedding models have different performance directions depending on the use case, roughly corresponding to how they were trained. In RAG systems, the most important type is a model strong at retrieval or reranking tasks, trained so that questions and matching passages score highly. The next best direction is a model strong at similar sentence search.

Among models that seemed likely to perform well and could be used freely for Japanese, I converted data into embeddings with the following models. OpenAI's embeddings API, ada-v2, has the advantage of long input token length, but I excluded it because the output dimension is large at 1536 dimensions, the license situation for converted data is unclear, and above all it costs API fees.

multilingal-e5 series
- You need to be aware of prefixes: use "passage: " for retrieval passages and "query: " otherwise. It feels quite strong.
pkshatech/GLuCoSE-base-ja
cl-nagoya/sup-simcse-ja-base

I converted the passage-400 data above into embeddings with these models. For e5, using "passage: " or "query: " at the beginning of the target text produces different vectors, so I published embeddings generated with each prefix.

https://huggingface.co/datasets/hotchpotch/wikipedia-passages-jawiki-embeddings/tree/main/embs/passages-c400-jawiki-20230403

In addition to embeddings, I published FAISS indexes for search. For index parameters, based on this evaluation, I used IVF nlist=2048, and set the PQ quantization option to one quarter of the vector dimension, such as 96 for 384 dimensions. If you are interested in how Wikipedia sentence embedding search behaves with each model, please try them.

https://huggingface.co/datasets/hotchpotch/wikipedia-passages-jawiki-embeddings/tree/main/faiss_indexes/passages-c400-jawiki-20230403

The implementation used for conversion is published as hotchpotch/wikipedia-passages-jawiki-embeddings-utils. datasets_to_embs.py converts datasets to embeddings, and embs_to_faiss.py creates FAISS indexes from embeddings.

Closing

This article introduced and described creating Japanese Wikipedia embeddings and search indexes for RAG that are easy to try in a general way. In this year when LLMs rose rapidly, search use cases are likely to change greatly, and I feel this was the year when search technology started to be rediscovered and redefined. RAG, though this article only touched one element of RAG, vector search, suggests expansion into many search use cases. I look forward to the development and use of search technology next year and beyond.

Using Wikipedia sentences as RAG context

How to create Wikipedia sentence and passage data

Which model to use for sentence embeddings

Closing

Reference links