cat articles/query-crafter-japanese

query-crafter-japanese: A Model for Generating Queries for Information Retrieval

created 2025-05-07

Training neural retrieval models such as vector search models and rerankers requires pairs of questions and answer documents. The answer document can be almost any text, though higher-quality text and domain-specific data naturally help produce better models. For training, however, we also need questions that are related to those answers. Recent LLMs have improved substantially, and we can use them to generate questions automatically from answer documents, then use those pairs for training. Datasets created this way are often called synthetic datasets.

However, when you want to create and publish a synthetic dataset broadly, commercial LLMs such as OpenAI and Gemini can create licensing issues because of their terms of use. Processing a large number of documents also takes significant time and cost.

For that reason, I created and released query-crafter-japanese, a family of small 1.7B to 4B models that run quickly, can generate retrieval questions at roughly the same level as questions generated by DeepSeek-R1, and do not impose restrictions on output licensing. The models are released under the Apache 2.0 license.

query-crafter-japanese-Qwen3-1.7B
- Recommended for speed and performance
query-crafter-japanese-Qwen3-4B
query-crafter-japanese-sarashina2.2-3b-instruct-v0.1

query-crafter can generate seven categories:

keywords: space-separated keywords
synonym_keywords: distinctive keywords using synonyms
query: a question based on the content of the text
alt_query: a question phrased in a way that does not match well with BM25
title: a title representing the whole text
faq: a question when treating the text as an FAQ answer
summary: a short summary of the text

Let's generate each category from the following text.

query-crafter-japanese-example.py

In the evening, we had the results presentation for a development retreat. Everyone except me worked on proper AI-related themes, and the quality was high. It was interesting. Person I is not even an engineer, but they made a Figma plugin and deployed it to Vercel, mostly written by Cursor. It was a close-up example of how AI can greatly expand what someone can do. I did not work on a particular theme. Instead, because I had never tried vibe coding, I tested how far I could build something in Cursor without touching or reading the code.

I made a tool that summarizes these yearly diary entries and posts them to Discord, adding new features based on a specification I had written before. I also made a tool that automatically gives titles to diary entries that do not have one. Vibe coding worked about as I expected. It is convenient.

Because I developed in a black-box way without looking at the code and only checked the output artifacts, the generated code was not production-ready when I looked at it later. Still, it was enough for quick one-off tools. I only need to give the specification, and I also make sure the specification is updated along the way. If I want to change a feature, I only need to change or add to the specification, which is easy.

Here is the result of generating queries by category with query-crafter-japanese-Qwen3-1.7B. keywords, query, title, and summary show clear differences. synonym_keywords is not always a perfect synonym, and alt_query and faq may sometimes be close to query.

keywords: Vibe Cording ブラックボックス開発 仕様変更
synonym_keywords: AI活用開発プロジェクト 発表会 仕様変更追加
query: 開発合宿で作成したツールの具体的な機能は？
alt_query: 開発者向けツール開発でコード見ない開発手法の利点は？
title: AI活用で拓く開発の新領域：Vibe Cordingとブラックボックス開発の可能性
faq: 開発合宿で実現した新機能や成果は？
summary: AI活用の開発成果発表会で、Vibe Cordingや日記ツール開発、コード見ずに開発を実施

The model is also fast. In a vLLM + RTX 5090 environment, it runs at about 48,000 toks/s for input tokens and 2,200 toks/s for output tokens. If you generate 10,000 questions from 10,000 texts of around 1,000 Japanese characters each, it takes a little under 100 seconds. Even if there were 100 million target documents, processing all of them would take about 140 hours.

For comparison, when I processed 100,000 documents with DeepSeek-R1 during the nighttime discount window, with input at 0.135 USD per 1M tokens and output at 0.55 USD per 1M tokens, using 100 parallel API requests took about 7 hours and cost around 40 USD. Processing 100 million documents with the DeepSeek-R1 API would cost around 40,000 USD and take about 7,000 hours. In practice it would take longer if you try to use only nighttime discount periods, and the maximum parallel request count also depends on DeepSeek's available resources.

In this way, query-crafter has large advantages in both speed and cost when you want to generate questions from a large number of documents.

Training query-crafter-japanese

For training, I used DeepSeek-R1, which does not restrict output use, to create supervised question data as a synthetic dataset from fineweb-2-edu-japanese.

For example, for title, I used an instruction like: "Think of and create a title that represents the whole text well. Output the title within 30 Japanese characters. The output must be strict JSON in the form {"query": "title"}. Do not output anything else."

https://huggingface.co/datasets/hotchpotch/japanese-query-crafter-reasoning-80k

I then used this data as supervised data for SFT, supervised fine-tuning, on Qwen3-4B, Qwen3-1.7B, sarashina2.2-3b-instruct-v0.1, and TinySwallow-1.5B-Instruct.

The SFT format was simple:

{
  "system": "{category名}",
  "user": "{text}",
  "assistant": "{query}",
}

The system prompt contains the instruction category such as title, the user input contains the document text, and the model output contains query. For SFT specialized to a particular use case, a verbose prompt is not necessary. A short instruction, in this case the category, can train the behavior well.

Evaluation

I evaluated query-crafter using the test split of japanese-query-crafter-reasoning-80k. I generated questions from the text in this data using each SFT-trained query-crafter model.

Then I paired those generated questions with the original text and scored them with the reranker BAAI/bge-reranker-v2-m3. The reranker score is 1.0 when the document and text are highly related, and 0.0 when they are not related. It is therefore a rough measure of whether the generated question is related to the text.

Model	Mean	Std. dev.
query-crafter-jp-Qwen3-1.7B	0.8701	0.2592
query-crafter-jp-Qwen3-4B	0.8712	0.2652
query-crafter-jp-TinySwallow-1.5B	0.7526	0.3611
query-crafter-jp-sarashina2.2-3b	0.8670	0.2646
deepseek-r1	0.8507	0.2875

The percentile plot is below.

Except for TinySwallow-1.5B, the models scored higher than DeepSeek-R1 in most cases. In particular, Qwen3-1.7B is a multilingual model not specialized for Japanese, but after SFT its score is almost the same as Qwen3-4B. Its performance is impressive. Unless you have a specific reason to choose otherwise, query-crafter-japanese-Qwen3-1.7B is a good choice.

A lower reranker score than DeepSeek-R1 does not necessarily mean the DeepSeek-R1 question is worse. There are cases where it creates correct but difficult questions that are hard even for a reranker to judge. TinySwallow-1.5B sometimes generated questions that were completely unrelated, which lowered its score compared with the other models. TinySwallow-1.5B-Instruct was distilled with TAID, so it may be less suitable for subsequent SFT.

Closing

I created and released query-crafter-japanese, a model with significant speed and cost advantages when generating a large number of questions. Since the release of high-performing DeepSeek-R1, which does not restrict output use, it has become easier to create and publish datasets and then build models using them as supervised data. The emergence and improvement of open-weight LLMs with practical licenses, such as smaller Qwen models, also makes it easier to create and publish fine-tuned small models specialized for specific use cases. I feel that the range of possible applications has widened considerably. Half a year earlier, creating this model as an individual would probably have been impossible for resource reasons.

I hope this model helps people who need to generate questions.