cat articles/jfwir-japanese-fineweb-ir

JFWIR: A Large Japanese Information Retrieval Dataset Built from Japanese FineWeb

created 2025-06-19

In Japanese information retrieval, many datasets have historically been built around Wikipedia. Real web text, however, is not limited to the clean, well-formatted writing found in Wikipedia. It includes blogs, news, forums, diverse writing styles, and noise.

JFWIR (Japanese FineWeb Information Retrieval) is a large dataset of about 64 million Japanese document-query pairs created to address that gap. It is based on fineweb-2-edu-japanese, a web-crawl dataset containing high-quality educational Japanese content.

https://huggingface.co/datasets/hotchpotch/JFWIR

Characteristics of JFWIR

1. Large and Diverse

JFWIR has the following characteristics:

More than 64 million document-query pairs: seven different query types are generated for each document: keywords, synonym_keywords, query, alt_query, title, faq, and summary
Real web text: educationally valuable web content beyond Wikipedia
Hard negatives: similar but incorrect documents for effective training

2. Benchmark Results

I evaluated reranking models trained with JFWIR on major Japanese information retrieval benchmarks:

Benchmark	Without JFWIR	With 10M JFWIR records
JQaRA	0.7621	0.7633
MIRACL(ja)	0.8332	0.8385
jsquad	0.9801	0.9821
JaCWIR	0.9339	0.9586

The improvement on JaCWIR, which targets web text, was especially clear: 0.9339 to 0.9586.

Usage

JFWIR can be used easily from Hugging Face Datasets:

from datasets import load_dataset

# Load the main dataset.
train_ds = load_dataset("hotchpotch/JFWIR", split="train", name="small_tokens_cleaned")

# Inspect sample data.
for i in range(3):
    sample = train_ds[i]
    print(f"Query: {sample['query']}")
    print(f"Document: {sample['text'][:100]}...")

# Load the hard-negative dataset.
hard_negatives_ds = load_dataset("hotchpotch/JFWIR", split="train", name="hard_negatives")

# Example hard-negative usage.
for i in range(3):
    hn_sample = hard_negatives_ds[i]
    pos_id = hn_sample['pos_id']
    pos_doc = train_ds[pos_id]
    
    print(f"Query: {pos_doc['query']}")
    print(f"Positive (score: {hn_sample['pos_score']:.3f}): {pos_doc['text'][:100]}...")
    
    # Sort negative documents by score.
    neg_pairs = list(zip(hn_sample['neg_ids'], hn_sample['neg_scores']))
    neg_pairs.sort(key=lambda x: x[1])
    
    print("Negatives (lowest scores):")
    for neg_id, score in neg_pairs[:2]:
        print(f"  Score {score:.3f}: {train_ds[neg_id]['text'][:80]}...")

Dataset Creation Process

1. Collecting High-Quality Japanese Web Text

First, I extracted educationally valuable Japanese content from FineWeb-2 to create fineweb-2-edu-japanese. I then created the small_tokens_cleaned subset by removing web-specific noise and adjusting the text length.

2. Generating Diverse Queries

To generate queries for 64 million records, I used the lightweight query generation model query-crafter-japanese. To increase diversity, I combined three models:

By generating seven query types for each document, keywords, synonym_keywords, query, alt_query, title, faq, and summary, the dataset can support a wider range of retrieval needs.

3. Creating Hard Negatives

To improve retrieval model performance, I also created a dataset containing hard negatives, documents that are similar to the query but not correct:

Similar document retrieval with an embedding model: I vectorized 64 million documents with ruri-v3-30m and retrieved similar documents for each document.
Selecting suitable negatives: I randomly sampled from similarity ranks top 10-50 and top 50-200.
Assigning reranker scores: I scored documents with japanese-reranker-xsmall-v2. For example, by excluding positives that are unsuitable, such as score<0.6, and negatives that are unsuitable, such as score>0.4, you can select more appropriate positive and negative examples.

Future Work

JFWIR is published to contribute to Japanese information retrieval. However, query-crafter-japanese currently generates relatively simple queries from text. I think retrieval accuracy can improve further by creating more diverse and valuable questions.

Summary

JFWIR takes a different approach from previous Japanese IR datasets that were heavily biased toward Wikipedia. It targets real web text and includes about 64 million records, seven query types, and hard negatives for contrastive learning. These elements should be useful for developing information retrieval systems.

The dataset is published on Hugging Face and can be used freely under the ODC-By license. I hope it contributes, even a little, to the development of Japanese information retrieval.

License

This dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0, the same as the original FineWeb2. The Common Crawl terms of use also apply.

Citation Information

If you use the JFWIR dataset in research or development, please use the following citation:

@misc{tateno2025jfwir,
  author = {Yuichi Tateno},
  title = {JFWIR: Japanese FineWeb Information Retrieval Dataset},
  year = {2025},
  url = {https://huggingface.co/datasets/hotchpotch/JFWIR},
  note = {A large-scale Japanese information retrieval dataset with 60+ million document-query pairs}
}