cat articles/fineweb-2-edu-japanese

FineWeb2 Edu Japanese: A High-Quality Educational Japanese Dataset

created 2025-02-20

I published FineWeb2 Edu Japanese, a high-quality educational Japanese dataset.

https://huggingface.co/datasets/hotchpotch/fineweb-2-edu-japanese

The following is an English version of the content on that page.

This dataset filters the Japanese portion of FineWeb2, 376 million records, down to 120 million records, about 89.3B tokens, that were judged to be educational content. It also provides the following subsets.

default: about 120M records and about 89.3B tokens
sample_10BT: about 10B tokens randomly sampled from default
small_tokens: only short texts with 512 tokens or fewer
small_tokens_cleaned: small_tokens with web-specific text noise removed

Background

FineWeb, which is English-only, was created to deduplicate web data and extract high-quality text. FineWeb-Edu, which extracts higher-quality educational text, makes efficient training possible with fewer tokens.

FineWeb2, released in December 2024, is a high-quality multilingual dataset that includes Japanese. As of February 2025, however, an "Edu" dataset that extracts educationally valuable Japanese text had not been released. For that reason, I created and published FineWeb2 Edu Japanese.

Filtering Educational Data

To build this dataset, I filtered FineWeb2 Japanese data with fineweb-2-edu-japanese-classifier, a model for judging whether text is educational. The supervised data for the scoring model comes from fineweb-2-edu-japanese-scores, which was evaluated with DeepSeek-API (deepseek-chat). This dataset extracts only texts with a score of 2.5 or higher, and the score is included in the score column.

Token Counts

Token counts computed with the ModernBERT-Ja-130M tokenizer are included in the token_count column.

Removing Web-Specific Noise

FineWeb2 Japanese data can contain web-specific boilerplate and unnecessary noise. For example, text like the following can appear.

This text is displayed on a site that has not been updated for more than 90 days.
Login Logout

Besides the text that is actually needed, various kinds of noise may be included. This sentence is one such example. Unnecessary text can be inserted in this way.

50% off now! Click to view the linked product

Especially when the text is short, most of it may contain noise. Removing such text may allow higher-quality text to be extracted.

Previous page  Next page

To remove this kind of unnecessary text, I developed fineweb-2-japanese-text-cleaner. The supervised data for noise detection is fineweb-2-japanese-noise-spans. That supervised data was created using cyberagent/DeepSeek-R1-Distill-Qwen-32B-Japanese.

The model detects noisy spans as follows.

[NOISE]This text is displayed on a site that has not been updated for more than 90 days.[/NOISE]
[NOISE]Login[/NOISE] [NOISE]Logout[/NOISE]

Besides the text that is actually needed, various kinds of noise may be included. This sentence is one such example. Unnecessary text can be inserted in this way.
[NOISE]
50% off now! Click to view the linked product[/NOISE]

Especially when the text is short, most of it may contain noise. Removing such text may allow higher-quality text to be extracted.

[NOISE]Previous page[/NOISE]  [NOISE]Next page[/NOISE]

The small_tokens_cleaned subset applies fineweb-2-japanese-text-cleaner to small_tokens and removes detected noise. The raw data produced by running noise detection with the model is also published as fineweb-2-edu-japanese-noise-detect-raw.

Noise detection is not perfect, so in some cases parts of valid text may have been mistakenly removed.

Notes

I have not run a comparative experiment between this dataset, FineWeb2 Edu Japanese, and the original FineWeb2 dataset without Edu filtering. Therefore, the actual difference in effect during LLM training has not been verified.

The classification of whether text is educational is also not perfect, and some non-educational text is included.

License

This dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0, the same as the original FineWeb2. The Common Crawl terms of use also apply.

Citation Information

@software{yuichi2025fineweb-2-edu-japanese,
  author = {Yuichi Tateno},
  title = {FineWeb2 Edu Japanese},
  month = feb,
  year = 2025,
  url = {https://huggingface.co/datasets/hotchpotch/fineweb-2-edu-japanese/}
}