hotchpotch.dev

trn: A Fast Local Translation Command for macOS Tahoe

Mon, 08 Jun 2026 00:00:00 GMT

After updating my Mac to macOS Tahoe 26, I became interested again in the APIs behind Apple's on-device intelligence features. Apple Intelligence is the visible product name, but the more interesting part for developers is that some of the underlying system capabilities are exposed as ordinary frameworks.

Translation is one of those capabilities. On Tahoe, the Translation framework can use two different strategies: a low-latency path and a high-fidelity path that uses Apple Intelligence when available. I wanted to see how practical that was from a small local tool, so I built mac-translate-cli.

It provides a command named trn.

What `trn` Does

trn is deliberately small. It only translates text. It uses the macOS built-in Translation framework and exposes it as a command-line tool.

On macOS Tahoe 26.4 or later, it can be installed with Homebrew:

brew tap hotchpotch/mac-translate-cli https://github.com/hotchpotch/mac-translate-cli
brew install hotchpotch/mac-translate-cli/trn

Basic usage:

trn --to ja "Hello world!"
#=> こんにちは、世界！

It also reads from standard input:

echo "Hello world!" | trn --to ja
cat notes.txt | trn --to en
pbpaste | trn --to ja | pbcopy

When --from is omitted, trn detects the source language from the input. Longer input is split into chunks, translated concurrently, and written back in the original order. The default concurrency is 4.

The command depends on the language packages installed in macOS. If a required package is supported but missing, trn reports that it should be installed from System Settings.

Why `low` Is the Default

trn has two quality modes:

trn --from en --to ja --quality low "Hello world!"
trn --from en --to ja --quality high "Hello world!"

low is the default. It uses Apple's lower-latency traditional translation models. high uses Apple Intelligence high-fidelity translation when available.

I initially expected high to be the obvious default. After checking English/Japanese translation quality and speed, I changed my mind. The qualitative results were not simply "high is better." In a small hand-checked evaluation, direct English/Japanese translation averages were almost the same:

Direction	`high` average	`low` average
English to Japanese	3.83 / 5	3.84 / 5
Japanese to English	4.35 / 5	4.32 / 5

The details are in the translation quality report. It is not a comprehensive benchmark, but it was useful for making the default choice. low was sometimes better on practical terminology, including some finance and software examples. high was not uniformly better.

Neither mode should be trusted blindly. Proper names, roles, honorifics, and long round trips still produced drift and broken fragments. For publication, legal, financial, education, emergency, or other sensitive text, the output needs human review.

For ordinary English/Japanese translation, though, low was good enough often enough, and the speed difference was too large to ignore.

Speed

I translated the current mac-translate-cli README.md from English to Japanese five times for each quality mode.

Item	Value
Machine	MacBook Pro, Mac16,5
Chip	Apple M4 Max
CPU	16 cores: 12 performance cores and 4 efficiency cores
Memory	128 GB
OS	macOS 26.5
`trn`	0.1.2
Input	`README.md`, 5,591 characters / 5,924 bytes
Command	`trn --from en --to ja --quality <low\|high> < README.md`
Concurrency	default, `4`

Quality	Mean wall time	Input speed	Relative speed
`low`	3.560 s	1,570.6 chars/s	11.91x faster
`high`	42.383 s	131.9 chars/s	1.00x

For this README-sized input, low felt like a normal local utility. high felt like invoking a much heavier model. That is why trn defaults to low, while still keeping high available when I want to compare whether the high-fidelity path helps a specific text.

What This Suggests About the Mac

The interesting part of trn is not the CLI itself. It is that a system-level on-device model can be exposed through a tiny tool and still feel fast enough to use casually.

This does not mean local models replace large cloud models. Remote LLMs are still much stronger for many tasks. But smaller local models are becoming practical, even around the 1B-parameter scale, and Apple can optimize the hardware, OS, model runtime, and application APIs together.

Compared with buying expensive NVIDIA GPUs, a Mac is not cheap, but it is also not an exotic machine. Many developers already use one. If useful local models continue to become available at the OS level, and if normal apps and small tools can call them with low friction, that is a strong direction for the Mac as an edge machine-learning platform.

trn is a small example: a focused local tool, backed by system-level machine learning, that improves everyday UX without requiring a remote service.

OpenAI API-Compatible Access Without Additional API Billing via Codex

Thu, 07 May 2026 00:00:00 GMT

I built a local server that uses Codex credentials and exposes an OpenAI API-compatible interface. In addition to the Responses API and Chat Completions API, the current version also supports /v1/images/generations for image generation, although that support is still not a complete implementation of the official OpenAI API surface.

The main benefit is that, within the scope of a Codex subscription, you can use an OpenAI API-compatible interface without additional OpenAI API usage billing. If your environment is already authenticated with Codex, you can start the compatible server with a single uvx command.

https://github.com/hotchpotch/openai-api-server-via-codex

uvx openai-api-server-via-codex

This starts a local server. By default it listens on http://127.0.0.1:18080, and clients that can call the OpenAI API can use endpoints such as /v1/responses, /v1/chat/completions, and /v1/images/generations. This is convenient when you want to use existing OpenAI SDKs or GUI clients.

For image generation, the server can be called through OpenAI SDK methods such as client.images.generate(...). At the moment, however, responses are returned as base64 b64_json; URL responses, streaming, and image edit APIs are not supported. It is better to think of this as a practical compatibility layer for local experiments and personal tooling, not a complete replacement for the official OpenAI API.

For example, if you point a GUI client that supports the OpenAI API at the local server, it looks like this. The screenshot below uses a model that was available through Codex when the article was first published.

Why Subscription-Scoped Use Matters

OpenAI's official API usage is billed separately from ChatGPT and Codex subscriptions. For example, on the official OpenAI pricing page, gpt-5.5 is listed at 5.00 USD per 1M input tokens and 30.00 USD per 1M output tokens.

This server does not make the official OpenAI API free. It uses Codex credentials and exposes the models and execution environment available through Codex as a local OpenAI API-compatible interface. Within subscription limits and the applicable terms, that can make it possible to experiment with expensive model capabilities from local development tools or personal workflows without additional API usage charges.

This is especially useful when you want to keep using existing OpenAI SDK-compatible tools, or when you are iterating quickly on a local prototype. It should not be used to bypass limits, resell access, provide an API to third parties, or run the backend of a public service.

How It Works

This approach is also used by Pi, one of the agent execution engines used when OpenClaw runs through Codex. It uses Codex authentication information and calls the Codex API endpoint at https://chatgpt.com/backend-api/codex.

I was not sure whether it was acceptable to use this API endpoint directly, but after reading Simon Willison's article, A pelican for GPT-5.5 via the semi-official Codex backdoor API, it seemed likely enough to be usable. "Semi-official" is a fitting description.

Notes and Caveats

This project is not an official OpenAI API. It is only a way to use Codex credentials that you are authorized to use and convert them into an OpenAI API-compatible shape in a local or development environment. Uses such as the following may violate OpenAI's terms:

Using it to bypass ChatGPT or Codex usage limits
Providing it as an API to people other than yourself
Reselling or sharing a subscription
Using it as the backend for a public API or third-party service
Sharing Codex credentials or tokens with someone else

The Codex backend may also change without notice. Request formats or model names that work today may suddenly stop working.

Difference Between the Codex and Claude Code Subscription Positions

In April 2026, Claude Code announced that subscription use through tools such as OpenClaw would be prohibited. In contrast, Codex has started publishing mechanisms such as Codex App Server and Codex SDK, at least for now, to support broader use even with subscriptions. The difference in stance is interesting.

Anthropic seems to be placing strict limits because inference resources for Claude are currently tight. OpenAI, on the other hand, gives the impression of having more room and being comparatively generous. That said, if OpenAI's resources become tight, or if competition weakens, it is entirely possible that the policy could become stricter or subscriptions could change in the future. It is hard to predict what will happen.

Japanese Full-Text Search in SQLite and DuckDB with Vaporetto

Mon, 27 Apr 2026 00:00:00 GMT

With the rise of various agents, embedded databases such as SQLite and DuckDB have become increasingly interesting because they run without a server and persist as local files. I looked into how these databases handle Japanese full-text search. SQLite FTS5 can search with trigrams, but out of the box it does not seem to provide search specialized for Japanese vocabulary.

There are approaches using Lindera, but this time I made extensions that embed Vaporetto, a lightweight and fast tokenizer implemented in Rust.

SQLite + Vaporetto
- https://github.com/hotchpotch/sqlite-vaporetto
DuckDB + Vaporetto
- https://github.com/hotchpotch/duckdb-vaporetto

Vaporetto uses a pointwise prediction method, judging character boundaries with a linear classification model. It can be used with a dictionaryless model, though dictionary-based models also exist, so it can keep the model size small depending on the use case.

I also made a technical demo that runs entirely in the browser, combining DuckDB and Vaporetto to perform full-text search sorted by BM25 relevance score. When the number of target texts is small, the benefits of full-text search, which stays fast as the number of documents grows, and BM25, which considers term frequency and document length, are less visible.

https://duckdb-vaporetto-wasm-demo.surge.sh/

As another example, I tried building article search for this blog, secon.dev, with SQLite + FTS5 + Vaporetto. For about 2,700 articles, BM25 search usually ran in around 3 ms.

This makes casual Japanese full-text search possible with SQLite and DuckDB, so I think it can be useful when the use case fits.

Update: I received a helpful comment from Kudo-san, a leading expert in morphological analysis. For full-text search, word segmentation by pointwise prediction is not well suited because inconsistency can be a problem. That makes sense.

Pointwise word segmentation is not well suited to full-text search. Especially without a dictionary, context-dependent segmentation is unavoidable, which increases the risk of missed matches. I discuss this in my book on morphological analysis.

https://x.com/taku910/status/2048676651517768163

Context dependence means, for example, that the segmentation of the phrase "morphological analysis" is uniquely determined without being affected by surrounding context. It is important that the query segmentation can be reproduced in the document. A unigram language model satisfies this condition. Accuracy is sacrificed, but consistency is guaranteed.

https://x.com/taku910/status/2048695518176665637

Building a Machine Learning PC with Two RTX 5090 GPUs

Mon, 19 Jan 2026 00:00:00 GMT

I like training small Transformer models, usually around 100M parameters or less, and I run training jobs often. I have built and used custom PCs with RTX 3090, 4090, and 5090 GPUs.

[!NOTE] I live in Japan, so some details in this build are specific to Japanese residential environments. For example, standard household wall outlets are usually 100 V and limited to 1500 W, which strongly affects power supply choices for a two-GPU machine.

This time I wanted a little more training speed and also wanted to gain practical knowledge about multi-GPU training, so I built a custom PC with two RTX 5090 GPUs. Recently, NVLink has been removed from consumer GPUs, and power consumption has also increased. There are surprisingly few examples of systems using two RTX 5090s, so I had to research more than expected. This article summarizes the build. The example reflects the situation around the end of 2025.

Power

The first difficult point with two RTX 5090s is power. The RTX 5090 has a maximum TBP of 575 W, and there are two of them. Considering the CPU and everything else, I wanted at least a 1600 W power supply. However, household 100 V outlets in Japan are limited to 1500 W, and from what I could find, ordinary PC power supplies sold for 100 V top out at 1300 W.

There are many 1300 W power supplies, but options above that become extremely limited. Higher-wattage power supplies also use a C19 input connector instead of the common C13 connector. By supplying 200 V power to that connector, output above 1300 W becomes possible.

I therefore had electrical work done to install a NEMA 20 A 250 V wall outlet, choosing Panasonic WF2520B, and made 200 V 20 A, up to 4000 W, available. The power cable needs to be NEMA L6-20P to IEC 60320 C19, so I used a Schneider Electric AP8753J Power Cord, Locking C19 to L6-20P. This outlet is also fed directly from its own breaker.

For the power supply, I chose the 1650 W ASRock Taichi TC-1650T, which seemed to have a good reputation. It supports ATX 3.1, which is relevant for the safety of the 12V-2x6 connectors that supply large amounts of power to the GPUs. This power supply also comes with a cable that can plug into a 100 V C19-C20 power cable. That only supports up to 1300 W, but because this type of cable is hard to find on the market, it is useful for test booting.

Update: Another possible method is to use a case that can install two power supplies and connect two 1300 W units to separate 100 V outlets.

GPU

Because the RTX 5090 produces a lot of heat, most air-cooled models are three to four PCI slots thick. When using two GPUs of that thickness, you often need riser cables to physically separate them. Otherwise they may collide with the case or motherboard and fail to fit.

The main options are:

Use air-cooled models that are three slots or thinner, though the lack of spacing may make heat a concern
Use liquid-cooled AIO models for both GPUs
Use one air-cooled GPU and one AIO liquid-cooled GPU
Use riser cables somehow

I already had an RTX 5090 that was about 3.5 slots thick, so I used one AIO liquid-cooled GPU and one air-cooled GPU. If I had not already owned an RTX 5090, I probably would have used two AIO liquid-cooled GPUs and an air-cooled CPU. That would cost a little more, but it would make internal case layout easier and likely lower GPU temperatures further.

The GPUs I used are:

MSI GeForce RTX 5090 32G VENTUS 3X OC
- Air-cooled, about 3.5 slots thick, which I already owned
MSI GeForce RTX 5090 32G SUPRIM LIQUID SOC
- Slightly over two slots thick, with a 120 x 360 liquid-cooling radiator

If budget allows, another option is RTX 6000 Pro, which uses the same Blackwell architecture as the RTX 5090 and has 96 GB of memory. The RTX PRO 6000 Blackwell Max-Q is also an option. Its performance is somewhat lower, but power consumption is much lower at 300 W. The Max-Q model should also reduce cooling concerns and make installation easier.

Motherboard

The motherboard requirements were that it could run two GPUs at PCIe 5.0 x8, and that there was enough spacing between GPU 1, the liquid-cooled card in the upper slot, and GPU 2, the air-cooled card in the lower slot. I chose the ASUS ProArt X870E-CREATOR WiFi AMD AM5 X870E ATX, partly because I found examples of it being sold overseas in prebuilt RTX 5090 x2 PCs.

It has an onboard Wi-Fi 7 chip, but there does not currently seem to be a Linux kernel driver for it. If you plan to connect with onboard Wi-Fi, that may matter. In my use case I do not use wireless and connect over wired LAN, so it has not been a problem.

Case

I needed a case that would leave a reasonable amount of space when a 3.5-slot-thick GPU was installed in the lower slot, and that could install two AIO radiators, one for the CPU and one for GPU 1. I chose the CORSAIR 7000D AIRFLOW. It is larger than a normal case, but the larger internal space is a clear cooling advantage. I did not need a glass side panel to see inside the PC, but after building it I found it looked good and I am satisfied with it.

Airflow

When the system can consume up to around 1650 W inside the case, the generated heat is substantial. Air must circulate in a reasonable way.

Because the CPU and GPU 1 use AIO liquid cooling and GPU 2 uses air cooling, I needed to think about how to bring in and exhaust air. PC cooling fans can be switched between intake and exhaust by flipping them around. After discussing options with AI, I used the airflow below. I am not an airflow expert, so there may be a better layout.

Front intake
- Two 140 mm fans included with the case; ideally I should add one more 140 mm fan
- Positioned to hit GPU 2, the air-cooled GPU
Side intake
- GPU 1 liquid cooler, 120 mm x 3
Top exhaust
- CPU liquid cooler, 120 mm x 3
Rear exhaust
- One 140 mm fan included with the case

This was the part where I had the hardest time finding information. The remaining parts are mostly a matter of preference, but I will describe them with comments from the perspective of a machine learning PC.

CPU

I used the AMD Ryzen 9 9950X, with 16 cores and 32 threads. The 9950X3D was also available, but since I do not use this machine for games, the performance difference seemed marginal, and the 9950X was about 20,000 yen cheaper. Data processing is often parallel, so more CPU cores are useful, but going beyond this would mean Threadripper. I chose 16 cores.

RAM

I considered installing the maximum 192 GB, but due to the rapid increase in memory demand from AI-related data centers, prices were staying about four to five times higher than in September 2025. That was too expensive, so I used DDR5-5600 32 GB x 2, for 64 GB. I wanted ECC, but that was also too expensive. In my use case, 64 GB occasionally touches swap, but because the swap is on a fast NVMe drive, it rarely causes real problems. More RAM would be nice, but 64 GB has mostly been enough.

This time I bought DDR5 5600 MHz 32 GB x 2 from a Chinese brand called Acclamator, which was selling for about 60% of the price of other brands with the same capacity. It seems the price has gone up since then. I ran memtest86 and stresstest-cli at 5600 MHz for about 12 hours and saw no errors. I do not yet know about long-term durability or summer heat, since it is currently winter and cold. RAM speed has almost no effect during GPU training, so I lowered it to 4800 MHz for stability. There are cases where RAM speed matters, such as CPU offload during inference, but I do not plan to use it that way.

Update: I eventually felt the lack of memory and added another 32 GB x 2, for a total of 128 GB.

Storage: NVMe

Training data can involve random access if handled casually, because the data is often shuffled. For example, Hugging Face Transformers shuffles data by default during training. For that reason, a large NVMe SSD is useful. More capacity is better.

Sandisk SN850X NVMe SSD WDS800T2X0E 8TB
- CPU-connected PCIe lanes. Even 8 TB is not enough, and I use it while deleting data, so I would like more capacity.
Samsung 980 Pro 2TB
- Added because I had one spare
- Chipset-shared lanes

Storage: HDD

I use a 14 TB HDD as a temporary location for raw downloaded data. It is too slow for workloads with random access, but it works for this purpose. In practice, the Hugging Face datasets library first downloads data to the directory specified by HF_HUB_CACHE, but when the library loads it, Parquet files are converted to Arrow format. As long as the latter can be accessed from NVMe, I can set only HF_HUB_CACHE to the HDD and separate the roles.

TOSHIBA MG07ACA14TE 14TB

CPU Cooler

I did not have a strong preference as long as it was a 120 x 3 radiator AIO, so I used the CORSAIR NAUTILUS 360 RS LCD. I bought it because the LCD on the CPU cooler could display CPU temperature, which seemed nice. After buying it, I realized the display is controlled over USB, making it difficult to control from Linux. There are OSS options, but showing temperature quickly did not seem straightforward. If I were buying now, I would probably choose a model without the LCD.

Assembly

Other than the case, power supply, and air-cooled GPU being heavy enough to cause muscle soreness, and my own repeated mistakes with fan orientation and radiator orientation, the build was straightforward. It booted on the first try and has been running without problems.

OS

I used Ubuntu Server 24 LTS, which I am used to. I only connect over SSH and do not use a GUI at all.

Impressions After Building an RTX 5090 x2 PC

It has been about a month since I built it. Perhaps because it is winter, even when both GPUs are fully used, it has been stable without particular problems. One good point is that when PCIe is not the bottleneck, for example when training a bi-encoder model with MLM, training speed is about 1.8 times faster than with one RTX 5090. Inference is also convenient when horizontal processing is possible. For example, processing 10 million records with Qwen3-8B on vLLM can run at almost twice the speed.

CUDA makes it easy to switch which GPU a program can see with the CUDA_VISIBLE_DEVICES environment variable. If I want to use GPU 2, I can set CUDA_VISIBLE_DEVICES=1, and the program recognizes it as a single GPU without any code changes. This makes it easy to switch between GPUs.

It has also been useful for learning about multi-GPU systems. Until now I had only used one GPU, so I was able to learn methods and ways of thinking about training and inference in a multi-GPU environment.

On the other hand, PCIe 5.0 x8 speed often feels like a bottleneck. For example, PyTorch DDP performs All-Reduce to synchronize data between GPUs at every training step, and depending on the training method that can take a long time. Large-batch contrastive learning is one example. GPU SM idle time can increase substantially, and the speedup may be only around 1.2x. In some cases one GPU can even be faster.

With datacenter GPUs such as B200 and H200, NVLink can provide hundreds of GB/s to TB/s between GPUs depending on the configuration. PCIe 5.0 x8 has an effective speed of about 20-30 GB/s, so it is much slower than NVLink. Expensive GPUs are well designed for a reason. A machine with eight B200s might cost around 80 million yen.

Overall, I am very satisfied with the build. The timing of buying parts was also relatively good. Memory was already expensive, but by mid-January 2026, storage, memory, and RTX 5090 GPUs had become even more expensive. AI demand and the weak yen have made many things costly.

Looking Back on 2025

Wed, 31 Dec 2025 00:00:00 GMT

Life

Child

Our child was born. People often say that the birth of a child is the biggest change in life, and now that I am living through it, I think that is exactly right. My way of thinking has shifted so that everything starts from the child. Also, the child is cute. Very cute. I feel fortunate to be able to raise a child. Working remotely also means I can see the child's face often, which I am grateful for.

Before the birth, my wife became unwell because of a chronic condition and was hospitalized for a long period. Even after the birth, we were quite anxious about how things would go. Fortunately, her recovery has been good, and she is now able to live daily life without problems. Health is something to be grateful for. I am mostly fine too, apart from frozen shoulder.

House

The house we built last year, and started living in at the end of last year, is very comfortable. Winter is reasonably cold, and because the area around us is farmland, there are many insects in summer. Even with those tradeoffs, it is very easy to live here. Now that our family has grown, I am glad we built it.

Car

We welcomed a Tesla Model Y Juniper Long Range. The comfort of a BEV without an engine, quietness and quick acceleration, is good in itself, but the software and UX are also very well made. Most cars extend the UX of past cars, but a newer company does not need to inherit all of that history, so it can design a different experience. I feel that clearly. Car travel, which is almost all travel in the countryside, has become much easier and more comfortable.

The current Autopilot driver assistance is already quite comfortable, and I expect FSD, or practically speaking a good automated driving assistance system, will become available in Japan in the future. I am looking forward to that too.

Technology

I continued working mainly around information retrieval, along with various technical work and product development. This was the year coding agents rose quickly, and almost all of my technical work is now helped by AI. The number of available moves has increased, and I feel the range of what I can do has expanded significantly. This is not limited to ordinary software development. For example, in information retrieval model development, AI can handle a large part of the work. Tasks that previously required too much manual effort can now often be approached in a more experimental way, which is very useful.

I think it is important, especially when the right product direction is unclear, to work on technical areas that AI cannot or does not easily replace. I have been able to work on those technical areas, so it has been enjoyable. Of course, I may be biased, and some of those areas may turn out to be replaced more easily than I expect.

At work, the AI product I had mainly been building was released, and it seems to have received a certain level of recognition both inside the company and in the broader market. I am grateful for that. It was also the result of a team environment that made development easy, so thank you to everyone involved. Next year I want to keep relying heavily on the team and on AI, while continuing technical work and building new products.

A lot happened in 2025, but the main event was that our child was born. As always, I am grateful to my wife, who supports me in many ways. With that, I hope to see you all in 2026.

OpenProvence: A Model for Removing Irrelevant Sentences Before Passing Text to an LLM

Fri, 31 Oct 2025 00:00:00 GMT

Recently, systems often recursively search, gather information, create additional search queries from multiple angles to fill missing knowledge, and extract only the necessary information from the results in order to build "good knowledge" for an LLM to answer with. In 2025 trends such as AI agents, Deep Research, and context engineering, this kind of search often happens behind the scenes. The ability to retrieve useful information can be a key part of the system.

However, when a system searches a lot, the amount of search-result text also grows. That can make it harder for the LLM to extract the truly necessary information, increase hallucinations, slow processing because of larger inputs, and increase LLM usage cost.

Provence is an approach for deleting irrelevant information before passing search results to an LLM, while also assigning a relevance score. It can remove irrelevant parts from documents returned by search. When I measured the model's performance, an evaluation using long-form question-answer datasets, MLDR plus LLM evaluation, showed that it could remove about 80-95% of the text. In other words, a 10,000-character text can be reduced to roughly 500-2,000 characters before being passed to the LLM. That can substantially reduce input size. Even for datasets made of many shorter sentence-like chunks, depending on the domain, it removed about 30-70% of sentences.

The Provence implementation and models published for research are non-commercial, and no Japanese dataset was available. I therefore created a project called OpenProvence and published training and inference code, model weights, and related artifacts under open licenses. I also created and published Japanese datasets. Many of the datasets themselves are not under open licenses because they inherit the original source licenses.

Trying OpenProvence

I prepared a Hugging Face Spaces demo that runs on CPU:

🤗 https://huggingface.co/spaces/hotchpotch/open_provence_demo

For example, if you use the sample Wikipedia page about information retrieval and run sentence pruning with the query "What is vector search?", the article of about 5,000 Japanese characters is reduced to about 400 characters, leaving only the information about vector search.

You can also run the demo locally with the following steps. On a recent MacBook, inference should be fairly fast.

git clone https://huggingface.co/spaces/hotchpotch/open_provence_demo
cd open_provence_demo
uv sync
uv run python app.py

Using It from Python

From Python, it can be used as follows. The small xsmall model can run on CPU. In a GPU environment, NVIDIA with FlashAttention 2, inference should finish almost immediately and prune the text. I think it is fast enough to be integrated into a production search environment.

from transformers import AutoModel

# Change this to the model you want to use.
model_name = "hotchpotch/open-provence-reranker-xsmall-v1"
provence = AutoModel.from_pretrained(model_name, trust_remote_code=True)

question:str = "日本の首都について"
context:str = """
今日は学校に行き、さまざまなことを学んだり、友達と学食でたらふく食べた。
日本の首都は東京で、東京は日本の政治、経済、文化の中心地らしい。この都市は約1,400万人の人口を抱える世界有数の大都市らしい。
夜は飲み会に誘われたが、参加せずに帰宅した、今月そんなにお金が残ってないからなぁ、残念だ。
"""

result = provence.process(question, context, threshold=0.1)
print(f"Reranking Score: {result['reranking_score']:.4f}")
print(f"Compression Rate: {result['compression_rate']:.1f}%")
print(f"Pruned Context:\n{result['pruned_context']}")

# Output example:
# Reranking Score: 0.7043
# Compression Rate: 62.5%
# Pruned Context:
# 日本の首都は東京で、東京は日本の政治、経済、文化の中心地らしい。
# この都市は約1,400万人の人口を抱える世界有数の大都市らしい。

Using Coding Agents

For OpenProvence, I set a constraint that I would not write a single line of code myself. All implementation work, including inference, training model code, evaluation code, and dataset creation code, was done by coding agents such as Claude Code and Codex. I still had to give many correction instructions, but for a side project progressed in spare time, I think the result is quite good. Looking at the final code, it could probably be simpler. At the same time, with current LLMs, this amount of explicit and somewhat verbose code may be easier for the model to understand and modify.

The term vibe engineering has emerged for building production-quality software in collaboration with AI by continuing to provide suitable instructions, development guidelines, and an environment where AI can develop and improve the project itself.

In addition to the usual software development practices used with coding agents, such as development guidelines, unit tests, CI, and code review, I found that machine learning model projects of a certain size can also be developed this way by preparing a small baseline that can train quickly, evaluation data whose accidental changes would indicate bugs, and detailed explanations of the datasets.

Closing

An approach like OpenProvence, which removes text unrelated to the question, should work especially well for products that process very large documents.

RAG was a major topic in 2024, and in 2025 trends such as AI agents, Deep Research, and context engineering have made this area even more important. I am impressed by, and grateful for, the foresight of the Provence team at Naver Labs Europe, who worked on an important technical point early. Provence was published in January 2025.

For products using LLMs, information retrieval behind the scenes can increase value, and retrieval technology remains very interesting. I hope this project is useful in products or research.

Evaluating the Japanese Performance of Embedding Gemma 300M with JMTEB

Thu, 18 Sep 2025 00:00:00 GMT

Google recently released the embedding model google/embeddinggemma-300m. It performs quite well on MTEB Multilingual v2, so I benchmarked it with JMTEB v1 to properly measure its Japanese performance.

The short version is that, for Japanese, Embedding Gemma performed very poorly in my measurement.

JMTEB v1 Benchmark

Model	Params	Avg	Retrieval	STS	Classification	Reranking	Clustering	PairClass
google/embeddinggemma-300m	308M	58.10	42.18	73.36	63.23	91.55	45.87	62.42
intfloat/multilingual-e5-small	118M	69.52	67.27	80.07	67.62	93.03	46.91	62.19
intfloat/multilingual-e5-large	560M	71.65	70.98	79.70	72.89	92.96	51.24	62.15
cl-nagoya/ruri-v3-30m	37M	74.51	78.08	82.48	74.80	93.00	52.12	62.40
cl-nagoya/ruri-v3-310m	315M	77.24	81.89	81.22	78.66	93.43	55.69	62.60

Note: This is the micro average, simple average, across the 16 Japanese tasks in JMTEB v1.

The JMTEB configuration is here, and prefixes and similar settings should be applied. The result JSON, summary.json, is in this gist. The reproduction steps are also in the gist. If my measurement is wrong, please let me know.

Update, 2025-10-03: Because of a Transformers bug, the latest version reportedly improves performance to around the level of ruri-base. Thank you to LM8 (@ShengzheLi) for the information.

JQaRA / JaCWIR

Because the JMTEB v1 score was very low, I also evaluated it separately on JQaRA and JaCWIR. The results were again quite low.

Model	JQaRA (nDCG@10)	JQaRA (MRR@10)	JaCWIR (MAP@10)	JaCWIR (HIT_RATE@10)
google/embeddinggemma-300m	0.261	0.457	0.730	0.904
intfloat/multilingual-e5-small	0.492	0.729	0.869	0.970
intfloat/multilingual-e5-large	0.554	0.799	0.876	0.973

Strong MTEB Does Not Necessarily Mean Strong Japanese Performance

This was also true for Qwen3 Embedding, which I evaluated recently in Evaluating the Japanese Performance of Qwen3 Embedding with JMTEB. Recent multilingual embedding models with high MTEB scores often have weak Japanese performance. Looking at the Language-specific Japanese section of the MTEB leaderboard, both Qwen3 Embeddings and Embedding Gemma only show Pair Classification for Japanese, so it is not very informative. That makes the meaning of multilingual performance somewhat unclear.

Both Qwen3 Embedding and Embedding Gemma are based on decoder-model architectures. Looking inside embeddinggemma-300m, it uses an embedding head, pooling plus two dense layers, with mean pooling.

For decoder-based models with small parameter counts, at least in Japanese, the performance was much lower than other encoder-based multilingual models. It is unclear whether this is because the models were barely trained on Japanese embedding tasks, or because the original small decoder model has weak Japanese generalization.

JFWIR: A Large Japanese Information Retrieval Dataset Built from Japanese FineWeb

Thu, 19 Jun 2025 00:00:00 GMT

In Japanese information retrieval, many datasets have historically been built around Wikipedia. Real web text, however, is not limited to the clean, well-formatted writing found in Wikipedia. It includes blogs, news, forums, diverse writing styles, and noise.

JFWIR (Japanese FineWeb Information Retrieval) is a large dataset of about 64 million Japanese document-query pairs created to address that gap. It is based on fineweb-2-edu-japanese, a web-crawl dataset containing high-quality educational Japanese content.

https://huggingface.co/datasets/hotchpotch/JFWIR

Characteristics of JFWIR

1. Large and Diverse

JFWIR has the following characteristics:

More than 64 million document-query pairs: seven different query types are generated for each document: keywords, synonym_keywords, query, alt_query, title, faq, and summary
Real web text: educationally valuable web content beyond Wikipedia
Hard negatives: similar but incorrect documents for effective training

2. Benchmark Results

I evaluated reranking models trained with JFWIR on major Japanese information retrieval benchmarks:

Benchmark	Without JFWIR	With 10M JFWIR records
JQaRA	0.7621	0.7633
MIRACL(ja)	0.8332	0.8385
jsquad	0.9801	0.9821
JaCWIR	0.9339	0.9586

The improvement on JaCWIR, which targets web text, was especially clear: 0.9339 to 0.9586.

Usage

JFWIR can be used easily from Hugging Face Datasets:

from datasets import load_dataset

# Load the main dataset.
train_ds = load_dataset("hotchpotch/JFWIR", split="train", name="small_tokens_cleaned")

# Inspect sample data.
for i in range(3):
    sample = train_ds[i]
    print(f"Query: {sample['query']}")
    print(f"Document: {sample['text'][:100]}...")

# Load the hard-negative dataset.
hard_negatives_ds = load_dataset("hotchpotch/JFWIR", split="train", name="hard_negatives")

# Example hard-negative usage.
for i in range(3):
    hn_sample = hard_negatives_ds[i]
    pos_id = hn_sample['pos_id']
    pos_doc = train_ds[pos_id]
    
    print(f"Query: {pos_doc['query']}")
    print(f"Positive (score: {hn_sample['pos_score']:.3f}): {pos_doc['text'][:100]}...")
    
    # Sort negative documents by score.
    neg_pairs = list(zip(hn_sample['neg_ids'], hn_sample['neg_scores']))
    neg_pairs.sort(key=lambda x: x[1])
    
    print("Negatives (lowest scores):")
    for neg_id, score in neg_pairs[:2]:
        print(f"  Score {score:.3f}: {train_ds[neg_id]['text'][:80]}...")

Dataset Creation Process

1. Collecting High-Quality Japanese Web Text

First, I extracted educationally valuable Japanese content from FineWeb-2 to create fineweb-2-edu-japanese. I then created the small_tokens_cleaned subset by removing web-specific noise and adjusting the text length.

2. Generating Diverse Queries

To generate queries for 64 million records, I used the lightweight query generation model query-crafter-japanese. To increase diversity, I combined three models:

By generating seven query types for each document, keywords, synonym_keywords, query, alt_query, title, faq, and summary, the dataset can support a wider range of retrieval needs.

3. Creating Hard Negatives

To improve retrieval model performance, I also created a dataset containing hard negatives, documents that are similar to the query but not correct:

Similar document retrieval with an embedding model: I vectorized 64 million documents with ruri-v3-30m and retrieved similar documents for each document.
Selecting suitable negatives: I randomly sampled from similarity ranks top 10-50 and top 50-200.
Assigning reranker scores: I scored documents with japanese-reranker-xsmall-v2. For example, by excluding positives that are unsuitable, such as score<0.6, and negatives that are unsuitable, such as score>0.4, you can select more appropriate positive and negative examples.

Future Work

JFWIR is published to contribute to Japanese information retrieval. However, query-crafter-japanese currently generates relatively simple queries from text. I think retrieval accuracy can improve further by creating more diverse and valuable questions.

Summary

JFWIR takes a different approach from previous Japanese IR datasets that were heavily biased toward Wikipedia. It targets real web text and includes about 64 million records, seven query types, and hard negatives for contrastive learning. These elements should be useful for developing information retrieval systems.

The dataset is published on Hugging Face and can be used freely under the ODC-By license. I hope it contributes, even a little, to the development of Japanese information retrieval.

License

This dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0, the same as the original FineWeb2. The Common Crawl terms of use also apply.

Citation Information

If you use the JFWIR dataset in research or development, please use the following citation:

@misc{tateno2025jfwir,
  author = {Yuichi Tateno},
  title = {JFWIR: Japanese FineWeb Information Retrieval Dataset},
  year = {2025},
  url = {https://huggingface.co/datasets/hotchpotch/JFWIR},
  note = {A large-scale Japanese information retrieval dataset with 60+ million document-query pairs}
}

Evaluating the Japanese Performance of Qwen3 Embedding with JMTEB

Wed, 11 Jun 2025 00:00:00 GMT

The open-weight, high-performance multilingual embedding and reranker series Qwen3 Embedding has been released. It includes 8B, 4B, and 0.6B model sizes, performs well for text embeddings and reranking, and currently ranks at the top of the Multilingual MTEB leaderboard.

However, multilingual models often do not place much emphasis on Japanese, so I measured Qwen3-Embedding-0.6B with JMTEB: Japanese Massive Text Embedding Benchmark. jsick and jsts errored, so STS tasks are excluded.

JMTEB Results

Model	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
Qwen3-Embedding-0.6B	72.81	--	66.09	93.10	48.84	62.42
ruri-v3-310m	81.89	81.22	78.66	93.43	55.69	62.60
ruri-v3-130m	81.89	79.25	77.16	93.31	55.36	62.26
ruri-v3-70m	79.96	79.82	76.97	93.27	52.70	61.75
PLaMo-Embedding-1B	79.94	83.14	77.20	93.57	53.47	62.37
ruri-v3-30m	78.08	82.48	74.80	93.00	52.12	62.40
sbintuitions/sarashina-embedding-v1-1b	77.61	82.71	78.37	93.74	53.86	62.00
jinaai/jina-embeddings-v3	75.22	80.05	76.39	92.71	51.46	62.37
OpenAI/text-embedding-3-large	74.48	82.52	77.58	93.58	53.32	62.35
pkshatech/GLuCoSE-base-ja-v2	73.36	82.96	74.21	93.01	48.65	62.37
pkshatech/RoSEtta-base-ja	73.21	81.39	72.41	92.69	53.23	61.74
intfloat/multilingual-e5-large	70.98	79.70	72.89	92.96	51.24	62.15
OpenAI/text-embedding-3-small	66.39	79.46	73.06	92.92	51.06	62.27

These are the results. Perhaps because Japanese tasks were not trained heavily, the Japanese results were not strong. The ruri-v3 series is smaller and clearly much stronger for Japanese.

For Retrieval and Reranking tasks, I added the query prefix Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:.

The JMTEB configuration, summary JSON, and execution commands used for this measurement are available here. The Qwen3-Embedding-0.6B score feels low, so if I made a mistake, please let me know.

https://gist.github.com/hotchpotch/f6be186010e70d6eb6e46447cea258f9

Extra: Reading the Qwen3 Embedding Paper

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models was published, so I read it briefly. I found the synthetic dataset creation process especially interesting.

These are notes from the parts that interested me:

It does not convert a decoder to an encoder like LLM2Vec; it uses causal attention as-is.
The embedding model obtains the final embedding from the hidden state of the final layer's [EOS] token.
- Queries are built as Instruction + Query. Documents are used as-is.
- The score improves on InfoNCE rather than using simple contrastive learning, including multiple hard negatives and adjustments for false negatives by tuning positive and negative similarity.
Reranking uses the chat template directly and computes a relevance score from the probabilities of the "yes" and "no" tokens.
- It applies the usual decoder-model label-classification approach, looking at the probability of the target label token.
- It can be trained with SFT.
In the first stage, training uses a synthetic dataset created with Qwen3-32B.
- It creates four types: information retrieval, bitext mining, semantic similarity, and classification.
- For the information retrieval synthetic dataset, it creates detailed settings and generates queries from documents in the Qwen3 pretraining corpus.
In the second stage, training uses 7 million existing datasets such as MS MARCO and MIRACL, plus 12 million records filtered by cosine similarity from the first stage.
Finally, it uses model merging with diversity in mind.
- The details are not written, so this is an inference, but multiple second-stage checkpoints could include task-specialized checkpoints or checkpoints focused on particular languages.
- If you have many checkpoints, model merging seems worth trying. Even with limited compute, you can often observe benchmark improvements by merging checkpoints and evaluating the result.

Releasing Small, Fast, and Practical Japanese Rerankers: tiny, xsmall, small, and base v2

Thu, 08 May 2025 00:00:00 GMT

I released very small Japanese reranker models, japanese-reranker-tiny-v2 and japanese-reranker-xsmall-v2. In information retrieval systems, rerankers improve the precision of search results, but model size and compute cost are practical challenges.

🆕 Update, 2025-07-10: I also added moderately small rerankers, japanese-reranker-small-v2 and japanese-reranker-base-v2.

These models are built with minimal layer counts and parameter counts, and they run at practical speed even on CPU and Apple silicon. This makes it possible to improve RAG system accuracy without expensive GPU resources, and should make them useful for edge deployment and production environments that require low latency. In evaluation, they achieve competitive scores even compared with larger models.

What Rerankers Are, and Why Small Rerankers Matter

A reranker is a model that evaluates the relevance between a question, or query, and documents, then reorders the documents by relevance. Its strength is that it can evaluate complex relationships that ordinary embedding search may miss. In particular, CrossEncoder architectures take the query and document as one input pair, allowing finer-grained nuance and contextual understanding.

Releasing High-Performance Japanese Rerankers, and What Rerankers Are

Small rerankers matter for several reasons. First, a reranker must evaluate every combination of a query and candidate document. Reranking 100 candidate documents requires 100 model inferences. Smaller models therefore directly improve throughput and reduce latency.

Small models can also run in resource-limited environments. They can run at realistic speed on CPU-only environments, edge devices, and mobile devices, improving the practicality of RAG systems. In server environments, they also reduce GPU memory usage and make it easier to share GPU resources, improving cost efficiency.

Ask! NIKKEI RAG Search Technology Deep Dive

Small rerankers therefore provide important benefits in speed, cost, and resource efficiency, and can play a useful role in practical RAG systems.

Benchmark Performance

The benchmark results are below. Considering their model size, the tiny and xsmall v2 models perform quite well. Among larger models, ruri-v3-reranker-310m is clearly strong. The fact that these high-performing models are based on ModernBERT likely contributes to the improvement.

Japanese models have learned the tendencies of JQaRA, a quiz-style dataset, which puts bge-reranker-v2-m3 at a disadvantage. This is also an example of how much a reranker score can improve when the domain task is learned appropriately.

Model name	avg	JQaRA	JaCWIR	MIRACL	JSQuAD
japanese-reranker-tiny-v2	0.8138	0.6455	0.9287	0.7201	0.9608
japanese-reranker-xsmall-v2	0.8699	0.7403	0.9409	0.8206	0.9776
japanese-reranker-small-v2	0.8856	0.7633	0.9586	0.8385	0.9821
japanese-reranker-base-v2	0.8930	0.7845	0.9603	0.8425	0.9845
japanese-reranker-cross-encoder-xsmall-v1	0.8131	0.6136	0.9376	0.7411	0.9602
japanese-reranker-cross-encoder-small-v1	0.8254	0.6247	0.9390	0.7776	0.9604
japanese-reranker-cross-encoder-base-v1	0.8484	0.6711	0.9337	0.8180	0.9708
japanese-reranker-cross-encoder-large-v1	0.8661	0.7099	0.9364	0.8406	0.9773
japanese-bge-reranker-v2-m3-v1	0.8584	0.6918	0.9372	0.8423	0.9624
bge-reranker-v2-m3	0.8512	0.6730	0.9343	0.8374	0.9599
ruri-v3-reranker-310m	0.9171	0.8688	0.9506	0.8670	0.9820

Inference Speed

The table below shows inference time for reranking about 150,000 pairs with the Hugging Face Transformers library. Tokenization time is excluded, so this is pure model inference time. I used an M4 Max for MPS and CPU measurements, an RTX 5090 for GPU, and FlashAttention 2 for ModernBERT-family models on GPU.

japanese-reranker-tiny-v2 and xsmall-v2 are clearly fast. ruri-v3-reranker-310m is also fast for its size, likely because FlashAttention 2 is effective. Other models can also use FlashAttention 2 through tools such as text-embeddings-inference, and may run faster than in this evaluation.

Model name	Layers	Hidden size	Speed (GPU)	Speed (MPS)	Speed (CPU)
japanese-reranker-tiny-v2	3	256	2.1s	82s	702s
japanese-reranker-xsmall-v2	10	256	6.5s	303s	2300s
japanese-reranker-small-v2	13	384	15.2s
japanese-reranker-base-v2	19	512	32.5s
japanese-reranker-cross-encoder-xsmall-v1	6	384	20.5s
japanese-reranker-cross-encoder-small-v1	12	384	40.3s
japanese-reranker-cross-encoder-base-v1	12	768	96.8s
japanese-reranker-cross-encoder-large-v1	24	1024	312.2s
japanese-bge-reranker-v2-m3-v1	24	1024	310.6s
bge-reranker-v2-m3	24	1024	310.7s
ruri-v3-reranker-310m	25	768	81.4s

The benchmark script is here.

I also publish models converted to ONNX for CPU use, so with ONNX and ARM quantized models, they should be usable even in edge environments such as Raspberry Pi.

Short Technical Report

The training data for japanese-reranker-tiny-v2, xsmall-v2, small-v2, and base-v2 is based on the dataset used to train hotchpotch/japanese-splade-v2, plus hard negatives and some additional private data. The large improvement over v1 likely comes from using ruri-v3-pt-30m, a ModernBERT-based model pretrained for the target task, using several times more data than v1, and extracting higher-quality data with hard negatives, including filtering positives and negatives with scores from various rerankers.

For the Tiny model's parameter extraction source, I evaluated sbintuitions/modernbert-ja-30m and cl-nagoya/ruri-v3-pt-30m. ModernBERT alternates global attention and local attention layers. For example, modernbert-ja-30m has 10 layers, where [0,3,6,9] are global attention layers and the others are local attention layers.

At first I expected all global attention layers to work best, but including layers 3, 6, and 9 generally made results worse. Including layers close to the output also made results worse. The table below shows reranking evaluation results for models trained on the same dataset. Results including layers close to the output, such as 6 and 9, were much worse and training was stopped early, so they are not included. Layer 0 alone did not produce useful performance.

name	JQaRA	miracl	jsquad	JaCWIR
modernbert-ja-30m + full layers	0.7261	0.8095	0.9752	0.9420
modernbert-ja-30m + layer 0,2,4	0.6455	0.7185	0.9588	0.9265
modernbert-ja-30m + layer 0,2	0.6171	0.6784	0.9516	0.9155
modernbert-ja-30m + layer 0	0.2515	0.4416	0.3172	0.0738
ruri-v3-pt-30m + full layers (= xsmall-v2)	0.7403	0.8206	0.9776	0.9409
ruri-v3-pt-30m + layer 0,2,4 (= tiny-v2)	0.6455	0.7201	0.9608	0.9287
ruri-v3-pt-30m + layer 0,1,3	0.6405	0.7124	0.9552	0.9211
ruri-v3-pt-30m + layer 0,3	0.6177	0.6619	0.9482	0.9076

From these results, I published ruri-v3-pt-30m as xsmall, and ruri-v3-pt-30m + layer 0,2,4 as tiny. small-v2 and base-v2 are based on ruri-v3-pt-70m and ruri-v3-pt-130m, respectively. Model merging slightly improves performance, but I did not use it this time.

Closing

This article introduced the small, lightweight, and practical Japanese reranker models japanese-reranker-tiny-v2, japanese-reranker-xsmall-v2, japanese-reranker-small-v2, and japanese-reranker-base-v2. The tiny and xsmall models run at practical speed on CPU and Apple silicon, and can improve search accuracy for local RAG systems without requiring expensive GPU resources. Running them on GPU also enables fast responses.

Recent high-performance encoder models such as ModernBERT make it easier to build practical models with stronger performance. I hope this article contributes to the further development of Japanese language processing technology.

query-crafter-japanese: A Model for Generating Queries for Information Retrieval

Wed, 07 May 2025 00:00:00 GMT

Training neural retrieval models such as vector search models and rerankers requires pairs of questions and answer documents. The answer document can be almost any text, though higher-quality text and domain-specific data naturally help produce better models. For training, however, we also need questions that are related to those answers. Recent LLMs have improved substantially, and we can use them to generate questions automatically from answer documents, then use those pairs for training. Datasets created this way are often called synthetic datasets.

However, when you want to create and publish a synthetic dataset broadly, commercial LLMs such as OpenAI and Gemini can create licensing issues because of their terms of use. Processing a large number of documents also takes significant time and cost.

For that reason, I created and released query-crafter-japanese, a family of small 1.7B to 4B models that run quickly, can generate retrieval questions at roughly the same level as questions generated by DeepSeek-R1, and do not impose restrictions on output licensing. The models are released under the Apache 2.0 license.

query-crafter-japanese-Qwen3-1.7B
- Recommended for speed and performance
query-crafter-japanese-Qwen3-4B
query-crafter-japanese-sarashina2.2-3b-instruct-v0.1

query-crafter can generate seven categories:

keywords: space-separated keywords
synonym_keywords: distinctive keywords using synonyms
query: a question based on the content of the text
alt_query: a question phrased in a way that does not match well with BM25
title: a title representing the whole text
faq: a question when treating the text as an FAQ answer
summary: a short summary of the text

Let's generate each category from the following text.

query-crafter-japanese-example.py

In the evening, we had the results presentation for a development retreat. Everyone except me worked on proper AI-related themes, and the quality was high. It was interesting. Person I is not even an engineer, but they made a Figma plugin and deployed it to Vercel, mostly written by Cursor. It was a close-up example of how AI can greatly expand what someone can do. I did not work on a particular theme. Instead, because I had never tried vibe coding, I tested how far I could build something in Cursor without touching or reading the code.

I made a tool that summarizes these yearly diary entries and posts them to Discord, adding new features based on a specification I had written before. I also made a tool that automatically gives titles to diary entries that do not have one. Vibe coding worked about as I expected. It is convenient.

Because I developed in a black-box way without looking at the code and only checked the output artifacts, the generated code was not production-ready when I looked at it later. Still, it was enough for quick one-off tools. I only need to give the specification, and I also make sure the specification is updated along the way. If I want to change a feature, I only need to change or add to the specification, which is easy.

Here is the result of generating queries by category with query-crafter-japanese-Qwen3-1.7B. keywords, query, title, and summary show clear differences. synonym_keywords is not always a perfect synonym, and alt_query and faq may sometimes be close to query.

keywords: Vibe Cording ブラックボックス開発 仕様変更
synonym_keywords: AI活用開発プロジェクト 発表会 仕様変更追加
query: 開発合宿で作成したツールの具体的な機能は？
alt_query: 開発者向けツール開発でコード見ない開発手法の利点は？
title: AI活用で拓く開発の新領域：Vibe Cordingとブラックボックス開発の可能性
faq: 開発合宿で実現した新機能や成果は？
summary: AI活用の開発成果発表会で、Vibe Cordingや日記ツール開発、コード見ずに開発を実施

The model is also fast. In a vLLM + RTX 5090 environment, it runs at about 48,000 toks/s for input tokens and 2,200 toks/s for output tokens. If you generate 10,000 questions from 10,000 texts of around 1,000 Japanese characters each, it takes a little under 100 seconds. Even if there were 100 million target documents, processing all of them would take about 140 hours.

For comparison, when I processed 100,000 documents with DeepSeek-R1 during the nighttime discount window, with input at 0.135 USD per 1M tokens and output at 0.55 USD per 1M tokens, using 100 parallel API requests took about 7 hours and cost around 40 USD. Processing 100 million documents with the DeepSeek-R1 API would cost around 40,000 USD and take about 7,000 hours. In practice it would take longer if you try to use only nighttime discount periods, and the maximum parallel request count also depends on DeepSeek's available resources.

In this way, query-crafter has large advantages in both speed and cost when you want to generate questions from a large number of documents.

Training query-crafter-japanese

For training, I used DeepSeek-R1, which does not restrict output use, to create supervised question data as a synthetic dataset from fineweb-2-edu-japanese.

For example, for title, I used an instruction like: "Think of and create a title that represents the whole text well. Output the title within 30 Japanese characters. The output must be strict JSON in the form {"query": "title"}. Do not output anything else."

https://huggingface.co/datasets/hotchpotch/japanese-query-crafter-reasoning-80k

I then used this data as supervised data for SFT, supervised fine-tuning, on Qwen3-4B, Qwen3-1.7B, sarashina2.2-3b-instruct-v0.1, and TinySwallow-1.5B-Instruct.

The SFT format was simple:

{
  "system": "{category名}",
  "user": "{text}",
  "assistant": "{query}",
}

The system prompt contains the instruction category such as title, the user input contains the document text, and the model output contains query. For SFT specialized to a particular use case, a verbose prompt is not necessary. A short instruction, in this case the category, can train the behavior well.

Evaluation

I evaluated query-crafter using the test split of japanese-query-crafter-reasoning-80k. I generated questions from the text in this data using each SFT-trained query-crafter model.

Then I paired those generated questions with the original text and scored them with the reranker BAAI/bge-reranker-v2-m3. The reranker score is 1.0 when the document and text are highly related, and 0.0 when they are not related. It is therefore a rough measure of whether the generated question is related to the text.

Model	Mean	Std. dev.
query-crafter-jp-Qwen3-1.7B	0.8701	0.2592
query-crafter-jp-Qwen3-4B	0.8712	0.2652
query-crafter-jp-TinySwallow-1.5B	0.7526	0.3611
query-crafter-jp-sarashina2.2-3b	0.8670	0.2646
deepseek-r1	0.8507	0.2875

The percentile plot is below.

Except for TinySwallow-1.5B, the models scored higher than DeepSeek-R1 in most cases. In particular, Qwen3-1.7B is a multilingual model not specialized for Japanese, but after SFT its score is almost the same as Qwen3-4B. Its performance is impressive. Unless you have a specific reason to choose otherwise, query-crafter-japanese-Qwen3-1.7B is a good choice.

A lower reranker score than DeepSeek-R1 does not necessarily mean the DeepSeek-R1 question is worse. There are cases where it creates correct but difficult questions that are hard even for a reranker to judge. TinySwallow-1.5B sometimes generated questions that were completely unrelated, which lowered its score compared with the other models. TinySwallow-1.5B-Instruct was distilled with TAID, so it may be less suitable for subsequent SFT.

Closing

I created and released query-crafter-japanese, a model with significant speed and cost advantages when generating a large number of questions. Since the release of high-performing DeepSeek-R1, which does not restrict output use, it has become easier to create and publish datasets and then build models using them as supervised data. The emergence and improvement of open-weight LLMs with practical licenses, such as smaller Qwen models, also makes it easier to create and publish fine-tuned small models specialized for specific use cases. I feel that the range of possible applications has widened considerably. Half a year earlier, creating this model as an individual would probably have been impossible for resource reasons.

I hope this model helps people who need to generate questions.

FineWeb2 Edu Japanese: A High-Quality Educational Japanese Dataset

Thu, 20 Feb 2025 00:00:00 GMT

I published FineWeb2 Edu Japanese, a high-quality educational Japanese dataset.

https://huggingface.co/datasets/hotchpotch/fineweb-2-edu-japanese

The following is an English version of the content on that page.

This dataset filters the Japanese portion of FineWeb2, 376 million records, down to 120 million records, about 89.3B tokens, that were judged to be educational content. It also provides the following subsets.

default: about 120M records and about 89.3B tokens
sample_10BT: about 10B tokens randomly sampled from default
small_tokens: only short texts with 512 tokens or fewer
small_tokens_cleaned: small_tokens with web-specific text noise removed

Background

FineWeb, which is English-only, was created to deduplicate web data and extract high-quality text. FineWeb-Edu, which extracts higher-quality educational text, makes efficient training possible with fewer tokens.

FineWeb2, released in December 2024, is a high-quality multilingual dataset that includes Japanese. As of February 2025, however, an "Edu" dataset that extracts educationally valuable Japanese text had not been released. For that reason, I created and published FineWeb2 Edu Japanese.

Filtering Educational Data

To build this dataset, I filtered FineWeb2 Japanese data with fineweb-2-edu-japanese-classifier, a model for judging whether text is educational. The supervised data for the scoring model comes from fineweb-2-edu-japanese-scores, which was evaluated with DeepSeek-API (deepseek-chat). This dataset extracts only texts with a score of 2.5 or higher, and the score is included in the score column.

Token Counts

Token counts computed with the ModernBERT-Ja-130M tokenizer are included in the token_count column.

Removing Web-Specific Noise

FineWeb2 Japanese data can contain web-specific boilerplate and unnecessary noise. For example, text like the following can appear.

This text is displayed on a site that has not been updated for more than 90 days.
Login Logout

Besides the text that is actually needed, various kinds of noise may be included. This sentence is one such example. Unnecessary text can be inserted in this way.

50% off now! Click to view the linked product

Especially when the text is short, most of it may contain noise. Removing such text may allow higher-quality text to be extracted.

Previous page  Next page

To remove this kind of unnecessary text, I developed fineweb-2-japanese-text-cleaner. The supervised data for noise detection is fineweb-2-japanese-noise-spans. That supervised data was created using cyberagent/DeepSeek-R1-Distill-Qwen-32B-Japanese.

The model detects noisy spans as follows.

[NOISE]This text is displayed on a site that has not been updated for more than 90 days.[/NOISE]
[NOISE]Login[/NOISE] [NOISE]Logout[/NOISE]

Besides the text that is actually needed, various kinds of noise may be included. This sentence is one such example. Unnecessary text can be inserted in this way.
[NOISE]
50% off now! Click to view the linked product[/NOISE]

Especially when the text is short, most of it may contain noise. Removing such text may allow higher-quality text to be extracted.

[NOISE]Previous page[/NOISE]  [NOISE]Next page[/NOISE]

The small_tokens_cleaned subset applies fineweb-2-japanese-text-cleaner to small_tokens and removes detected noise. The raw data produced by running noise detection with the model is also published as fineweb-2-edu-japanese-noise-detect-raw.

Noise detection is not perfect, so in some cases parts of valid text may have been mistakenly removed.

Notes

I have not run a comparative experiment between this dataset, FineWeb2 Edu Japanese, and the original FineWeb2 dataset without Edu filtering. Therefore, the actual difference in effect during LLM training has not been verified.

The classification of whether text is educational is also not perfect, and some non-educational text is included.

License

This dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0, the same as the original FineWeb2. The Common Crawl terms of use also apply.

Citation Information

@software{yuichi2025fineweb-2-edu-japanese,
  author = {Yuichi Tateno},
  title = {FineWeb2 Edu Japanese},
  month = feb,
  year = 2025,
  url = {https://huggingface.co/datasets/hotchpotch/fineweb-2-edu-japanese/}
}

Releasing a Japanese StaticEmbedding Model for Practical 100x Faster Text Embeddings

Tue, 21 Jan 2025 00:00:00 GMT

Dense text vectors can be used for many tasks, including information retrieval, text classification, and similar-text extraction. However, even small recent Transformer models can be slow, especially on CPU, and that often makes them impractical.

A recently released non-Transformer StaticEmbedding model offers a new approach. In benchmark comparisons with intfloat/multilingual-e5-small, or mE5-small, it achieved at least reasonable performance, around 85% of the score, while creating sentence vectors 126 times faster on CPU. That speed is impressive.

I therefore trained and released a Japanese and English model, static-embedding-japanese.

https://huggingface.co/hotchpotch/static-embedding-japanese

The JMTEB results for Japanese text embeddings are below. The overall score is slightly below mE5-small, but it wins on some tasks and is sometimes stronger than other Japanese base-size BERT models. Before training it, I was not sure a model this simple would really perform this well, so the result was surprising.

Model	Avg(micro)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
text-embedding-3-small	69.18	66.39	79.46	73.06	92.92	51.06	62.27
multilingual-e5-small	67.71	67.27	80.07	67.62	93.03	46.91	62.19
static-embedding-japanese	67.17	67.92	80.16	67.96	91.87	40.39	62.37

Technical notes on training the Japanese StaticEmbedding model are in the latter half of this article.

Usage

Usage is simple. You can create sentence vectors with SentenceTransformer as usual. This example runs on CPU without a GPU. I tested with SentenceTransformer 3.3.1.

pip install "sentence-transformers>=3.3.1"

from sentence_transformers import SentenceTransformer

model_name = "hotchpotch/static-embedding-japanese"
model = SentenceTransformer(model_name, device="cpu")

query = "美味しいラーメン屋に行きたい"
docs = [
    "素敵なカフェが近所にあるよ。落ち着いた雰囲気でゆっくりできるし、窓際の席からは公園の景色も見えるんだ。",
    "新鮮な魚介を提供する店です。地元の漁師から直接仕入れているので鮮度は抜群ですし、料理人の腕も確かです。",
    "あそこは行きにくいけど、隠れた豚骨の名店だよ。スープが最高だし、麺の硬さも好み。",
    "おすすめの中華そばの店を教えてあげる。とりわけチャーシューが手作りで柔らかくてジューシーなんだ。",
]

embeddings = model.encode([query] + docs)
print(embeddings.shape)
similarities = model.similarity(embeddings[0], embeddings[1:])
for i, similarity in enumerate(similarities[0].tolist()):
    print(f"{similarity:.04f}: {docs[i]}")

(5, 1024)
0.1040: 素敵なカフェが近所にあるよ。落ち着いた雰囲気でゆっくりできるし、窓際の席からは公園の景色も見えるんだ。
0.2521: 新鮮な魚介を提供する店です。地元の漁師から直接仕入れているので鮮度は抜群ですし、料理人の腕も確かです。
0.4835: あそこは行きにくいけど、隠れた豚骨の名店だよ。スープが最高だし、麺の硬さも好み。
0.3199: おすすめの中華そばの店を教えてあげる。とりわけチャーシューが手作りで柔らかくてジューシーなんだ。

The document matching the query gets a higher score. In this example, BM25 would have difficulty because direct words such as "ramen" in the query do not appear in the documents.

Here is an example of a similar-sentence task:

sentences = [
    "明日の午後から雨が降るみたいです。",
    "来週の日曜日は天気が良いそうだ。",
    "あしたの昼過ぎから傘が必要になりそう。",
    "週末は晴れるという予報が出ています。",
]

embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)

print(similarities)

# Show similarity between the first sentence and the others.
for i, similarity in enumerate(similarities[0].tolist()):
    print(f"{similarity:.04f}: {sentences[i]}")

tensor([[1.0000, 0.2814, 0.3620, 0.2818],
        [0.2814, 1.0000, 0.2007, 0.5372],
        [0.3620, 0.2007, 1.0000, 0.1299],
        [0.2818, 0.5372, 0.1299, 1.0000]])
1.0000: 明日の午後から雨が降るみたいです。
0.2814: 来週の日曜日は天気が良いそうだ。
0.3620: あしたの昼過ぎから傘が必要になりそう。
0.2818: 週末は晴れるという予報が出ています。

The similar sentence receives a higher score here as well.

Many people have experienced that creating sentence vectors with Transformer models on CPU takes a long time even for a small amount of text. With StaticEmbedding, it should finish almost instantly if the CPU is reasonably fast.

Reducing Output Dimensions

The default sentence vector has 1024 dimensions, but you can reduce it further. For example, here is truncate_dim=128.

# truncate_dim can be 32, 64, 128, 256, 512, or 1024.
model = SentenceTransformer(model_name, device="cpu", truncate_dim=128)

This produces 128-dimensional vectors. The score changes slightly because reducing dimensions lowers performance a little. On the other hand, reducing from 1024 to 128 dimensions reduces storage size and makes similarity computation about 8 times cheaper, so lower dimensions can be preferable depending on the use case.

Why Is CPU Inference Fast?

StaticEmbedding is not a Transformer model. It has no attention computation, the core of "Attention Is All You Need." It stores token embeddings in a 1024-dimensional table and creates a sentence vector by averaging the token vectors that appear in the sentence. Because there is no attention, it does not understand context in the same way a Transformer does.

Internally, it uses PyTorch's nn.EmbeddingBag, passing concatenated tokens and offsets so that PyTorch can use optimized CPU parallel processing and memory access.

According to the speed evaluation in the original article, it is 126 times faster than mE5-small on CPU.

Evaluation

All JMTEB results are in this JSON file. Comparing with other models on the JMTEB leaderboard shows the relative difference. Considering the model size, the overall JMTEB result is very good. The Mr. TyDi task in JMTEB requires vectorizing 7 million documents and usually takes a long time, around 1 to 4 hours on an RTX 4090 depending on the model. StaticEmbeddings processed it very quickly, finishing in about 4 minutes on an RTX 4090.

Can It Replace BM25 for Retrieval?

Looking at the Retrieval results, StaticEmbedding performs very poorly on Mr. TyDi. Mr. TyDi has far more documents than the other tasks, 7 million documents, so results may be poor for tasks that search over very large document collections. Since the model simply averages tokens without considering context, the more documents there are, the more likely similar averages may appear.

For large document collections, it may perform much worse than BM25. For smaller collections where exact keyword matches are rare, it may often perform better than BM25.

The JAQKET retrieval score is unusually good compared with other models. This may be because the model trained on JQaRA dev and unused data, which includes JAQKET-style questions, but the score still feels high. I do not think the test data leaked, but I am not fully sure why the score is this good.

Clustering Is Weak

I have not investigated this in detail, but the clustering score is clearly worse than other models. Classification is not bad, so this is somewhat surprising. It may be related to the embedding space being created with Matryoshka Representation Learning.

JQaRA and JaCWIR Reranking Evaluation

JQaRA:

model_names	ndcg@10	mrr@10
static-embedding-japanese	0.4704	0.6814
bm25	0.458	0.702
multilingual-e5-small	0.4917	0.7291

JaCWIR:

model_names	map@10	hits@10
static-embedding-japanese	0.7642	0.9266
bm25	0.8408	0.9528
multilingual-e5-small	0.869	0.97

On JQaRA it is slightly better than BM25 and slightly worse than mE5-small. On JaCWIR it is much lower than BM25 and mE5-small.

JaCWIR asks the model to find web article titles and summaries from queries, and those texts are often not clean. Transformer models are robust to noise, so it makes sense that a simple token-average StaticEmbedding model falls behind. BM25 matches distinctive words, so noisy words in documents often do not match the query in the first place, which helps it remain competitive with Transformer models on JaCWIR.

This suggests StaticEmbedding may score poorly compared with Transformer models or BM25 when texts contain a lot of noise.

Reducing Output Dimensions

The model created here outputs 1024 dimensions. Higher dimensionality increases computation cost for downstream tasks such as clustering and retrieval. Because the model is trained with Matryoshka Representation Learning (MRL), however, the 1024-dimensional vector can be easily truncated to smaller dimensions.

MRL encourages earlier dimensions to hold more important information, so using only the first 32, 64, 128, or 256 dimensions can still produce reasonable results.

According to the StaticEmbedding article, the model retains 91.87% performance at 128 dimensions, 95.79% at 256 dimensions, and 98.53% at 512 dimensions. This is useful when accuracy requirements are not too strict and downstream computation should be reduced.

Dimension Reduction Results for static-embedding-japanese

JMTEB can pass truncate_dim, making it easy to benchmark dimension-reduced outputs.

Dimensions	Avg(micro)	Score ratio (%)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
1024	67.17	100.00	67.92	80.16	67.96	91.87	40.39	62.37
512	66.57	99.10	67.63	80.11	65.66	91.54	41.25	62.37
256	65.94	98.17	66.99	79.93	63.53	91.73	42.55	62.37
128	64.25	95.65	64.87	79.56	60.52	91.62	41.81	62.33
64	61.79	91.98	61.15	78.34	58.23	91.50	39.11	62.35
32	57.93	86.24	53.35	76.51	55.95	91.15	38.20	62.37

I had previously measured the 512-dimensional score incorrectly and corrected it. Matryoshka Representation Learning appears to work: reducing dimensions causes a small score drop, but the reduced dimensions should lower downstream cost.

Interestingly, clustering improves over 1024 dimensions even when reduced to 128 dimensions. Normally, keeping more information should help, so this is unexpected. It may mean that, for clustering, using only the earlier dimensions that capture more global features works better than using later dimensions, depending on the clustering algorithm.

For this model, 512, 256, and 128 dimensions seem like reasonable tradeoffs between performance and dimensionality reduction.

Impressions After Building a StaticEmbedding Model

I was honestly skeptical that a simple average of token embeddings could perform this well, but after training it, I was surprised by the performance of such a simple architecture. In an era dominated by Transformers, it is interesting to see a practical model based on a more traditional word-embedding style approach.

A fast CPU sentence embedding model should be useful for converting large amounts of text locally, edge devices, and environments with slow networks where calling a remote inference server is difficult.

Technical Notes on Training the Japanese StaticEmbedding Model

Why Training Works

StaticEmbedding is very simple. It tokenizes a sentence, obtains N-dimensional word embeddings from an EmbeddingBag table, 1024 dimensions in this model, and averages them.

Traditional word embeddings such as word2vec and GloVe learn from word context with Skip-gram or CBOW. StaticEmbedding instead trains with entire sentences. It uses contrastive learning with large batches over many kinds of text, which can learn useful word embeddings.

Contrastive learning treats everything except the positive as a negative. With batch size 2048, one positive is compared against 2047 negatives for 2048 examples, about 4 million comparisons. This allows the model to update weights appropriately over the original word space.

Training Datasets

For the Japanese model, I created and used datasets suitable for contrastive learning:

hotchpotch/sentence_transformer_japanese
- This is arranged with column names and structures easy to use with SentenceTransformer training, such as (anchor, positive), (anchor, positive, negative), and (anchor, positive, negative_1, ..., negative_n).
- It is based on datasets including hpprc/emb, hpprc/llmjp-kaken, hpprc/msmarco-ja, hpprc/mqa-ja, and hpprc/llmjp-warp-html. For hpprc/emb and msmarco-ja, I filtered positives and negatives with reranker scores, using positive(>=0.7) and negative(<=0.3).
- I used many subsets from the constructed dataset, with augmentation to increase the amount of information retrieval-oriented data.
For English data, I used datasets such as sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1, sentence-transformers/squad, sentence-transformers/all-nli, sentence-transformers/trivia-qa, nthakur/swim-ir-monolingual, sentence-transformers/miracl, and sentence-transformers/mr-tydi.

As always, I am grateful to the dataset authors, especially hpprc.

Japanese Tokenizer

To train StaticEmbedding, it seemed easiest to use a tokenizer that can be processed in Hugging Face's tokenizer.json format, so I created hotchpotch/xlm-roberta-japanese-tokenizer, with a vocabulary size of 32,768.

This tokenizer was trained by segmenting Japanese Wikipedia data with UniDic and training SentencePiece unigram. I originally thought it also used sampled English Wikipedia and Japanese CC-100, but after checking the creation code, it used only Japanese Wikipedia. It also works as an XLM-Roberta-style Japanese tokenizer. I used this tokenizer for the model.

Hyperparameters

Notes and changes from the original training code:

Batch size was changed from 2048 to 6072.
- In large-batch contrastive learning, having positives and negatives in the same batch can hurt training. BatchSamplers.NO_DUPLICATES avoids this, but sampling can become slow with huge batches.
- I used BatchSamplers.NO_DUPLICATES and set the batch size to 6072, which fit in 24 GB on an RTX 4090. Larger batches may produce better results.
Epochs were changed from 1 to 2.
- 2 epochs performed better than 1, though with a larger dataset, 1 might be better.
Scheduler:
- Changed from the default linear scheduler to cosine, which has often worked better in my experience.
Optimizer:
- Kept the default AdamW. Switching to Adafactor made convergence worse.
Learning rate:
- Kept 2e-1. I wondered whether it was too large, but lower values worsened results.
dataloader_prefetch_factor=4
dataloader_num_workers=15
- Tokenization and batch sampler sampling take time, so I set this relatively high.

Training Resources

CPU: Ryzen 9 7950X
GPU: RTX 4090
Memory: 64 GB

With these resources, full-scratch training took about 4 hours. GPU core load was very low, often near 0%, unlike Transformer training where it stays around 90%. Most of the time appears to be spent transferring huge batches into GPU memory. Faster GPU memory bandwidth may improve training speed further.

Further Improvements

The tokenizer used here is not specialized for StaticEmbedding, so a more suitable tokenizer may improve performance. Larger batch sizes may also stabilize training and improve performance.

Using broader text resources, including various domains and synthetic datasets, may further improve performance.

Training Code

The training code is published under the MIT license. Running the script should reproduce the model.

https://huggingface.co/hotchpotch/static-embedding-japanese/blob/main/trainer.py

License

static-embedding-japanese publishes model weights and training code under the MIT license.

Looking Back on 2024

Tue, 31 Dec 2024 00:00:00 GMT

The year is coming to an end, so here is a short look back at 2024.

Life

In 2024 I was able to live at my own pace. Because we were building a house, there were many meetings and site visits, so I could not take a long overseas trip. Still, I traveled a bit both in Japan and abroad, and overall it was an active enough year.

On the health side, I hit my highest body weight so far. I started going to a personal gym partly to diet, lost 5 kg, and got back to a weight from a while ago. Ideally I would like to lose another 5 kg and gain more muscle, but that probably requires exercising properly rather than only going to the gym once a week.

And the house is finally built. We moved in right at the end of the year, so we have only been living here for about six days, but so far it has been very comfortable. The wood stove has been running almost all the time except when we are sleeping, and it is warm and excellent.

Work

At work, I was able to do a lot of what I wanted as a product manager and data scientist, mainly in information retrieval, on an AI-related product that started this year. I also worked with teammates who had expertise I did not have, and it was an enjoyable and fulfilling year. I think we built a fairly good product. It has had a quiet release, and next year I think it will probably be used more widely. More precisely, I want to help make that happen.

Technology and Hobbies

Since the second half of 2023, information retrieval has become much more interesting to me. I started working on it professionally as well, and as a hobby I created datasets, built rerankers and retrieval models, and published some useful outputs. Training LLMs is difficult with my machine resources, but it is possible to build models that fill a missing piece for a specific task. In Japanese, relatively few people publish this kind of work, so if the task is specific enough it is still possible to build a strong model. That has been fun.

In information retrieval, NLP, and machine learning, the more I work on them, the more knowledge keeps accumulating. There is still so much I want to do and learn. In 2025 I want to keep this area as a technical focus, build things as a hobby, and apply what I learn at work.

At the same time, I may have spent too much private time on work-adjacent technology. Most mornings and evenings were spent building models as a hobby. I also feel that life might be richer if I enjoyed a wider range of things outside work, so I want to pay attention to that balance in 2025.

Summary

Looking back, the year went by quickly. It feels like I did quite a lot and also like there is still a lot I did not do. I think I recovered the pace I lost in 2023, and I was able to do various things with a balance that suits me.

As always, I am grateful to my wife, who supports me in many ways. With that, I hope to see you all in 2025.

Releasing Japanese SPLADE v2, a Strong Retrieval Model for Texts Under 512 Tokens

Thu, 19 Dec 2024 00:00:00 GMT

In 2024 I became interested in information retrieval and have been building retrieval-related models as a hobby as @hotchpotch. Transformers are enjoyable because they often learn reasonably well even when the setup is not overly elaborate.

I have been running consumer GPUs at home and released Japanese SPLADE v2, japanese-splade-v2, an improved version of the Japanese SPLADE v1 retrieval model I published earlier. On JMTEB retrieval benchmarks, it achieves very strong scores for document lengths up to 512 tokens, which is common in RAG. Considering the model size and performance, I think it is a well-balanced retrieval model.

This article is also day 24 of the Information Retrieval / Search Technology Advent Calendar 2024.

What Is SPLADE?

SPLADE is probably unfamiliar to many people, so before explaining SPLADE, I will briefly discuss dense vector search and sparse vector search.

When people talk about natural language search today, the popular approach is dense retrieval, also called text embeddings or embedding search. However, sparse retrieval is still actively used in many places. Keyword-based methods such as TF-IDF and BM25 are representative examples.

Suppose you search for "Tell me a good cafe." Sparse vector search, such as TF-IDF or BM25, scores how important keywords are and returns results. In this case, the results depend on the frequency and rarity of words such as "good" and "cafe". Documents with matching distinctive keywords tend to rank higher.

Dense retrieval represents the meaning of words and phrases as vectors. For "good cafe", it can also retrieve contextually related phrases such as "popular coffee shop" or "pleasant cafe". This is because the neural network model, usually a Transformer, has learned broad semantic representations of words and sentences.

In short, sparse vectors such as BM25 emphasize the keywords themselves, while dense vectors emphasize the meaning and nuance of the keywords. Which one to use depends on whether exact keyword matching or semantic breadth is more important.

Difference in Dimensions

Dense and sparse vectors also differ in the number of dimensions used to represent information.

Dense vectors typically have large dimensionality, often from 384 to 3072 dimensions, and sometimes more depending on the model. For example, OpenAI's text-embedding-3-large uses 3072 dimensions by default. Higher dimensionality means that vector computations, such as dot products or cosine similarity, become more expensive and require more storage and memory. This is one of the challenges of dense vectors.

In real search systems, searching all high-dimensional vectors exactly is too inefficient, so algorithms such as approximate nearest neighbor search, ANN, are used to trade a little accuracy for better computational efficiency.

Sparse vectors, when keyword-based, theoretically use the vocabulary of the whole document collection as dimensions, creating a large vector space. In practice, however, most dimensions are zero, and only a small number of elements are non-zero. A query such as "Tell me a good cafe" uses only the dimensions corresponding to words like "good", "cafe", and "tell". The other tens or hundreds of thousands of possible dimensions remain zero. This greatly reduces storage, memory, and computation in production and enables fast search.

Sparse vectors also have the advantage that it is easy to understand what each non-zero dimension means. It is clear which dimensions correspond to keywords such as "good" and "cafe", making results easier to interpret.

Here is an example of a dense vector:

dense_vector = [
 0.0023, -0.0008, 0.0017, 0.0009, -0.0025,
 ... # elements continue for the number of dimensions
]

All dimensions in a dense vector have meaning, but it is hard to understand what each value specifically represents.

A sparse vector, on the other hand, has an easier-to-understand structure:

sparse_vector = {
  33721: 1.5, # dimension 33721 corresponds to "good"
  1191: 2.3, # dimension 1191 corresponds to "cafe"
  997: 0.2 # dimension 997 corresponds to "tell"; frequent words have lower scores
  # all other dimensions are zero and do not need to be written
}

In this example, it is clear which dimensions correspond to "good", "cafe", and "tell". This makes it easier to interpret which words contributed to the search result.

Dense vectors are good at capturing broad meaning, but their many dimensions make them computationally expensive. Sparse vectors are efficient and especially useful for precise keyword search.

Weaknesses of Sparse Vector Search

Sparse vectors use relatively few active dimensions and make it easy to understand which words matched. That may sound like an obvious win.

However, dense retrieval is popular for natural language search because of accuracy. Algorithms such as BM25 basically match predefined keywords and manually maintained synonym dictionaries. If you search for "good cafe", BM25 will usually not match "tasty coffee shop" unless the keywords align. Dense vectors use fuzzier semantic representations and can match texts like "tasty coffee shop" as similar.

SPLADE: Sparse Vectors That Understand and Expand Context

Sparse vector search is well suited to exact keyword matching, such as e-commerce search where similar but different products can be wrong. For natural language queries, dense retrieval often seems more suitable.

As people increasingly want systems to find target documents from casual natural language, similar to talking with AI, dense retrieval models have become popular.

This is where SPLADE, Sparse Lexical and Expansion Model, comes in. SPLADE's key feature is that it understands context and proposes multiple appropriate words or tokens. For example, for the query "What time of day has the highest household TV rating in Japan?", SPLADE can output related terms inferred from context, not only words directly included in the query.

Words directly included in the query:
- Japan
- viewing
- household
- time
Related words inferred from context:
- TV and broadcasting: broadcast, program, slot
- Metrics: rate, rise, high
- Time-related: time, period

Traditional sparse vector search could only find documents where the entered keywords matched exactly. SPLADE can understand context and search with related words as well, while preserving the fast retrieval performance of sparse vectors.

Efficient Retrieval

SPLADE can perform this kind of advanced search efficiently.

For example, for the query "What time of day has the highest household TV rating in Japan?"
sparse_vector = {
    1423: 1.71,  # corresponds to "Japan"
    5891: 1.59,  # corresponds to "viewing"
    8754: 1.57,  # corresponds to "household"
    2341: 1.33,  # corresponds to "time"
    9876: 0.96,  # corresponds to "broadcast"
    # ...other related dimensions
}

Only the necessary information is stored as a sparse vector, and matching uses a small number of dimensions. The important point is that SPLADE scores are not simple frequencies; they represent contextual importance.

Why SPLADE?

Compared with other retrieval approaches:

Traditional sparse retrieval such as BM25
- Depends on keyword combinations such as "rating" + "time slot"
- Weak against paraphrases such as "broadcast peak time"
- Strong for exact keyword matches
- Easy to explain results
Dense retrieval
- Represents queries and documents with dense vectors, so stronger accuracy often requires larger models and higher vector dimensions
  - This affects inference speed and search speed
- Results are harder to interpret
SPLADE, context-aware sparse retrieval
- Can search with contextual understanding
- Maintains fast search performance
  - Queries are often around 20-40 dimensions and documents around 150-400 dimensions
  - Runtime tradeoffs between accuracy and speed are possible by not searching or indexing low-importance words
- Results are easy to interpret because you can see which word tokens matched

SPLADE balances many requirements of modern search systems.

How Good Is the Actual Performance?

Let's look at performance, especially the ability to retrieve appropriate documents for natural language questions.

This is the JMTEB retrieval benchmark result, nDCG@10. For texts of 512 tokens or fewer, Japanese SPLADE v2 achieves the best score on most tasks. The benchmark tasks nlp_journal_abs_intro and nlp_journal_title_intro contain documents longer than 512 tokens, so models with shorter maximum input lengths score lower across the board.

In practical use cases such as retrieval for RAG, documents are often split into smaller chunks, so depending on the use case, handling only up to 512 tokens may not be a problem.

The JMTEB retrieval datasets are roughly:

JaGovFaqs_22k
- QA dataset based on Japanese government agency FAQs
- Queries: 3,420
- Documents: 22,794
- Mostly 512 tokens or fewer
Mr. TyDi
- Retrieval benchmark of manually created questions and related Wikipedia passages
- Queries: 720
- Documents: 7,000,027
- Mostly 512 tokens or fewer
JAQKET
- Dataset from the AI-Ou quiz competition, containing quiz questions and Wikipedia articles with the answers
- Queries: 997
- Documents: 114,229
- Mostly 512 tokens or fewer
NLP Journal
- Dataset built from the Japanese NLP Journal LaTeX Corpus, combining titles, abstracts, and introductions
- Many introductions exceed 512 tokens

Japanese SPLADE v2 did not use the train, dev, or test data from Mr. TyDi, MIRACL, JAQKET, or JQaRA as training sources. Using those as training data can improve performance on that domain, but I avoided doing so in order to measure generalization.

Model Size and Dimensions

The model parameter counts and output dimensions are shown in the original table. Parameter counts are roughly computed from layer weights. Larger models usually cost more for training and inference. Larger document output dimensions also require more memory and storage.

Because SPLADE output dimensionality, the number of non-zero elements, depends on the text, I included rough numbers for JMTEB queries and documents.

License

Japanese SPLADE v2 has no special usage restrictions and is released under the MIT license. You can use it freely.

Using It from Code

Sample code is available on huggingface.co/hotchpotch/japanese-splade-v2.

FAQ

Can sparse vector search be used in production?

Yes. Classic search technologies such as TF-IDF and BM25 are sparse vector search methods, and many search systems, including Elasticsearch, Vespa, and Qdrant, support sparse vector search and hybrid search combining dense and sparse vectors.

Is SPLADE better than dense vector models?

On benchmarks it can be better, but it depends on the use case. Even in relatively simple search systems, such as finding a corresponding document from a natural language question, the best method depends on what kinds of questions and documents you expect and what requirements you need to satisfy. Simple BM25 may be best in some cases.

Dense vector models and SPLADE often return results with different characteristics, so hybrid search that combines both is also recommended.

For hybrid search, another useful approach is to train either the dense or sparse model on domain-specific data while keeping the other model more general. The Trainer implementation for Japanese SPLADE v2, YAST, and the training data and settings are published. By creating query-document training data from your own domain and adding it to the training data, retrieval accuracy may improve significantly. Recently, if you have text, it has also become easier to create synthetic supervised data with LLMs, expanding the ways data can be used.

Closing

SPLADE uses context-aware word expansion to cover some weaknesses of keyword-based methods such as BM25, and it is gaining attention as one practical neural search option.

Japanese SPLADE v2, trained properly on Japanese, is likely one of the strongest current models for natural language question tasks such as Mr. TyDi. It is also a high-performing and well-balanced retrieval model that should be practical in production.

I hope this model and article are useful to people working on AI development, natural language processing, and information retrieval.

Releasing Japanese BERT RetroMAE Models and Evaluating Them on Downstream Retrieval Tasks

Wed, 30 Oct 2024 00:00:00 GMT

Neural retrieval models that capture semantic similarity between queries and documents are important for search tasks. However, conventional language models such as BERT are mainly pretrained on token-level tasks, so their sentence-level representations are not always well developed. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder was proposed as a new pretraining method specialized for retrieval.

In this article, I created and released Japanese BERT models pretrained with RetroMAE and evaluated them on downstream retrieval tasks, JMTEB.

model_name	Avg.	jagovfaqs22k	jaqket	mrtydi	nlp_journalabs_intro	nlp_journaltitle_abs	nlp_journaltitle_intro
bert-base-japanese-v3	0.7266	0.6532	0.6236	0.4521	0.8774	0.9732	0.7803
bert-base-japanese-v3retromae	0.7352	0.6631	0.6632	0.4526	0.8893	0.9722	0.7708
ruri-pt-baseretromae	0.7397	0.6678	0.6691	0.4667	0.8931	0.9605	0.7812

The results show improvements in almost all scores, confirming the usefulness of RetroMAE. The training method is also practical because it is unsupervised and only requires text.

About RetroMAE

RetroMAE uses a masked auto-encoder approach with three main design choices:

A new workflow that applies different masks to the input sentence
An asymmetric encoder-decoder structure
Different mask ratios for the encoder and decoder

These choices allow the model to learn representations that understand document meaning more deeply and support effective retrieval. RetroMAE also performs well on benchmarks such as BEIR and MS MARCO. The high-performing multilingual dense embedding model BAAI/bge-m3 also uses RetroMAE pretraining.

There is also a later method, RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models, also called DupMAE. This article covers RetroMAE.

Workflow with Different Masks

RetroMAE applies two different masks to the input sentence. The encoder generates a sentence embedding from the first masked input. The decoder then combines the second masked input with that sentence embedding to reconstruct the original sentence.

Asymmetric Encoder and Decoder

RetroMAE intentionally uses an asymmetric structure. The encoder uses BERT's 12-layer Transformer to capture the meaning of the input sentence sufficiently. The decoder, on the other hand, is only a very simple one-layer Transformer. This simple decoder makes the reconstruction task harder, encouraging the encoder to learn higher-quality sentence embeddings.

The one-layer decoder also introduces a special mechanism called enhanced decoding. It prepares two inputs: a query that combines the sentence embedding and position embedding, and a context that combines the sentence embedding, token embedding, and position embedding. It then applies an attention mask according to position. This allows all input tokens to be reconstruction targets while each token is reconstructed from its own context, enabling efficient training even with a shallow decoder.

Different Mask Ratios

The encoder uses a moderate mask ratio, 15-30%, so it can retain most of the information in the input sentence. The decoder uses a more aggressive mask ratio, 50-70%. With this high mask ratio, the decoder cannot easily reconstruct the input from the masked input alone, so it must rely heavily on the sentence embedding produced by the encoder. This forces the encoder to learn deeper semantic understanding.

Pretraining Japanese RetroMAE Models

The original paper trains on English Wikipedia, BookCorpus, and MS MARCO. For Japanese, I used the following datasets, which contain similar tasks:

(A) Japanese Wikipedia: hpprc/jawiki-paragraphs
(A) jawiki-books: hpprc/jawiki-books-paragraphs
(B) Japanese MQA: hpprc/mqa-ja
(B) JSNLI: shunk031/jsnli

For Wikipedia and jawiki-books, I used only paragraphs and did not include titles. For MQA, I concatenated query and document. For JSNLI, I removed spaces.

Instead of training from zero weights, I used tohoku-nlp/bert-base-japanese-v3 and cl-nagoya/ruri-pt-base as the base models for RetroMAE training. ruri-pt-base is a pretrained model based on bert-base-japanese-v3 and trained with contrastive learning. Because the MLM decoder layer is lost in that process, I used a model where the decoder layer weights were copied from bert-base-japanese-v3.

For the training script, I used the MIT-licensed OSS retromae_pretrain. The encoder mask ratio was 30%, and the decoder mask ratio was 50%. Other Trainer hyperparameters were:

  "learning_rate": 1e-4,
  "num_train_epochs": 2,
  "per_device_train_batch_size": 16,
  "gradient_accumulation_steps": 32,
  "warmup_ratio": 0.05,
  "lr_scheduler_type": "cosine",
  "bf16": true,
  "dataloader_drop_last": true,
  "dataloader_num_workers": 12

Using these settings, I created RetroMAE pretrained models using only dataset group (A), and using (A) + (B).

Evaluation on Downstream Retrieval Tasks

For downstream retrieval evaluation, I trained Japanese SPLADE models using only the mMARCO dataset. The settings are based on japanese-splade-base-v1-mmarco-only, with the model epochs reduced from 12 to 10 and model_name replaced with the model being evaluated.

For evaluation, I used my fork of JMTEB, modified to evaluate sparse vectors, and ran retrieval tasks.

The evaluation scores are:

model_name	Avg.	jagovfaqs22k	jaqket	mrtydi	nlp_journalabs_intro	nlp_journaltitle_abs	nlp_journaltitle_intro
bert-base-japanese-v3	0.7266	0.6532	0.6236	0.4521	0.8774	0.9732	0.7803
bert-base-japanese-v3retromae(A)	0.7361	0.6655	0.6621	0.4557	0.888	0.9604	0.7848
ruri-pt-base retromae(A)	0.737	0.6657	0.6541	0.4608	0.8823	0.9768	0.7821
bert-base-japanese-v3 retromae(A+B)	0.7352	0.6631	0.6632	0.4526	0.8893	0.9722	0.7708
ruri-pt-base retromae(A+B)	0.7397	0.6678	0.6691	0.4667	0.8931	0.9605	0.7812

In almost all evaluations, the models trained with RetroMAE scored higher than models not trained with RetroMAE. The best model, ruri-pt-base retromae(A+B), improved by about 2% compared with bert-base-japanese-v3.

The datasets also show that training on both (A) and (B) generally produced higher scores than training only on (A). This suggests that adding more datasets or training on domain-specific text may further improve performance.

The RetroMAE models trained on (A+B) are published on Hugging Face:

Closing

This article applied RetroMAE, a retrieval-oriented pretraining method, to Japanese BERT models and evaluated its effect. In downstream SPLADE evaluation, models trained with RetroMAE improved over the baseline bert-base-japanese-v3 on almost all retrieval tasks. In particular, the model based on contrastively trained ruri-pt-base and trained on multiple datasets such as Wikipedia, books, and question-answer data achieved an average improvement of about 2%.

Another advantage of RetroMAE is that it can be trained in an unsupervised way using only text data. This makes it useful for customizing models for specific domains or business tasks. Further improvements may be possible by adding more training data or continuing training on domain-specific text.

The RetroMAE models are published on Hugging Face and can be used. I hope this article helps improve Japanese retrieval task performance.

How to Build a SPLADE Model: Japanese SPLADE Technical Report

Wed, 23 Oct 2024 00:00:00 GMT

In recent years, the rise of large language models has made information retrieval increasingly important. Applications such as Retrieval-Augmented Generation, or RAG, need search systems that are both efficient and accurate.

In neural retrieval, dense retrievers have become mainstream, and strong multilingual models such as multilingual-e5 and bge-m3 are available. At the same time, sparse retrieval models represented by SPLADE, Sparse Lexical and Expansion Model with Contextualized Embeddings, have shown strong performance in English.

However, SPLADE depends heavily on lexical features, and tokenization depends on the model tokenizer. This made multilingual SPLADE difficult. Multilingual tokenizers often split many languages at the character level, making meaningful word-level tokenization difficult. I therefore developed and evaluated a SPLADE model specialized for Japanese.

The original SPLADE implementation, naver/splade, is released under CC-BY-NC and has commercial-use restrictions. I implemented a Trainer based on the papers and released it as MIT-licensed open source software.

YAST - Yet Another SPLADE or Sparse Trainer

This report covers implementation details, evaluation results, and future directions for the Japanese SPLADE model.

SPLADE Algorithm

SPLADE learns sparse document and query representations for information retrieval. This section describes how it is trained.

Word Importance and Output Tokens

SPLADE uses the per-token output of a model pretrained with Masked Language Modeling, or MLM, to compute context-dependent word importance. More concretely, it uses the vocabulary space of a pretrained model such as BERT and applies max pooling over token scores at each input position. Applying a log-saturation function suppresses extreme values while emphasizing important features. These operations produce sparse and efficient document and query representations that capture salient features.

This operation is called SPLADE Max. A Python implementation is below.

def splade_max_pooling(logits, attention_mask):
    # Step 1: apply log saturation, log(1 + x)
    # - torch.relu() clamps negative values to 0
    # - torch.log(1 + x) converts values to log scale and suppresses large values
    relu_log = torch.log(1 + torch.relu(logits))
    
    # Step 2: mask scores at padded positions with attention_mask
    # unsqueeze(-1) aligns dimensions: (batch_size, seq_len, 1)
    weighted_log = relu_log * attention_mask.unsqueeze(-1)
    
    # Step 3: apply max pooling
    # torch.max() takes the maximum over sequence length (dim=1)
    # This selects the most important score for each vocabulary item
    max_val, _ = torch.max(weighted_log, dim=1)
    
    return max_val

Predicting Document-Query Relevance

The word importance scores from SPLADE Max are used to predict the relevance between documents and queries, mainly with an inner product. The difference between the prediction and the training data is defined as the loss.

Loss functions such as KL divergence, MarginMSE, and cross entropy can be used to measure differences between the model's predicted vocabulary distribution and the target distribution. These can be used alone or in combination. SPLADE-v3 combines KL divergence and MarginMSE.

Sparsity and Regularization

Regularization is included in the loss to make the output word-importance distribution sparse. The main algorithms are:

L1 regularization: minimizes the sum of absolute parameter values, pushing many values toward zero and encouraging sparse representations.
FLOPs regularization: in high-dimensional sparse representation learning, distributes non-zero elements across dimensions to reduce matrix-operation FLOPs quadratically. See Minimizing FLOPs to Learn Efficient Sparse Representations.

Different losses and regularization coefficients can be applied to queries and documents. Applying strong regularization from the beginning of training can harm importance prediction, so a warmup period that gradually increases the regularization loss weight is also used.

Training and Relevance Computation

By training with these methods, SPLADE can increase query-document relevance while encouraging sparsity. It combines sparse representations with neural contextual vocabulary information, enabling strong information retrieval.

Training Method for the Japanese Model

Dataset Preparation

For the final japanese-splade-base-v1 model, I used several subsets from hpprc/emb, which contains Japanese questions, answers, and hard negatives. The subsets include auto-wiki-qa, mmarco, jsquad, jaquad, auto-wiki-qa-nemotron, quiz-works, quiz-no-mori, miracl, jqara, mr-tydi, baobab-wiki-retrieval, and mkqa.

I also created hotchpotch/hpprc_emb-scores, a scored dataset using high-performance Japanese cross-encoder rerankers, BAAI/bge-reranker-v2-m3 and cl-nagoya/ruri-reranker-large. For English data, I used MS MARCO and data scored with BAAI/bge-reranker-v2-m3.

For filtering, I used the average score of the rerankers: positives with scores of 0.7 or higher, and negatives with scores of 0.3 or lower. This removes passages that the rerankers judge to be inappropriate for the query.

For datasets with small proportions, I increased the amount of training per epoch so the model would not forget their characteristics.

For an mMARCO-only training dataset, I created and used hotchpotch/mmarco-hard-negatives-reranker-score, based on mMARCO and scored with BAAI/bge-reranker-v2-m3. It uses the same filtering rule: positives at 0.7 or higher and negatives at 0.3 or lower.

Training Settings and Hyperparameters

I used simple cross-entropy loss as the training loss. I tried KL divergence and MarginMSE as well, but cross entropy produced the best result. The goal was to let the model learn the scores from high-performance rerankers.

For sparsity regularization, I used L1 regularization. Compared with FLOPs loss, L1 regularization encouraged sparsity more effectively for Japanese.

The learning rate was 5.0e-2, a common value for a 110M-parameter model in this setting. I used a cosine learning-rate scheduler and set 10% of the total steps as warmup.

Each batch contains one positive and seven negatives, for eight examples total. The batch size was 32 for japanese-splade-base-v1 and 128 for japanese-splade-base-v1-mmarco-only. For mMARCO-only training, query and document sparsity converged quickly even with a large batch. For japanese-splade-base-v1, which trains on diverse datasets, larger batch sizes slowed sparsity convergence, so smaller batches worked better. If more training time and resources are available, larger batches might still improve japanese-splade-base-v1.

Detailed parameters are available in the actual training configuration files.

Removing Noise Tokens

In Japanese training, punctuation and symbols such as 、, 。, 「, and ： appeared prominently as noisy features. When these tokens remained in SPLADE Max output, I added their scores to the loss as a penalty. I extracted symbolic words with fugashi and unidic-lite.

By treating these as noise tokens and including them in the loss, the trained model almost stopped outputting them. Training also became more stable and converged faster.

Base Model

The base model was tohoku-nlp/bert-base-japanese-v3, which has lexical semantic features in its output layer from MLM pretraining and is based on the Japanese BERT architecture.

Training

Using these settings, I fine-tuned and created japanese-splade-base-v1 and japanese-splade-base-v1-mmarco-only. On an RTX 4090, training took about 33 hours for japanese-splade-base-v1 and about 24 hours for japanese-splade-base-v1-mmarco-only.

japanese-splade-base-v1 was trained for 2 epochs because the dataset was large. japanese-splade-base-v1-mmarco-only was trained for 12 epochs because the dataset was smaller and contained only mMARCO. Increasing the number of epochs for japanese-splade-base-v1 lowered training loss but reduced retrieval performance during evaluation, probably because of overfitting.

The trained models are published on Hugging Face:

Evaluation Results

JMTEB Retrieval

The JMTEB results are below. I used my fork modified to evaluate sparse vectors.

model_name	Avg.	jagovfaqs	jaqket	mrtydi	nlp_journalabs_intro	nlp_journaltitle_abs	nlp_journaltitle_intro
japanese-splade-base-v1	0.7465	0.6499	0.6992	0.4365	0.8967	0.9766	0.8203
japanese-splade-base-v1-mmarco-only	0.7313	0.6513	0.6518	0.4467	0.8893	0.9736	0.7751
text-embedding-3-large	0.7448	0.7241	0.4821	0.3488	0.9933	0.9655	0.9547
GLuCoSE-base-ja-v2	0.7336	0.6979	0.6729	0.4186	0.9029	0.9511	0.7580
multilingual-e5-large	0.7098	0.7030	0.5878	0.4363	0.8600	0.9470	0.7248
multilingual-e5-small	0.6727	0.6411	0.4997	0.3605	0.8521	0.9526	0.7299
ruri-large	0.7302	0.7668	0.6174	0.3803	0.8712	0.9658	0.7797

On average, japanese-splade-base-v1 performed best, though it trained on some domain tasks such as Mr. TyDi and JAQKET, not the test data used in JMTEB evaluation. japanese-splade-base-v1-mmarco-only trained only on mMARCO but was best on Mr. TyDi and competitive on other tasks.

SPLADE models perform relatively poorly on jagovfaqs. This may be because the queries are FAQ-like and often resemble summarization or contextual similarity tasks. Other models learn semantic similarity, while japanese-splade-base-v1 does not. Strong Japanese models such as ruri-large and GLuCoSE-base-ja-v2 may also benefit from training on Japanese data from MQA, a multilingual FAQ and CQA dataset.

JAQKET contains many quiz-style questions with distinctive Japanese phrasing. Models that learn these expressions score well, and because answer documents contain the correct answer words, SPLADE's lexical features likely help.

The Mr. TyDi result is counterintuitive: japanese-splade-base-v1, which should have learned the domain, is worse than japanese-splade-base-v1-mmarco-only, which did not. I have not fully analyzed this.

For the three NLP Journal tasks, SPLADE models perform well on title_abs, while text-embedding-3-large is much stronger on abs_intro and title_intro. This is because title_abs documents average 442 tokens, while abs_intro and title_intro average 2052 tokens. All models except text-embedding-3-large have a maximum input length of 512 tokens, while text-embedding-3-large supports 8191, so the other models evaluate only the beginning of long documents.

Reranking Evaluation

For reranking, I used JQaRA and JaCWIR.

model_name	JaCWIR map@10	JaCWIR HR@10	JQaRA ndcg@10	JQaRA mrr@10
japanese-splade-base-v1	0.9122	0.9854	0.6441	0.8616
japanese-splade-base-v1-mmarco-only	0.8953	0.9746	0.5740	0.8176
text-embedding-3-small	0.8168	0.9506	0.3881	0.6107
GLuCoSE-base-ja-v2	0.8567	0.9676	0.6060	0.8359
bge-m3+dense	0.8642	0.9684	0.5390	0.7854
multilingual-e5-large	0.8759	0.9726	0.5540	0.7988
multilingual-e5-small	0.8690	0.9700	0.4917	0.7291
ruri-large	0.8291	0.9594	0.6287	0.8418

Although japanese-splade-base-v1 learned the JQaRA domain, it achieved the best results across these evaluations.

English Evaluation

japanese-splade-base-v1 includes English MS MARCO data in training, so I evaluated it on MS MARCO dev with the script from naver/splade.

model_name	MRR@10 (MS MARCO dev)
japanese-splade-base-v1	0.047
japanese-splade-base-v1-mmarco-only	0.036
naver/splade_v2_max	0.340

There is a small improvement compared with the model that did not train on English data, but the score is far below naver/splade_v2_max, which is trained for English. The model has little English retrieval capability.

Sparsity Evaluation

I measured sparsity with the number of non-zero elements, the L0 norm, for queries and documents. The following results were measured on JMTEB retrieval tasks, top 1000, with JMTEB_L0.py.

JMTEB tasks	v1	v1-mmarco-only
jagovfaqs_22k-query	27.9	43.4
jaqket-query	23.3	38.9
mrtydi-query	13.8	20.5
nlp_journal_abs_intro-query	75.3	127.2
nlp_journal_title_abs-query	19	26.4
nlp_journal_title_intro-query	19	26.4
jagovfaqs_22k-docs	73.2	97.9
jaqket-docs	146.2	231.8
mrtydi-docs	89.3	100.4
nlp_journal_abs_intro-docs	95.7	182
nlp_journal_title_abs-docs	75.2	126.9
nlp_journal_title_intro-docs	95.7	182

The L0 norms show that v1-mmarco-only generally has more non-zero elements and is less sparse. Query and document sparsity are both important, but they have different requirements.

For search speed, higher query sparsity is especially valuable. Document sparsity also matters for memory and disk usage, but in production, millions to tens of millions of documents can often be searched in memory on one machine, so document sparsity may not need to be managed as strictly as query sparsity. At the same time, if documents have too few non-zero elements, retrieval quality can suffer. Tuning query and document sparsity is important for balancing search quality and efficiency.

Summary of Evaluation

These results suggest that japanese-splade-base-v1 is competitive with recent models for Japanese retrieval, especially on tasks where lexical features are important. Query and document sparsity are also sufficient for practical use.

Other models in the comparison are dense vector models, while SPLADE is a sparse vector model that emphasizes lexical features. Combining different models can produce more diverse search results than using dense models alone. This is important in real systems where diverse retrieval results are useful, such as passing varied search information to an LLM.

Future Work

japanese-splade-base-v1 has been released as a first artifact, but there is still room for improvement. The original SPLADE papers improve performance with self-distillation, multiple loss scores, and hard-negative sampling using SPLADE itself.

I have also not fully explored selecting or training pretrained models suited to retrieval tasks. Methods such as Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval and RetroMAE may improve performance.

Other possibilities include adding FAQ-style task data, supporting longer context, and adding more diverse datasets. Current data tends to be Wikipedia-heavy.

Recent models such as Llama 3.1 have licenses that allow their outputs to be used for training, making it easier to create retrieval datasets without licensing issues. The hpprc/emb dataset used here provides high-quality data using LLM outputs, as described in Ruri: Japanese General Text Embeddings.

Creating retrieval-suitable queries from documents used to require significant manual effort. LLMs now make it possible to generate large numbers of queries at low cost. Training on specific domains often improves generalization to those domains, so richer datasets should further improve retrieval models.

Closing

This report described japanese-splade-base-v1, a SPLADE model specialized for Japanese, and evaluated it. The results show that it performs strongly compared with recent models for Japanese information retrieval.

Future work includes methods for further performance improvement, selecting pretrained models better suited to retrieval, and using more diverse datasets.

I hope releasing the Japanese SPLADE model and SPLADE training Trainer contributes to the development of information retrieval technology.

References

@article{tateno2024splade,
    title={SPLADE モデルの作り方・日本語SPLADEテクニカルレポート},
    author={TatenoYuichi},
    year={2024},
    url={/articles/japanese-splade-tech-report}
}

Releasing a High-Performance Japanese SPLADE Sparse Retrieval Model

Mon, 07 Oct 2024 00:00:00 GMT

I created and released a Japanese SPLADE sparse vector model for text retrieval. On retrieval tasks over large text collections and reranking tasks that reorder documents related to a query, it achieves strong competitive results compared with recent dense vector models such as multilingual-e5-large, ruri-large, GLuCoSE-base-ja-v2, and OpenAI text embeddings.

https://huggingface.co/hotchpotch/japanese-splade-base-v1

For technical details on building the Japanese SPLADE model, see How to Build a SPLADE Model: Japanese SPLADE Technical Report.

What Is SPLADE?

SPLADE, Sparse Lexical and Expansion Model, is a retrieval model that uses sparse vectors. BM25 is the representative sparse retrieval algorithm and has been widely used for many years because of its strong performance. However, BM25 depends on exact word matches between queries and documents, so it can miss documents that contain related words or synonyms.

SPLADE uses a Transformer architecture and can include contextually related words in the vector representation. This allows words beyond exact matches to become retrieval candidates, enabling more flexible and effective search.

Characteristics and Benefits

SPLADE uses a pretrained Transformer model, such as BERT, to understand the context of the input text. It does not depend only on exact word matches and can effectively extract contextually related words. Each word is assigned an importance score, making it clear which words matter for retrieval. It also produces sparse vectors, where many dimensions are zero, which keeps computation manageable and enables efficient search.

These characteristics make SPLADE suitable for flexible retrieval needs involving related terms and synonyms. Sparse vectors allow fast search with lower computation, improving the efficiency of the overall system. Because each word has an explicit importance score, the retrieval result is also easier to interpret. Finally, SPLADE is relatively easy to introduce into existing search engines, so it can be integrated smoothly into current systems.

A Concrete Example

To understand how SPLADE works, here is a concrete example from the actual japanese-splade-base-v1 model. You can also get outputs easily from the Japanese SPLADE demo.

Example of Word Expansion

SPLADE output for "How can I improve my car's fuel efficiency?"

Score	Word (vocab)
2.1797	車
2.1465	燃費
1.7344	向上
1.5586	方法
1.3291	燃料
1.1377	効果
0.8716	良い
0.8452	改善
0.8340	アップ
0.7065	いう
0.6450	理由
0.4355	価格
0.3184	は
0.2510	家
0.2417	せる
0.2286	目的
0.1735	店
0.1627	手段
0.0851	用
0.0752	率
0.0734	上昇

As shown here, the model understands the context of the query and extracts related words such as "fuel" and "effect", even though they are not present in the original sentence. Each word also has an importance score. Some words that look unrelated or noisy, such as Japanese particles, are also included. Because such words appear in many outputs, they often become noise that can be mostly ignored, and search can still work well.

The same process can be applied to documents. By taking the dot product between the sparse vector for a query and the sparse vector for a document, we can compute how related they are.

Performance

As noted above, the SPLADE model performs well on many Japanese information retrieval tasks. Benchmark results on JMTEB retrieval, JQaRA, and JaCWIR are shown below. It performs strongly on tasks where lexical features matter. On the other hand, it is weaker on tasks such as jagovfaqs, where understanding similar sentence meaning appears to be more important.

JMTEB Retrieval

JQaRA and JaCWIR Reranking

Most open source search engines, including Elasticsearch, OpenSearch, Qdrant, and Vespa, support sparse retrieval, so adoption is relatively easy. Sparse vector search has also existed for a long time and is fast, similarly to BM25.

SPLADE and BM25 strongly reflect lexical features, so their results often differ from dense vector models such as multilingual-e5. Combining both sets of results as hybrid search can produce better and more diverse results. Most of the search engines mentioned above also support hybrid search, and many make it easy to use.

Is It Hard to Run in Production?

SPLADE can be operated almost the same way as a dense vector model, so it is not especially difficult. As mentioned above, most search engines support sparse search.

Obtaining a SPLADE sparse vector is also not complicated. It passes token scores through a combination of max pooling, often called SPLADE max, and a log-saturation function.

Example: obtaining sparse vectors with the Transformers library

It can also be used from text-embedding-inference, a fast inference server that is convenient for production operation.

https://huggingface.co/hotchpotch/japanese-splade-base-v1-dummy-fast-tokenizer-for-tei

Closing

At first I was not sure whether SPLADE would really perform well. However, SPLADE-v3, trained only on the English MS MARCO dataset, performs well across a variety of retrieval tasks. That made me interested in what would happen if it were trained properly for Japanese.

SPLADE also depends on the tokenizer vocabulary. That makes it a poor fit with multilingual model tokenizers that often split Japanese at the character level, so specialized training for Japanese is needed. This was another reason the project seemed interesting. High-performance multilingual dense vector models that support Japanese are already being pursued by many companies.

As a result of training, even though some known-domain tasks such as JAQKET and Mr.TyDi are included, I was able to create a base-size 110M-parameter sparse retrieval model that outperforms large OpenAI models on some benchmarks.

Training took about 33 hours on an RTX 4090. Because SPLADE can be trained with relatively modest compute and time, creating a model trained on domain-specific data with SPLADE seems like a useful approach for teams that need retrieval results adapted to their own domain.

I expect Japanese sparse retrieval performance with SPLADE to continue improving, and I think it remains an interesting research area.

Running Japanese Tokenizer Models with text-embeddings-inference

Mon, 30 Sep 2024 00:00:00 GMT

text-embeddings-inference, or TEI, is an inference server provided by Hugging Face. It is written in Rust, provides Docker containers for various GPU architectures, and when the GPU architecture supports FlashAttention 2, it is often about 1.5 to almost 2 times faster than running inference with Python's Transformers library. I find it useful as a high-performance production inference server.

One problem in Japanese environments is that TEI requires a Rust-based FastTokenizer, in other words a model with tokenizer.json. Many Japanese Transformer models use morphological analysis dictionaries and libraries such as UniDic or MeCab that run in Python, so many of them cannot be used through the tokenizer.json path.

This was a serious issue for me at first, but I found that some APIs such as /embed and /embed_sparse, though unfortunately not /rerank, can be used with a workaround. As an example, I will record the method using cl-nagoya/ruri-base.

Prepare a Dummy tokenizer.json

TEI checks for tokenizer.json when starting the model, and it will not start without one. Therefore we prepare a dummy tokenizer.json. You can create one yourself or use one from a public model. For this example I used the tokenizer.json from hotchpotch/mMiniLMv2-L6-H384.

I created a version of ruri-base with this tokenizer.json added as ruri-base-dummy-fast-tokenizer-for-tei.

Start the Server with the Dummy tokenizer.json Model

Prepare a docker-compose.yaml like this:

services:
  ruri-base:
    # Change the image to one that matches your architecture.
    image: ghcr.io/huggingface/text-embeddings-inference:86-1.5
    ports:
      - "8080:80"
    volumes:
      - /tmp/docker-tei-data:/data
    # Change pooling to match the model architecture.
    command: [ "--model-id", "hotchpotch/ruri-base-dummy-fast-tokenizer-for-tei", "--dtype", "float16", "--pooling", "mean", "--max-batch-tokens", "131072", "--max-client-batch-size", "16" ]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [ gpu ]

Start it:

$ docker compose up
...
ruri-base-1  | 2024-09-30T06:51:45.266929Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1778: Starting HTTP server: 0.0.0.0:80
ruri-base-1  | 2024-09-30T06:51:45.266940Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1779: Ready

It should now be running on port 8080.

Convert to token_ids Locally and Call the API

Next, tokenize locally and call the API with token_ids.

from transformers import AutoTokenizer
import requests
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("hotchpotch/ruri-base-dummy-fast-tokenizer-for-tei", use_fast=False)

sentences = [
    "クエリ: 瑠璃色はどんな色？",
    "文章: 瑠璃色（るりいろ）は、紫みを帯びた濃い青。名は、半貴石の瑠璃（ラピスラズリ、英: lapis lazuli）による。JIS慣用色名では「こい紫みの青」（略号 dp-pB）と定義している[1][2]。",
    "クエリ: ワシやタカのように、鋭いくちばしと爪を持った大型の鳥類を総称して「何類」というでしょう?",
    "文章: ワシ、タカ、ハゲワシ、ハヤブサ、コンドル、フクロウが代表的である。これらの猛禽類はリンネ前後の時代(17~18世紀)には鷲類・鷹類・隼類及び梟類に分類された。ちなみにリンネは狩りをする鳥を単一の目(もく)にまとめ、vultur(コンドル、ハゲワシ)、falco(ワシ、タカ、ハヤブサなど)、strix(フクロウ)、lanius(モズ)の4属を含めている。",
]

token_ids = tokenizer(sentences, padding=False, truncation=False, return_tensors="np")["input_ids"]
token_ids = [t.tolist() for t in token_ids]

url = "http://127.0.0.1:8080/embed"
payload = {"inputs": token_ids, "normalize": False, "truncate": True}
headers = {"Content-Type": "application/json"}

response = requests.post(url, json=payload, headers=headers)
embeddings_data = response.json()
embeddings = np.array(embeddings_data)
print(embeddings.shape)

# calc cosine similarity
normalized_embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
similarities = np.dot(normalized_embeddings, normalized_embeddings.T)

print(similarities)

Result:

(4, 768)

array([[1.        , 0.94194159, 0.68661375, 0.71621216],
       [0.94194159, 1.        , 0.66622363, 0.68591373],
       [0.68661375, 0.66622363, 1.        , 0.87196226],
       [0.71621216, 0.68591373, 0.87196226, 1.        ]])

This successfully obtains dense vectors, with cosine similarities almost the same as the values shown in the ruri-base model card. With this approach, TEI can be used with Japanese tokenizers for APIs other than reranking. Of course, if you send ordinary text instead of tokenized token_ids, you will get results that are completely off, so be careful.

The real solution would be to send pull requests so TEI can start without tokenizer.json and the /rerank API also works properly. I have not done that because implementing it in Rust and communicating through the PR process feels like more work than I currently want to take on. I would be grateful if someone did.

Releasing High-Performance Japanese Rerankers, and What Rerankers Are

Tue, 02 Apr 2024 00:00:00 GMT

💡 A newer version is available: Releasing very small, fast, and practical Japanese rerankers: japanese-reranker-tiny and xsmall v2

There were very few rerankers trained specifically for Japanese, so I created a family of reranker models that properly learn Japanese. The family includes models from small to large sizes.

The evaluation results are below. As of early April 2024, I think these are among the strongest publicly available models on Japanese reranking tasks, partly because almost no rerankers trained on Japanese had been published.

Model name	layers	hidden_size	JQaRA	JaCWIR	MIRACL	JSQuAD
japanese-reranker-cross-encoder-xsmall-v1	6	384	0.6136	0.9376	0.7411	0.9602
japanese-reranker-cross-encoder-small-v1	12	384	0.6247	0.939	0.7776	0.9604
japanese-reranker-cross-encoder-base-v1	12	768	0.6711	0.9337	0.818	0.9708
japanese-reranker-cross-encoder-large-v1	24	1024	0.7099	0.9364	0.8406	0.9773
japanese-bge-reranker-v2-m3-v1	24	1024	0.6918	0.9372	0.8423	0.9624

Technical details about how these rerankers were created are in Japanese Reranker Technical Report.

What Is a Reranker?

A reranker, as the name suggests, reranks documents. Given a query, it reorders documents by relevance. You might wonder how this differs from sorting by similarity between text embeddings. In practice, embedding similarity can also be used for ranking, but there are two important differences.

Rerankers Have Strong Reranking Performance

Text embeddings represent the query and document in the same vector space and measure similarity. This makes them efficient at large scale because document vectors can be computed offline in advance.

Rerankers, however, are specialized for reranking. The models I created use a CrossEncoder architecture that evaluates a query and document as one pair. This allows the model to understand finer nuances and contextual relationships between the query and document. As a result, relevant documents are more likely to move higher in the ranking.

Rerankers Cannot Be Precomputed and Are Slow

If rerankers are more accurate, one might ask why not evaluate everything with a reranker instead of using embeddings. The issue is that rerankers use both the query and document as input. With text embeddings, document vectors can be computed offline in advance. At search time, only the query vector needs to be computed.

A reranker, or CrossEncoder, cannot precompute document-only representations in the same way. If there are only 100 candidate documents, evaluating all of them online may be fine. As the number of documents grows, however, evaluating every candidate at search time becomes impractical.

Where Rerankers Fit

This does not mean rerankers are unusable in real-world search. A common approach is to first retrieve the top 100 related documents using an efficient offline-computable method such as text embeddings, and then use a reranker to reorder those 100 candidates more accurately.

Quoted from Retrieve & Re-Rank Pipeline.

The same idea is also described in articles such as Google Cloud's Your RAGs powered by Google Search technology, part 2, in the section on Deep re-ranking. Reranking is an important technique.

How Much Can Reranking Improve Retrieval Scores?

In a previous article, Taking on the First AI-Ou Quiz Competition with Vector Search Only: Evaluating Multiple Japanese Embeddings on a Q&A Task, I converted about 5.5 million documents into vectors with several text embedding models and evaluated retrieval with approximate nearest neighbor search using IVFPQ. The accuracy results for those embedding models are below.

JAQKET v1 vector search: Japanese embeddings evaluation

Now let's take the top 100 results from that approximate nearest neighbor search and rerank them with the smallest and fastest model in this family, the xsmall reranker.

The results improve substantially. Because the task searches about 5.5 million documents at practical speed with approximate nearest neighbor search, accuracy is lower than exhaustive embedding similarity search. Reranking those results with a reranker leads to a large score improvement.

OpenAI text embeddings, for example, often do not score especially well on Japanese information retrieval tasks, but reranking improves their scores considerably.

Next, let's rerank with the larger large reranker model.

The score improves further. If you have enough compute, using a larger model is reasonable, but reranking latency increases with model size. The time required to evaluate JaCWIR on an RTX 3090 was as follows.

Model name	layers	hidden_size	Runtime (sec)
japanese-reranker-cross-encoder-xsmall-v1	6	384	196
japanese-reranker-cross-encoder-small-v1	12	384	265
japanese-reranker-cross-encoder-base-v1	12	768	481
japanese-reranker-cross-encoder-large-v1	24	1024	1253
japanese-bge-reranker-v2-m3-v1	24	1024	1173

The xsmall and large models differ by about 6x in speed. There is a clear tradeoff between performance and latency, so it is important to choose a reranker according to the needed accuracy and speed. In runtime search systems, reranker latency often matters.

For evaluation results against many other models, see Japanese Reranker Technical Report. Also note that although the reranking evaluation in this article uses the test data from the AI-Ou Quiz competition and was not directly trained on it, these models did train on JQaRA, a dataset made from the competition's dev and unused data. This may make scores easier to improve.

Rerankers Are Surprisingly Important

I decided to build Japanese rerankers because, when searching over millions of documents, combining embeddings plus approximate nearest neighbor search with a reranker produced much better results than embeddings plus ANN alone. At that time I was using the multilingual reranker cross-encoder-mmarco-mMiniLMv2-L12-H384-v1. If a multilingual model could improve accuracy that much, I thought a model properly trained on Japanese might improve it further.

Rerankers require online computation, which is both a drawback and an advantage. The drawback is computational cost. A benefit other than accuracy is that you do not need to recompute precomputed data. If you want to replace a text embedding model with a better one, existing document vectors stored in a database must be changed carefully in production, and recomputing hundreds of millions of vectors can be expensive. A reranker is more like replacing the sorting algorithm. It can often be swapped in without changing precomputed data.

I have also observed that rerankers can improve performance substantially when trained on domain data for the task being solved. This suggests a useful split: use a general-purpose model for text embeddings, and use a domain-specific model for reranking.

This article introduced Japanese reranker models and explained what rerankers are. Much of the current attention is on training and using LLMs, but as LLM usage expands, I think search will increasingly be optimized for AI rather than humans, and information retrieval will become even more important.

Rerankers will likely become an essential tool for improving retrieval. I hope this article helps more people become interested in rerankers and information retrieval.

This article was lightly edited from text generated by Claude 3 Opus based on my draft.

Technical Report on Building Japanese Rerankers

Tue, 02 Apr 2024 00:00:00 GMT

This is a technical report on building Japanese reranker, or CrossEncoder, models. For an explanation of what rerankers are, see Releasing High-Performance Japanese Rerankers, and What Rerankers Are.

The models created are:

Model name	layers	hidden_size
hotchpotch/japanese-reranker-cross-encoder-xsmall-v1	6	384
hotchpotch/japanese-reranker-cross-encoder-small-v1	12	384
hotchpotch/japanese-reranker-cross-encoder-base-v1	12	768
hotchpotch/japanese-reranker-cross-encoder-large-v1	24	1024
hotchpotch/japanese-bge-reranker-v2-m3-v1	24	1024

How CrossEncoders Are Trained

A CrossEncoder can be trained as a simple regression task. Text in the form query text[SEP]passage text, separated by a SEP token or similar, is labeled as 1.0 for positives and 0.0 for negatives. For concrete training code, the SentenceTransformers CrossEncoder training examples are easy to understand.

Performance improves significantly when multiple negatives, or hard negatives, are trained in the same batch as the positive. FlagEmbedding's reranker trainer is a useful reference for this approach.

Training Datasets

Training requires datasets of questions, positives, and negatives. I used one positive and 15 hard negatives per item, for a group of 16 examples. The datasets were:

JQaRA: 7,270 records from dev and unused
JSQuAD:
- 62,859 records from train
- Additional Wikipedia passages for hard-negative mining
miracl: 6,984 Japanese records from train
mmarco: 346,413 filtered Japanese records from train
mr_tydi:
- 3,697 Japanese records from train
- The Japanese MIRACL data contains many records overlapping with this mr_tydi data
Wikipedia lead sections:
- 40,130 pairs of Wikipedia titles and lead paragraphs
- Hard-negative mining also used only Wikipedia lead paragraphs

Evaluation Datasets

The models were evaluated with:

JQaRA:
- 2,000 test records
- Metric: NDCG@10, as defined for JQaRA evaluation
JSQuAD:
- 4,442 validation records
- 19 negatives added by hard-negative mining from Wikipedia, evaluated with MAP@10 over 20 total candidates
miracl:
- 704 records from dev, filtered to records with at least 9 negatives
- 1 positive and 9 negatives, evaluated with MAP@10
- Japanese MIRACL has some overlap between dev and train, so training more on train tends to raise dev evaluation
JaCWIR:
- 5,000 eval records
- Metric: MAP@10, as defined for JaCWIR reranker evaluation

Hard-Negative Mining

Hard negatives are examples that a model is likely to mistakenly judge as positives, even though they are actually negative. Actively mining them increases the diversity and difficulty of the training data and can improve model accuracy.

For these models, I mined hard negatives with BM25 and multiple SentenceTransformer models. Using semantic textual similarity tasks, I extracted texts semantically similar to positives but actually negative, and randomly sampled from high-similarity candidates.

Pretrained Base Models

The following pretrained models were used as bases. For BAAI/bge-reranker-v2-m3, training on all data reduced generalization, so I randomly sampled 10,000 records each from mMARCO, JSQuAD, and Wikipedia lead sections, while using all records from the other datasets.

japanese-reranker-cross-encoder-xsmall-v1
- Microsoft mMiniLMv2-L6-H384
- 6 layers, 384 hidden size
japanese-reranker-cross-encoder-small-v1
- Microsoft mMiniLMv2-L12-H384
- 12 layers, 384 hidden size
japanese-reranker-cross-encoder-base-v1
- cl-nagoya/sup-simcse-ja-base
- tohoku-nlp/bert-base-japanese-v3
- A merged model from models trained from both sources
- 12 layers, 768 hidden size
japanese-reranker-cross-encoder-large-v1
- cl-nagoya/sup-simcse-ja-large
- tohoku-nlp/bert-large-japanese-v2
- A merged model from models trained from both sources
- 24 layers, 1024 hidden size
japanese-bge-reranker-v2-m3-v1
- BAAI/bge-reranker-v2-m3
- 24 layers, 1024 hidden size

Handling Overfitting

During CrossEncoder training, I found that because Wikipedia passages were used as hard negatives, evaluation on Wikipedia-related tasks such as JQaRA, JSQuAD, and Japanese MIRACL improved, while generalization outside Wikipedia domains degraded as training continued. To balance this, I created JaCWIR, an out-of-domain dataset not included in the training data, and used it for evaluation while training.

Training beyond 1 epoch caused overfitting, so training was limited to 1 epoch.

Training Parameters

The main model training used roughly the following parameters:

batch_size: 512 with gradient accumulation
- Since 16 examples form one group, the actual batch contains 512 * 16 = 8192 positive and negative examples
warmup_ratio: 0.25
Scheduler: cosine
Optimizer: paged_adamw_32bit
learning_rate:
- xsmall = 2e-04
- small = 5e-04
- base = 8e-05
- large = 3e-05
Loss:
- Cross entropy

Using Large Models as Teachers

For xsmall and small, I also used inference outputs from japanese-reranker-cross-encoder-large-v1 and japanese-bge-reranker-v2-m3-v1 as teacher labels. Teacher outputs are continuous inference values, such as pos=0.98 and negs=[0.02, 0.07, ...], so they can be used as regression targets rather than only 0 and 1. Using teacher outputs gave a small score improvement. MSE loss was used for this training.

Creating Mix Models

Changing datasets, score parameters, and seeds produces diverse training results. Linearly combining separately trained models can improve performance by adding diversity. I confirmed score improvements by combining multiple trained models. I used LM_Cocktail for model merging.

One caveat is that the merged model has a smaller output standard deviation, so there may be some performance degradation during quantization or similar processing.

Evaluation Results

The CrossEncoder evaluation results are below. BAAI/bge-reranker-v2-m3 already has strong multilingual generalization and high Japanese performance. If model size is not an issue, I think it is a good base model to fine-tune for reranker training, even with only a few thousand examples.

Scores on these evaluation datasets tend to increase when training on the corresponding public train data. The models created here learned the tendencies of train data for all datasets except JaCWIR, so that should be considered when reading the scores.

Model Name	JQaRA	JaCWIR	MIRACL	JSQuAD
japanese-reranker-cross-encoder-xsmall-v1	0.6136	0.9376	0.7411	0.9602
japanese-reranker-cross-encoder-small-v1	0.6247	0.939	0.7776	0.9604
japanese-reranker-cross-encoder-base-v1	0.6711	0.9337	0.818	0.9708
japanese-reranker-cross-encoder-large-v1	0.7099	0.9364	0.8406	0.9773
japanese-bge-reranker-v2-m3-v1	0.6918	0.9372	0.8423	0.9624
bge-reranker-v2-m3	0.673	0.9343	0.8374	0.9599
bge-reranker-large	0.4718	0.7332	0.7666	0.7081
bge-reranker-base	0.2445	0.4905	0.6792	0.5757
cross-encoder-mmarco-mMiniLMv2-L12-H384-v1	0.5588	0.9211	0.7158	0.932
shioriha-large-reranker	0.5775	0.8458	0.8084	0.9262
bge-m3+all	0.576	0.904	0.7926	0.9226
bge-m3+dense	0.539	0.8642	0.7753	0.8815
bge-m3+colbert	0.5656	0.9064	0.7902	0.9297
bge-m3+sparse	0.5088	0.8944	0.6941	0.9184
JaColBERTv2	0.5847	0.9185	0.6861	0.9247
multilingual-e5-large	0.554	0.8759	0.7722	0.8892
multilingual-e5-small	0.4917	0.869	0.7025	0.8565
bm25	0.458	0.8408	0.4387	0.9002

This article was lightly edited from text generated by Claude 3 Opus based on my notes and instructions.

After reading Human-in-the-Loop Machine Learning: a data-centric and suggestive book

Mon, 26 Feb 2024 00:00:00 GMT

This is a review after reading Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI, which I received from Junya Ueda, one of the translators.

Recently, with the rise of generative AI and LLMs, I hear more often about how to collect and create "high-quality data". Not only for data used to train LLMs, but also for solving familiar problems, it has become natural to define the task needed for the problem, analyze data for it, and create data in order to solve social problems.

For many of these problems, you do not need to invent a new model yourself. Defining the task, collecting data, and training can often produce enough performance. This is the so-called data-centric way of thinking, focused on data.

There are many model-centric discussions in the world, about models and algorithms, while data is often treated as something evaluated on an already published dataset. Human-in-the-Loop Machine Learning is rare because it focuses on data and explains it in depth.

For what the book contains, it is best to read the table of contents and reaction summary, so please look there. Personally, the first part that strongly interested me was data sampling methods for active learning. It is easy to think that data near a clear linear decision boundary, where confidence is low, should be annotated. But the book discusses from many perspectives how to interpret uncertainty and diversity, and what strategies to use when deciding which data should be annotated. It is full of ideas, and implementations, that made me stop and think.

The second was collaboration with annotators. As the book says, "people management is essential." It is not at all a matter of saying "label it like this, thanks" and leaving the rest to them. The book strongly argues that you should treat collaboration with annotators like ordinary people management: how to make requests, what skills are needed, how to give feedback, how to remove annotator bias, how to handle uncertainty for each annotator, and how to communicate and provide feedback. Of course, it also contains many hints beyond people management, such as bias.

There were useful points everywhere, and because I read it carefully, it took about two months to finish. It was that interesting, and as someone who works with machine learning, I am truly glad I encountered this book.

Now that the LLM-driven AI era has begun, it will probably become normal to have AI evaluate data, use that feedback to create high-quality training datasets, and perform reinforcement learning. The original Human-in-the-Loop Machine Learning was written before GPT-4 appeared and before LLMs drew as much attention as they do now, but I think the book's viewpoint will be very useful when combined with LLMs too.

ColBERT reaches e5-large-level performance on a Japanese RAG task

Fri, 02 Feb 2024 00:00:00 GMT

The recently released Japanese pretrained ColBERT model, JaColBERT, seems to perform well. I evaluated it on the AI-Ou Q&A RAG task that I usually use for evaluation.

https://docs.google.com/spreadsheets/d/1eSYzxzIfN3uMIpFKDGCTQsIxuWYELBtD49LQbl88GUE/edit#gid=140790548

The result was only slightly below multilingual-e5-large. Considering that the training data is small and the model size is almost the same as a 12-layer BERT, which is the same size as multilingual-e5-small, this is impressive.

Reading the ColBERT implementation and papers

That made me interested in ColBERT, so I read the papers and implementation.

ColBERT is not the usual method of outputting a sentence embedding with SentenceTransformer or similar and searching by similarity comparison. It is token-based similarity search. The final hidden layer of a sentence has contextual information for each token, so ColBERT uses token-level representations, not only a single sentence vector, to calculate similarity.

The similarity calculation uses a method called MaxSim. It takes cosine similarities between query token outputs and document token outputs, then sums the maximum values. The MaxSim calculation itself is simple.

The query and document must be encoded separately, but the model used is BERT with 12 layers plus a custom head, a 128-dimensional linear layer. BERT's 12-layer hidden output is 768 dimensions, so it is converted to 128 dimensions through the linear layer.

In the actual implementation, the query and document are distinguished only by adding custom tokens after CLS: for a query, a prefix like [CLS][unused0]; for a document, [CLS][unused1]. The encoder itself is the same.

After encoding them, MaxSim is computed, and the document with the largest score is judged to be similar to the query. In the ColBERT implementation, symbols and padding tokens in document tokens are masked and ignored during calculation.

Solving search-time performance issues

With ordinary sentence-vector search, approximate nearest neighbor search, or ANN, can search quickly even from hundreds of millions of documents. However, ColBERT uses MaxSim over token similarities rather than sentence vectors, so that approach cannot be used as-is.

ColBERTv2 describes how to create an index that can solve this problem with fast nearest neighbor search. It seems to compress vectors in several ways, calculate centroids with KMeans, and search from there. The implementation imports FAISS, so I first thought it used FAISS indexes directly. But FAISS was only used to calculate cluster centroids with KMeans. Once the index is created, later search is quick.

Creating an index for the 5.5 million passages in the AI-Ou Q&A RAG task above took around five hours on a Ryzen 3900 + RTX 3090 environment. Be careful that FAISS is quite slow unless you use the GPU version, faiss-gpu.

Trainable with little data

According to the JaColBERT report, it was made from bert-base-japanese-v3 by training for 10 hours with 10 million triplet examples and 8 NVIDIA L4 GPUs. If the data amount is small and training time is short, the possibilities are exciting.

ColBERT's problems and practical difficulty

Reading the implementation made me realize that ColBERT is not easy to use casually because both the processing itself and the implementation code are complex. RAGatouille addresses that with an approach that can be used quickly even with zero configuration. I used RAGatouille for this evaluation too.

RAGatouille can create indexes and search them, of course, and can also train models with a Trainer. It also supports modern integrations such as becoming a LangChain retriever.

Can ColBERT be used on a production search server?

One concern with ColBERT is whether it can be operated in production. As of early February 2024, it seems to require implementing and operating your own search API server, which is not easy to run casually.

However, the RAGatouille documentation says that the search engine Vespa will support it soon. If that happens, operation should become much easier. Adding data to an index also still seems experimental, but if that works properly, it should become reasonable to consider for production.

Above all, if you can train on your own domain data at low cost, it may be usable as high-quality RAG retrieval for your own data. For that kind of use case, I would actively consider it.

Closing

So this was a note saying that ColBERT is impressive. People researching information retrieval probably already know ColBERT, but I did not, so learning about it felt fresh. I probably would not have become interested without JaColBERT, so I am grateful to Benjamin Clavié, its author. He is also the author of RAGatouille, which is very helpful.

Evaluating OpenAI's new text-embedding-3-small on a RAG task

Mon, 29 Jan 2024 00:00:00 GMT

OpenAI recently released new embedding models, text-embedding-3-small and text-embedding-3-large. According to OpenAI, text-embedding-3-small costs one fifth as much as the older ada-v2 embedding model while improving performance.

According to OpenAI's article, MTEB scores improved slightly, and MIRACL scores improved substantially. MIRACL, Multilingual Information Retrieval Across a Continuum of Languages, is an information retrieval task across multiple languages. A large score improvement there suggests we can also expect better accuracy for Japanese information retrieval tasks.

Evaluation on a Wikipedia Q&A RAG task

So I evaluated it right away. I used the same method as Solving the first AI-Ou quiz competition with vector search only: vector search over about 5.5 million passages and checking whether the retrieved results contain the answer. In other words, this tests whether RAG retrieval can find text containing the appropriate answer. For text-embedding-3-small, I passed an option to the OpenAI API to reduce the embedding dimension to 512, and used those dimension-reduced results. If the original 1536-dimensional data were used, the score would probably improve slightly.

The results are below.

https://docs.google.com/spreadsheets/d/1eSYzxzIfN3uMIpFKDGCTQsIxuWYELBtD49LQbl88GUE/edit#gid=140790548

The text-embedding-3-small result was low, and the multilingual-e5 series remained overwhelmingly strong for Q&A-style information retrieval tasks. That said, one strength of OpenAI embeddings may be their ability to embed long token inputs. The passages used here were under 400 Japanese characters, and the score might also be a little higher without dimension reduction.

Data and code used

Extra: cost

The total was 1,490,618,785 tokens, and cost about 30 USD. With the old model this would have cost five times as much, so I would not have felt like trying it. At this level, it feels roughly acceptable even for an individual.

Summarizing same-date diary entries from secon.dev and sending them to Discord

Mon, 15 Jan 2024 00:00:00 GMT

I have accumulated four or five years of old diary entries on secon.dev. Sometimes I look back at past entries from the same date, but there are many days when I skip them. I think occasionally touching past records helps keep memories fresh, so skipping them feels wasteful.

The reason I skip them is that accessing old diaries is troublesome. They are shown in the footer of same-date diary pages, but I still do not go look very often. Also, if a diary entry is long, reading all of it is a bother. I want someone to tell me the gist.

So I wrote an implementation that reduces the access friction by posting to a family Discord channel, used only by my wife and me, while summarizing past diary entries from the same date. Posting to Discord is easy with a webhook. I also made the Discord message use as little screen space as possible, since ordinary links expand and consume too much area, so it can be read quickly.

Summarizing old diaries is also easy because it just sends them to OpenAI. To get used to LCEL, I wrote the implementation in LCEL even though it is verbose. LCEL is easy once you get used to it, and being able to use LangSmith for debugging is convenient.

https://github.com/hotchpotch/secon-dev-n-year-summary/

Understanding LangChain Expression Language (LCEL)

Thu, 11 Jan 2024 00:00:00 GMT

LCEL is a way to build LangChain chains easily. Development became active in the second half of 2023, and as of January 2024, writing LangChain code with LCEL is generally recommended, although the older style can still be used. For LCEL's benefits, the official LCEL documentation is a good reference.

However, when I started writing LCEL, code worked if I followed the official documentation exactly, but small changes often broke it. This was simply because I did not understand LCEL's behavior. The official documentation and tutorials contain many examples showing how simply LLM + RAG code can be written with LCEL, but they do not explain much about LCEL's behavior itself. Even when they do, the examples are often combined with ChatGPT or templates, so I had trouble finding an explanation of "how LCEL behaves in the first place."

So I wrote a notebook article using LangChain 0.1.0 that explains only the basics of LCEL behavior and helps understanding progress step by step. The Colab notebook is here:

https://colab.research.google.com/drive/1hC8aBzoo3p7psGiRM1IXiBMBk3Wj2ntQ

LCEL basics

The basic idea of LCEL in LangChain is simple. An object receives an input value and passes its output value to the next object. This is the same as ordinary implementation.

First, define a function that doubles a value.

def double(x):
    return x * 2

double(2)

Next, define a function that prints the argument to standard output and returns the argument unchanged.

def tap_print(x):
    """
    引数の値を標準出力に出力して、引数の値をそのまま返す
    """
    print(f"tap_print: {x}")
    return x

Now run the two functions together. Give an argument to double, run it, and pass the result to tap_print.

tap_print(double(2))

tap_print: 4

4

That produced the expected result.

Next, convert these functions into RunnableLambda, a subclass of Runnable, which is LCEL's execution class.

from langchain_core.runnables import RunnableLambda

r_double = RunnableLambda(double)

After conversion, you can use the Runnable interface. Let's call invoke, which runs a Runnable.

r_double.invoke(2)

You can also define a RunnableLambda-like function with the @chain decorator. A variable named chain appears later and would be confusing, so here I import it as chain_decorator.

from langchain_core.runnables import chain as chain_decorator

@chain_decorator
def r_double(x):
    return x * 2

r_double.invoke(2)  # r_double は RunnableLambda になるので、invoke で実行できる

Convert tap_print into a Runnable as well.

r_tap_print = RunnableLambda(tap_print)
r_tap_print.invoke(2)

tap_print: 2

2

Now finally, let's connect and run them with |, the core of LCEL.

chain = r_double | r_tap_print
chain.invoke(2)

tap_print: 4

4

Good. r_double returns a result, and r_tap_print prints that result while returning it. What exactly is this chain?

chain.__class__

langchain_core.runnables.base.RunnableSequence

The chain is a RunnableSequence, a Runnable that runs serially. Let's write the same implementation without the | syntax sugar.

from langchain_core.runnables import RunnableSequence

chain = RunnableSequence(r_double, r_tap_print)
chain.invoke(2)

tap_print: 4

4

That produced the same result.

Now let's display the execution flow, or execution graph, for this Runnable.

chain.get_graph().print_ascii()

  +----------------+   
  | r_double_input |   
  +----------------+   
          *            
          *            
          *            
+------------------+   
| Lambda(r_double) |   
+------------------+   
          *            
          *            
          *            
+-------------------+  
| Lambda(tap_print) |  
+-------------------+  
          *            
          *            
          *            
+------------------+   
| tap_print_output |   
+------------------+

double receives the input, passes it to tap_print, and the final output becomes tap_print_output.

Now look at this code:

chain = r_double | tap_print  # tap_print は RunnableLambda ではない!
chain.invoke(2)

tap_print: 4

4

Why does this work even though tap_print is not a Runnable? It works because Runnable objects use Python's bitwise | operator behavior: if either the left or right side is a Runnable, the other side is automatically converted into a Runnable. The actual Runnable code looks like this:

    def __or__(
        self,
        other: Union[
            Runnable[Any, Other],
            Callable[[Any], Other],
            Callable[[Iterator[Any]], Iterator[Other]],
            Mapping[str, Union[Runnable[Any, Other], Callable[[Any], Other], Any]],
        ],
    ) -> RunnableSerializable[Input, Other]:
        """Compose this runnable with another object to create a RunnableSequence."""
        return RunnableSequence(self, coerce_to_runnable(other))

    # 同様に __ror__ も定義されている

It wraps the result in RunnableSequence, and passes the other argument given to the | operator, tap_print in this example, through coerce_to_runnable.

coerce_to_runnable is also important, so let's look at it.

def coerce_to_runnable(thing: RunnableLike) -> Runnable[Input, Output]:
    """Coerce a runnable-like object into a Runnable.

    Args:
        thing: A runnable-like object.

    Returns:
        A Runnable.
    """
    if isinstance(thing, Runnable):
        return thing
    elif inspect.isasyncgenfunction(thing) or inspect.isgeneratorfunction(thing):
        return RunnableGenerator(thing)
    elif callable(thing):
        return RunnableLambda(cast(Callable[[Input], Output], thing))
    elif isinstance(thing, dict):
        return cast(Runnable[Input, Output], RunnableParallel(thing))
    else:
        raise TypeError(
            f"Expected a Runnable, callable or dict."
            f"Instead got an unsupported type: {type(thing)}"
        )

This function does the following conversions: if the object is already a Runnable, do nothing; if it looks like a generator, convert it to RunnableGenerator; if it is callable, such as a function, convert it to RunnableLambda; if it is a dict, convert it to RunnableParallel; otherwise, raise an exception.

From this, you can see that in LCEL, if either side connected by | is a Runnable, the other side is converted into a Runnable, and a RunnableSequence connecting them is returned.

Now run the next code.

chain = double | r_tap_print  # double は　RunnableLambda ではない！

chain.invoke(2)

tap_print: 4

4

This time, double is not a Runnable, but r_tap_print.__ror__ converts it into a RunnableSequence, so it can run.

So far, we have seen:

A Runnable basically receives an input value with invoke, processes it, and returns an output value.
Connecting Runnables with | creates code that runs serially through RunnableSequence.
If either side of | is not a Runnable, it is automatically converted into a Runnable.

Seen this way, Runnable feels simple and understandable.

Dict syntax converted into RunnableParallel

Next is the syntax using | and dict, which confused me a lot at first. Consider this implementation:

Pass a number as the argument.
- Keep the first value in original_value.
- Put the doubled value in double_value.
Pass those results to tap_print.

Let's write code that does this.

chain = {
    "original_value": lambda x: x,
    "double_value": double,
} | r_tap_print

chain.invoke(2)

tap_print: {'original_value': 2, 'double_value': 4}

{'original_value': 2, 'double_value': 4}

It worked, though it is not obvious why. If you use this by intuition without understanding the behavior, it gradually becomes confusing. That happened to me.

Let's check what this is doing by displaying the execution graph.

chain.get_graph().print_ascii()

+--------------------------------------------+    
| Parallel<original_value,double_value>Input |    
+--------------------------------------------+    
               **              **                 
            ***                  ***              
          **                        **            
+-------------+               +----------------+  
| Lambda(...) |               | Lambda(double) |  
+-------------+               +----------------+  
               **              **                 
                 ***        ***                   
                    **    **                      
+---------------------------------------------+   
| Parallel<original_value,double_value>Output |   
+---------------------------------------------+   
                        *                         
                        *                         
                        *                         
             +-------------------+                
             | Lambda(tap_print) |                
             +-------------------+                
                        *                         
                        *                         
                        *                         
              +------------------+                
              | tap_print_output |                
              +------------------+

It suddenly becomes parallel execution, branches, aggregates the parallel results, and then passes them to tap_print.

This is the confusing point: when you connect a dict with |, coerce_to_runnable is called, it is automatically converted into RunnableParallel, and the values of the dict are run in parallel. When RunnableParallel receives a dict, it automatically converts the dict values into Runnables, runs them in parallel, and returns the results under the corresponding keys.

Let's use coerce_to_runnable directly and see the result of the type conversion that happened through |.

from langchain_core.runnables.base import coerce_to_runnable

parallel = coerce_to_runnable(
    {
        "original_value": lambda x: x,
        "double_value": double,
    }
)

parallel.invoke(2)

{'original_value': 2, 'double_value': 4}

parallel.__class__

langchain_core.runnables.base.RunnableParallel

Declaratively writing code with the same behavior looks like this:

from langchain_core.runnables import RunnableParallel

parallel = RunnableParallel(
    {
        "original_value": coerce_to_runnable(lambda x: x),
        "double_value": coerce_to_runnable(double),
    }
)
parallel.invoke(2)

{'original_value': 2, 'double_value': 4}

So in LCEL, connecting a dict with | produces code that runs in parallel with RunnableParallel and returns values.

invoke and dict

Next, call invoke with a dict. Note that this is completely different from the dict that is converted into RunnableParallel above. This is an ordinary call with a dict as the argument.

data = {
    "input_value": 2,
    "input_do_nothing": 100,
}
chain = r_double | r_tap_print
try:
    chain.invoke(data)
except Exception as e:
    print("Error:", e)

Error: unsupported operand type(s) for *: 'dict' and 'int'

r_double expects an int, but it received a dict, so it cannot process it. You may say, "Just write chain.invoke(data['input_value'])", and that is true. But if r_double is in the middle of a chain and a dict reaches it, it cannot handle it.

In that case, insert a function that extracts only input_value.

data = {
    "input_value": 2,
    "input_do_nothing": 100,
}
chain = (lambda x: x["input_value"]) | r_double | r_tap_print
chain.invoke(data)

tap_print: 4

4

That worked.

Now let's write code that passes input_do_nothing onward unchanged, passes the calculation result of input_value onward as double_value, and also passes the original input_value.

data = {
    "input_value": 2,
    "input_do_nothing": 100,
}
chain = {
    "double_value": (lambda x: x["input_value"]) | r_double,
    "input_value": lambda x: x["input_value"],
    "input_do_nothing": lambda x: x["input_do_nothing"],
} | r_tap_print
chain.invoke(data)

tap_print: {'double_value': 4, 'input_value': 2, 'input_do_nothing': 100}

{'double_value': 4, 'input_value': 2, 'input_do_nothing': 100}

The intended values were passed to r_tap_print.

In the initial dict definition of this chain, the next | operator calls r_tap_print.__ror__, passes the dict to coerce_to_runnable, and automatically converts it into RunnableParallel, which produces the intended behavior.

But this is very verbose. If data had many more keys and you wanted to pass all of them as input values to later Runnables, it would be painful. For that situation, RunnablePassthrough exists. Let's rewrite this using RunnablePassthrough.

from langchain_core.runnables import RunnablePassthrough

data = {
    "input_value": 2,
    "input_do_nothing": 100,
}
chain = (
    RunnablePassthrough().assign(
        double_value=(lambda x: x["input_value"]) | r_double,
    )
    | r_tap_print
)
chain.invoke(data)

tap_print: {'input_value': 2, 'input_do_nothing': 100, 'double_value': 4}

{'input_value': 2, 'input_do_nothing': 100, 'double_value': 4}

Let's also add a result that triples the value.

chain = (
    RunnablePassthrough().assign(
        double_value=(lambda x: x["input_value"]) | r_double,
        triple_value=lambda x: x["input_value"] * 3,  # 暗黙的に RunnableLambda に変換される
    )
    | r_tap_print
)
chain.invoke(data)

tap_print: {'input_value': 2, 'input_do_nothing': 100, 'double_value': 4, 'triple_value': 6}

{'input_value': 2,
 'input_do_nothing': 100,
 'double_value': 4,
 'triple_value': 6}

As you can see, RunnablePassthrough is very convenient when you want to modify only part of an input dict, add keys, and pass it to the next Runnable.

Be careful not to confuse passing a dict as the argument to invoke with writing LCEL by connecting a dict with |, which effectively converts it into RunnableParallel. They have completely different intentions and behavior.

RunnablePassthrough can also be used declaratively to turn a non-dict input value into a dict.

value_format = "value is {value}, double value is {double_value}"

def template(data):
    return value_format.format(**data)

r_template = RunnableLambda(template)

chain = (
    {
        "value": RunnablePassthrough(),
        "double_value": RunnablePassthrough() | double,
    }  # r_double でなく double でも、自動で RunnableLambda に型変換される
    | r_template
    | r_tap_print
)
chain.invoke(100)

tap_print: value is 100, double value is 200

'value is 100, double value is 200'

You now understand why this behaves this way.

Next steps

If you understand this much, you should be able to read LangChain LCEL sample code and most LCEL code written by others. Looking back, if you understand how automatic conversion into Runnable works and that writing a dict around | turns it into RunnableParallel, you should run into far fewer implementation problems.

If you want to know why something behaves a certain way in practice, reading the Runnable source code may be the fastest way to understand it. In that case, I recommend reading the code.

I hope this article and notebook help someone understand LCEL.

Looking back on 2023

Mon, 08 Jan 2024 00:00:00 GMT

Several days have already passed since the new year began. I did not write a retrospective at the end of last year because I was traveling in Taiwan, though that is just an excuse.

Work

For a short period from May 2023 until my term ended in December, I served for the second time as an executive officer of a listed company. From my own point of view, my output was not very good, roughly at the level of being better there than not there. I thought continuing like that would not be good for either the company or myself, so I decided to stop when the term ended. I am not going to write publicly about why it was like that, but I did not work freely at my own pace as I did from 2020 to 2022, and that is probably why I did not feel like writing this 2023 retrospective either.

Going forward, I want to live at my own pace and without being swayed.

Travel

I went to South America for two months in February and March, and to Taiwan for one week at the end of the year. South America especially felt unknown to me, so I am glad I was able to go while I am alive. I should have more time this year, so I want to travel around Japan and abroad from time to time. Short trips of about a week sound good, and staying for about a month somewhere with a good internet environment also sounds good.

Life

I neglected daily life because of work. Even though I live in the countryside, if I do not leave the house, it is no different from city life. At night there is not much entertainment in the countryside, so if I only work during the day, that is really all the day becomes. I regret that.

Technology

With the appearance of GPT-4, many things started changing dramatically, not only around data science but across many areas. In software development, GPT-4 is now the first move, and I do not really want to develop in an editor without Copilot. My development style changed a lot and became easier.

Social networks

I thought I should do some publicity, so I started using social networks a little again. Basically I do not use them, and only log in when needed for posting. The password is stored somewhere very hard to access, so I only look a little. Even at this distance, social networks still take time away, so I want to keep my distance while using them only slightly.

2024

I want to put an end to the way I worked in 2023 and return to the 2020-2022 style: living while doing work, daily life, and the things I want to do. On the work side, I will be involved in a development project that applies recent AI-related technologies in society and has enough potential to become a product that makes people take notice. Whether it can be built depends to some extent on my own development ability, so I would be happy if, in next year's retrospective, I can say that I built that thing.

On the technology side, machine learning, data science, and surrounding technologies are still too interesting in every area, so I want to keep learning. I also want to win at least one Kaggle gold medal. In 2023 I joined only one competition and got one silver. Travel and daily life are as written above.

This year I want to do things at my own pace, enjoy myself, and stay healthy. Also, as always, I am grateful to my wife for supporting me in many ways.

With that, I look forward to another year with everyone.

Solving the first AI-Ou quiz competition with vector search only

Thu, 21 Dec 2023 00:00:00 GMT

This article was written for day 21 of the Kaggle Advent Calendar 2023.

With the appearance of LLMs that can handle long token contexts, I feel that Retrieval-Augmented Generation, or RAG, has become increasingly important as a way to improve LLM output accuracy. For example, in the Kaggle competition LLM Science Exam, all top solutions used RAG. One core element of RAG is a search method that retrieves text well representing the target text, such as a question. Major approaches include keyword-based search such as BM25 and vector search using sentence features, or embeddings.

In this article, I use only Japanese vector search to solve the task from the already-finished AI-Ou: Quiz AI Japan Championship, First Competition, and check what score can be achieved. I also evaluate several Japanese embedding conversion models.

What is the first AI-Ou competition?

The first AI-Ou, or Quiz AI Japan Championship, competition asks systems to choose one correct answer from about 20 candidates for each question. About 13,000 examples were published for training and about 2,000 for validation. The quiz answers are always included in Japanese Wikipedia. A dataset example looks like this:

## 質問
1868年に化石が発見された南フランスの地名から名が付いた、現在の人類の直接的な祖先とされる化石人類は何でしょう?

## 回答候補
['ホモ・ハイデルベルゲンシス', 'ホモ・サピエンス・イダルトゥ', 'クロマニョン人', 'ホモ・エルガステル', 'ジャワ原人', 'オロリン', 'サヘラントロ プス', 'アウストラロピテクス・アフリカヌス', 'ホモ・アンテセッサー', '猿人', 'ネアンデルタール人', 'ホモ・ ゲオルギクス', 'ホモ・エレクトス', '元謀原人', 'アウストラロピテクス', 'ホモ・フローレシエンシス', 'ホモ・ローデシエンシス', 'アウストラロピテクス・アファレンシス', 'ホモ・サピエンス', 'ホモ・ハビリス']

## 正解
クロマニョン人

Predicting the answer with only vector search

Because the quiz answers are always included in Japanese Wikipedia, I convert the question text into embeddings, search over embeddings of Japanese Wikipedia passages, and extract the top-N passages and Wikipedia titles with high similarity. Then I search that text for the first occurrence position of each of the 20 candidate answer strings, and predict the answer whose first occurrence is earliest. For Wikipedia search, I use the roughly 5.5 million passage dataset from Building Japanese Wikipedia embeddings and a FAISS index for RAG.

For example, for the question above, I convert "1868年に化石が発見された南フランスの地名から名が付いた、現在の人類の直接的な祖先とされる化石人類は何でしょう?" into embeddings, retrieve top-N by vector search, and concatenate the results into one text. The example below uses top-3.

南アフリカの人類化石遺跡群 クロマニョン人 化石人類 そのため、180万年前から150万年前と推測されるその時期、東アフリカではヒト属が優勢になっていたのに対し、南アフリカで優勢だったのはパラントロプス属の方だったのだろうと考えられている。グラディスヴェール はスタルクフォンテインから8 km ほどの場所にある遺跡で、1948年には探索が行われていたが、化石人骨の出土は1992年になってのことだった。この地で調査に当たっていた古人類学者リー・バーガー(英語版)は、アウストラロピテクス・アフリカヌスの断片を見つけるにとど まっていたという。しかし、バーガーは2008年8月にヨハネスブルグからグラディスヴェールに向かう大きな道を数 km 手前で脇に逸れ、グーグル・アースで見当をつけていた近隣の石灰石採掘場跡に赴いた。その場所で彼は9歳の息子マシューとともに、新種の猿人化石を発見した。 クロマニョン人(クロマニョンじん、Cro-Magnon man)とは、南フランスで発見された人類化石に付けられた名称である。1868年、クロマニョン (Cro-Magnon) 洞窟で、鉄道工事に際して5体の人骨化石が出土し、古生物学者ルイ・ラルテ(フランス語版、英語版)によって研究された。その後、ヨーロッパ、北アフリカ各地でも発見された。現在ではクロマニョン人を、現世人類と合わせて解剖学的現代人(英語: anatomically modern human) (AMH) と呼ぶことがある。またネアンデルタール人を、従来の日本語では旧人と呼ぶのに対し(ネアンデルター ル人以外にも、25万年前に新人段階に達する前の、現代型サピエンスの直接の祖先である古代型サピエンス等も旧人段階の人類とみなすこ とがある)、クロマニョン人に代表される現代型ホモ・サピエンスを、従来の日本語では新人と呼ぶこともある。 化石人類(かせきじんるい、英語: fossil hominidまたはfossil man)は、現在ではすでに化石化してその人骨が発見される過去の人類。人類の進化を考察していくうえで重要な化石資料となる。資料そのものは化石人骨(かせきじんこつ)とも称する。また、主に第四紀更新世(洪積世)の地層で発見される ので更新世人類ないし洪積世人類とも称する。

From this text, I search for the answer candidates above and choose the one that appears first. The candidates include 'ホモ・ハイデルベルゲンシス', 'ホモ・サピエンス・イダルトゥ', 'クロマニョン人', ..., so the predicted answer is "クロマニョン人". The true answer is also "クロマニョン人", so this case is correct.

Japanese embedding models and accuracy

For data, I used the roughly 2,000 validation examples, dev1 and dev2, provided by the first AI-Ou competition, and evaluated with accuracy. As Japanese embedding conversion models, I used:

For the e5 series, different embeddings are generated by adding passage: to the original text for retrieval, or query: otherwise, so I tried both. Also, because search uses a FAISS IVFPQ-compressed index, based on the values from Measuring speed, data size, and accuracy for vector search algorithms and quantization parameters, answer accuracy may fluctuate by about plus or minus 2% for top-3 and plus or minus 0.5% for top-5 compared with an uncompressed index.

The results are below. As a reference, scores for the 13,000 training examples are also written in a separate sheet. acc@N is accuracy calculated from top-N data, and NMR@N is the no match rate, where none of the 20 answer candidate keywords were found from top-N.

For top-1, 3, 5, 10, 20, and so on, multilingual-e5-large wins decisively. I had assumed, with some bias, that multilingual-e5-small and multilingual-e5-large might not differ that much. But a 7% difference in accuracy between small and large changes the picture a lot. Another surprise was that for e5 embeddings, the difference between using passage: and query: as the prefix was almost nonexistent except for base. In fact, outside base, the results were almost reversed. For retrieval tasks that retrieve answer text for a question, I expected passage: to score better, but that was not the case here. Looking only at this result, it feels like for RAG search with e5 embeddings, using the more general query: prefix, which also works for similar sentence tasks, may be fine.

The reason many results are worse at acc@100 than acc@10 is the keyword search order. I concatenate text in the order title@1, title@2, ..., title@N, passage@1, passage@2, so as N grows, the chance of matching a wrong title increases.

Difference from top competition teams

According to the AI-Ou retrospective, top teams in the competition, including the LB first-place team, had CV scores over 0.95 on the dev dataset. Even the best e5-large score of 0.7791 is nowhere close.

Still, for only vector search and simple string search with no training, preprocessing, or postprocessing, I think the score is fairly good. I cannot confirm it now, but the originally published BERT fine-tuning baseline was reportedly around 0.8. If no-training search reaches 0.78, that does not feel bad.

Closing

This time I tried the first AI-Ou quiz competition using only vector search. In question-answering systems and similar applications, techniques that insert knowledge an LLM does not have and produce desired output through RAG + in-context learning will probably continue until LLMs can cheaply add external knowledge through training and hallucinations are almost eliminated. In Kaggle too, RAG and in-context learning may appear again in NLP tasks.

For this competition task, where the goal is to search for likely answer-containing Japanese sentences for a question, multilingual-e5-large performed well as an embedding conversion model. But for similar sentence search tasks, other models may perform better, as shown by JSTS, JSICK, and related evaluations. It seems necessary to evaluate performance appropriately for the task and data you want to handle.

I hope this article helps with Kaggle tasks or with using and choosing Japanese embeddings.

Training a Q&A + RAG-focused LLM with SFT, making 4-bit quantized models, and exceeding GPT-3.5 with a 7B model

Fri, 15 Dec 2023 00:00:00 GMT

This article was written for December 15 of the LLM Advent Calendar 2023.

Recently I wrote Building Japanese Wikipedia embeddings and a FAISS index for RAG, where I used GPT-3.5 or GPT-4 to extract answers to questions. Since I had the data, I wanted to avoid using a huge LLM such as OpenAI's models and instead train a rapidly improving local LLM with Supervised Fine-tuning Trainer, or SFT, making an LLM specialized for Q&A + RAG tasks. As the base LLM, I used youri7b-instruction, published by rinna.

For example, if I give the trained model an input like this:

以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。

### 指示:
楽曲『約束はいらない』でデビューした、声優は誰?

### 入力:
Suara 大阪府立豊中高等学校を経て大阪外国語大学でインドネシア語を専攻中にバンド・ユニットを組んで音楽活動を始めた。普段はお笑い番組が大好きなこともあってよく喋るほうだが、東京の仕事で標準語の喋りをする時は、
早見沙織 声優デビュー時より、数多くの主題歌やキャラクターソングを担当し、バラードからポップス、ヒットソングのカバー曲や英語の楽曲など、様々な曲を歌いこなす。2009年には吉田仁美とのユニット「blue dro
約束はいらない 「約束はいらない」(やくそくはいらない)は、坂本真綾のデビューシングル。
約束はいらない 坂本真綾の歌手デビュー作品。当時坂本はまだ無名の声優であったが、同曲がテーマソングとなったアニメ『天空のエスカフローネ』とともに知名度を上げることとなる。後に「指輪」が同アニメの劇場版映画の主題歌とな
坂本真綾 本格的な歌手活動は、1996年にテレビアニメ『天空のエスカフローネ』へ出演したことがきっかけで始めており、同作のオープニングテーマソングである「約束はいらない」(岩里祐穂作詞、菅野よう子作曲)をシング

### 応答:

It outputs only the answer to the question:

坂本真綾

I also created a notebook that runs on a Google Colab T4 GPU, so please try it if you are interested.

Performance comparison with GPT-3.5

Let's compare the model I made and its quantized versions with GPT-3.5 and GPT-4. For the comparison dataset, I used 980 validation examples from hotchpotch/jaqket_v1_qa_wikija_context. For questions that have context containing the answer, I evaluated whether the answer could be extracted correctly using exact match and partial match accuracy.

The results are below. After training, every model exceeded GPT-3.5 accuracy by a large margin. Execution time was also faster, especially for the AutoGPTQ quantized model, which was about twice as fast. I discuss the numbers later in the article.

Model	Exact match	Partial match	Time	GPU memory (MB)
GPT3.5	0.5949	0.799	405
GPT4.0	0.8786	0.9173	1152
fp16 before training	0.5908	0.7327	4218	11122
fp16 after training	0.7582	0.8939	4146	9964
BnB 4bit	0.7602	0.8867	397	3774
AutoGPTQ	0.7969	0.8867	211	4695
AutoAWQ	0.7316	0.8847	301	5933

The evaluation code is in the eval_xxx files here:

https://github.com/hotchpotch/youri-7b-stf-qa-context-jaqket/

Training with Supervised Fine-tuning Trainer

SFT is an easy way to train a model to output a specific format in response to instructions, or instruction tuning. The way to train is simple. Prepare examples like this:

### 指示:
今日の天気は何ですか？

### 入力:
本日は大雨ですね。

### 応答:
大雨

In this example, the data after ### 応答: is what we want the model to output well. If you provide the example sentence and ### 応答:, it trains appropriately. During actual training, the model predicts after 応答: and learns the token probability score, or cross entropy loss, for the desired answer. In other words, once examples can be created, this is an easy training method that handles the rest nicely. I have heard that around 1000 examples can be enough to train reasonably well, though a citation is needed.

Training dataset

For training, I used 2939 train examples from jaqket_v1_qa_wikija_context. This dataset extracts the CC BY-SA 4.0 DEED licensed portion of the AI Quiz King dataset and adds context usable for RAG.

Training

Training was done with this implementation. On an RTX 4090, one epoch, or 91 steps, took a little over two hours.

https://github.com/hotchpotch/youri-7b-sft-qa-context-jaqket/blob/main/train.py

I omit the details here, but the training loads youri7b-instruction with BnB 4-bit quantization and FlashAttention 2, then trains with LoRA. I also used NEFTune to improve performance.

Looking at training results

The training process is recorded in this wandb run:

https://wandb.ai/hotchpotch/youri-7b-stf-qa-context-jaqket/runs/beelrst5/workspace

Train loss flattened fairly quickly, and eval loss stopped decreasing around 40% of the training. Forty percent means around 1200 training examples, so the claim that around 1000 examples can train reasonably well feels fairly plausible.

Let's also look at wrong results at the end of training where exact match did not match. wandb is convenient because it can display dataframes as tables.

Many results were close: an extra 』 at the end, or differences between full-width ＝ and half-width =.

Model quantization

As of December 2023, according to Quantize 🤗 Transformers models, Hugging Face Transformers lists the following three quantization methods as easy to use from Python:

AWQ
- Implementation of Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ
- Implementation of GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
BnB (bitsandbytes)
- Implementation related to 8-bit Optimizers via Block-wise Quantization and others

BnB is a relatively older quantization method, and I also used it during training. GPTQ appeared in 2022, and AWQ appeared in 2023. This time I quantized with each method at 4 bits and evaluated on the validation data of jaqket_v1_qa_wikija_context. For AWQ and GPTQ, I provided Wikipedia text and training data as calibration sample text during quantization so that quantization would be better.

The result is the same as the table shown earlier. The measurement environment is CPU Ryzen 9 5950X and GPU RTX 4090. Every model exceeded GPT-3.5 on partial match and exact match, and all quantized models were faster. AutoGPTQ was about twice as fast as GPT-3.5. Comparing the quantized models, it is understandable that non-quantized fp16 had the best exact match, but unexpectedly, AutoGPTQ had the best partial match and exceeded fp16. Because I passed training data as samples during AutoGPTQ quantization, that bias may have made the result better than fp16. Against GPT-4.0, the models lose clearly on accuracy, which is unavoidable.

Model	Exact match	Partial match	Time	GPU memory (MB)
GPT3.5	0.5949	0.799	405
GPT4.0	0.8786	0.9173	1152
fp16 before training	0.5908	0.7327	4218	11122
fp16 after training	0.7582	0.8939	4146	9964
BnB 4bit	0.7602	0.8867	397	3774
AutoGPTQ	0.7969	0.8867	211	4695
AutoAWQ	0.7316	0.8847	301	5933

I did not tune GPTQ, AWQ, or BnB in detail, so results may differ with better optimization. For example, AWQ can speed up inference by changing the algorithm version depending on token length and batch size for the use case. Also, GPU memory is the memory at model load time, and inference probably uses more GPU memory.

Closing

This time I used easy SFT training to fine-tune a 7B local LLM so that it can answer appropriately for Q&A + RAG tasks. As a result, although general ability was lost, the quantized models achieved better speed and accuracy than GPT-3.5 on a home machine. With SFT, if you have around 1000 training examples, it seems possible to make a model follow many output formats, and in that case training may take less than one hour on an RTX 4090. That makes training LLMs specialized for specific uses feel casual.

Local LLM performance will continue improving, and smaller high-performance local LLMs such as TinyLlama-1.1B will likely continue to be developed. I look forward to local LLM progress next year.

Implementations, notebooks, and public models used for training and inference

https://github.com/hotchpotch/youri-7b-sft-qa-context-jaqket/
hotchpotch/youri-7b-sft-qa-context-jaqket-gptq
- Model quantized with AutoGPTQ
hotchpotch/youri-7b-sft-qa-context-jaqket-awq
- Model quantized with AutoAWQ

Building Japanese Wikipedia embeddings and a FAISS index for RAG

Mon, 04 Dec 2023 00:00:00 GMT

This article was written for December 4 of the Information Retrieval and Search Technology Advent Calendar 2023.

With the recent rise of LLMs, I feel that RAG, or Retrieval-Augmented Generation, has become increasingly important as a way to improve LLM outputs by using external information as in-context learning. But when building a system with RAG, it is not very interesting if the data is too small. So I created embeddings for about 5.5 million Japanese Wikipedia passages that can be searched easily and used as RAG input data, along with FAISS indexes for fast vector search.

For example, you can search for relevant Wikipedia sentences like this:

from datasets.download import DownloadManager
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import faiss

# wikipedia 日本語データセットのロード
wikija_dataset = load_dataset(
    path="singletongue/wikipedia-utils",
    name="passages-c400-jawiki-20230403",
    split="train",
)
# faiss index のダウンロード
dm = DownloadManager()
index_local_path = dm.download(
    f"https://huggingface.co/datasets/hotchpotch/wikipedia-passages-jawiki-embeddings/resolve/main/faiss_indexes/passages-c400-jawiki-20230403/multilingual-e5-small-passage/index_IVF2048_PQ96.faiss"
)
# faiss index のロード
faiss_index = faiss.read_index(index_local_path)

# embeddings へ変換するモデルのロード
model = SentenceTransformer("intfloat/multilingual-e5-small")
model.max_seq_length = 512

# embeddings へ変換
def to_emb(model, text, prefix="query: "):
    return model.encode([prefix + text], normalize_embeddings=True)

emb = to_emb(model, "楽曲『約束はいらない』でデビューした、声優は誰?")

# faiss で検索して、関連 Top-5 を取り出す
TOP_K = 5
scores, indexes = faiss_index.search(emb, TOP_K)
for idx, (id, score) in enumerate(zip(indexes[0], scores[0])):
    data = wikija_dataset[int(id)]
    print((score, data["title"], data["text"][:100]))

The result is below. For the question asking "who?", the appropriate person's name appears in the top 3.

(0.21018645, '約束はいらない', '「約束はいらない」(やくそくはいらない)は、坂本真綾のデビューシングル。')
(0.24241784, '約束はいらない', '坂本真綾の歌手デビュー作品。当時坂本はまだ無名の声優であったが、同曲がテーマソングとなったアニメ『天空のエスカフローネ』とともに知名度を上げることとなる。後に「指輪」が同アニメの劇場版映画の主題歌とな')
(0.2441598, '坂本真綾', '本格的な歌手活動は、1996年にテレビアニメ『天空のエスカフローネ』へ出演したことがきっかけで始めており、同作のオープニングテーマソングである「約束はいらない」(岩里祐穂作詞、菅野よう子作曲)をシング')
(0.26043588, 'Suara', '大阪府立豊中高等学校を経て大阪外国語大学でインドネシア語を専攻中にバンド・ユニットを組んで音楽活動を始めた。普段はお笑い番組が大好きなこともあってよく喋るほうだが、東京の仕事で標準語の喋りをする時は、')
(0.2690969, '早見沙織', '声優デビュー時より、数多くの主題歌やキャラクターソングを担当し、バラードからポップス、ヒットソングのカバー曲や英語の楽曲など、様々な曲を歌いこなす。2009年には吉田仁美とのユニット「blue dro')

Because the full Wikipedia text and the search index can be loaded from Hugging Face datasets, this can be used easily without doing difficult setup. The FAISS index used for search is memory-efficient and fast. For about 5.5 million vectors, it can search quickly on CPU without requiring large compute resources.

I also made a Hugging Face Spaces demo for RAG, so you can try RAG search and QA interaction from a Web UI.

https://huggingface.co/spaces/hotchpotch/wikipedia-japanese-rag-qa

Using Wikipedia sentences as RAG context

Next, let's use the answer data for the question "楽曲『約束はいらない』でデビューした、声優は誰？" as in-context learning for RAG. First, asking ChatGPT 3.5 directly produced this result:

楽曲『約束はいらない』でデビューした、声優は誰？

「約束はいらない」は、声優の悠木碧（ゆうきあおい、英: Aoi Yuki）によって歌われた楽曲です。悠木碧は、日本の声優として知られており、多くのアニメ作品やゲームで声優として活躍しています。この曲は彼女のデビューシングルの一つとして知られています。

ChatGPT 3.5 gave a wrong answer. Next, I injected the Wikipedia search results above and used them for generation, in other words as RAG search results for in-context learning.

楽曲『約束はいらない』でデビューした、声優は誰？以下のデータにヒントが含まれる場合は参考にして答えよ。また以下のデータ自体が間違っている場合もあるため、その場合はデータを利用せずに答えよ。

(0.21018645, '約束はいらない', '「約束はいらない」(やくそくはいらない)は、坂本真綾のデビューシングル。') (0.24241784, '約束はいらない', '坂本真綾の歌手デビュー作品。当時坂本はまだ無名の声優であったが、同曲がテーマソングとなったアニメ『天空のエスカフローネ』とともに知名度を上げることとなる。後に「指輪」が同アニメの劇場版映画の主題歌とな') (0.2441598, '坂本真綾', '本格的な歌手活動は、1996年にテレビアニメ『天空のエスカフローネ』へ出演したことがきっかけで始めており、同作のオープニングテーマソングである「約束はいらない」(岩里祐穂作詞、菅野よう子作曲)をシング') (0.26043588, 'Suara', '大阪府立豊中高等学校を経て大阪外国語大学でインドネシア語を専攻中にバンド・ユニットを組んで音楽活動を始めた。普段はお笑い番組が大好きなこともあってよく喋るほうだが、東京の仕事で標準語の喋りをする時は、') (0.2690969, '早見沙織', '声優デビュー時より、数多くの主題歌やキャラクターソングを担当し、バラードからポップス、ヒットソングのカバー曲や英語の楽曲など、様々な曲を歌いこなす。2009年には吉田仁美とのユニット「blue dro')

提供されたデータによれば、楽曲「約束はいらない」は坂本真綾のデビューシングルであり、坂本真綾が歌手デビューした作品であることが確認できます。したがって、この楽曲でデビューした声優は坂本真綾（さかもとまあや）です。

This answered correctly. The test is here. GPT-4 answered correctly from the start even without adding RAG information.

How to create Wikipedia sentence and passage data

There are several ways to obtain Wikipedia data, such as full dump data, CirrusSearch dump data, and calling the API. Each has advantages and disadvantages. Personally, I think calling the API is the best way to get clean text. However, calling the API for every article is hard, so at first I used the easier CirrusSearch dump data and split it into chunks for testing.

Then I found singletongue/wikipedia-utils, created by Masatoshi Suzuki. Looking inside, it contains clean text data including article titles and section titles. It also publishes passages of useful lengths, such as 200, 300, and 400 characters, so I created embeddings based on that data.

Which model to use for sentence embeddings

Embedding models have different performance directions depending on the use case, roughly corresponding to how they were trained. In RAG systems, the most important type is a model strong at retrieval or reranking tasks, trained so that questions and matching passages score highly. The next best direction is a model strong at similar sentence search.

Among models that seemed likely to perform well and could be used freely for Japanese, I converted data into embeddings with the following models. OpenAI's embeddings API, ada-v2, has the advantage of long input token length, but I excluded it because the output dimension is large at 1536 dimensions, the license situation for converted data is unclear, and above all it costs API fees.

multilingal-e5 series
- You need to be aware of prefixes: use "passage: " for retrieval passages and "query: " otherwise. It feels quite strong.
pkshatech/GLuCoSE-base-ja
cl-nagoya/sup-simcse-ja-base

I converted the passage-400 data above into embeddings with these models. For e5, using "passage: " or "query: " at the beginning of the target text produces different vectors, so I published embeddings generated with each prefix.

https://huggingface.co/datasets/hotchpotch/wikipedia-passages-jawiki-embeddings/tree/main/embs/passages-c400-jawiki-20230403

In addition to embeddings, I published FAISS indexes for search. For index parameters, based on this evaluation, I used IVF nlist=2048, and set the PQ quantization option to one quarter of the vector dimension, such as 96 for 384 dimensions. If you are interested in how Wikipedia sentence embedding search behaves with each model, please try them.

https://huggingface.co/datasets/hotchpotch/wikipedia-passages-jawiki-embeddings/tree/main/faiss_indexes/passages-c400-jawiki-20230403

The implementation used for conversion is published as hotchpotch/wikipedia-passages-jawiki-embeddings-utils. datasets_to_embs.py converts datasets to embeddings, and embs_to_faiss.py creates FAISS indexes from embeddings.

Closing

This article introduced and described creating Japanese Wikipedia embeddings and search indexes for RAG that are easy to try in a general way. In this year when LLMs rose rapidly, search use cases are likely to change greatly, and I feel this was the year when search technology started to be rediscovered and redefined. RAG, though this article only touched one element of RAG, vector search, suggests expansion into many search use cases. I look forward to the development and use of search technology next year and beyond.

Reference links

Measuring speed, data size, and accuracy for vector search algorithms and quantization parameters

Mon, 27 Nov 2023 00:00:00 GMT

Recently, more use cases convert sentences into feature vectors such as embeddings. When searching for similar vectors, a few thousand vectors usually require almost no special thought. From tens of thousands of vectors, you often use approximate nearest neighbor algorithms such as HNSW to speed up search. From millions of vectors, you often combine optimization techniques such as quantization to keep the data size practical.

These optimizations for similar vector search, such as HNSW, IVF, and quantization, create tradeoffs among search speed, data size, and accuracy. When thinking about optimization strategies under those tradeoffs, I often see recall@10 or recall@100 used as the reported accuracy metric. For example, Choose the k-NN algorithm for your billion-scale use case with OpenSearch evaluates with recall@10, and Byte-quantized vectors in OpenSearch uses recall@100.

If search results are combined with information other than embeddings, or if a reranker re-sorts the results, recall@10 or recall@100 may be fine. But when using retrieval for RAG, I do not often put top-10 results into the LLM or Reader in-context. In my own use, top-3 or top-5 is more common. So I evaluated recall@1, @2, @3, and @5 with FAISS, representative algorithms, and quantization, and measured search speed, data size, and accuracy, or recall. I used FAISS as the library. Even outside FAISS, major vector search engines usually implement the main algorithms and quantization methods, and OpenSearch can also use FAISS internally as a vector search engine.

Algorithms and parameters in vector search databases

Before measuring, here is a review of algorithms and parameters used when creating indexes. Major examples include IVF and HNSW search algorithms, compression with PQ, or Product Quantization, and each of their parameters. The following descriptions are GPT-4 output with some edits. For nearest neighbor search itself, I recommend reading Professor Matsui's Theory and Applications of Approximate Nearest Neighbor Search Using Graphs, which explains it thoroughly.

HNSW (Hierarchical Navigable Small World)
- Approach: HNSW uses a graph-based approach. Each node, or data point, has links to neighboring nodes, and search is performed efficiently through those links.
- Characteristics:

- Fast search: The graph's hierarchical structure enables very fast approximate nearest neighbor search.
- Dynamic addition: New data points can be added dynamically.
- High accuracy: It provides high accuracy compared with many other approximate algorithms.
- Memory usage: Because of the graph structure, memory usage may be relatively large.

Parameter: M

- The maximum number of neighboring nodes each data point has.

IVF (Inverted File Index)
- Approach: IVF divides data into multiple clusters and creates a separate index for each cluster. During search, it identifies the clusters closest to the query and searches only inside those clusters.
- Characteristics:

- Efficient large-scale search: It can make search efficient for large datasets.
- Scalability: It can be applied to large datasets and is scalable.
- Customizable: Many parameters can be customized, such as the number of clusters, nlist, and the quantization level.
- Memory usage: It is often more memory-efficient than HNSW, though this depends on the number of clusters and quantization level.

Parameter: nlist

- The number of clusters.

Comparison of HNSW and IVF
- Accuracy: HNSW generally provides higher accuracy than IVF, but memory usage tends to increase.
- Speed: HNSW enables fast search, while IVF is more scalable and memory-efficient for large datasets.
- Use cases: HNSW is suitable when accuracy is important or for realtime search. IVF is suitable for large datasets or limited memory resources.
Product Quantization (PQ)
- Product Quantization is a technique for efficiently compressing high-dimensional vectors. It includes the following steps:

- Vector splitting: Each vector is split into multiple lower-dimensional subvectors.
- Subvector quantization: Each subvector is quantized using a small separate codebook, or predefined set of values. Each subvector is mapped to the nearest value in that codebook.
- Compressed representation: Finally, the original vector is represented as a combination of these quantized subvectors.

The main parameter for HNSW is M; for IVF it is nlist, and also mbit, though that did not appear above. When using PQ, the parameter is the number of subvectors. The number of subvectors must divide the original vector dimension. For example, if the original dimension is 384, possible values include 32, 64, and 96. These parameters are needed before training, but the number of graph nodes or clusters searched can be decided at search execution time with parameters such as efSearch for HNSW and nprobe for IVF.

In a FAISS + Python environment, index_factory() lets you create indexes from strings like this, which is convenient:

faiss.index_factory(d, "IVF2048,PQ64") # nlist = 2048, PQ = 64
faiss.index_factory(d, "HNSW32,PQ64") # M = 32, PQ = 64

Dataset and code

This time I used ANN_SIFT1M, which is often used in FAISS code. As the name says, it is a dataset of 1M, or one million, 128-dimensional vectors. I used 10,000 search queries from it and measured recall@N. FAISS can also use GPU for search, but most searches will probably run on CPU, so I used a CPU, Ryzen 9 5950X.

The benchmark code is bench_gpu_sift1m_ivf_hnsw.py. If you put this source file into the benchs directory of a cloned FAISS repository, place the dataset appropriately, and run it, it should reproduce the benchmark.

Measurement results

https://docs.google.com/spreadsheets/d/1ZsMJZf-4tgKgfSa4zvLcmZrBZpoy6Op_EOgY17GPOfY/edit?usp=sharing

The result spreadsheet has two sheets. Showing everything would be too large, so one sheet contains extracted data.

As described at the beginning, IVF, HNSW, and PQ, or quantization, show tradeoffs. IVF is outstandingly memory-efficient with small data size, but its speed and accuracy are worse than HNSW. However, once the data no longer fits in memory, HNSW speed will likely degrade too. The recall@1 to recall@100 values are interesting. Recall@100 and recall@10 approach 1.0 fairly quickly, but if recall@3 or recall@5, meaning whether the same data appears in the top 3 or 5 results, is important for your system, you need to choose parameters carefully. Also, when using PQ, the probability that top-1 matches under recall@1 can be fairly low. If top-1 is important, you need to think carefully about what to do.

Summary

For systems where top-3 or top-5 matters in RAG, judging that everything is fine based on a larger recall metric without using a reranker may diverge from the results you actually want. The point is obvious, but choose metrics based on what you want to do, and set optimal parameters that fit those metrics. These results are only for the SIFT1M dataset, so different datasets should produce different results.

There is no best parameter that works without thought. But for vectors beyond one million, especially if they may reach tens of millions, I feel that IVF + quantization is a good direction. nlist around 1024 seems good because it is near the square root of the actual count, so for ten million items something like 4096 may be appropriate. PQ should be as large as possible; for example, for 128-dimensional vectors, 64 seems good. If accuracy and speed are important and the scale is up to a few million vectors, HNSW + PQ can also be handled with a practical data size, so it should be considered too.

The FAISS benchs directory includes measurements beyond IVF and HNSW, such as a benchmark for dimensionality reduction with PCA. It contains quite a few measurements for things you may want to do in vector search, so it is worth a look.

Extra: a Kaggle competition mistake

In the already-finished Kaggle competition LLM Science Exam, many solutions improved scores by putting top 1 to 5 RAG results into the prompt. I did that too. At the time, I chose FAISS parameters fairly casually, so the data size became quite small, but I did not notice that accuracy had actually dropped a fair amount. I would like to tell my past self to measure properly and think about parameters when accuracy matters.

Making Transformers inference 1.6 to almost 2 times faster with CTranslate2

Thu, 23 Nov 2023 00:00:00 GMT

There is a fast inference library written in Python and C++ called CTranslate2. I had wanted to try it someday, but because it required converting models, I had put it off. Then I learned about hf_hub_ctranslate2, a library that transparently converts Hugging Face models into a format usable with CTranslate2 and runs inference with them. I tried it and very easily got 1.6x faster inference on GPU and 1.9x faster inference on CPU, with almost no change in accuracy. I should have used it earlier, so here is a note.

What is CTranslate2?

CTranslate2, or CT2 below, is, as the GitHub project overview says, "a C++ and Python library for efficient inference with Transformer models." It is a library that makes Transformer model inference efficient through various optimizations. Libraries for efficient inference such as llama.cpp basically support only decoder models, but CT2 supports not only decoder models, but also encoder-decoder models and some encoder models. Since supported encoder models include BERT, BERT-family models can also run inference efficiently.

You may think, "BERT? Do we still use such an old architecture?" But for example, multilingual-e5-small, the model I use daily to generate embeddings, is also a BERT-family model. There are still many chances to use it.

Embedding inference with CTranslate2 and SentenceTransformer

Using CTranslate2 as a SentenceTransformer-compatible model is very easy. For example, change this SentenceTransformer code:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name, device=device)
embs = model.encode(texts)

to this:

from hf_hub_ctranslate2 import CT2SentenceTransformer
model = CT2SentenceTransformer(
    model_name, device=device, compute_type=compute_type
)
embs = model.encode(texts)

That alone basically makes inference 1.6 to 2 times faster. It also uses less memory.

CT2SentenceTransformer is implemented as a subclass of SentenceTransformer, so it can be used in almost the same way. I describe compute_type later.

Actual inference speed and output differences

Let's look at the measured speed difference. I used 20,000 Japanese Wikipedia samples, taking the first 512 tokens and converting them into embeddings with multilingual-e5-small for a similarity search task by adding the "query: " prefix. I compared inference with the original SentenceTransformer and with CT2 under several compute_type settings. The notebook is here. The GPU is RTX 4090, and the CPU is Ryzen 9 5950X. speed is relative to SentenceTransformer as 1.0.

device	type	speed	time	rps	mAP@100	MSE
cuda	sentence_transformer	1.00	38.99	512.94	-	-
cuda	CT2 + int8	0.94	41.44	482.63	1.0	0.000004
cuda	CT2 + int8_float32	0.93	41.89	477.43	1.0	0.000004
cuda	CT2 + int8_float16	1.45	26.98	741.30	1.0	0.000004
cuda	CT2 + float16	1.66	23.54	849.53	1.0	0.0
cuda	CT2 + auto	1.48	26.36	758.69	1.0	0.000004
cpu	sentence_transformer	1.00	1389.80	14.39	-	-
cpu	CT2 + auto	1.89	737.07	27.13	1.0	0.000004

In this result, CT2 with quantization such as int8 was actually slower, and CT2 + float16 was the fastest. On GPU, compute_type="float16" gave 1.66x speed. As an evaluation metric for the inference results, mAP@100 was 1.0, meaning the ranking did not change. To look at a finer accuracy difference, I also measured MSE, and it was almost unchanged too. It is displayed as 0.0, but the actual value was around 3e-09. For GPU, compute_type="auto" seems to be int8_float.

On CPU, compute_type="auto" was about 1.9x faster. mAP@100 remained 1.0, and MSE was only 0.000004, a tiny difference that should be almost no problem in real operation. There are many cases where inference is run on CPU, so a 1.9x speedup for CPU inference is quite valuable. I did not measure memory this time, but CT2 also advertises lower memory use and it did seem memory-efficient, so it should be even more useful in environments with tight compute resources.

CTranslate2 deserves more attention

CTranslate2 can be used for encoder models, and with hf_hub_ctranslate2, Hugging Face models can be used easily. This time I used it as a replacement for SentenceTransformer, but BERT-family models are still used for many tasks, so I feel its range of use is broad.

However, CTranslate2 currently has 2.2k GitHub stars, while llama.cpp has 44.6k. In this LLM boom, many projects receive a lot of stars, so its popularity feels modest. The name CTranslate2, perhaps because it was originally used to speed up machine translation models, also does not make it easy to infer what the library can do. That feels like a waste. I hope more people try it.

Embedding conversion performance on Apple Silicon GPU (MPS)

Fri, 10 Nov 2023 00:00:00 GMT

With the announcement of the M3 Max with 128 GB unified memory, people were excited that even huge-parameter LLMs might run locally. For my own use case, I wanted to convert text into embeddings on my local Mac, so I measured how fast text-to-sentence-vector, or embedding, conversion actually is.

Environment

I ran the benchmark on a local Mac, Linux through WSL2, and Colab.

Mac: 2022 MacBook Air / M2, 8 CPU cores, 10 GPU cores, 24 GB memory
Linux (WSL2): Ryzen 9 5950X / NVIDIA RTX 4090
Google Colab: T4 instance

On each CPU and GPU, I measured the time to convert the first 512 tokens of 1000 Japanese Wikipedia samples with multilingual-e5-small. The notebook used for measurement is here:

https://colab.research.google.com/drive/14_oeZrN5v7Potq5_a8UXvaOGCUJ4I1m8?usp=sharing

Results

Device	Method	Total Time (sec)	RPS
RTX 4090	CUDA (GPU)	2.58	388.07
Colab T4	CUDA (GPU)	19.92	50.21
MacBook Air M2	MPS (GPU)	33.16	30.15
Ryzen 5950X	CPU	73.18	13.66
MacBook Air M2	CPU	104.89	9.53
Colab	CPU	710.72	1.41

The RTX 4090 wins overwhelmingly, which is expected. But the 10-core M2 GPU reaches about 60% of the T4's speed. M3 Max also has a 40-core GPU model. If speed scales linearly, an M3 Max 40-core GPU would reach around 130 RPS. That is about one third of an RTX 4090 and quite fast for a laptop GPU. It also looks likely to be more than twice as fast as a T4.

Whether M2 embedding conversion is practical depends on the use case. It is not terribly slow, but it is not fast either. Still, the GPU is about three times faster than the M2 CPU, and with Hugging Face Transformers you can use it just by setting the device to "mps". If you use a Mac, you should naturally use the GPU. With M3 Max, I think many use cases would get reasonably practical speed.

Even so, the M2 CPU with 8 cores is surprisingly fast. Ryzen 5950X uses all 16 cores for this run, yet on single CPU core speed the M2 seems faster than the Ryzen 5950X. Library optimization may also be involved, but in a simple comparison without thinking too hard, it is fast.

Update

I received a message from yuumi3 saying that a Mac mini M2 Pro with 10 CPU cores and 16 GPU cores produced the following speeds. Thank you. The GPU score seems to have increased roughly with the move from 10 GPU cores on M2 to 16 GPU cores on M2 Pro.

[mps] convert 1000 embs, total time: 13.59 sec  / rps: 73.60
[cpu] convert 1000 embs, total time: 68.12 sec  / rps: 14.68

Starting Weekly AI News: automated summaries with clustering and GPT

Mon, 31 Jul 2023 00:00:00 GMT

AI-related topics continue to be numerous, and I thought it would be useful to know roughly what became a topic each week. So I started a weekly newsletter on Substack. The content is created fully automatically. For example, the AI news summary for the week going back from July 28, 2023 looked like this:

https://ainewsdev.substack.com/p/weekly-ai-news-1

I would not call it perfect, but I think it gathers reasonably notable topics in a decent way. If you are interested, please subscribe or read it through a feed reader.

Substack signup form

The newsletter title is a tribute to Weekly Kaggle News.

That would be only publicity, so I will also write about the internal implementation. It has not changed drastically from the material I linked before, but roughly:

Generate features, or sentence vectors, from title + summary using multilingual-e5-small.
- I use small so that it runs on the low-spec VPS environment. Subjectively, small did not feel much less accurate.
Add a standardized article timestamp vector to the 384 dimensions from e5-small, making a 385-dimensional representation.
Run KMeans without dimensionality reduction. The number of clusters is total article count divided by 8, chosen roughly. With about 250 target articles, this gives around 30 clusters.
- Reducing dimensions with UMAP or PCA did not produce very good results.
Look at overall distances, extract only articles near each cluster center, and use clusters where at least N articles remain.

This extracts clusters that look meaningful as groups of articles from the week. Then I generate titles and summaries for those clusters with gpt-3.5-turbo. It is basically ordinary BERTopic-like clustering plus GPT-based topic representation. In other words, a topic model implementation.

Recent BERTopic implementations on GitHub also seem to include OpenAI and LLM-based features, such as creating sentence vectors with OpenAI embeddings, or ada-v2, in addition to sentence-transformers, and creating topic representations with ChatGPT or GPT-4. By default, it seems to include c-TF-IDF keyword extraction in the prompt for generation. If you want to try this quickly with a library, BERTopic may be a good option.

Incidentally, newsletter article creation is fully automatic, but Substack itself does not seem to have a mechanism that lets me send the newsletter by calling an API. The final delivery flow is manual through the Web UI, which is unfortunate for my own workload.

Implementing and trying gzip + kNN text classification from the paper that beats BERT

Thu, 20 Jul 2023 00:00:00 GMT

The recently published paper “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors (Jiang et al., Findings 2023) says that it uses the length of data compressed with gzip and performs better than BERT on text classification tasks. That sounded interesting, so I implemented the method myself and tried it. As a result, for text category classification using the livedoor news corpus, it actually achieved a better result than Japanese BERT.

What kind of method is it?

What it does is simple. First, calculate NCD, or Normalized Compression Distance. The examples use gzip as the compression algorithm.

Compress individual data x and y, and call their compressed lengths C(x) and C(y).
Concatenate the two pieces of data into xy, compress it, and call the compressed length C(xy).
Calculate NCD:
- NCD(x, y) = [C(xy) - min(C(x), C(y))] / max(C(x), C(y))

If the information is similar, NCD becomes small. "Hello world!" concatenated with "Hello!" should compress well and have a small NCD, while "Hello world!" and "Good!" should be farther apart.

Then sort the training data by this distance, take the top-k items, and use the most common category among those top-k items as the prediction. In other words, it decides from the k nearest training items under NCD distance, so it is kNN.

Implementation

The original paper's implementation is available at https://github.com/bazingagin/npc_gzip. The method should be simple, but the code was hard to use, so I implemented a version that can be used quickly with sklearn-like fit and predict interfaces.

https://github.com/hotchpotch/ncd_classifier/

Trying it on the livedoor news corpus

Using this implementation, let's classify categories in the livedoor news corpus. In this Japanese BERT implementation article, 1475 items, about 20% of the whole dataset, were used as test data, and the accuracy was 0.9261.

This time I split the dataset into train and test at 8:2, with 5894 train items and 1473 test items.

https://github.com/hotchpotch/ncd_classifier/blob/main/examples/livedoor_news_ja.py

The accuracy was 0.9457, which appears to beat Japanese BERT. It is impressive for such a simple mechanism. You may wonder why I did not prepare validation data, but this method only calculates distance, or NCD, and does not pretrain anything, so validation data would not be meaningful here.

# Accuracy
0.9456890699253224

# Confusion matrix
[[150   0   0   0   1   1   0   0   0]
 [  0 166   2   2   3   0   0   0   2]
 [  0   1 164   0   0   2   0   0   0]
 [  0   0   1 156   0   0   5   0   0]
 [  5   1   3   2 103   5   0   3   4]
 [  5   0   1   1   5 148   0   4   3]
 [  0   0   0   5   1   0 182   0   0]
 [  1   0   1   0   3   7   0 151   0]
 [  0   0   0   0   0   0   0   0 173]]

Trying it on MARC-ja

Next, let's evaluate MARC-ja from the JGLUE dataset. MARC-ja has about 190,000 items with positive and negative labels, with roughly 90% positive and 10% negative. Japanese BERT reportedly gets accuracy 0.958. If everything were predicted as positive, accuracy would be around 0.9.

With NCD Classifier, the accuracy was 0.802. Very bad. At first I thought the implementation must be wrong and checked it repeatedly, but the dataset distribution is heavily skewed, and many texts are too short. Livedoor news articles, for example, have a reasonable length. It seems this method does not perform well under these conditions.

https://github.com/hotchpotch/ncd_classifier/blob/main/examples/JGLUE_MARC_ja.py

# Accuracy
0.8020870180403255

# Confusion matrix
[[4077  755]
 [ 364  458]]

Trying it on AGNews

The paper reports a score of 0.937 on AGNews, but my implementation only reached about 0.898. I do not know the cause of this difference. It may come from implementation differences, the data used, or some difference in data processing. My implementation may be wrong, so please let me know if you notice a problem.

https://github.com/hotchpotch/ncd_classifier/blob/main/examples/ag_news.py

# Accuracy
0.8976315789473684

# Confusion matrix
[[1718   47   83   52]
 [  20 1838   23   19]
 [  72   31 1635  162]
 [  81   37  151 1631]]

Rough summary

I confirmed that simple text classification with NCD using gzip plus kNN, without training, can outperform BERT in some cases. As the paper says, it seems likely to work well on small datasets of a few hundred to a few thousand items where the text length is reasonably long, such as news articles. Among the datasets I tried, the livedoor news corpus is exactly that kind of dataset.

This paper's approach looks worth trying as one classifier for casual text classification that does not require pretraining. The implementation is also simple.

However, as the paper says, computational cost grows as the data grows. At prediction time, if M is the number of train items and N is the number of items to predict, the cost is roughly O(M*N). The more data you have, the more prediction cost you pay. It is not learning features; it is doing a direct full calculation, so it is slow. For example, the MARC-ja dataset has M = 187528 and N = 5654, so the computation cost is large. Even using 32 virtual cores on a Ryzen 7950X CPU fully, predicting 5654 items takes about 30 minutes.

Still, it was an interesting paper showing that a simple approach without pretraining can produce enough performance depending on the use case.

Generating answers from images with ChatGPT 3.5 and extracting information through BLIP-2 prompts

Wed, 05 Jul 2023 00:00:00 GMT

This is about how image-based answer generation, similar to what can be done with the ChatGPT Web UI and GPT-4, can also be useful with ChatGPT 3.5 and BLIP-2 if the requirements match. What I really want to talk about is information extraction through prompts using BLIP-2.

The reason I tried this was that I wondered how GPT-4 generates answers from images. While looking into it, I found BLIP-2 explained as an example of incorporating VQA tasks into an LLM in Current Status and Prospects of Vision and Language (GPT-4).

I had roughly thought of BLIP as something for generating captions from images. BLIP-2, however, trains with an open LLM behind it, which lets it learn information that cannot be represented by image-caption pairs alone and improves accuracy.

What is useful for users is that it is not limited to ordinary caption generation. You can extract information by making it solve a Q&A task about an image.

For example, "a night view of Tokyo Tower is shown" is generated by having GPT-3.5 answer based on a simple caption. The flow for the opening example, "please suggest several recipes", is as follows:

Use image_question_prompt with "What ingredients are there?" to extract information from the image with BLIP-2.
- The QA task only accepts English, so translate with Meta's NLLB-200, which is for non-commercial use only.
- The prompt becomes Question: What are the ingredients? Answer:.
The extracted text becomes "Vegetables, carrots, cabbage, cauliflower, broccoli, and potatoes".
After that it is easy. Use that text and have GPT-3.5 answer with few-shot prompting.

That is the simple mechanism. If you write the information-extraction prompt directly in English for image_question_prompt, you can process it without the translation step.

This cannot handle advanced context in the way ChatGPT with GPT-4 can. But if the task can be combined with information extraction that BLIP-2 can perform, it can work reasonably well.

What I really want to say in this article is what I wrote above: with an appropriate prompt, BLIP-2 can perform a certain amount of information extraction. ChatGPT's own API will probably become able to handle image-based information extraction tasks, but BLIP-2 plus open LLM models should also continue to improve. I expect zero-shot information extraction with open models to become more accurate, and I am looking forward to that.

References

Enjoying Stable Diffusion again from a technical perspective

Tue, 04 Jul 2023 00:00:00 GMT

Recently I used Stable Diffusion again through stable-diffusion-webui, and there were several technical things I did not know. These are my notes.

ControlNet

ControlNet is an extremely powerful neural network for controlling generated images so that they follow specific conditions. It works properly with any base model.

As of v1.1, it supports conditions such as depth, semantic segmentation, human pose, fake scribbles, HED boundary, M-LSD lines, and Canny edge. From a source image, it can use composition, pose, segmentation, edge extraction, masked-region completion, and many other conditions for image generation. The way it combines existing datasets and architectures is also exciting. The range of applications is wide.

It is innovative enough that anyone who has not used ControlNet image generation should try it. In the illustration-generation context, people often focus only on pose control, but it can reproduce many kinds of composition. It is seriously impressive. More types of conditional image generation will probably become possible from here.

There is also a ControlNet WebUI extension for SD-WebUI, so it can be used easily from sd-web.

https://github.com/Mikubill/sd-webui-controlnet

Clear explanations and related material:

LoRA: Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning

https://github.com/cloneofsimo/lora
https://github.com/kohya-ss/sd-scripts/blob/main/docs/train_network_README-ja.md (LoRA training based on the implementation above)

The original LoRA paper, LoRA: Low-Rank Adaptation of Large Language Models, is about LLMs, or Transformers. The LoRA used with Stable Diffusion often uses these implementations. The cloneofsimo/lora implementation can train for high-quality output based on Pivotal Tuning Inversion, or PTI.

LoRA adds low-rank matrices to a base model and trains only those parts. This reduces training cost. Because the resulting parameters are small, both the parameter file size and memory usage are smaller.

From the point of view of someone using LoRA for image generation, LoRA is easy to layer. You can apply LoRA-trained parameter B and LoRA-trained parameter C on top of base model A, and generate images with characteristics from both B and C without major changes. In SD-WebUI, you can quickly specify which LoRA to apply and at what strength from a text prompt, such as <lora:model_a:1.0>, <lora:model_b:0.7>.

A clear Japanese deck about LoRA for LLMs is here.

Textual Inversion embeddings

In Stable Diffusion, one of the generation inputs is the embedding output, or sentence vector, from the CLIP text encoder. My understanding, although it is a little vague, is that textual inversion adds a new word that has learned a specific feature when creating these embeddings, and adjusts generation toward the intended output.

Normally, humans adjust output by putting strings into the text prompt. With textual inversion, you can directly insert text embeddings that have learned a specific expression, so you can make finer adjustments than ordinary text can express. After training, the embedding data has the same rank as a sentence vector, so the file size is extremely small.

Well-known examples include EasyNegative, which learns characteristics of words used in negative prompts and lets you add a good negative prompt easily, and badhandv4 and bad_prompt, which suppress strange arms and fingers.

Merging checkpoint models

Checkpoint merging combines checkpoint models to create another checkpoint. What it does is simple: linearly combine model parameters, essentially adding them with weights. It is surprisingly simple. This alone can create a model C that has characteristics of both models A and B. The many models named XxxMix that you see around are merged models made from multiple models. That said, not all merged models are only simple linear combinations. Some seem to use techniques such as changing the ratio by network layer, though that is still a kind of linear combination.

SD-WebUI keeps making these new techniques easy to use. For example, LoRA files, embeddings, and base models work by putting them in directories, and extensions such as sd-webui-controlnet can add features. The UI is not exactly approachable, but the system is well made. There is also a wiki page that roughly explains all features. If you use plain Stable Diffusion, you have to gather information yourself about what is possible, but SD-WebUI usually includes current trends, so it is also useful for understanding what is popular.

https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features

For image generation AI, I mostly use Midjourney. But touching Stable Diffusion again made me feel the evolution and interest of the ecosystem that comes from being open source and having published checkpoints. Looking closely, many parts are interesting, and it feels like I could get pulled deeply into it, so I am not digging too far. Still, it seems like an interesting area to get absorbed in.

After writing this much, I remembered that A New Era of AI Art: Image Generation Technology and Applications Using CLIP and Stable Diffusion had been sitting unread, so I started reading it. It covers the topics in this note, of course, and also explores many experimental approaches around the CLIP encoder as well as Stable Diffusion. It includes results showing what happens when each approach is applied, so you can understand how generated images change. Reading it is definitely enjoyable.

Another major benefit of LoRA: switching task models instantly while sharing GPU memory

Wed, 31 May 2023 00:00:00 GMT

LoRA, or Low-Rank Adaptation of Large Language Models, adds low-rank matrices so that you can train at low cost while keeping the original model intact. Recently, large Japanese models such as cyberagent/open-calm-7b and rinna/japanese-gpt-neox-3.6b were released, and Hugging Face released peft, a library that makes LoRA easy to use with Transformers. Many people have probably tried it.

Most explanations of LoRA's benefits focus on training. I had not seen much discussion of another major benefit: handling multiple tasks while sharing memory for the base LLM. This article explains how to do that with peft.

For an explanation of what LoRA is, this study group deck on LoRA: Low-Rank Adaptation of Large Language Models is very clear.

What problem does this solve?

As the name says, LLMs are large language models. For example, if you load the open-calm-7b model onto a GPU in fp16, it alone uses about 13 GB of memory. If you fully fine-tune it for a task, running that task needs 13 GB. If you then load another model for another task, it needs another 13 GB. A total of 26 GB is a harsh amount of memory, especially for a home GPU.

However, if you train open-calm-7b with LoRA by adding low-rank matrices with parameter r=8, the additional memory needed is only 17 MB. Not 17 GB, but 17 MB. You get a neural network that has learned task-specific characteristics and can solve some task with only that extra size.

That means you can handle another task with the 13 GB base LLM plus 17 MB. And not just one task. If you have LoRA data trained from ten different tasks or datasets, you can handle those tasks with 13 GB + 170 MB of memory. That is extremely powerful.

To be honest, for batch processing where you run the same process over lots of data, repeatedly loading and unloading GPU memory is often acceptable if you can wait. But for realtime sequential processing, such as responding to user input, being able to share memory on one GPU while handling multiple tasks is much better for performance.

For example, this seems useful for cases like:

Changing the expression style of chatbot responses
Running an article hosting service and switching models per user after learning each user's writing characteristics
Switching models to evaluate which training worked better in an A/B test
Switching LangChain Agents quickly when running them locally
- Each Agent may have different capabilities, and you may want to switch Agents depending on the content. If each Agent is a huge model, frequent loading and unloading from memory becomes very slow.

One caveat is that the base LLM must be the same.

How to switch in practice

When using LoRA models trained with peft, switching is very easy. I prepared the following notebook as an example:

https://gist.github.com/hotchpotch/e99a70a6864c76f5638010537d535a33

PeftModel can switch the active model using a feature called adapter. The model loaded by default is named default, and you can load another model with a name using load_adapter(model_name, adapter_name).

For example, load peft_model like this:

from peft import PeftConfig, PeftModel

peft_model_open2ch = "hotchpotch/open-calm-7b_lora_open2ch"
peft_config_open2ch = PeftConfig.from_pretrained(peft_model_open2ch)

model = AutoModelForCausalLM.from_pretrained(peft_config_open2ch.base_model_name_or_path, device_map="auto", torch_dtype=torch.float16)

peft_model = PeftModel.from_pretrained(model, peft_model_open2ch)

Then add a model with a different capability:

# https://note.com/masuidrive/n/n0e2a11fc5bfa
peft_model_instruct = "masuidrive/open-calm-instruct-lora-20230525-r4-alpha16-batch32-epoch1"

# Load it into peft_model with the adapter name "instruct"
peft_model.load_adapter(peft_model_instruct, "instruct")

After that, just switch adapters according to the task:

# The trained model hotchpotch/open-calm-7b_lora_open2ch
peft_model.set_adapter("default")
# The trained model masuidrive/open-calm-instruct-lora-20230525-r4-alpha16-batch32-epoch1
peft_model.set_adapter("instruct")

With this, the base LLM cyberagent/open-calm-7b should be loaded into about 13 GB of memory, while the 2ch-style text generation model is loaded as the default adapter and the QA answering model is loaded as the instruct adapter. Together, those adapters add only about 34 MB of memory.

So by switching with set_adapter for the task you want to run, you can use the models well without loading and freeing the huge LLM again. In the notebook example, I switch between two capabilities: generating 2ch-style text and answering questions.

LLMs plus many adapters open up possibilities

Training huge models that previously required full fine-tuning can now be done efficiently with LoRA, with small saved data size. In inference, multiple tasks can also share memory and run with lower memory usage.

This area is evolving quickly day by day. It is very interesting, the things we can do are expanding, and I am looking forward to the future.

The title says "GPU memory", but this memory sharing should not be limited to GPUs.

Analyzing the Iris dataset with ChatGPT's Noteable plugin

Sat, 27 May 2023 00:00:00 GMT

The OpenCALM data I wrote about earlier was far too small, so I tried asking Noteable to analyze the Iris dataset, the classic dataset that everyone loves and that has probably been analyzed a hundred million times. The results were about what I expected, but it also quickly wrote code for plotting graphs and trying several algorithms, which felt very convenient. This is a short note about that.

The standard statistics output was unsurprising, but the pair plot was nice. I usually end up reading the documentation while writing pair plots, but it generated a clean seaborn plot and colored it by target, or species.

I only gave it an instruction like: "I want to build a model that predicts target using data other than target. What algorithms would be good for building the prediction model? Please answer using Noteable." From that, it split the data into train and test at 8:2, wrote implementations for five sklearn algorithms, and displayed the actual results. That was convenient because writing this by hand each time is bothersome. This time all models achieved 100% accuracy, but if the accuracy had differed by algorithm, I could probably ask why and get an explanation.

The notebook code automatically created by Noteable looked like this:

from sklearn.model_selection import train_test_split

X = iris_df.drop('target', axis=1)
y = iris_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

models = [
    ('Logistic Regression', LogisticRegression()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('SVM', SVC()),
    ('KNN', KNeighborsClassifier())
]

results = []

for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results.append((name, accuracy))

results

[('Logistic Regression', 1.0),
 ('Decision Tree', 1.0),
 ('Random Forest', 1.0),
 ('SVM', 1.0),
 ('KNN', 1.0)]

ChatGPT then returned an easy-to-understand explanation of these results.

Next was clustering and visualization with dimensionality reduction. This is another thing that is quietly annoying to write yourself because you end up checking the documentation, but it generated the code quickly. It first used PCA for dimensionality reduction, and when I asked what would happen with t-SNE, the graph appeared right away.

The clustering and dimensionality-reduction plots with PCA and t-SNE looked like this:

ChatGPT cannot properly analyze unknown data it has never seen, although few-shot examples can help. A service that lets it execute a notebook, observe the results, and then continue the conversation through ChatGPT complements that weakness well. It made me feel again that Noteable is impressive.

People who do data analysis probably see Iris and think, "Ah, another Iris tutorial", and do not feel like analyzing it again. I was surprised that a day came when I voluntarily wanted to analyze the Iris dataset again.

Using ChatGPT's Noteable and WebPilot plugins to build a notebook that predicts OpenCALM 14B performance

Sat, 27 May 2023 00:00:00 GMT

I saw people saying that the Noteable plugin, which lets you create notebooks through conversation on ChatGPT, was impressive. I tried it with ChatGPT, or GPT-4, and it was better than I expected. We have reached a convenient time where AI can write the notebook for small analyses, and you can check the results in ChatGPT without opening the notebook itself.

For example, I asked it to fetch the number of parameters and PPL for OpenCALM models from the Hugging Face page and plot them. The result looked like this:

The PPL performance graph per parameter looked like this:

It also built a model using the parameters from 1B onward, where a linear regression seemed plausible. The predicted graph for a hypothetical OpenCALM-14B model looked like this. Since the 1B, 3B, and 7B models decrease linearly, it feels like performance will continue improving as larger models appear.

先ほど作成した線形回帰モデルの切片と係数は以下の通りです：

切片（Intercept）: 10.7928
係数（Coefficient）: -0.000383
これは、モデルが以下の形式で表されることを意味します：

Dev ppl = 10.7928 - 0.000383 * Params

つまり、Paramsが1増えると、Dev pplは約0.000383減少すると予測されます。

The ChatGPT conversation looked like this. It is not written in this exchange, but if I ask it to model the data with something nonlinear, such as SVM, a neural network, or a polynomial instead of only linear regression, it will express it that way too. It seems very convenient.

https://chat.openai.com/share/e3f754bb-0a9f-4bc9-a04e-399296f55884
- ChatGPT recently added share URLs. Noteable images seem to be cached only for a few dozen minutes during the conversation, so they disappear after a while.

The notebook created by Noteable looked like this. The data scraped by WebPilot is placed in the first cell.

https://app.noteable.io/f/71f35e5a-171f-4d70-8428-62c42bf7193d/params_dev_ppl_graph-(1).ipynb

Until now, the natural flow was to collect data from a web page, shape it into CSV or Python code, and then analyze it in a notebook using familiar steps. Being able to do that quickly by writing natural language is extremely convenient. If I want to do something more complex, the notebook already exists, so I can continue the analysis by adding a little myself. Having the usual notebook workflow become "mostly let AI do it, then have a human make the final adjustments" is a strong point.

Tedious work keeps disappearing, which feels good.

Quantizing fastText to build a practical 1.7 MB text classifier

Tue, 09 May 2023 00:00:00 GMT

This is a note about my surprise after building a text classifier that decides whether English articles are AI-related or not with fastText and quantization. The resulting model was practical and only 1.7MB. 1.7MB!

As I wrote in Launching AI News and how I used OpenAI behind it, AI News currently classifies text as AI-related or not by converting articles into 1536-dimensional vectors with OpenAI's text-embedding-ada-002 and training lightGBM on those vectors. The problem with this approach is that every article must go through the OpenAI API. On days with many long articles, this can cost several tens of yen per day. Monthly, it probably costs 500 to 1000 yen. Small costs add up.

The data had started to accumulate, so I wanted to classify articles without spending money on the OpenAI API. The data source is about 1,100 English article titles and bodies, with a ratio of about 2 AI articles to 8 non-AI articles. I split it 7:2:1 into train, validation, and test. Also, because I want to avoid mistakenly classifying AI articles as non-AI as much as possible, I look not only at accuracy but also at recall.

Current classifier: OpenAI embeddings + lightGBM

Accuracy was 0.9636, and recall was 0.777. For embeddings without fine-tuning, this is quite high.

Transformer: deberta-v3-xsmall

Accuracy was 0.9636, and recall was 0.888. This is a properly fine-tuned transformer, deberta-v3-xsmall. The test set is small, about 110 items, so it is hard to say too much, but it is roughly the same performance as the current classifier. That is expected, since it is fine-tuned.

I also tried deberta-v3-large, but the score actually dropped. The training data may be too small for fine-tuning a larger model to fit well.

fastText: cc.en.300

Accuracy was 0.9454, and recall was 1.0. Accuracy dropped, but recall was 1.0. Because the test data is small, I cannot strongly claim the recall is excellent, but the balance looks good.

The trained model size at this point was 4.6GB. That makes sense because the original cc.en.300 is large.

fastText: ag news

Accuracy was 0.9454, and recall was 1.0. The result was the same as cc.en.300. The trained model size was 88MB, already much smaller.

The ag news model came from Supervised models. cc.en.300 is 300-dimensional data trained from Common Crawl and Wikipedia, while ag news is 10-dimensional data trained from a corpus of news article titles and descriptions. Since this dataset matches my use case fairly well, I think the compatibility was good.

At this point the model was already reasonably small and practical, but looking at Supervised models, I noticed that very lightweight quantized models are also provided. Their scores do not degrade much; for example, ag news goes from 0.924 to 0.92. The model size goes from 387MB to 1.6MB.

So I tested performance with quantization too.

fastText: ag news + quantization

Accuracy was 0.9363, and recall was 1.0. Accuracy dropped slightly, but the model size shrank dramatically from 88MB to 1.7MB.

With the fastText command, quantization can be done quickly like this. Inference with the quantized model also worked without any special handling.

fasttext  quantize -output ./trained.ag_news -input ./trained.ag_news.bin -qnorm -retrain -cutoff 100000

Extra: fastText cc.en.300 + quantization

Accuracy was 0.9181, and recall was 0.714. The score dropped a lot. The model size also dropped dramatically from 4.6G to 16MB, though. The 300-dimensional size of cc.en.300 may have worked against it here.

Performance summary

The classifiers I tried performed as follows. OpenAI embeddings are clearly strong. If cost does not matter, it feels like they are good enough. DeBERTa v3 is also good if the machine has enough resources. But AI News data processing runs on a VPS with 1 GB of memory, so it is overkill there.

So I decided to use the memory-efficient, quantized, and still practical fastText ag news model as the first-stage classifier, and then use OpenAI embeddings + lightGBM as the second stage. The first stage should filter out about 80% of articles, so the number of calls to OpenAI in the second stage should drop sharply.

OpenAI embeddings + lightGBM
- acc 0.9636, recall: 0.777
deberta-v3-xsmall
- acc 0.9636, recall: 0.8888
fastText cc.en.300, after fine-tuning: 4.6G
- acc 0.9454, recall: 1.0
fastText cc.en.300, after fine-tuning and quantization: 16MB
- acc 0.9181, recall: 0.7142
fastText ag news, after fine-tuning: 88M
- acc 0.9454, recall 1.0
fastText ag news, after fine-tuning and quantization: 1.7M
- acc 0.9363, recall 1.0

fastText + quantization as an option

Until now I did not know that quantizing a fastText model could reduce the model size this much. Going forward, when I want to run text classification inference on a low-spec machine, I will consider fastText + quantization as one strong option.

Extra note: bin to vec

fasttext print-word-vectors ag_news.bin

This outputs a .vec file containing only word vectors. Specify this .vec file for a pretrained model.

After reading Kaggle ni Idomu Deep Learning Programming no Gokui

Sat, 29 Apr 2023 00:00:00 GMT

I finished reading the book in the title, so here are my impressions. In one sentence, it is a good, easy-to-understand book that compresses the essence of machine learning into a short "index", whether or not you care about Kaggle. I recommend it both to machine learning beginners and Kaggle beginners. I am technically a Kaggle Competitions Master, and even so it helped me understand several things that I had only understood shallowly, and it introduced techniques I did not know. Machine learning engineers and people familiar with Kaggle should also get a lot from it.

The book, Kaggle ni Idomu Deep Learning Programming no Gokui, was given to me by one of its authors, Shotaro Ishihara. Thank you.

One of the good things about this book is that it is short. Excluding the index, it is about 200 pages. Many machine learning books are thick, so this one is nicely compact. Books that explain theory properly tend to become long because they need careful explanation, and bad books often become long because they explain difficult material in a confusing and redundant way.

This book explains things accurately and simply, so you can get a broad overview. Looking at the book's table of contents, it covers many techniques needed not only for Kaggle but also for practical machine learning in general, and it explains why each technique is needed and when to use it. From chapter 3 onward, the book applies techniques and ways of thinking to actual Kaggle competition tasks and improves the score, so it is easy to see that the methods really have an effect. If you want more detail, you can follow the URLs and papers scattered throughout the text. It also gives you important keywords, so you can deepen your knowledge through search or other books.

The main models it covers are also models that are strong enough to consider early in practical work: gradient boosted decision trees, neural networks such as CNNs, RNNs, and Transformers, and linear models for ensembles. This focused selection is probably one reason the book does not become redundant.

If I had encountered this book when I first started learning machine learning, I could have learned about models and algorithms that are good in terms of performance without wandering too far off course, and I could have learned simply how to look at data and how to validate models. Beginners will not understand everything just by reading it lightly, but at the beginning you often do not even know which keywords matter. This book exposes you to many keywords that feel important and worth looking up when needed. That alone is valuable.

If I had encountered it when I started Kaggle, it would also have answered many of the questions that come up when beginning Kaggle competitions. Fortunately, my first Kaggle competition was a team effort, and much stronger teammates taught me the basics, so I was able to understand the overall flow. With this book, I think I could have grasped the flow of Kaggle and started my first competition with more understanding.

As I wrote at the beginning, this is a good, easy-to-understand book that compresses the essence of machine learning into a short index, beyond Kaggle itself. I hope many people read it.

Similar embedding search with SVM: an alternative to kNN

Sat, 29 Apr 2023 00:00:00 GMT

LangChain v0.0.141 added an implementation called SVM Retriever. It finds the top-K embeddings similar to a single query embedding from a set of embeddings by using SVM. I wondered how that worked, looked into it, and found it interesting because it used an idea I did not know. This is a note about that.

kNN vs SVM

There is a notebook called knn_vs_svm.ipynb, which this implementation is based on. Machine-translating the beginning gives the following:

A common workflow is to index some data based on embeddings, then, given a new query embedding, use k-Nearest Neighbor search to retrieve the most similar examples. For example, you could imagine embedding a large collection of papers based on their abstracts, then giving a new paper of interest and retrieving the most similar papers.

In my experience, if you have a little extra compute budget, using SVM instead of kNN always works better. Here is an example:

k-nearest neighbors computes using Euclidean distance, but this approach uses SVM. The way SVM is used is interesting. Quoting from the notebook:

# Wired: use an SVM
from sklearn import svm

# create the "Dataset"
x = np.concatenate([query[None,...], embeddings]) # x is (1001, 1536) array, with query now as the first row
y = np.zeros(1001)
y[0] = 1 # we have a single positive example, mark it as such

# train our (Exemplar) SVM
# docs: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
clf = svm.LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-6, C=0.1)
clf.fit(x, y) # train

# infer on whatever data you wish, e.g. the original data
similarities = clf.decision_function(x)
sorted_ix = np.argsort(-similarities)
print("top 10 results:")
for k in sorted_ix[:10]:
  print(f"row {k}, similarity {similarities[k]}")

It labels only the target single embedding as 1 and all others as 0, then trains LinearSVC as a classification task. The confidence score is obtained as a value around 1 to -1, and the top-K items closest to 1 are treated as the most similar items.

Instead of simple Euclidean distance, the score is computed while considering a space learned by SVM, so it does seem likely to produce better results. Using LinearSVC to compute that made me think, "I see." SVM Retriever is an abstraction that makes this method convenient to use in LangChain.

Comparing kNN and SVM results

Let's use 450 Japanese items from AI News and compare search results from kNN and SVM for a specific query.

query:  生成AIと著作権
=== kNN ===
0.886: 生成AIの猛烈な進化と著作権制度～技術発展と著作権者の利益のバランスをとるには～ | STORIA法律事務所
0.880: スター・ウォーズやハリポタの人気キャラと話せるAIの「著作権問題」をどう考えるべきか | シリコンバレーの「生き字引」がズバリ指摘 | クーリエ・ジャポン
0.876: 生成AIの利用ガイドライン作成のための手引き | STORIA法律事務所
0.876: ダブスタクソイナゴは生成AIの法的議論に参加してくるんじゃねえ！！
0.874: 画像生成AI “クリエーターの権利脅かされる” 法整備など提言 | NHK | AI（人工知能）
0.870: 【AI】生成AIを利用する場合に気を付けなければならない著作権の知識｜福岡真之介｜note
0.868: AIイラストに規制を求める団体の理事「木目百二」氏が二次創作のガイドライン違反で支援サイトの作品全消し、謝罪に追い込まれる - Togetter
0.865: 生成AI「開発規制、望ましくない」　松本総務相 - 日本経済新聞
=== SVM ===
-0.305: 生成ＡＩで作品、それって著作権侵害？　福井健策弁護士に聞く：朝日新聞デジタル
-0.384: 生成AIの猛烈な進化と著作権制度～技術発展と著作権者の利益のバランスをとるには～ | STORIA法律事務所
-0.402: ダブスタクソイナゴは生成AIの法的議論に参加してくるんじゃねえ！！
-0.408: AIイラストに規制を求める団体の理事「木目百二」氏が二次創作のガイドライン違反で支援サイトの作品全消し、謝罪に追い込まれる - Togetter
-0.436: 画像生成AIによる作品の無許可使用を主張した写真家が逆に損害賠償を請求される - GIGAZINE
-0.479: アーティストのGrimes、生成AIで自分の声を自由に使っていいとツイート - ITmedia NEWS
-0.482: 生成AIの利用ガイドライン作成のための手引き | STORIA法律事務所
-0.483: スター・ウォーズやハリポタの人気キャラと話せるAIの「著作権問題」をどう考えるべきか | シリコンバレーの「生き字引」がズバリ指摘 | クーリエ・ジャポン

For the result above, both look reasonable at a glance. Let's try a slightly more difficult query.

query:  大規模言語モデルを低スペックのマシンで動かしたい
=== kNN ===
0.872: RWKV14Bを日本語AlpacaデータセットでLoRAして公開しました(ご家庭で動く!?)｜shi3z｜note
0.861: チャットAIをブラウザのWebGPUだけで実行でき日本語も使用できる「Web LLM」、実際に試してみる方法はこんな感じ - GIGAZINE
0.855: LLMをアプリ開発に統合するSDK「Semantic Kernel」がPythonに対応、TypeScriptへの対応も検討中|CodeZine（コードジン）
0.853: ChatGPT対抗のオープンソース言語モデル「StableLM」。日本語版も？  - PC Watch
0.851: “画像の面白さ”を解説できるAI「MiniGPT-4」　写真からラップや詩、料理レシピ作成　デモサイトも公開中：Innovative Tech（1/2 ページ） - ITmedia NEWS
0.850: チャットAI「StableLM」発表　オープンソースモデルで商用可　「Stable Diffusion」開発元から - ITmedia NEWS
0.849: Googleの大規模言語モデル「Bard」、日本でも利用可能に。英語のみだが、改良されたPaLMベース | テクノエッジ TechnoEdge
0.849: Stability AIがオープンソースで商用利用も可能な大規模言語モデル「StableLM」をリリース - GIGAZINE
=== SVM ===
-0.359: 大規模言語モデルを自社でトレーニング＆活用する方法｜mah_lab / 西見 公宏｜note
-0.366: 大規模言語モデル間の性能比較まとめ｜mah_lab / 西見 公宏｜note
-0.451: 深層学習コンパイラスタックと最適化
-0.456: LLMをアプリ開発に統合するSDK「Semantic Kernel」がPythonに対応、TypeScriptへの対応も検討中|CodeZine（コードジン）
-0.471: dolly-v2-12bという120億パラメータの言語モデルを使ってみた!｜Masayuki Abe｜note
-0.490: Googleの大規模言語モデル「Bard」、日本でも利用可能に。英語のみだが、改良されたPaLMベース | テクノエッジ TechnoEdge
-0.504: RWKV14Bを日本語AlpacaデータセットでLoRAして公開しました(ご家庭で動く!?)｜shi3z｜note
-0.510: Webブラウザ上で3D/2Dモデルをぬるぬる動かせる「Babylon.js 6」正式版に。レンダリング性能が最大50倍、WASM化した物理演算エンジン搭載、液体のレンダリングも － Publickey

Depending on the query, the results can differ quite a bit. I also implemented a hybrid search that ensembles kNN and SVM results, so let's look at that.

query:  大規模言語モデルを低スペックのマシンで動かしたい
=== kNN ===
-3.816: RWKV14Bを日本語AlpacaデータセットでLoRAして公開しました(ご家庭で動く!?)｜shi3z｜note
-3.527: チャットAIをブラウザのWebGPUだけで実行でき日本語も使用できる「Web LLM」、実際に試してみる方法はこんな感じ - GIGAZINE
-2.920: LLMをアプリ開発に統合するSDK「Semantic Kernel」がPythonに対応、TypeScriptへの対応も検討中|CodeZine（コードジン）
-2.591: ChatGPT対抗のオープンソース言語モデル「StableLM」。日本語版も？  - PC Watch
-2.436: “画像の面白さ”を解説できるAI「MiniGPT-4」　写真からラップや詩、料理レシピ作成　デモサイトも公開中：Innovative Tech（1/2 ページ） - ITmedia NEWS
=== SVM ===
-3.923: 大規模言語モデルを自社でトレーニング＆活用する方法｜mah_lab / 西見 公宏｜note
-3.865: 大規模言語モデル間の性能比較まとめ｜mah_lab / 西見 公宏｜note
-3.140: 深層学習コンパイラスタックと最適化
-3.097: LLMをアプリ開発に統合するSDK「Semantic Kernel」がPythonに対応、TypeScriptへの対応も検討中|CodeZine（コードジン）
-2.962: dolly-v2-12bという120億パラメータの言語モデルを使ってみた!｜Masayuki Abe｜note
=== Hybrid ===
-3.869: 大規模言語モデルを自社でトレーニング＆活用する方法｜mah_lab / 西見 公宏｜note
-3.102: RWKV14Bを日本語AlpacaデータセットでLoRAして公開しました(ご家庭で動く!?)｜shi3z｜note
-2.913: 大規模言語モデル間の性能比較まとめ｜mah_lab / 西見 公宏｜note
-2.844: LLMをアプリ開発に統合するSDK「Semantic Kernel」がPythonに対応、TypeScriptへの対応も検討中|CodeZine（コードジン）
-2.558: チャットAIをブラウザのWebGPUだけで実行でき日本語も使用できる「Web LLM」、実際に試してみる方法はこんな感じ - GIGAZINE

This looks better to me. Since it is easy to try, using SVM in addition to kNN search or similarity search seems like a reasonable option. Of course kNN is overwhelmingly faster, but if SVM can be used at practical speed, it seems useful.

Extra code

embs must be created separately as an array of embeddings. texts is paired data for embs. LangChain's SVM Retriever makes this easier, but it does not expose scores, so I implemented it myself.

# base: https://github.com/karpathy/randomfun/blob/master/knn_vs_svm.ipynb
from sklearn import svm
import numpy as np
from langchain.embeddings import OpenAIEmbeddings

def knn_top_k(query_emb, embs, k=10):
    l2_embs = embs / np.sqrt((embs**2).sum(1, keepdims=True))
    l2_query = query_emb / np.sqrt((query_emb**2).sum())
    similarities = l2_embs.dot(l2_query)
    sorted_index = np.argsort(-similarities)
    res_index = sorted_index[1:k+1]
    return res_index, similarities[res_index], -similarities

def svm_top_k(query_emb, embs, k=10):
    X = np.concatenate([query_emb[None, ...], embs])
    y = np.zeros(X.shape[0])
    y[0] = 1
    clf = svm.LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-6, C=0.1)
    clf.fit(X, y)
    similarities = clf.decision_function(X)
    sorted_index = np.argsort(-similarities)
    res_index = sorted_index[1:k+1] - 1
    return res_index, similarities[res_index + 1], -similarities[1:]

def get_query_emb(text):
    emb = OpenAIEmbeddings().embed_query(text) # type: ignore
    return np.array(emb)

def join_colon(num_list_a, list_b):
    return [f'{a:.3f}: {b}' for a, b in zip(num_list_a, list_b)]

def knn_svm(text, embs, texts, k=5):
    query_emb = get_query_emb(text)
    knn_index, knn_similarities, _ = knn_top_k(query_emb, embs, k)
    svm_index, svm_similarities, _ = svm_top_k(query_emb, embs, k)
    print('query: ', text)
    print('=== kNN ===')
    print("\n".join(join_colon(knn_similarities, texts[knn_index])))
    print('=== SVM ===')
    print("\n".join(join_colon(svm_similarities, texts[svm_index])))

def hyblid_knn_svm(text_or_emb, embs, texts, k=5):
    if isinstance(text_or_emb, str):
        query_emb = get_query_emb(text_or_emb)
        print('query: ', text_or_emb) # type: ignore
    else:
        query_emb = text_or_emb
    # 全件取得する
    knn_index, knn_similarities, knn_all_scores = knn_top_k(query_emb, embs, embs.shape[0])
    svm_index, svm_similarities, svm_all_scores = svm_top_k(query_emb, embs, embs.shape[0])
    
    # score を正規化する
    knn_score_normalized = (knn_all_scores - np.mean(knn_all_scores)) / np.std(knn_all_scores)
    svm_score_normalized = (svm_all_scores - np.mean(svm_all_scores)) / np.std(svm_all_scores)

    # それぞれのスコアを足し合わせて、ハイブリッドなスコアを作る
    hybrid_similarities = (knn_score_normalized + svm_score_normalized) / 2
    hybrid_index = np.argsort(hybrid_similarities)[:k]

    print('=== kNN ===')
    print("\n".join(join_colon(np.sort(knn_score_normalized)[:k], texts[knn_index][:k])))
    print('=== SVM ===')
    print("\n".join(join_colon(np.sort(svm_score_normalized)[:k], texts[svm_index][:k])))
    print('=== Hybrid ===')
    print("\n".join(join_colon(hybrid_similarities[hybrid_index][:k], texts[hybrid_index][:k])))

Launching AI News and how I used OpenAI behind it

Mon, 24 Apr 2023 00:00:00 GMT

I launched a site called AI News. It collects topics related to AI, data science, and machine learning, summarizes them into three lines with AI, and publishes them. It is also available through Twitter @AINewsDev and an Atom feed. I have been running it for a few days, and although it is my own site, it has already been useful for collecting information. English articles are summarized in Japanese too, which is convenient.

Why I built it

I used to read ML-related information conveniently through ML-News, made by @syou6162. Around the time Twitter API pricing became an issue, it seems to have become unavailable.

That made it harder to follow data science and machine learning topics, and I had been thinking about building a similar site someday. Then ChatGPT, or GPT-4, appeared as an actually useful LLM. As everyone knows, topics around AI then exploded. There was too much information to read, but I still wanted to read the things I cared about properly. It would be useful to look at a reliable overview first and then decide whether to read the full article. The description written in an article's HTML is often only the first hundred characters or so, so it is not enough for that judgment. Then I realized that this was exactly the kind of thing an LLM such as ChatGPT could do, so I built it.

Implementation notes

Web scraping, article body extraction, and the website implementation are ordinary web development topics, so I will leave them aside for now and focus first on ChatGPT and the OpenAI API.

Article classification

I needed an implementation that takes scraped web articles and decides whether each article is related to AI. If you do machine learning, this sounds easy when correct labels exist, especially for binary classification of AI-related or not. The problem is that creating correct labels is tedious. I wanted AI itself to judge the articles instead of doing all of it by hand.

So I first asked GPT-3.5 to label the data. However, asking it to score how AI-like a topic is on a numeric scale from 0.0 to 1.0 was surprisingly unstable. I tried hard to write prompts that would make the output look like a probability distribution, but my prompting ability was not enough. What I wanted was a softmax-like probability distribution, so instead of describing it in prose, I wrote more directly that the values should be as if passed through a softmax function. That worked better. The final prompt is here:

https://gist.github.com/hotchpotch/8cb74d7a2ed1730faf1ec1ba089f93cf

I made it evaluate multiple AI-like categories and an "Others" category. When I fed it the roughly 400-character summaries described later, I often got output like this. Each value is between 0.0 and 1.0, and the total is 1.0, so it feels softmax-like.

{
    "AI": 0.0,
    "Machine Learning": 0.0,
    "Data Science": 0.0,
    "Data Analysis": 0.0,
    "Statistics": 0.8,
    "Deep Learning": 0.0,
    "kaggle": 0.0,
    "ChatGPT": 0.0,
    "MLOps": 0.0,
    "Generative AI": 0.0,
    "LLM": 0.0,
    "Others": 0.2
}

But sometimes it produced output like this. The values are between 0.0 and 1.0, but the total is greater than 1.0. What happened to softmax?

{
    "AI": 0.5,
    "Machine Learning": 1.0,
    "Data Science": 1.0,
    "Data Analysis": 1.0,
    "Statistics": 0.5,
    "Deep Learning": 0.0,
    "kaggle": 0.0,
    "ChatGPT": 0.8,
    "MLOps": 0.0,
    "Generative AI": 0.0,
    "LLM": 0.0,
    "Others": 0.2
}

If I pass that output through an actual softmax function, I get this. The values form a 0.0 to 1.0 distribution and sum to 1.0, so I can use this.

{
    "AI": 0.08285351386643752,
    "Machine Learning": 0.13660235066384355,
    "Data Science": 0.13660235066384355,
    "Data Analysis": 0.13660235066384355,
    "Statistics": 0.08285351386643752,
    "Deep Learning": 0.05025319642492017,
    "kaggle": 0.05025319642492017,
    "ChatGPT": 0.11184054543123119,
    "MLOps": 0.05025319642492017,
    "Generative AI": 0.05025319642492017,
    "LLM": 0.05025319642492017,
    "Others": 0.06137939271976228
}

Using this data, I treated items where "Others" was the highest score as non-AI and then checked the labels by hand. After correcting wrong labels by hand, I had N=550 labels: 200 AI-related and 350 others. Compared with the labels before manual correction, the accuracy was about 94%. That is quite high. The labeling is biased because I checked likely mistakes based on the AI output, but even so, the accuracy was good. Many of the mistakes were also borderline cases. I could probably improve it further by tuning the prompt or using GPT-4, but the goal was to create correct labels for training a classifier, and that was achieved, so I considered this good enough for now.

I listed many categories because when I wrote the task as something like "machine learning, AI, data science, or anything else", GPT-3.5's output felt less stable.

I corrected labels by hand while looking at a screen like this. It was much easier than labeling everything from scratch, although still tedious.

Building a classifier

Creating 550 correct labels was manageable, so next I built a classifier to decide whether an article is AI-related. For feature generation, I used OpenAI's Embeddings API, text-embedding-ada-002, to convert article bodies into 1536-dimensional vectors. The price per 1K tokens is also 20% of gpt-3.5-turbo, which is nice.

According to OpenAI's blog, text-similarity-davinci-001 seems to be more accurate for classification prediction. Still, I want to use embeddings for various things in the future, so I chose text-embedding-ada-002 for its generality.

Now I had 1536-dimensional features, so I split the labeled data into train, validation, and test sets and built a classifier. This time I used lightGBM, familiar to Kagglers. It had been several months since I last used lightGBM, and reading the documentation while implementing it felt bothersome, so I asked ChatGPT. It quickly produced working code, which surprised me. I was able to use it almost as-is.

https://gist.github.com/hotchpotch/81cf130279f4df9aeccd20e51678cff4

The code in that gist splits the data into 80% train, 10% validation, and 10% test, but because the final amount of data was not large, I adjusted it to 60% train, 30% validation, and 10% test. The trained model achieved validation accuracy 0.987 and test accuracy 1.0. Test accuracy of 100%! Of course the test data is only about 55 items, so it may be chance. When I changed the random seed casually, accuracy ranged from 0.96 to 1.0. Even with non-fine-tuned text-embedding-ada-002 features, the score was very good for a classification task. For NLP classification with only 330 training samples, that is impressive.

This completed the classifier for deciding whether an article is AI-related. Since then I have added various data sources, so at the moment some non-AI articles occasionally slip through and get displayed. I plan to retrain the classifier later and make it smarter.

Creating article summaries

For article summaries, if money were no issue, asking gpt-4 to summarize the whole article would be the most accurate. But the token cost is 15 times higher than gpt-3.5-turbo. Fifteen times. That is a lot for a hobby project, so I wanted to keep the cost as low as possible.

I asked GPT-4 and GPT-3.5 to summarize roughly the first 4K tokens of article text into about 400 Japanese characters and compared the results subjectively. GPT-4 produced better summaries, but they did not feel overwhelmingly better than GPT-3.5. Considering cost and processing time, I first use GPT-3.5 to summarize the first roughly 4K tokens into about 400 characters.

The difference between GPT-4 and GPT-3.5 became clearer when compressing the information further. When I reduced summaries to around 80 characters for Twitter posts, GPT-4 was much better. When the prompt specified a constraint such as "around 80 characters in Japanese", GPT-4 followed the constraint much more closely. GPT-3.5 sometimes produced much longer text, so GPT-4's ability to respect the character-limit constraint was valuable for Twitter posting.

GPT-4 is also much better at handling several tasks in one prompt. If I ask it in one request to create "around 80 characters", "around 80 characters in a casual style", "a three-emoji summary", and "a three-line summary", GPT-4 almost always does it. GPT-3.5 seems to struggle with doing multiple tasks in a single run.

At the moment I use this prompt:

https://gist.github.com/hotchpotch/427b2c24a1368a6f54d79d3f282c9445

Running that prompt through GPT-4 gives results for multiple tasks like this:

{
  "Bullets": ["約2650チーム中15位で金メダル獲得", "Kaggle Competitions Masterの称号取得", "CV・LB相関が観測できず、最終結果は大幅なshake予想"],
  "Summary": "Kaggleのコンペティションで15位の金メダルを獲得し、Kaggle Competitions Masterの称号を手に入れた。",
  "SummaryEmojis": "🏆Kaggleのコンペで15位の金メダル🥇を獲得し、Kaggle Competitions Master👑の称号を手に入れた🎉",
  "Emojis": "🥇👑🎉"
}

So summary generation is split into two steps:

First, summarize to about 400 characters with GPT-3.5 to save money and time.
Then, create multiple shorter summaries from that 400-character summary with GPT-4. This costs more money and time, but the quality is higher.

As people often say, prompts written in English generally produced better output than prompts written in Japanese. I am not good at English, so I translated my prompts with DeepL, but even that produced better results in English.

Ordinary web development

In addition to the OpenAI and machine learning work described above, I also implemented the following. This part took around 70% of the total development time, which is about what I expected.

Scrapers for various sites in Python
Saving data to the backend database and related systems in Python
Batch job implementation
Website implementation with Next.js, TypeScript, and Chakra UI

I wrote much of this implementation based on code generated by the ChatGPT 3.5 and 4 web UI. For Python code, which I also write in normal work, I sometimes felt it would have been faster to write it myself. Still, it was very useful for small pieces of implementation, such as writing a simple function or a regular expression.

It had been a long time since I used Next.js and TypeScript, and Chakra UI was new to me, so ChatGPT-generated code was especially useful there because I had less knowledge. However, the Next.js code was probably based on version 11 or 12 from ChatGPT's training data rather than the current version 13, so it sometimes produced deprecated structures. That is part of the charm.

ChatGPT 4.0 produced higher-quality code, but it was slower, so I mostly used 3.5 for small code generation. I used 4.0 when 3.5 looked suspicious or when I needed to include many conditions in the code. Use the right tool for the job.

Even with the current GPT-3.5 and 4.0, better VS Code integration alone would make development much more convenient. If code generation gets smarter over the next one or two years, development where I define requirements, review diffs, press y/N, and occasionally give feedback starts to feel realistic.

The future of AI News

For now I have only built the minimum necessary pieces, so I plan to keep improving it bit by bit. It is a personal sandbox, and it is also a website I can use conveniently myself, so maintaining it is fun in the way tending a bonsai might be, although I have never actually tended one. The article embeddings are currently used only for binary classification, but they should be useful for many other things too. I expect I will keep tinkering with it for a while.

RAPIDS SVR and SVC: fast training without fine-tuning, evaluated on MARC-ja

Tue, 13 Dec 2022 00:00:00 GMT

I learned about RAPIDS SVR and SVC in the Kaggle competition Feedback Prize - English Language Learning. They train quickly, and I felt they were useful methods for regression and classification tasks, so I will introduce what they are. In fact, top solutions in that competition used RAPIDS SVR.

I will also use RAPIDS SVC to evaluate MARC-ja, the classification dataset in the Japanese evaluation benchmark JGLUE. The implementation used for the evaluation is available on GitHub.

This article was written for day 13 of the Kaggle Advent Calendar 2022.

What are SVR and SVC?

SVR is Support Vector Regression, and SVC is Support Vector Classification. The algorithm behind them is SVM, or Support Vector Machine, which is known for strong accuracy and was apparently very popular at one point. sklearn also has an implementation, so many people have probably used it.

However, as the sklearn documentation says:

The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to datasets with more than a couple of 10000 samples.

In other words, with sklearn's libsvm-based implementation, scaling past around ten thousand samples is not very realistic.

What are RAPIDS SVR and SVC?

RAPIDS SVR and SVC are SVM implementations in cuML, which is part of RAPIDS, NVIDIA's project for GPU-accelerated data science. Roughly speaking, cuML implements general-purpose machine learning algorithms similar to those in sklearn, follows sklearn's estimator API such as fit() and transform(), and optimizes them to run on CUDA. According to its benchmarks, it is 10 to 50 times faster than sklearn. That means algorithms that are difficult to run at practical speed in sklearn can become practical with cuML. RAPIDS also includes other CUDA-based tools, such as cuDF for fast DataFrame operations, so it is worth looking at the rest of the project if you are interested.

What becomes useful when SVR can run quickly? One answer is that training on the embedding representation from a neural network output layer becomes practical. You can take an existing public model, use it only for feature extraction without fine-tuning, and train SVR on those features. It is also easy to combine features from multiple models and train on the concatenated features. You can use non-fine-tuned models this way, but fine-tuned models can also be used as feature extractors.

--Quoted from RAPIDS SVR starter kit

Extracting features from neural networks

How should we extract features from a neural network? As an example, I will describe encoder models from Hugging Face Transformers. For most encoder models, you can either take the CLS token from last_hidden_state or apply mean pooling. The resulting vectors are then normalized before use.

class MeanPooling(nn.Module):
    def __init__(self, eps=1e-6):
        super(MeanPooling, self).__init__()
        self.eps = eps

    def forward(
        self, outputs: torch.Tensor, attention_mask: torch.Tensor
    ) -> torch.Tensor:
        last_hidden_state = outputs[0]
        input_mask_expanded = (
            attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
        )
        sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=self.eps)
        mean_embeddings = sum_embeddings / sum_mask
        return mean_embeddings

class ClsPooling(nn.Module):
    # 実際は Pooling ではなくただの CLS を取り出しているだけなので、このクラス名は良くない…
    def __init__(self):
        super(ClsPooling, self).__init__()

    def forward(
        self, outputs: torch.Tensor, attention_mask: torch.Tensor
    ) -> torch.Tensor:
        last_hidden_state = outputs[0]
        return last_hidden_state[:, 0, :]

POOLING_CLASSES = {
    "mean": MeanPooling,
    "cls": ClsPooling,
}

class TransformerEmbsModel(torch.nn.Module):
    def __init__(self, model_name: str, pooling: str = "mean"):
        super().__init__()
        self.model = AutoModel.from_pretrained(model_name)
        self.pool = POOLING_CLASSES[pooling]()

    def feature(self, inputs: dict[str, torch.Tensor]) -> torch.Tensor:
        outputs = self.model(**inputs)
        sentence_embeddings = self.pool(outputs, inputs["attention_mask"])
        # Normalize the embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        sentence_embeddings = sentence_embeddings.squeeze(0)
        return sentence_embeddings

    def forward(self, inputs: dict[str, torch.Tensor]) -> torch.Tensor:
        embs = self.feature(inputs)
        return embs

This is enough to extract features.

Training with RAPIDS SVC

After that, we only need to train with SVC. SVR works with almost the same code.

from cuml.svm import SVC
import numpy as np

DEFAULT_SVC_PARAMS = {
    "C": 3.0,  # Penalty parameter C of the error term.
    "kernel": "rbf",  # Possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’.
    "degree": 3,
    "gamma": "scale",  # auto or scale
    "coef0": 0.0,
    "tol": 0.001,  # 0.001 = 1e-3
}

def train_svc(
    X: np.ndarray,
    y: np.ndarray,
    svc_params: dict[str, object] = DEFAULT_SVC_PARAMS,
    probability: bool = True,
) -> SVC:
    svc = SVC(**svc_params)
    svc.probability = probability
    svc.fit(X, y)
    return svc

The core is almost just this.

Measuring the score on MARC-ja

Now let's evaluate MARC-ja, the classification dataset in the Japanese evaluation benchmark JGLUE. MARC-ja is a binary positive/negative sentiment classification dataset built from Japanese Amazon reviews. It has 187,528 train samples and 5,654 dev, or validation, samples. That is a reasonably large dataset. The test data does not seem to be publicly available at the moment.

JGLUE's GitHub page lists dev accuracy scores. For example, cl-tohoku/bert-base-japanese-v2 gets 0.958 after four epochs. The top score shown there is 0.964 from XLM-RoBERTa large.

Training time is also interesting. When I ran training casually on Colab with a T4 GPU, one epoch of bert-base-japanese-v2 took about 100 minutes, with 0.9573 accuracy after the first epoch. On my local RTX 4090, one epoch took about 30 minutes.

Feature extraction and SVC training

https://github.com/hotchpotch/rapids-svr-svc-marc_ja

In the repository above, I implemented feature extraction from a neural network on MARC-ja and training with RAPIDS SVC. Let's first train and evaluate SVC using cl-tohoku/bert-base-japanese-v2 without fine-tuning. The execution times below are from my local RTX 4090.

$ python lib/runner.py bert-base-ja-v2-cls
[create cache] tmp/embs_cache/bert-base-ja-v2-cls.pkl.gz
100%|███████████████████████████████████████████████████████████████████████| 5861/5861 [06:04<00:00, 16.09it/s]
100%|█████████████████████████████████████████████████████████████████████████| 177/177 [00:10<00:00, 16.45it/s]
exec time: 394.05 sec
shape: (187528, 768) (5654, 768)
concat embs: (187528, 768) (5654, 768)
[train svc]
svc exec time: 17.83 sec
==================================================
bert-base-ja-v2-cls
valid acc score: 0.927661832331093
==================================================
              precision    recall  f1-score   support

    positive    0.89788   0.56691   0.69500       822
    negative    0.93067   0.98903   0.95896      4832

    accuracy                        0.92766      5654
   macro avg    0.91428   0.77797   0.82698      5654
weighted avg    0.92590   0.92766   0.92059      5654

Feature extraction took 394 seconds, SVC training took about 18 seconds, and the accuracy was 0.92766. Once features are extracted, my implementation reuses them as a cache, so the second run costs almost only the SVC training time.

Next, let's look at the same model with mean pooling instead of CLS.

$ python lib/runner.py bert-base-ja-v2-mean
[load cache] tmp/embs_cache/bert-base-ja-v2-mean.pkl.gz
shape: (187528, 768) (5654, 768)
concat embs: (187528, 768) (5654, 768)
[train svc]
svc exec time: 18.44 sec
==================================================
bert-base-ja-v2-mean
valid acc score: 0.9324372125928546
==================================================
              precision    recall  f1-score   support

    positive    0.91667   0.58881   0.71704       822
    negative    0.93406   0.99089   0.96164      4832

    accuracy                        0.93244      5654
   macro avg    0.92536   0.78985   0.83934      5654
weighted avg    0.93153   0.93244   0.92608      5654

I had already run this before, so the features were loaded from cache and only SVC training was needed. Accuracy was 0.93244, so mean pooling worked better than CLS. What happens if we train on both sets of features?

$ python lib/runner.py bert-base-ja-v2-cls bert-base-ja-v2-mean
[load cache] tmp/embs_cache/bert-base-ja-v2-cls.pkl.gz
shape: (187528, 768) (5654, 768)
[load cache] tmp/embs_cache/bert-base-ja-v2-mean.pkl.gz
shape: (187528, 768) (5654, 768)
concat embs: (187528, 1536) (5654, 1536)
[train svc]
svc exec time: 30.04 sec
==================================================
bert-base-ja-v2-cls + bert-base-ja-v2-mean
valid acc score: 0.9334984082065794
==================================================
              precision    recall  f1-score   support

    positive    0.90545   0.60584   0.72595       822
    negative    0.93652   0.98924   0.96216      4832

    accuracy                        0.93350      5654
   macro avg    0.92099   0.79754   0.84405      5654
weighted avg    0.93200   0.93350   0.92782      5654

Because both feature sets were already cached, loading was almost instant, and SVC training took about 30 seconds. The result was 0.93350. Even with the same neural network model, extracting CLS and mean-pooled features separately and training on them together improved the score by about 0.001.

How about classic TF-IDF? TF-IDF features have too many dimensions as-is, so I reduced them to 1000 dimensions with SVD and then trained and evaluated SVC.

$ python lib/runner.py tfidf
[load cache] tmp/embs_cache/tfidf.pkl.gz
shape: (187528, 1000) (5654, 1000)
concat embs: (187528, 1000) (5654, 1000)
[train svc]
svc exec time: 55.78 sec
==================================================
tfidf
valid acc score: 0.8924655111425539
==================================================
              precision    recall  f1-score   support

    positive    0.81657   0.33577   0.47586       822
    negative    0.89729   0.98717   0.94009      4832

    accuracy                        0.89247      5654
   macro avg    0.85693   0.66147   0.70797      5654
weighted avg    0.88556   0.89247   0.87260      5654

Accuracy was 0.89247, which is not very good. For text with many unseen words, this is probably about what we should expect. Then what happens if we combine TF-IDF with BERT features?

$ python lib/runner.py bert-base-ja-v2-cls bert-base-ja-v2-mean tfidf
[load cache] tmp/embs_cache/bert-base-ja-v2-cls.pkl.gz
shape: (187528, 768) (5654, 768)
[load cache] tmp/embs_cache/bert-base-ja-v2-mean.pkl.gz
shape: (187528, 768) (5654, 768)
[load cache] tmp/embs_cache/tfidf.pkl.gz
shape: (187528, 1000) (5654, 1000)
concat embs: (187528, 2536) (5654, 2536)
[train svc]
svc exec time: 53.41 sec
==================================================
bert-base-ja-v2-cls + bert-base-ja-v2-mean + tfidf
valid acc score: 0.9379200565970994
==================================================
              precision    recall  f1-score   support

    positive    0.92280   0.62530   0.74547       822
    negative    0.93957   0.99110   0.96465      4832

    accuracy                        0.93792      5654
   macro avg    0.93119   0.80820   0.85506      5654
weighted avg    0.93713   0.93792   0.93278      5654

The result was 0.93792, much higher than BERT alone. TF-IDF points in a different direction as a feature source, so combining it likely added diversity and improved the score. It is also interesting that SVC training became slightly faster than TF-IDF alone, perhaps because convergence was better.

In the same way, I tried combining features from several Japanese models published on Hugging Face.

$ python lib/runner.py bert-base-ja-v2-cls bert-base-ja-v2-mean rinna-ja-roberta-base-cls rinna-ja-roberta-base-mean tfidf bert-base-ja-sentiment-cls bert-base-ja-sentiment-mean
[load cache] tmp/embs_cache/bert-base-ja-v2-cls.pkl.gz
shape: (187528, 768) (5654, 768)
...中略
[load cache] tmp/embs_cache/bert-base-ja-sentiment-mean.pkl.gz
shape: (187528, 768) (5654, 768)
concat embs: (187528, 5608) (5654, 5608)
[train svc]
svc exec time: 89.47 sec
==================================================
bert-base-ja-v2-cls + bert-base-ja-v2-mean + rinna-ja-roberta-base-cls + rinna-ja-roberta-base-mean + tfidf + bert-base-ja-sentiment-cls + bert-base-ja-sentiment-mean
valid acc score: 0.9432260346657234
==================================================
              precision    recall  f1-score   support

    positive    0.93717   0.65328   0.76989       822
    negative    0.94391   0.99255   0.96762      4832

    accuracy                        0.94323      5654
   macro avg    0.94054   0.82292   0.86876      5654
weighted avg    0.94293   0.94323   0.93887      5654

Training SVC on 187528x5608 features took 90 seconds. The accuracy was 0.94323, the best result in this trial. Compared with the 0.958 score from properly fine-tuned BERT, it is still not enough. Still, it is good enough to consider as one model in an ensemble, and there is still plenty of room to improve the score by adding more features.

The training speed is high. Once the neural network features, which take the most time, have been extracted, I can freely combine features and observe results. That also means using a large number of folds should still be practical.

Use in real Kaggle competitions

In the competition I recently joined, Feedback Prize - English Language Learning, which predicted scores for text, the summary of the 1st through 8th place solutions says that the 1st, 3rd, and 4th place solutions used RAPIDS SVR models in their ensembles. I also tried SVR. Because it did not improve my Public LB score when added to my ensemble, I did not include it in my final submission. However, it scored higher on both Public and Private LB than the early public fine-tuned DeBERTa v3 base model. After the competition ended, I was able to confirm on the Private LB that adding it to the ensemble improved the score, so knowing the result now, I should have included it.

I also heard that SVR was used in the first-place solution for the image competition PetFinder.my - Pawpularity Contest.

Another possible use is near the end of a competition, when deciding which additional models to include in an ensemble. It may be useful to first pass candidate model features through SVR and prioritize fine-tuning the models with higher scores. Pretrained Embeddings are all you need (sort of ...) lists SVR results for extracted features, and I think the scores would correlate with the scores obtained by actually fine-tuning those models.

Closing

This article introduced RAPIDS SVR and SVC, which can train directly on extracted features without fine-tuning. Fine-tuning often takes tens of minutes to several hours depending on the amount of data, and real-world datasets can be much larger. SVR and SVC, which can run in a "RAPID" way with a few minutes for feature extraction and seconds to tens of seconds for training, seem useful not only for Kaggle but also for ordinary work and research.

Until now, when I did not train a neural network for regression or classification tasks, I usually only tried gradient boosted decision trees. RAPIDS SVR and SVC make it possible to run SVM quickly, so they look like methods worth adding to the list of things to try.

How secon.dev was implemented, December 2022 edition

Sun, 11 Dec 2022 00:00:00 GMT

It has been a while since I wrote the 2020 edition of how secon.dev was implemented, and I have been thinking that it might be time to touch secon.dev again. I may only think about it and never actually do it, but I will write down the state as of late 2022.

Core implementation

This part is almost unchanged from the 2020 version. I write files in Markdown, sync them through Dropbox, and a build server detects changes, builds static HTML with Next.js, and deploys it to Firebase Hosting.

Related entries

For entries other than diaries, the site shows related entries at the bottom of the page. This article should probably show them too. The approach is the same as in I made a CLI that outputs similar documents for static site generators: calculate TF-IDF for the text, use cosine similarity to find similar articles, and have Next.js read that data and include it at build time. It is a very plain mechanism.

Related entries work reasonably well for non-diary articles. Diaries, though, tend to contain many unrelated notes instead of a single topic. Treating a whole diary entry as one document and computing similarity from TF-IDF word occurrence does not work very well, so I do not use it there.

Articles with similar images

When an article contains a photo, the site shows articles that contain similar photos at the bottom of the individual article page. I use this navigation a lot myself. It is one of my favorite features.

During the image upload flow described below, I extract metadata and, at the same time, image features. At the moment I use EfficientNetB0. I then use those features to compute cosine similarity and pick similar images. This is also a plain mechanism, and it is almost the same as what I described in the similar image search article.

At the time I did not really understand image features. I still cannot say I understand them properly, but I probably know more than I did then. So I simply chose EfficientNetB0 because it was both performant and small. If I were choosing now, I would start from what kind of "similar image" would actually be useful and choose the model from there.

Diaries from the same date

This is an old feature in web diary systems. It lets you look back at what happened on the same date in previous years, and it becomes very good navigation when you are writing "diaries" rather than "articles". The more content accumulates, the more useful it becomes.

Image upload

In the 2020 version I used Hatena Fotolife as the image upload destination. Later, because Hatena Fotolife removed its paid upload option, I changed the storage to GCS.

When I upload an image file to a specific GCS bucket, a Cloud Function converts JPEG to WebP, resizes it to frequently used sizes, and extracts metadata. The image feature extraction mentioned above also happens there.

Uploading files to a specific GCS bucket sounds troublesome at first, but I mount the bucket as a Windows local filesystem using the method described in Mount a GCS bucket as a Windows filesystem. That means I can develop a photo in Lightroom, save it, and quickly get the various image sizes needed by the site.

Because all files are on GCS, it is also convenient when I want to bring photos back to my local machine and do something with them. I can fetch them with something like gsutil -m rsync .... I am glad I got around to building this setup.

Cost

secon.dev does not get a lot of traffic, so Firebase Hosting for the website, GCS for image hosting, and Cloud Functions together cost 49 yen including tax for November 2022.

In reality there is also the build server cost, since it runs on a VPS that I use for many other things, so the total cost is probably a little higher.

Future implementation direction

secon.dev is currently a static build, and the data is filesystem-based: Markdown text plus JSON metadata. This approach has become inefficient enough that I now want a database where I can build data while keeping references between pieces of information, whether with GraphQL, an RDB, or something else. It is not strictly necessary, though, so I am still not sure what to do.

For machine learning features, if more things could be calculated dynamically, I could do more with the site, such as building my own search. I am also thinking about that area.

I am still interested in machine learning these days, and I would like to combine experiments in that area with secon.dev. I will probably choose technologies and architecture that fit that direction.

Kaggle Feedback Prize - English Language Learning: team gold medal, 15th place, and Kaggle Master

Wed, 30 Nov 2022 00:00:00 GMT

The Kaggle competition Feedback Prize - English Language Learning ended, and we received a gold medal with 15th place out of about 2,650 teams. With this, I had two gold medals and one silver medal in total, meeting the requirements for Kaggle Competitions Master, the title I had aimed for when I first started competitions. I do not think I could have reached this result alone, so I am grateful to my teammates @masakiaota and @olivineryo.

At the end of the competition we were 8th on the public LB among about 2,700 teams, inside the gold-medal range. But the public LB used only 26% of the full LB data. Scores were shown only to the second decimal place, so the display was coarse. Also, some submissions that were quite bad on our CV ranked oddly high on the public LB, so we could not observe a reliable CV-LB correlation. Because of that, we did not trust the public LB and expected a large shake in the final private LB ranking.

For final submission, we submitted three versions: the best CV model, the best LB model, and the best CV model without pseudo labels. Pseudo labels made CV overwhelmingly better, but there were concerns about over-optimizing CV or leaking information. This competition had another prize category, so we had three submission slots and chose those three. If only two submissions had been allowed, the choice would have been very painful.

The result was that although we were 8th on public LB, we dropped to 17th on private LB and unfortunately missed the gold medal range, which was 15th or higher at the time. Missing gold by two places was extremely frustrating. Still, some teams dropped tens or hundreds of ranks, and the public LB first-place team dropped 175 places, so a nine-place drop could have been much worse. It reconfirmed how hard it is to win gold.

Update: after invalid users were removed and the ranking was finalized, we moved up to 15th and received a gold medal. I am very happy.

Competition Task

This was the third competition in the Feedback Prize, or FP, series, which evaluates English written by students. The task was to infer scores, in 0.5-point increments from 1.0 to 5.0, across six columns such as grammar, vocabulary, and syntax. The evaluation metric was MCRMSE, mean columnwise root mean squared error, so the mean of RMSE for each column. Since data from the previous FP1 and FP2 competitions existed, we tried various approaches while thinking about whether they could be used as pseudo labels and how to optimize the six evaluation axes.

Solution

We submitted three ensemble models combining the ideas below. The best private LB result came from an ensemble of 11 models that did not use pseudo labels.

Backbone model selection

On the public LB, deberta-v3-base scored better than deberta-v3-large, even though deberta-v3-large has the stronger generalization performance in general. I accepted that result and trained and optimized around it. I thought that for this competition, v3-base, with 12 layers, might score better than the more complex v3-large, with 24 layers. Since the essays were written by students and were not complex prose, perhaps the simpler model was better.

In the final private LB, however, v3-large was indeed strong. We should not have trusted LB and should have built our strategy around CV and v3-large.

deberta-v3-base
- Overwhelmingly strong on this competition's public LB.
deberta-v1-large
- Somewhat effective on both CV and LB.
deberta-v3-large
- Effective on CV, but mixing it in caused public LB to drop considerably.
- On private LB, that was not the case.

Our team's own approaches

These were things not discussed publicly, or at least not widely adopted as "this is effective, let's use it". Most of these ideas came from my teammates, and I did not contribute much to them. I am grateful to have had excellent teammates.

Attention pooling separately for the six columns
- Improved the score substantially compared with mean pooling.
Applying LayerNorm after pooling
- This also improved the score substantially.
maxlen: 640
- Truncating to 640 improved scores compared with longer maxlen.
- For deberta-v3-large only, using a longer maxlen and training with splits using Sliding Window Attention gave better scores.
Pseudo labeling
- Very effective for CV, but not very effective for LB.
- Using pseudo labels with leaked information improved CV even more, so we carefully removed, or tried to remove, information leaks.
- Pseudo-label data used past FP1 and FP2 competition data.
- Pseudo-label scores were created by ensembling predictions from trained models.
- A single-model pseudo label was learned too quickly, but using ensembled pseudo labels kept improving CV.
- A single model trained with pseudo labels, not the strongest on CV but moderately improved, was best on private LB. It hurt that we did not choose it for final submission.
Post-processing
- Clamp values below 1.0 to 1.0 and values above 5.0 to 5.0. This slightly improved the score.
- When ensembling, find optimal weights per each of the six columns as an optimization problem that minimizes CV score. For example, with four ensemble members, solve for and apply 4 x 6 = 24 weights. This slightly improved the score.
- We used scipy.optimize.minimize to define and solve the optimization problem. I had not known that SciPy could solve it so easily; it was convenient.

Approaches discussed publicly

Layer reinitialization
- Reinitialize the final n layers of the pretrained model. Reinitializing only the final layer worked best.
- This helped a lot.
LLRD, or Layer-wise Learning Rate Decay
- Gradually decrease the learning rate by layer.
- We decayed LR by 0.8 for large models and by 0.7225 for base models.
- This helped a lot.
Layer freezing
- Do not train the first n layers. When LLRD is applied, the early layers of large models, 24 layers, are barely trained anyway, so freezing the first 12 layers of large models sped up training.
AWP
- Team member Aota has a clear explanation: Kaggleで使用される敵対学習方法AWPの論文解説と実装解説
- This helped a lot.
Multi-sample dropout
- The score barely changed, but training became more stable, so we used it.
- Added 0.2 * 5 dropout.

Useful to learn, even though it did not improve our score

RAPIDS SVR. In the RAPIDS SVR approach, neural networks are used only to extract feature embeddings, and then those embeddings are trained with SVR, Support Vector Regression, on CUDA using cuML. SVR training itself finishes instantly or within a few seconds, depending on the GPU. I wondered whether such a method could work, but it produced a score better than my early baseline. Some top solutions also seem to have used RAPIDS SVR, so it is quite possible that I simply failed to use it effectively for performance improvement.

SVR was useful not only because it quickly produced reasonably good scores, but also because it correlated to some extent with the performance of NN backbone models. NN models that scored poorly with SVR often also scored poorly after full NN training. That made SVR useful as a guide when choosing backbone models, especially near the end when we had to consider many models for the final ensemble.

I also had not known about NVIDIA's cuML library itself, so learning about it was valuable. cuML can run basic machine learning algorithms similar to those in scikit-learn on CUDA, and some algorithms become extremely fast. Its interface is usually sklearn-compatible, so it is easy to understand. In the future, if a machine learning task takes minutes or more on CPU, I would like to consider using cuML.

After the Competition

After this competition, I should become a long-awaited Kaggle Competitions Master. Getting gold or silver medals in three medal competitions in a row from my first competition was possible largely because of teammates, except for the solo competition, and even there the previous team experience helped. I am especially grateful to Aota, who invited me into Kaggle competitions.

That said, I felt this during my previous solo competition, and this competition made it clear again: I was not the person who came up with the idea that won gold. At my current level, I do not have the skill needed to win a solo gold medal, which is required to become a Kaggle Competitions Grandmaster. There is a large wall in front of me. If I know the approach for a problem domain, I feel I might be able to win solo silver after trying a few times, but I cannot yet imagine winning solo gold. People who have won solo gold medals are truly impressive.

So far, I have only joined competitions solvable with natural language processing Transformer encoder approaches. If I join another competition, rather than prioritizing medals by solving a similar task, I would like to try a task where my score may be weaker but I can gain different knowledge. I am still a beginner, and most machine learning algorithms and problem-solving approaches stimulate my curiosity. I would like to keep enjoying the process while broadening my knowledge.

Solo silver medal, 43rd place, in Kaggle Feedback Prize - Predicting Effective Arguments

Wed, 24 Aug 2022 00:00:00 GMT

I joined the Kaggle competition Feedback Prize - Predicting Effective Arguments solo and finished 43rd out of 1,566 teams, earning a silver medal. This was still a provisional ranking at the time, so the final rank could change slightly.

In my previous competition, my first Kaggle competition, I was blessed with a team and happened to win a gold medal. Through that experience I learned how fun Kaggle can be, and I wanted to join another competition. I also thought I would prefer a team, because staying motivated alone seemed hard. This time, however, I started without a team, or more precisely, I did not know many people and could not form one. The previous competition also taught me that team participation can sometimes lead to a gold medal even without enough individual skill, as in my case, so I wanted to see what result I could leave as a solo participant.

At first I worried that my motivation would not last. But my first baseline was already in the silver-medal range, so I started with the feeling that maybe the gold range was not impossible, and that kept me motivated through the end. In the result, I was nowhere near the gold range, and I recognized both my lack of skill and the narrowness of my toolbox. Still, solo participation gave me its own lessons. Unlike a team, I had to write every piece of code myself, understand the intent of that code, and try a wide range of methods considered good in similar competitions. That broadened my knowledge. On the other hand, I did not get the sense of unity that comes from team participation, nor the unexpected knowledge that teammates can bring. Both styles have tradeoffs.

The PPPM competition I joined previously also used Transformer encoder models for NLP, and this competition likely required similar models. A good part of the knowledge from the previous competition carried over. I think that was one reason I could stay motivated as a solo participant: I was not starting from a place where I knew nothing at all.

Near the end, before the team-merge deadline, several teams invited me to merge. That gave me a real sense that these things happen when you participate solo. By that point, leaving a result as a solo participant had also become one of my goals, so although I appreciated the invitations, I declined them this time.

What Kind of Competition Was It?

This competition was a variant of the earlier Feedback Prize - Evaluating Student Writing. The task was a classification problem: given part of an essay written by a U.S. student, predict whether it was Ineffective, Adequate, or Effective.

The actual data had about 37,000 rows. Each essay_id had a separate long essay text associated with it, and discourse_text was one part of that essay. Many public baselines created inputs like discourse_type + discourse_text + [SEP] + essay_text and trained on them. The problem with this approach is that essay_text appears repeatedly, causing overfitting very quickly. It also seemed that rows from the same essay were written by the same student, so each essay had its own tendency across Ineffective, Adequate, and Effective. That tendency seemed useful, because a student who writes a good essay probably writes generally good discourse segments.

Solution Notes

This was not a gold-medal solution and may not be broadly useful, but these are the methods I tried, what worked, what improved training efficiency, and what did not work well for me.

Look at essays, not rows

Instead of treating the problem row by row, I looked at it by essay_id. There are about 37,000 rows, but only about 4,200 essays, around one eighth as many, which improves training speed. Since each target discourse_text is contained in the essay, the task can be seen as classifying specific spans that appear within one text. In that form, a model similar to NER can classify the spans.

For example, I created essay text with special tokens [TAR_START] and [TAR_END] around target spans, like this, and turned the region between those tokens into the classification target.

[TAR_START]Lead so you want to take all the cars out of the city ok cool[TAR_END]. [TAR_START]Claim this will save the pepol like 1,000,000 dolers a year[TAR_END] and will [TAR_START]Claim reduce polution[TAR_END] and stuf i dont know. [TAR_START]Counterclaim i gess its a good idea but on the other hand nah i mean lookif pepal want to ruin the world with gas thats there choce man[TAR_END]. but i just relised that the thing seid to agrewith the pasige or whatever so yah. ummmmmmm i dont know [TAR_START]Position the eirth is cool so why distroit with gasis or something[TAR_END]. look to be honist...

For the classification representation, I tried averaging between [TAR_START] and [TAR_END], using only the [TAR_START] and [TAR_END] tokens, using only [TAR_START], and other variants. The best was using only [TAR_START]. I also tried not using special tokens, using only CLS and SEP, removing TAR_END, replacing it with SEP, and so on. The custom [TAR_START] and [TAR_END] special tokens worked better.

The tokenizer max_length was 1024. Even at that length, some essays overflowed, so I added some ad hoc processing to pack the discourse_text appropriately.

In the end, I trained a 4-fold deberta-v3-large model with CV 0.5911, LB 0.585, private LB 0.586; a deberta-large model with CV 0.6034, LB 0.595, private LB 0.597; and earlier row-based models such as deberta-v3-large with CV 0.6179, LB 0.608, private LB 0.604. I ensembled these. The result was LB 0.582 and private LB 0.583.

Methods that worked

AWP, or Adversarial Weight Perturbation
- Improved the score by about 0.05 to 0.1.
- AWP has perturbation width and range parameters. Increasing the perturbation width by epoch improved the score a little. If I had more time, I would have liked to try changing it with a scheduler.
Lowercasing and removing symbols in text processing
- Improved the score by about 0.03.
Back-translation augmentation
- I ran retranslated data through the model and used only essays where the model's accuracy exceeded 90%.
- Improved the score by about 0.03.
- Using only retranslated text caused overfitting. When I used data with about 20% [MASK], training progressed stably.

Methods that did not work well

This only means I could not make them work. Someone with different knowledge might well make them effective.

LSTM or Bi-LSTM before the final output.
Pseudo-labeling using the previous Feedback Prize - Evaluating Student Writing data.
- This still seems useful if done properly.
- https://www.kaggle.com/code/rolianklay/pseudo-labeling-how-to-get-pseudo-labels
Adding class weights to torch.nn.CrossEntropyLoss.
- The data was imbalanced, so I tried weights, but it did not work.
Removing essays with very few discourse_id values.
- This removes data like one discourse ID per essay.
deberta-xlarge and deberta-xxlarge.
- Training did not progress.
allenai/longformer-large-4096.
- Gradients became NaN and training did not progress.
- allenai/longformer-base-4096 trained, but the score was poor.

Training-efficiency methods

The text was fairly long in this competition, so training speed and memory efficiency were important. Optimization approaches for Transformers summarizes these techniques. The linked article has details, but here is a rough summary.

8-bit optimizers
- Use an 8-bit optimizer instead of a 16-bit one to save more memory.
- Specifically, bitsandbytes can replace AdamW and worked smoothly.
- I did not notice a score drop, and memory usage really decreased.
- Replacing torch.nn.Embedding with bnb.nn.StableEmbedding did not work smoothly for me. If I had replaced it well, the result might have improved further.
Gradient checkpointing
- Computes while discarding unnecessary gradients. If discarded values are needed again, the backward graph is rebuilt, so speed decreases.
- With Transformers, it can be enabled with model.gradient_checkpointing_enable().
- The score barely changed. Training slowed down, but memory usage decreased greatly.
Automatic mixed precision, or AMP
- Places and computes safe parts in fp16 instead of fp32.
- The important point is to use it with GradScaler to avoid gradient overflow.
Gradient accumulation
- When memory is limited, batch size becomes small. By splitting loss application, you can train similarly to using a larger batch size.
Freezing
- Layers near the input have lower learning rates, and sometimes not training them gives better results.
- So those layers can be frozen.
Fast tokenizers
- Use Transformers tokenizers written in Rust.
- Recently, Rust implementations are used by default when available, so there is often nothing special to do.
Uniform dynamic padding
- Dynamic padding is pulled up by the longest token length in each batch. If examples are sorted by token length beforehand, awkwardly long token lengths are less likely to appear.
- It is hard to use during training because you usually want random ordering, but it can speed up inference.

After the Competition

At first I thought maybe I had a chance at the gold range. In the end, I was in the middle of the silver range and nowhere near gold. That helped me reconfirm both my current position and my lack of skill. Small incremental improvements were not enough to reach a gold-medal score; a more drastic improvement would have been needed. I am looking forward to reading the top solutions.

Even though this was an NLP competition similar to the previous one, using a Transformer encoder model, I finished in the middle of the silver range. For Kaggle competitions in other topics, whether I can win a medal at all is still uncertain.

Trying it individually instead of as a team was good. I was able to leave some result on my own, stay motivated by myself, and enjoy the work. One more silver medal would make me a Kaggle Competitions Master, so I would like to keep joining competitions that interest me, whether as a team or solo. Most competitions are probably tasks I have never done before, so even if I do not win a medal, I expect to gain a lot of knowledge from any of them.

Was the end of Japan's 2022 rainy season unusually early?

Tue, 28 Jun 2022 00:00:00 GMT

Yesterday, the rainy season was considered to have ended in the Kanto-Koshin region. News reports said it was unusually early, 22 days earlier than normal, where "normal" means the statistics from the past 30 years. News reports usually avoid talking about standard deviation because they need to communicate clearly to the general public, but I wanted to check how unusual "unusually early" actually was.

For rainy-season start and end dates, I used data published by the Japan Meteorological Agency from 1951 onward. There is even a year, 1993, where the start and end could not be identified. The data was inconvenient to handle directly, so I pasted it into a spreadsheet and cleaned it up roughly.

https://docs.google.com/spreadsheets/d/1-1JZitpvKu_MDDoK0bRaYFGOMaMJuVc9s1GusthlvPc/edit?usp=sharing

Looking at about 70 years of data, the standard deviation of the rainy-season end date is 8.6 days. The 2022 date is 21 days earlier than the 70-year average. If we assume the data is normally distributed, it falls outside the 95% confidence interval, so it is fair to call it unusually early.

However, looking only at the past 10 years, the rainy season has already been ending earlier. The 70-year average end date is July 18, while the 10-year average is July 14. The average start date differs by only one day, so the end date has moved noticeably earlier. The standard deviation of the 10-year average end date is 12.59 days. Compared with the 10-year average, the 2022 end date differs by 17.2 days. If we look only at the narrow window of the past 10 years, it probably falls somewhere around a 75-85% confidence interval, so it feels like something that happens from time to time. The 2022 date is also not that different from 2018, only two days earlier.

In other words, viewed across the past 70 years, it is unusually early. Viewed across the past 10 years, it is only something that occasionally happens. The data makes me feel that changes in the global environment, including environmental issues, may already be affecting the timing of the rainy season's end.

Finding optimal weighted-ensemble coefficients with constrained least squares

Tue, 21 Jun 2022 00:00:00 GMT

In the Kaggle competition U.S. Patent Phrase to Phrase Matching, I created about 20 ensemble candidates near the final stage by combining several pretrained models and multiple promising parameter settings. We then tried different ways to mix those models. The final submissions were a 6-model ensemble and a 9-model ensemble.

At that point, asking a human to choose the optimal ensemble weights from model performance and intuition is not very realistic. Of course, a domain expert may sometimes make better choices, but I wanted the machine to calculate the weights automatically. Changing the mixing ratio to minimize MSE is just a linear combination, so I thought the optimum should be computable.

A Simple Example

Suppose we have data like this. y is the true value, and X contains the predictions from each ensemble model.

import numpy as np
y = np.array([0.5, 0.75, 0.25, 1.0, 0.5])
X = np.array([
    [0.52, 0.9, 0.41, 0.99, 0.51],
    [0.52, 0.7, 0.41, 0.99, 0.51],
    [0.48, 0.73, 0.12, 0.97, 0.47],
    [0.45, 0.35, 0.25, 0.9, 0.49],
])

First, look at the MSE for each row of X.

np.square(X - y).mean(axis=1)
=> array([0.00974, 0.00574, 0.0039 , 0.03452])

If we simply average the predictions, the MSE is below. It is worse than the best single model.

np.square(X - y).mean(axis=0).mean(axis=0)
=> 0.013475

Least Squares

One way to find optimal coefficients is least squares, or linear regression. Let's try it.

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X.T, y)
reg.coef_
=> array([ 0.43575566, -0.05397578,  0.46076883,  0.21063718])

Ignoring the negative coefficient for a moment, using these parameters as weights gives the least-squares prediction.

X.T @ reg.coef_
=>array([0.51448131, 0.76448131, 0.26448131, 1.01448131, 0.51448131])

The values look close to the target, so calculate the MSE. It is good.

np.square(X.T @ reg.coef_ - y).mean(axis=0)
=> 0.00020970822203200185

But should an ensemble model really receive a negative coefficient? I want the coefficients to be positive.

Solving with Constrained Least Squares

So we can treat this as a least-squares problem with the constraint that coefficients must be non-negative. If there is an implementation of a constrained least-squares solver, this should be easy. Searching for restricted least square method scipy led me directly to one in SciPy. scipy.optimize.nnls solves non-negative least squares, so I used it.

weights, rnorm = scipy.optimize.nnls(X.T, y)
weights
=> array([0.29260857, 0.08404164, 0.52487508, 0.12761238])

Now we have positive coefficients. Use them as ensemble weights.

X.T @ weights
=> array([0.50522372, 0.75      , 0.24931469, 0.99686367, 0.50131296])
np.square(X.T @ weights - y).mean(axis=0)
=> 7.863453999510499e-06

This gives a minimum MSE with positive coefficients. For simple linear blending of ensemble results like this, solving with scipy.optimize.nnls looks easy and useful.

The ensemble predictions calculated with these weights correlated well between CV and both public and private LB. Being able to quickly compute the optimal blending ratio without hand-tuning weights was helpful in the final stage of a Kaggle competition, when time was short and everyone was mentally tired. This example uses simple linear-combination weights. There are many other ways to weight ensembles, so the implementation of "optimal weighting" will depend on the purpose.

After the competition ended, I learned that this method is a way of solving the linear problem in what is called stacking. I also tried optimizing ensembles by context with LightGBM, or GBDT. It optimized CV but did not work well for LB score. According to this discussion, neural networks can overfit easily, so GBDT is not well suited here. That makes sense.

People who know linear algebra can probably derive these coefficients naturally. In my case, I had recently read Optimization Mathematics Starting from Vectors and Matrices, and that was the first time I understood, at least a little, what linearity means. If I had not read that book, I probably would not have thought of keywords such as "constrained non-negative least squares". Fundamentals matter.

My first Kaggle competition ended with a team gold medal, 8th place

Tue, 21 Jun 2022 00:00:00 GMT

The Kaggle competition ended today, June 21, 2022, at 9:00 in the morning. Results appear immediately, so I watched them with my team members. We finished 8th out of 1,975 teams, the same as our public LB rank, and received a gold medal. It was my first Kaggle competition, and I was fortunate to be on an excellent team.

The team's solution is published in English, and I expect a Japanese version will be published separately. So here I will write mainly about my impressions as a first-time participant, rather than the solution itself. About 40 days earlier, Aota invited me to try U.S. Patent Phrase to Phrase Matching, a patent and natural language processing competition. My Kaggle knowledge was only at the level of having read a beginner book and tried the Titanic practice competition, so he taught me more concrete ways to proceed and how to think when working as a team.

For tools, we used Notion for information sharing, mainly a kanban board, and shared implementations on GitHub. Scores were recorded in a spreadsheet together with implementation serial-number files. Implementations were branched by plain copying. At first I wondered whether a notebook-style workflow, like Colab plus diff management on save, would be better. In practice, the copy-based approach made each file easier to compare, and because many variants were created, more than 100 training implementations in the end, copy-paste implementation was convenient. File-level diffs were also easy to take.

Asynchronous flow-style information sharing happened in chat, while synchronous discussions happened in a weekly regular meeting. When the team reached consensus on a method, we updated the baseline implementation. Aota mostly grew the baseline into something strong. Deciding the best CV split method early also paid off more and more as data accumulated. Running the baseline saved not only scores but also metadata needed to derive those scores, such as preprocessed data, which made it easier to investigate why a score appeared.

That is how the competition work began. From then on, I spent almost all the time I had previously spent on games and other things on Kaggle. Once I started, Kaggle had a game-like addictiveness. To get a higher score, I wanted to try one idea after another. Because it was my first competition, I also gained a lot of knowledge along the way, which made the process intellectually satisfying and fun.

That said, Kaggle is not easy enough that a beginner can get high scores just by imitating what they read in Code or Discussion. Around the middle of the competition, there were periods where improving local CV produced the opposite movement on LB. My motivation dropped a little. Because I was on a team, though, I still felt that we were moving forward together, which gave me enough motivation not to quit midway. At least for me, if I had been doing it alone, I probably would have stopped working on it partway through. The chart below is from the period when CV and LB were completely negatively correlated.

Ten days before the deadline, we merged with Raiii and Masuda, who had been participating as another team. We shared methods and ideas, built a reliable model that was the best at that point, and started creating ensembles for final submission. Then Raiii implemented a brilliant training idea three days before the deadline, thinking it might work. A single model with that method suddenly entered the 30s, and when we ensembled it, we climbed to 4th on the LB at once. The idea initially considered a graph-network solution, but the problem could be treated as a simpler 1:N problem, so perhaps it would train well if fed as a sequence. In hindsight that sounds plausible, but I would not have come up with the idea, and I was impressed by the skill needed to implement it in such a short time.

From this point the competition became too interesting. On the weekend right before the deadline, I spent almost all my time on Kaggle: creating various ensembles from new training methods, incorporating a way to automatically compute optimal blending weights for a weighted ensemble, because it is a linear combination with a unique optimum, and analyzing score data repeatedly. My small contribution was probably around there. It felt like an intense three-day festival. In the end, CV and LB scores became almost correlated, so we trusted the best CV and chose the final submission. On the morning of the deadline, we watched the result together while doing KPT and chatting. That gave the team a real sense of unity and achievement.

As written at the beginning, the final result was 8th place and a gold medal. It was simply very interesting. I was lucky to have strong team members, gain a lot of knowledge, and experience an exciting situation. This was an NLP competition, and because I went through the tutorials and beginner course for Transformers, I also gained practical knowledge that feels useful for work. Kaggle has competitions in many other domains too, and in unfamiliar domains everything seems fresh and full of things to learn.

My first Kaggle competition went very well because I was blessed with the team. I am grateful to the team members. But if you take a competition seriously, I think you should expect to spend at least about 50 hours on one competition. If you cannot make that time, joining may be difficult. This time, over about 40 days, I think I spent around 80 to 100 hours. Conversely, if you have the time, I felt that you can gain a lot from it, as described above. It is also simply fun if you think of it as a game. I would like to join another competition on a topic that interests me, but personally I may give up midway if I do it solo, so I would like to participate with at least one teammate for continuity and motivation.

Reading Basic Statistics by Kimio Miyakawa: statistics before machine learning

Mon, 13 Dec 2021 00:00:00 GMT

I started learning machine learning this year. For roughly the first six months, I learned mainly from typical machine learning books: basic supervised learning such as regression and classification, unsupervised learning such as clustering and dimensionality reduction, how to read results, how to process well-formed data, simple neural networks such as perceptrons, fully connected layers, CNNs, and RNNs implemented from scratch, and model building with TensorFlow and Keras. When abstracted libraries existed, I used them while thinking about which model was appropriate for the problem.

In practice, however, before building the model you need, you first have to look at the data and think. Explanations of exploratory data analysis often skip, especially for beginners, the question of what can be understood from data in the first place. They start from ideas such as correlation and distribution, assuming that background knowledge already exists. I could do something that looked like EDA, but in reality I was doing it without really understanding what I was looking at.

The same thing happens when building a model and validating it with an A/B test. Many explanations say something like "use a chi-square test and check statistical significance", and you end up validating things without understanding them well. What are degrees of freedom? What is a t statistic? Can you ignore degrees of freedom because internet data has a large sample size? Why is variance divided by n - 1? I did not understand even these basic points. And in reality, you are not always looking at one A/B test once. You may want to know whether repeated results are significant, where degrees of freedom matter because the number of trials is small, what you want to call significant, and what result you expected before running the test.

These are only examples, but I lacked the underlying premises. Because of that, my understanding was shallow and I could not always choose an appropriate method. I often did not understand basic terms that appeared in library documentation. It took me about half a year to realize that this "background knowledge I was missing" was statistics. The foundations needed for machine learning are calculus, linear algebra, and statistics. I had at least a minimal handle on calculus and linear algebra because I studied their basics in high school, and linear algebra also appeared in 3D programming, where I had implemented related code before.

Statistics, on the other hand, was almost absent from my working knowledge. I may have taken a university credit for it, but I had forgotten it completely. I did not even understand basic ideas such as looking at the mean and variance of data, standardizing a value, and knowing that a normal distribution falls within -1.96 to 1.96 about 95% of the time.

So I decided to learn the basics of statistics properly. At first, though, I did not know where to begin. Looking around bookstores, I found many all-in-one books combined with SQL or Python, but I could not tell which books would let me actually learn statistics. I tried O'Reilly's Practical Statistics for Data Scientists, but because I did not understand the underlying basics of statistics, I could not really get started.

Returning to the basics, I skimmed textbook-style books. Basic Statistics from University of Tokyo Press honestly felt too difficult for me, and I could not imagine finishing it. Around that time, I happened to see a video that recommended Basic Statistics, 4th Edition by Kimio Miyakawa. I tried it without much expectation, but it was extremely clear, and the example problems were excellent. I read it almost every day, worked through exercises with a pen, notebook, and scientific calculator, and finished it over a little under three months. I almost never finish this kind of textbook, so it must have matched me very well.

The explanations are concise and easy to understand. When enough time has passed that you might have forgotten something, the book gives page references and supplementary explanations, so it does not leave you behind. The exercises are also easy to imagine in real-world terms, for example: "If the defect rate of a product is 2%, what is the probability that 2 defective items are included among 200 products?" or "In an experiment, the average time until 10 fuses blew under a 25% overload was 9.2 minutes, with a standard deviation of 2.5 minutes. Estimate the mean time until this type of fuse blows under a 25% overload with a 99% confidence coefficient."

As you can see from the table of contents, the book covers mean and variance, frequency distributions, regression and correlation analysis, probability, random variables and probability distributions, major probability distributions, sampling distributions, estimation, hypothesis testing, and the statistical logic of regression. It teaches probability and regression, which are important foundations for machine learning algorithms, probability distributions that matter when looking at data, and estimation and testing for checking whether hypotheses hold. As I learned these topics gradually, I think my practical ability to look at data and form hypotheses improved substantially.

Looking back, for me the efficient timing would have been about three months after I started machine learning, once I had become able to use machine learning tools such as scikit-learn and TensorFlow at a basic level. In short, statistics is one of the foundations of machine learning, and it is better to learn at least the minimum basics early. I recommend Kimio Miyakawa's Basic Statistics as a clear way to learn those foundations. I am grateful to Professor Miyakawa for writing such a good book.

Adding type hints to a Python project and getting value from type checking

Wed, 07 Jul 2021 00:00:00 GMT

Recently I started writing Python for a machine learning project I am helping with. I remembered that Python has type hints, tried adding them to the project, and found that they were easy to introduce and immediately useful because I could get the benefits of static type checking.

When I first started writing type hints, though, I was not sure which documents I should read or how I should introduce type hints into a project. This article explains those points and shows one way to start writing types in a Python project.

Which Documents Should You Read?

If you have written another statically typed language before, these two documents should be enough to get started.

Understanding Typing
- Part of the documentation for pyright, a type checker implementation. It summarizes the important points concisely.
typing - Support for type hints
- The official Python documentation. It feels more like a reference manual than an easy tutorial, so I recommend reading it after Understanding Typing.

After reading those two, you should have a rough sense of Python type hints and may want to start writing them yourself. The sample code I used in an internal study session to check type behavior is here. It may not be exemplary code, but it is useful for observing behavior.

https://gist.github.com/hotchpotch/f0253edcc1d403ffb38740cf2e0ecc74

How to Introduce Type Hints

When adding type hints to a project, the first thing to choose is the type checker implementation. Python type hints are specified across several PEPs, and there are multiple tools that implement them. Major options include:

mypy
- A long-standing implementation that can also type-check Python 2 code.
pytype
- Google's implementation. It can also type-check Python 2.7.
pyre
- Meta's implementation. It supports Python 3 and later.
pyright
- Microsoft's implementation. It supports only Python 3 and later, and is written for Node.js.

All of these can be used through the Language Server Protocol, although I have not checked pytype myself, so they can be used from LSP-compatible IDEs and editors. If Python 3 or later is enough, my personal recommendation is pyright. It is fast, and GitHub issues and PRs seem to get quick responses, perhaps because Microsoft employees are assigned to them. In VS Code, it is easy to use through the Pylance extension.

One Pylance trap is that type checking is off by default. Really. Pyright's default is basic, so this surprised me. You should change it in VS Code's settings.json. Otherwise, you may think "I installed Pylance, now I can write type hints, and there are no errors, so everything is fine", when in fact type checking is simply not running. That is exactly what happened to me at first.

  // settings.json
  // 現在初期値は "off" となっている🤣なんでや～
  "python.analysis.typeCheckingMode": "basic",
  // workspace 全体に対して型チェック。
  "python.analysis.diagnosticMode": "workspace"

Pylance includes pyright, which is open source, but Pylance itself is not open source because it includes other features as well. If you use something other than VS Code, you can use pyright directly instead of Pylance, so the absence of Pylance should not be a serious problem.

Adding pyright to a Project

If you only want to install pyright as a CLI, install it through npm and run the pyright command.

$ npm install --global pyright
$ pyright
No configuration file found.
pyproject.toml file found at C:\Users\hotch\src\github.com\....
Loading pyproject.toml file at C:\Users\hotch\src\github.com\...\pyproject.toml
Assuming Python platform Windows
No include entries specified; assuming C:\Users\hotch\src\github.com\...
Auto-excluding **/node_modules
Auto-excluding **/__pycache__
Auto-excluding .git
stubPath C:\Users\hotch\src\github.com\...\typings is not a valid directory.
Searching for source files
Found 62 source files
0 errors, 0 warnings, 0 infos
Completed in 2.591sec

Type errors are detected when they exist, and pyright -w can watch files and keep running. In normal use, however, VS Code or another editor will run type checks through pyright, so there are not many occasions to invoke the command directly.

You can apply project-wide settings by placing either pyproject.toml or pyrightconfig.json in the project root.

# pyproject.toml
# https://github.com/microsoft/pyright/blob/main/docs/configuration.md
[tool.pyright]
pythonVersion = "3.7"
typeCheckingMode = "basic"

Personally, I wanted to pin the pyright version and install it quickly with npm install, so I put a Node package-management package.json in the project.

{
  "name": "pyright-exec",
  "version": "1.0.0",
  "description": "",
  "main": "",
  "scripts": {
    "pyright": "pyright"
  },
  "author": "",
  "license": "",
  "dependencies": {
    "pyright": "^1.1.155"
  }
}

If you want to run it in CI, you can configure GitHub Actions like this example.

      - uses: actions/setup-node@v1
        with:
          node-version: 14.x
      - name: Install node dependencies
        run: npm install
      - name: Typecheck
        run: npm run pyright

At first I wondered whether it was strange to add pyright, which is written for Node.js, to a Python project. But it has zero dependencies, while many tools pull in a lot of packages, so installation is immediate. That also makes it comfortable to use.

Python Versions and Type Hints

Once you start writing type hints in Python, the next thing you may hit is that available typing features differ by Python version. For example, list type hints behave differently depending on the version.

# py 3.9～ 何もしないで builtins のものとして書ける
l: list[str] = []

# py 3.7～は future import でバックポートを実現
from __future__ import annotations
l: list[str] = []

# py 3.5～ は typing モジュールを使う
# なおこの書き方は 3.9 ～非推奨に…
from typing import List
l: List[str] = []

Another commonly used type is TypedDict, which is supported from Python 3.8.

# py3.8～
from typing import TypedDict

If the typing module in the Python version you need does not include a type you want, you can usually install typing_extensions and use its backport.

# py3.7以前の場合。もちろん3.8以降でも動く。
from typing_extensions import TypedDict

Until you get used to it, you need to check the reference documentation to see which Python version supports which typing feature. For a new project without constraints, using the newest possible Python version is best. But the runtime environment may be older. For example, as of July 2021, Google Colab uses Python 3.7. If your code must run on Colab, you need to write it so it works on 3.7. I initially wrote code for 3.8, found that it did not run on Colab, and had to rewrite it for 3.7.

Type Stubs: Providing Types from Separate Files

Next, you may run into a library that does not provide type hints. In that case, you can use the type stub mechanism defined in PEP 561.

In pyright, the default stubPath is ./typings. .pyi type stub files placed under that directory are loaded. To see concrete examples of how to write .pyi, it is useful to look at typeshed, which collects type stubs for the standard library and well-known packages. Typeshed's stubs are included by default in tools such as pyright and mypy, so you normally do not need to install them separately.

pyright also has a pyright --createstub packagename command that generates a type stub template, and Pylance can create stubs through its UI. See pyright's Type Stub Files documentation for details.

Start Writing Type Hints

Writing type hints brings many benefits: better IDE completion, easier refactoring, improved development efficiency, more confidence from static type checks, and fewer runtime errors. Python's type hints can do far less than, for example, TypeScript's type system, and sometimes I wish I could manipulate types more expressively. But because of that limitation, most Python type annotations stay simple. That keeps the learning cost low and makes the types easy for most readers to understand.

The cost of introducing type hints into a project is also low, especially for a new project. If you are unsure whether to write type hints, I think it is worth trying them first.

Other Notes

As of July 2021, I help with machine learning projects as a software engineer at Nikkei Innovation Lab three to four days a week. I am not an employee; I help on a project basis. Most of the information in this article comes from that work, and this post is based on material I presented at an internal Nikkei study session.

Nikkei is an interesting environment for machine learning. It naturally has newspapers and other text suited to natural language processing, and it also has access to various large-scale datasets through the Nikkei electronic edition. If that sounds interesting, take a look at the hiring site. At the time of writing, Nikkei is also recruiting machine learning interns for a summer internship.

A CLI for finding similar documents in static site generators

Tue, 27 Apr 2021 00:00:00 GMT

I recently wrote a small CLI for outputting related entries for secon.dev. While doing that, I realized that most static site generators produce HTML from article files written in Markdown or HTML, such as .md and .html, located somewhere on disk. That means the same idea could be useful outside my own site, so I released it as a CLI for finding similar documents.

https://github.com/hotchpotch/similar-documents-cli
- Install it with pip install -U similar-documents

If you pass the files whose related entries you want to infer, the CLI outputs, in JSON, the most related files for each input file. The article files for this site, secon.dev, are not public, so as an example I tried inferring related entries from the Markdown articles in the source code for r7kamura.com, which r7kamura publishes.

$ time similar-documents --debug -k 3 -t japanese ~/src/github.com/r7kamura/r7kamura.com/articles/*.md > r7kamura_com_similar_articles.json
files to texts 951 documents
calc tfidf...
calc similarity...
assign similarity score
similar-documents --debug -k 3 -t japanese  >   8.03s user 3.98s system 383% cpu 3.131 total

It took about 3.1 seconds to infer related articles for 951 posts on my Ryzen 3900X environment. The JSON includes file paths from my machine, so I cleaned it up a little. The hash keys are article paths, and each entry array contains related articles in descending score order.

cat r7kamura_com_similar_articles.json | jq . | sd '/home/yu1/src/github.com/r7kamura/' 'https://' |sd '.md"' '"' > converted.json
cat converted.json

Here are a few excerpts from the JSON. For an article about duct rails, other duct-rail-related articles appear with high scores.

  "https://r7kamura.com/articles/2021-02-05-switchbot-hub-mini-on-rails": [
    [
      "https://r7kamura.com/articles/2020-12-19-google-home-mini-on-rails",
      0.6502251932677562
    ],
    [
      "https://r7kamura.com/articles/2021-01-18-nature-remo-on-rails",
      0.6088665752039284
    ],
    [
      "https://r7kamura.com/articles/2016-12-12-h",
      0.33070364498269256
    ]
  ],

For an article about the game Atelier Ryza, another Ryza article and an FF13 article appear. Looking at the scores, the other Ryza article is clearly the closest one.

  "https://r7kamura.com/articles/2021-02-13-atelier-ryza": [
    [
      "https://r7kamura.com/articles/2020-01-19-atelier-ryza",
      0.4632359977711961
    ],
    [
      "https://r7kamura.com/articles/2020-12-31-games-2020",
      0.17984491640184092
    ],
    [
      "https://r7kamura.com/articles/2021-01-30-final-fantasy-13",
      0.15056225381780178
    ]
  ],

For an article about bathtub cleaning, articles about bathtub detergent and drain cleaning are inferred as related.

  "https://r7kamura.com/articles/2021-02-19-laundry-cleaning": [
    [
      "https://r7kamura.com/articles/2020-11-02-lookplus",
      0.39103024934082137
    ],
    [
      "https://r7kamura.com/articles/2020-10-12-ember-restored",
      0.3759286934329018
    ],
    [
      "https://r7kamura.com/articles/2014-08-31-h",
      0.33743028929351304
    ]
  ],

With output like this, a single command can generate JSON for related entries. If a static site build reads that JSON, related-article features should be fairly easy to add to static site generators.

If you have persistent compute resources, something like Elasticsearch's More like this should produce more accurate related entries. But for static site generators, there is value in a command that can run casually at build time without maintaining any external state.

Technical Notes

Nothing complicated is happening. It uses the kind of document similarity method that appears in introductory machine learning material: count terms, calculate TF-IDF, and find similar documents by cosine similarity. For Japanese tokenization it uses MeCab through fugashi, a Python wrapper that is easy to use and makes dictionaries easy to install. TF-IDF and cosine similarity are handled entirely by scikit-learn. It is a classical method, but in practice it gives fairly reasonable related articles.

At the moment, .md and .html files are converted to text through parsers for their formats, and all other files are treated as plain text. In TF-IDF, terms that appear across many documents receive lower scores. So if every file uses the same particular format, words specific to that format should have limited effect on the score, even though converting to clean text is still preferable. That is why this simple approach seems to work reasonably well.

When sharp fails to install on WSL

Sun, 18 Apr 2021 00:00:00 GMT

This was a small but annoying trap. When I ran yarn add sharp on Linux inside WSL2, sharp started trying to build itself locally and failed with:

"/usr/include/vips/vips8:35:10: fatal error: glib-object.h:"

The confusing part was that the libvips-related packages existed, and /usr/include/glib-2.0/glib-object.h existed too. I tried several things, but none of them fixed it.

Looking more carefully at the other error messages, I found that it first failed here:

sharp: Installation error: ENOENT: no such file or directory, mkdir 'C:\Users\hotch\AppData\Roaming/npm-cache'

After that failure, it tried to start a local build. But why was Linux inside WSL referring to a Windows path in the first place? Looking at the environment, I found that the APPDATA environment variable was being passed into WSL as-is, and that path was coming from there.

In sharp's install code, the cache path includes env.APPDATA, which made it refer to the odd Windows path.

When I set APPDATA to an empty string, which evaluates as false in Node.js, sharp used the prebuilt Linux binary from the cache and installed successfully.

APPDATA='' yarn add sharp

This does not fix the underlying local-build problem, but sharp became usable, so that was good enough for now.

Fixing WSL clock drift after sleep from the Windows side

Sat, 17 Apr 2021 00:00:00 GMT

After developing in a WSL2 environment for the first time in a while, I noticed that the time was off. A problem I had not seen before seemed to have appeared: the WSL clock drifted after waking from sleep.

https://github.com/microsoft/WSL/issues/5324

This is an old issue, and it was fixed in yesterday's Insider Preview Linux kernel, so it will probably be fixed in Windows 10 21H1. At the moment, though, the problem still occurs.

On the Linux side, hwclock --hctosys fixes it. But running that command manually every time is annoying, and setting up periodic execution inside WSL2, for example with cron.d, is more troublesome than it first looks. So I prepared a batch file that runs the WSL-side command from Windows, then ran it from Windows Task Scheduler.

wsl.exe -u root --exec /usr/sbin/hwclock --hctosys

There are several examples online for running a Task Scheduler job when the machine wakes from sleep, so I will skip that part. This solved the issue. After writing that much, I found a comment on the issue that does the same thing in one line.

schtasks /Create /TN wsl-clock /TR "wsl.exe -u root sh -c hwclock -s" /SC ONEVENT /EC System /MO "*[System[Provider[@Name='Microsoft-Windows-Kernel-Power'] and (EventID=107 or EventID=507)]]" /F

Right, schtasks can register tasks in Task Scheduler. With this command, the WSL clock drift after waking from sleep can be fixed fairly easily.

NumPy cast overflow behavior can vary by environment and array size

Wed, 31 Mar 2021 00:00:00 GMT

I got caught by exactly what the title says: overflow behavior when casting in NumPy differed depending on the environment and the size of the data. It seems to take a different code path depending on the array length, and it took me a while to identify the cause. The correct answer is probably "do not pass overflowing data into a cast", but if behavior changes like this, I would at least appreciate a warning that an overflow happened.

This feels like the kind of bug that can look fine during development on a Mac, while already being broken, and then behave differently in production. I did not dig far enough to know whether this is specific to the Mac environment or whether it depends on the BLAS implementation, such as Intel MKL.

import platform
print(platform.system())
# オーバーフローして1になる
print(np.array([257.0], dtype="float32").astype('uint8'))
# オーバーフローして1になる
print(np.array([257.0, 0, 0, 0, 0, 0, 0], dtype="float32").astype('uint8'))
# オーバーフローして1になるが正しい、と思いきや、環境によっては丸め込まれて255になる
print(np.array([257.0, 0, 0, 0, 0, 0, 0, 0], dtype="float32").astype('uint8'))

Linux
[1]
[1 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]

Windows
[1]
[1 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]

On my Intel Mac, the value is clamped instead.

Darwin
[1]
[1 0 0 0 0 0 0]
[255   0   0   0   0   0   0   0]

Building a simple fully connected neural network with TensorFlow 2 without Keras

Sun, 28 Mar 2021 00:00:00 GMT

In TensorFlow, using the high-level Keras API makes it easy to create and train neural network models and do many other things you want to do with neural networks. But as a neural network beginner, I had been using it without really understanding what was happening. So I implemented a simple feed-forward neural network without Keras, using only TensorFlow APIs.

This article is implemented with reference to Deep Learning from Scratch. It is also a personal note for my own understanding. If you want to understand this properly, you should probably read Deep Learning from Scratch and the TensorFlow guide. The original Jupyter notebook is here.

By moving my hands and building everything once, I understood more about TensorFlow and neural network basics: which function affects what, what manual training feels like, how automatic differentiation works and how to use it, and why Keras is useful. Guide pages that I had not understood before became mostly readable.

Some very capable people I have seen can learn the logic and implement it in a program without much trouble. Even many ordinary capable people can read a book in an unfamiliar field and understand and implement it. In my case, that is often not enough. I often understand only after actually moving my hands and observing behavior. This reminded me of that again.

The simple neural network implemented here looks like this:

Create a layer with two weight parameters:
- weights of shape (input count, unit count)
- bias weights of shape (unit count,)
- During forward propagation, apply an activation function to the input multiplied by weights plus bias.
Create a network that manages layers.
- During inference, apply layers in order, or forward propagation, and output the result.
- Apply a loss function as a metric for how correct inference is.
- During learning, which obtains optimal weight parameters from training data and applies them with a learning rate, calculate gradients so that the loss function becomes smaller, and update layer parameters little by little in reverse order with backpropagation. Backpropagation uses TensorFlow autodiff.
Give training data to this network and train it.

First, implement a simple layer.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

# GPU 使わない設定
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

# seed の固定
np.random.seed(42)
tf.random.set_seed(42)

class SimpleLayer():
    def __init__(self, input_dim, units, activation):
        # ウェイトを正規分布で初期化。Variable で更新可能な値として作る。
        self.w = tf.Variable(tf.random.normal([input_dim, units]) * 0.01, name='weight')
        # バイアスを 0 で初期化
        self.b = tf.Variable(tf.zeros([units]), name="bais")
        # 活性化関数
        self.activation = activation
    
    @property
    def weights(self):
        return [self.w, self.b]

    def forward(self, x):
        y = x @ self.w + self.b # y = tf.matmul(x, self.w) + self.b # と同等
        return self.activation(y)

    def __call__(self, x):
        return self.forward(x)

identify_function = lambda x: x
zero_function = lambda x: x * 0

l1 = SimpleLayer(2, 2, identify_function)
l2 = SimpleLayer(2, 1, zero_function)

print(f'l1 weights: {l1.weights}')
a1 = l1([[10, 20]]) # SimpleLayer.__call__ を呼び出す
print(f'a1: {a1}')
a2 = l2(a1)
print(f'a2: {a2}')

print(l2(l1([[10, 20]])))

Next, implement several simple activation functions.

def step_function(x:tf.Tensor):
    return tf.cast(x > 0, tf.uint8)

step_function(tf.constant([1, 0, 3, -3]))

def sigmoid(x:tf.Tensor):
    return 1 / (1 + tf.exp(-x))

sigmoid(tf.constant([0, 1.0, -2.0]))

def relu(x: tf.Tensor):
    return tf.maximum(0.0, x)

relu(tf.constant([-2.0, -1.0, 1.0, 2.0]))

def tanh(x: tf.Tensor):
    return (tf.exp(x) - tf.exp(-x)) / (tf.exp(x) + tf.exp(-x))

tanh(tf.constant([-3.0,-1.0, 0.0, 1.0,3.0]))

For output layer activation functions, implement the identity function, which does nothing, and softmax, which is used for classification problems.

def identity(x: tf.Tensor):
    return x

identity(tf.constant([1.0, 0.0, -1.0, -3.0]))

def softmax(x:tf.Tensor):
    e = tf.exp(x -tf.reduce_max(x))
    s = tf.reduce_sum(e)
    return e / s

print(softmax(tf.constant([0.3,2.9,4.0])))
print(softmax(tf.constant([1010.0, 1000, 990])))

Next, implement loss functions: sum of squared error and cross entropy error for classification models, and root mean squared error for regression models.

def sum_squared_error(x:tf.Tensor, y: tf.Tensor):
    return tf.reduce_mean(0.5 * tf.reduce_sum((x-y) ** 2, axis=tf.rank(x)-1))

y1 = [0.0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
y2 = [0.0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
x1 = [0.1, 0.05, 0.6, 0.0, 0.05, 0.1, 0.0, 0.1, 0.0, 0.0]

print(sum_squared_error(tf.constant(x1), tf.constant(y1)))
print(sum_squared_error(tf.constant(x1), tf.constant(y2)))
print(sum_squared_error(tf.constant([x1, x1]), tf.constant([y1, y2])))

def cross_entropy_error(x:tf.Tensor, y: tf.Tensor):
    delta = tf.constant(1e-7)
    if tf.rank(x) == 1:
        x = tf.reshape(x, (1, tf.size(x)))
        y = tf.reshape(y, (1, tf.size(y)))
    batch_size = x.shape[0]
    return -tf.reduce_sum(y * tf.math.log(x + 1e-7)) / batch_size

print(cross_entropy_error(tf.constant(x1), tf.constant(y1)))
print(cross_entropy_error(tf.constant(x1), tf.constant(y2)))
print(cross_entropy_error(tf.constant([x1, x1]), tf.constant([y1, y2])))

def root_mean_squared_error(x:tf.Tensor, y: tf.Tensor):
    diff = y - x
    return tf.sqrt(tf.reduce_mean((diff)**2))

y = [[100.0], [160], [60]]
x = [[80.0], [100], [100]]

print(root_mean_squared_error(tf.constant(x), tf.constant(y)))

Next, check TensorFlow behavior for gradient calculation with automatic differentiation.

For the function f1 below, the derivative at x=3 is 40.

f1 = lambda x: x**3 + 2*x**2 + x

x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    z = f1(x)
tape.gradient(z, [x])

Next, partial differentiation. For the function f2 below, when w1 and w2 are 5 and 3, the partial derivatives with respect to w1 and w2 are 36 and 10. This code is from Chapter 12 - Custom Models and Training with TensorFlow.

def f2(w1, w2):
    return 3 * w1**2 + 2*w1 * w2

w1, w2 = tf.Variable(5.0), tf.Variable(3.0)

with tf.GradientTape() as tape:
    z = f2(w1, w2)
print(tape.gradient(z, [w1, w2]))

try:
    print(tape.gradient(z, [w1, w2]))
except RuntimeError:
    print('二回目の呼び出し時には、リソースが削除されていてエラーになる')

with tf.GradientTape(persistent=True) as tape:
    z = f2(w1, w2)

print(tape.gradient(z, [w1]))
print(tape.gradient(z, [w2]))
del tape # 開放する

Now check whether the functions and layers created so far behave as intended by trying a simple linear-function prediction. Create data based on 2x + 10 with small random noise.

x = np.arange(-50, 50, 2)
line_2x_1 = 2 * x  + 10
noise = -10 * np.random.rand(len(x)) + 5
dots_2x_1 = line_2x_1 + noise
plt.plot(x, line_2x_1)
plt.plot(x, dots_2x_1, 'o')
plt.show()

y = tf.expand_dims(tf.constant(dots_2x_1, dtype=tf.float32), axis=1)
X = tf.expand_dims(tf.constant(x, dtype=tf.float32), axis=1)

First, without using a neural network, check whether it works well with sklearn.

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)
print(reg.score(X, y))
reg.predict([[-50], [0], [50], [100]])

0.9975753493086111

array([[-89.75652361],
       [ 10.54682827],
       [110.85018015],
       [211.15353203]])

Next, train a two-layer neural network as a regression model. If it can predict regression similarly to sklearn, that is fine.

layer1 = SimpleLayer(1, 32, relu)
layer2 = SimpleLayer(32, 1, identify_function)

loss_function = root_mean_squared_error
lr = 0.003

predict = lambda x: layer2(layer1(x))
for i in range(10000):
    # 勾配を求める
    with tf.GradientTape() as tape:
        y_pred = predict(X)
        z = loss_function(y_pred, y)
    (l1_w_grads, l1_b_grads), (l2_w_grads, l2_b_grads) = tape.gradient(z, [layer1.weights, layer2.weights])
    # SDG で勾配を学習する
    layer1.w.assign_sub(lr * l1_w_grads)
    layer1.b.assign_sub(lr * l1_b_grads)
    layer2.w.assign_sub(lr * l2_w_grads)
    layer2.b.assign_sub(lr * l2_b_grads)
    if (i % 1000 == 0):
        print('iter {} / train loss: {:.3}'.format(i, z.numpy()))

print('train loss: {:.3}'.format(loss_function(predict(X), y)))
print(predict(tf.constant([[-50], [0], [50], [100]], dtype=tf.float32)))

iter 0 / train loss: 58.6
iter 1000 / train loss: 5.08
iter 2000 / train loss: 3.71
iter 3000 / train loss: 3.79
iter 4000 / train loss: 3.5
iter 5000 / train loss: 3.3
iter 6000 / train loss: 3.18
iter 7000 / train loss: 3.09
iter 8000 / train loss: 3.03
iter 9000 / train loss: 2.98
train loss: 2.94
tf.Tensor(
[[-92.20102  ]
 [ 10.1016445]
 [109.65451  ]
 [209.20737  ]], shape=(4, 1), dtype=float32)

It seems to work. Next, implement a network that handles the training above more conveniently.

class SimpleSequenceNetwork:
    def __init__(self, layers, loss_function, lr=0.01):
        self.layers = layers
        self.loss_function = loss_function
        self.lr = lr
    
    def predict(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

    def loss(self, x, target):
        y = self.predict(x)
        return self.loss_function(y, target)

    def accuracy(self, x, target):
        y = self.predict(x)
        y = tf.argmax(y, axis=1)
        target = tf.argmax(target, axis=1)

        accuracy = tf.math.count_nonzero(y == target) / x.shape[0]
        return accuracy
    
    @property
    def all_weights(self):
        return tf.nest.flatten([layer.weights for layer in self.layers])

    # 損失関数の、重みパラメータに対する勾配を求める
    def gradient(self, x, target):
        with tf.GradientTape() as tape:
            tape.watch(x)
            z = self.loss(x, target)
        return tape.gradient(z, self.all_weights)

    # 単純な勾配下降法(SDG)で、レイヤーの重みパラメータを更新する
    def update_variables_by_sdg(self, grads):
        for (grad, val) in zip(grads, self.all_weights):
            val.assign_sub(self.lr * grad)

    # 勾配を求め、パラメータを更新する
    def training(self, x, target):
        grads = self.gradient(x, target)
        self.update_variables_by_sdg(grads)

Load the dataset for training. Use the familiar MNIST digit data from 0 to 9.

import tensorflow_datasets as tfds

ds = tfds.load("mnist", as_supervised=True)
test_ds = ds['test']
train_ds = ds['train']

for (i, (image, label)) in enumerate(train_ds.take(12)):
    plt.subplot(3, 4, i+1)
    plt.imshow(image, cmap='gray')
    plt.subplots_adjust(wspace=0, hspace=1)
    plt.title(label.numpy())
    plt.axis('off')
plt.show()

def preprocess(image, label):
    # 画像は (28,28,1) を (784,) にして、0.～1. の範囲へ
    image = tf.cast(tf.reshape(image, (-1,)), tf.float32) / 255.0
    # ラベルはワンホットベクトルに
    label = tf.one_hot(label, 10, dtype=tf.float32)
    return image, label

train_ds = train_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).cache()
test_ds = test_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).cache()

print((len(train_ds), len(test_ds)))

(60000, 10000)

Create a function that trains the network based on the dataset.

def fit(network, train_ds: tf.data.Dataset, test_ds: tf.data.Dataset, epochs=20, batch_size=32):
    history_train_loss, history_train_accuracy, history_test_accuracy = [], [], []
    for epoch in range(1, epochs + 1):
        train_loss, train_accuracy, test_accuracy = [], [], []
        for (X_batch, y_batch) in train_ds.shuffle(1000).batch(batch_size).prefetch(1):
            network.training(X_batch, y_batch)
            train_loss.append(network.loss(X_batch, y_batch))
            train_accuracy.append(network.accuracy(X_batch, y_batch))
        for (X_batch, y_batch) in test_ds.shuffle(1000).batch(batch_size).prefetch(1):
            test_accuracy.append(network.accuracy(X_batch, y_batch))
        
        print("train acc, test acc, train loss | {:.4}, {:.4}, {:.4}".format(
            tf.reduce_mean(train_accuracy).numpy(),
            tf.reduce_mean(test_accuracy).numpy(),
            tf.reduce_mean(train_loss).numpy()
        ))
        history_train_loss.extend(train_loss)
        history_train_accuracy.extend(train_accuracy)
        history_test_accuracy.extend(test_accuracy)
    return {
        'train_loss': np.array(history_train_loss),
        'train_accuracy': np.array(history_train_accuracy),
        'test_accuracy': np.array(history_test_accuracy)
    }

Actually create layers and a neural network, then train it.

%%time
input_layer = SimpleLayer(784, 100, relu)
hidden_layer = SimpleLayer(100, 50, relu)
output_layer = SimpleLayer(50, 10, softmax) 
network = SimpleSequenceNetwork([input_layer, hidden_layer, output_layer], cross_entropy_error, lr=0.1)

history = fit(network, train_ds, test_ds, epochs=5, batch_size=32)
plt.plot(history['train_loss'])
plt.show()

train acc, test acc, train loss | 0.7391, 0.9364, 4.336
train acc, test acc, train loss | 0.9798, 0.9572, 3.604
train acc, test acc, train loss | 0.9897, 0.9674, 3.547
train acc, test acc, train loss | 0.9936, 0.9695, 3.524
train acc, test acc, train loss | 0.9954, 0.9706, 3.511

Try replacing the activation function.

%%time
input_layer = SimpleLayer(784, 100, tanh)
hidden_layer = SimpleLayer(100, 50, tanh)
output_layer = SimpleLayer(50, 10, softmax) 
network = SimpleSequenceNetwork([input_layer, hidden_layer, output_layer], cross_entropy_error, lr=0.1)

history = fit(network, train_ds, test_ds, epochs=5, batch_size=32)
plt.plot(history['train_loss'])
plt.show()

train acc, test acc, train loss | 0.774, 0.9139, 4.245
train acc, test acc, train loss | 0.953, 0.9455, 3.687
train acc, test acc, train loss | 0.973, 0.9567, 3.607
train acc, test acc, train loss | 0.9815, 0.9625, 3.57
train acc, test acc, train loss | 0.9869, 0.9663, 3.548

With this simple neural network, MNIST digit label classification also worked reasonably well. The hardest part, updating weights through backpropagation, can be done easily with TensorFlow autodiff, so I did not need to write that process myself.

Inferring Hiragana in the Browser with TensorFlow.js

Mon, 15 Mar 2021 00:00:00 GMT

While trying out TensorFlow.js, I made a demo that recognizes hiragana using only the browser.

https://tfjs-hiragana.surge.sh/

The model gets about 99.0% accuracy on the original dataset. In the browser, though, it makes a fair number of mistakes. I suspect this is because the training data contains handwriting from pens and brushes, while browser input has a different feel. Characters such as "か" are especially hard.

The trained weights loaded by the browser are 437 KB, and the model metadata is 49 KB. I trained a simple fully connected network in Python TensorFlow: 48x48 grayscale hiragana images, about 700,000 characters, 20 layers, 100 neurons, and 71 output classes. The original Keras model was about 5 MB. After converting it for JavaScript with tensorflowjs_converter --quantize_uint8 --input_format keras input.h5 output_dir, the model metadata became about 49 KB of JSON and the trained data about 437 KB. gzip reduces it by another 20% or so.

The source data is the Character Image Dataset: 73 Hiragana Characters. Its image license is Public Domain Mark 1.0. I used 71 characters from it, excluding ゑ and ゐ.

Converting from Keras to TensorFlow.js is easy if you follow the documentation, but only officially supported APIs can be imported.

Models using unsupported ops or layers, e.g. custom layers, Lambda layers, custom losses, or custom metrics, cannot be automatically imported, because they depend on Python code that cannot be reliably translated into JavaScript.

Custom loss functions and similar pieces are common, so models that are not fairly plain Keras or Python TensorFlow models may not import cleanly. If the final target is TensorFlow.js, it may be better to design the Python model with that in mind, or to build and train the model in TensorFlow.js from the beginning. Node + GPU support seems to exist now, and the API is also fairly complete.

TensorFlow.js and TensorFlow Lite are interesting because, despite their constraints, they let you run inference on browsers, smartphones, and other edge devices. I was impressed by how casually this kind of inference can now be done.

hotchpotch.dev

trn: A Fast Local Translation Command for macOS Tahoe

What trn Does

Why low Is the Default

Speed

What This Suggests About the Mac

OpenAI API-Compatible Access Without Additional API Billing via Codex

Why Subscription-Scoped Use Matters

How It Works

Notes and Caveats

Difference Between the Codex and Claude Code Subscription Positions

Japanese Full-Text Search in SQLite and DuckDB with Vaporetto

Building a Machine Learning PC with Two RTX 5090 GPUs

Power

GPU

Motherboard

Case

Airflow

CPU

RAM

Storage: NVMe

Storage: HDD

CPU Cooler

Assembly

OS

Impressions After Building an RTX 5090 x2 PC

Looking Back on 2025

Life

Child

House

Car

Technology

OpenProvence: A Model for Removing Irrelevant Sentences Before Passing Text to an LLM

Trying OpenProvence

Using It from Python

Using Coding Agents

Closing

Evaluating the Japanese Performance of Embedding Gemma 300M with JMTEB

JMTEB v1 Benchmark

JQaRA / JaCWIR

Strong MTEB Does Not Necessarily Mean Strong Japanese Performance

JFWIR: A Large Japanese Information Retrieval Dataset Built from Japanese FineWeb

Characteristics of JFWIR

1. Large and Diverse

2. Benchmark Results

Usage

Dataset Creation Process

1. Collecting High-Quality Japanese Web Text

2. Generating Diverse Queries

3. Creating Hard Negatives

Future Work

Summary

Related Links

Datasets

Models

Articles and Papers

License

Citation Information

Evaluating the Japanese Performance of Qwen3 Embedding with JMTEB

JMTEB Results

Extra: Reading the Qwen3 Embedding Paper

Releasing Small, Fast, and Practical Japanese Rerankers: tiny, xsmall, small, and base v2

What Rerankers Are, and Why Small Rerankers Matter

Benchmark Performance

Inference Speed

Short Technical Report

Closing

query-crafter-japanese: A Model for Generating Queries for Information Retrieval

Training query-crafter-japanese

Evaluation

Closing

FineWeb2 Edu Japanese: A High-Quality Educational Japanese Dataset

Background

Filtering Educational Data

Token Counts

Removing Web-Specific Noise

Notes

License

Citation Information

Releasing a Japanese StaticEmbedding Model for Practical 100x Faster Text Embeddings

What `trn` Does

Why `low` Is the Default