cat articles/tei
Running Japanese Tokenizer Models with text-embeddings-inference
Hugging Face's text-embeddings-inference is a fast production inference server, but many Japanese models cannot be used directly because they do not provide tokenizer.json.
text-embeddings-inference, or TEI, is an inference server provided by Hugging Face. It is written in Rust, provides Docker containers for various GPU architectures, and when the GPU architecture supports FlashAttention 2, it is often about 1.5 to almost 2 times faster than running inference with Python's Transformers library. I find it useful as a high-performance production inference server.
One problem in Japanese environments is that TEI requires a Rust-based FastTokenizer, in other words a model with tokenizer.json. Many Japanese Transformer models use morphological analysis dictionaries and libraries such as UniDic or MeCab that run in Python, so many of them cannot be used through the tokenizer.json path.
This was a serious issue for me at first, but I found that some APIs such as /embed and /embed_sparse, though unfortunately not /rerank, can be used with a workaround. As an example, I will record the method using cl-nagoya/ruri-base.
Prepare a Dummy tokenizer.json
TEI checks for tokenizer.json when starting the model, and it will not start without one. Therefore we prepare a dummy tokenizer.json. You can create one yourself or use one from a public model. For this example I used the tokenizer.json from hotchpotch/mMiniLMv2-L6-H384.
I created a version of ruri-base with this tokenizer.json added as ruri-base-dummy-fast-tokenizer-for-tei.
Start the Server with the Dummy tokenizer.json Model
Prepare a docker-compose.yaml like this:
services:
ruri-base:
# Change the image to one that matches your architecture.
image: ghcr.io/huggingface/text-embeddings-inference:86-1.5
ports:
- "8080:80"
volumes:
- /tmp/docker-tei-data:/data
# Change pooling to match the model architecture.
command: [ "--model-id", "hotchpotch/ruri-base-dummy-fast-tokenizer-for-tei", "--dtype", "float16", "--pooling", "mean", "--max-batch-tokens", "131072", "--max-client-batch-size", "16" ]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [ gpu ]
Start it:
$ docker compose up
...
ruri-base-1 | 2024-09-30T06:51:45.266929Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1778: Starting HTTP server: 0.0.0.0:80
ruri-base-1 | 2024-09-30T06:51:45.266940Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1779: Ready
It should now be running on port 8080.
Convert to token_ids Locally and Call the API
Next, tokenize locally and call the API with token_ids.
from transformers import AutoTokenizer
import requests
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("hotchpotch/ruri-base-dummy-fast-tokenizer-for-tei", use_fast=False)
sentences = [
"クエリ: 瑠璃色はどんな色?",
"文章: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
"クエリ: ワシやタカのように、鋭いくちばしと爪を持った大型の鳥類を総称して「何類」というでしょう?",
"文章: ワシ、タカ、ハゲワシ、ハヤブサ、コンドル、フクロウが代表的である。これらの猛禽類はリンネ前後の時代(17~18世紀)には鷲類・鷹類・隼類及び梟類に分類された。ちなみにリンネは狩りをする鳥を単一の目(もく)にまとめ、vultur(コンドル、ハゲワシ)、falco(ワシ、タカ、ハヤブサなど)、strix(フクロウ)、lanius(モズ)の4属を含めている。",
]
token_ids = tokenizer(sentences, padding=False, truncation=False, return_tensors="np")["input_ids"]
token_ids = [t.tolist() for t in token_ids]
url = "http://127.0.0.1:8080/embed"
payload = {"inputs": token_ids, "normalize": False, "truncate": True}
headers = {"Content-Type": "application/json"}
response = requests.post(url, json=payload, headers=headers)
embeddings_data = response.json()
embeddings = np.array(embeddings_data)
print(embeddings.shape)
# calc cosine similarity
normalized_embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
similarities = np.dot(normalized_embeddings, normalized_embeddings.T)
print(similarities)
Result:
(4, 768)
array([[1. , 0.94194159, 0.68661375, 0.71621216],
[0.94194159, 1. , 0.66622363, 0.68591373],
[0.68661375, 0.66622363, 1. , 0.87196226],
[0.71621216, 0.68591373, 0.87196226, 1. ]])
This successfully obtains dense vectors, with cosine similarities almost the same as the values shown in the ruri-base model card. With this approach, TEI can be used with Japanese tokenizers for APIs other than reranking. Of course, if you send ordinary text instead of tokenized token_ids, you will get results that are completely off, so be careful.
The real solution would be to send pull requests so TEI can start without tokenizer.json and the /rerank API also works properly. I have not done that because implementing it in Rust and communicating through the PR process feels like more work than I currently want to take on. I would be grateful if someone did.