cat articles/tei

Running Japanese Tokenizer Models with text-embeddings-inference

created 2024-09-30

text-embeddings-inference, or TEI, is an inference server provided by Hugging Face. It is written in Rust, provides Docker containers for various GPU architectures, and when the GPU architecture supports FlashAttention 2, it is often about 1.5 to almost 2 times faster than running inference with Python's Transformers library. I find it useful as a high-performance production inference server.

One problem in Japanese environments is that TEI requires a Rust-based FastTokenizer, in other words a model with tokenizer.json. Many Japanese Transformer models use morphological analysis dictionaries and libraries such as UniDic or MeCab that run in Python, so many of them cannot be used through the tokenizer.json path.

This was a serious issue for me at first, but I found that some APIs such as /embed and /embed_sparse, though unfortunately not /rerank, can be used with a workaround. As an example, I will record the method using cl-nagoya/ruri-base.

Prepare a Dummy tokenizer.json

TEI checks for tokenizer.json when starting the model, and it will not start without one. Therefore we prepare a dummy tokenizer.json. You can create one yourself or use one from a public model. For this example I used the tokenizer.json from hotchpotch/mMiniLMv2-L6-H384.

I created a version of ruri-base with this tokenizer.json added as ruri-base-dummy-fast-tokenizer-for-tei.

Start the Server with the Dummy tokenizer.json Model

Prepare a docker-compose.yaml like this:

services:
  ruri-base:
    # Change the image to one that matches your architecture.
    image: ghcr.io/huggingface/text-embeddings-inference:86-1.5
    ports:
      - "8080:80"
    volumes:
      - /tmp/docker-tei-data:/data
    # Change pooling to match the model architecture.
    command: [ "--model-id", "hotchpotch/ruri-base-dummy-fast-tokenizer-for-tei", "--dtype", "float16", "--pooling", "mean", "--max-batch-tokens", "131072", "--max-client-batch-size", "16" ]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [ gpu ]

Start it:

$ docker compose up
...
ruri-base-1  | 2024-09-30T06:51:45.266929Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1778: Starting HTTP server: 0.0.0.0:80
ruri-base-1  | 2024-09-30T06:51:45.266940Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1779: Ready

It should now be running on port 8080.

Convert to token_ids Locally and Call the API

Next, tokenize locally and call the API with token_ids.

from transformers import AutoTokenizer
import requests
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("hotchpotch/ruri-base-dummy-fast-tokenizer-for-tei", use_fast=False)

sentences = [
    "クエリ: 瑠璃色はどんな色？",
    "文章: 瑠璃色（るりいろ）は、紫みを帯びた濃い青。名は、半貴石の瑠璃（ラピスラズリ、英: lapis lazuli）による。JIS慣用色名では「こい紫みの青」（略号 dp-pB）と定義している[1][2]。",
    "クエリ: ワシやタカのように、鋭いくちばしと爪を持った大型の鳥類を総称して「何類」というでしょう?",
    "文章: ワシ、タカ、ハゲワシ、ハヤブサ、コンドル、フクロウが代表的である。これらの猛禽類はリンネ前後の時代(17~18世紀)には鷲類・鷹類・隼類及び梟類に分類された。ちなみにリンネは狩りをする鳥を単一の目(もく)にまとめ、vultur(コンドル、ハゲワシ)、falco(ワシ、タカ、ハヤブサなど)、strix(フクロウ)、lanius(モズ)の4属を含めている。",
]

token_ids = tokenizer(sentences, padding=False, truncation=False, return_tensors="np")["input_ids"]
token_ids = [t.tolist() for t in token_ids]

url = "http://127.0.0.1:8080/embed"
payload = {"inputs": token_ids, "normalize": False, "truncate": True}
headers = {"Content-Type": "application/json"}

response = requests.post(url, json=payload, headers=headers)
embeddings_data = response.json()
embeddings = np.array(embeddings_data)
print(embeddings.shape)

# calc cosine similarity
normalized_embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
similarities = np.dot(normalized_embeddings, normalized_embeddings.T)

print(similarities)

Result:

(4, 768)

array([[1.        , 0.94194159, 0.68661375, 0.71621216],
       [0.94194159, 1.        , 0.66622363, 0.68591373],
       [0.68661375, 0.66622363, 1.        , 0.87196226],
       [0.71621216, 0.68591373, 0.87196226, 1.        ]])

This successfully obtains dense vectors, with cosine similarities almost the same as the values shown in the ruri-base model card. With this approach, TEI can be used with Japanese tokenizers for APIs other than reranking. Of course, if you send ordinary text instead of tokenized token_ids, you will get results that are completely off, so be careful.

The real solution would be to send pull requests so TEI can start without tokenizer.json and the /rerank API also works properly. I have not done that because implementing it in Rust and communicating through the PR process feels like more work than I currently want to take on. I would be grateful if someone did.