cat articles/ctranslate2-embeddings

Making Transformers inference 1.6 to almost 2 times faster with CTranslate2

created 2023-11-23

There is a fast inference library written in Python and C++ called CTranslate2. I had wanted to try it someday, but because it required converting models, I had put it off. Then I learned about hf_hub_ctranslate2, a library that transparently converts Hugging Face models into a format usable with CTranslate2 and runs inference with them. I tried it and very easily got 1.6x faster inference on GPU and 1.9x faster inference on CPU, with almost no change in accuracy. I should have used it earlier, so here is a note.

What is CTranslate2?

CTranslate2, or CT2 below, is, as the GitHub project overview says, "a C++ and Python library for efficient inference with Transformer models." It is a library that makes Transformer model inference efficient through various optimizations. Libraries for efficient inference such as llama.cpp basically support only decoder models, but CT2 supports not only decoder models, but also encoder-decoder models and some encoder models. Since supported encoder models include BERT, BERT-family models can also run inference efficiently.

You may think, "BERT? Do we still use such an old architecture?" But for example, multilingual-e5-small, the model I use daily to generate embeddings, is also a BERT-family model. There are still many chances to use it.

Embedding inference with CTranslate2 and SentenceTransformer

Using CTranslate2 as a SentenceTransformer-compatible model is very easy. For example, change this SentenceTransformer code:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name, device=device)
embs = model.encode(texts)

to this:

from hf_hub_ctranslate2 import CT2SentenceTransformer
model = CT2SentenceTransformer(
    model_name, device=device, compute_type=compute_type
)
embs = model.encode(texts)

That alone basically makes inference 1.6 to 2 times faster. It also uses less memory.

CT2SentenceTransformer is implemented as a subclass of SentenceTransformer, so it can be used in almost the same way. I describe compute_type later.

Actual inference speed and output differences

Let's look at the measured speed difference. I used 20,000 Japanese Wikipedia samples, taking the first 512 tokens and converting them into embeddings with multilingual-e5-small for a similarity search task by adding the "query: " prefix. I compared inference with the original SentenceTransformer and with CT2 under several compute_type settings. The notebook is here. The GPU is RTX 4090, and the CPU is Ryzen 9 5950X. speed is relative to SentenceTransformer as 1.0.

device	type	speed	time	rps	mAP@100	MSE
cuda	sentence_transformer	1.00	38.99	512.94	-	-
cuda	CT2 + int8	0.94	41.44	482.63	1.0	0.000004
cuda	CT2 + int8_float32	0.93	41.89	477.43	1.0	0.000004
cuda	CT2 + int8_float16	1.45	26.98	741.30	1.0	0.000004
cuda	CT2 + float16	1.66	23.54	849.53	1.0	0.0
cuda	CT2 + auto	1.48	26.36	758.69	1.0	0.000004
cpu	sentence_transformer	1.00	1389.80	14.39	-	-
cpu	CT2 + auto	1.89	737.07	27.13	1.0	0.000004

In this result, CT2 with quantization such as int8 was actually slower, and CT2 + float16 was the fastest. On GPU, compute_type="float16" gave 1.66x speed. As an evaluation metric for the inference results, mAP@100 was 1.0, meaning the ranking did not change. To look at a finer accuracy difference, I also measured MSE, and it was almost unchanged too. It is displayed as 0.0, but the actual value was around 3e-09. For GPU, compute_type="auto" seems to be int8_float.

On CPU, compute_type="auto" was about 1.9x faster. mAP@100 remained 1.0, and MSE was only 0.000004, a tiny difference that should be almost no problem in real operation. There are many cases where inference is run on CPU, so a 1.9x speedup for CPU inference is quite valuable. I did not measure memory this time, but CT2 also advertises lower memory use and it did seem memory-efficient, so it should be even more useful in environments with tight compute resources.

CTranslate2 deserves more attention

CTranslate2 can be used for encoder models, and with hf_hub_ctranslate2, Hugging Face models can be used easily. This time I used it as a replacement for SentenceTransformer, but BERT-family models are still used for many tasks, so I feel its range of use is broad.

However, CTranslate2 currently has 2.2k GitHub stars, while llama.cpp has 44.6k. In this LLM boom, many projects receive a lot of stars, so its popularity feels modest. The name CTranslate2, perhaps because it was originally used to speed up machine translation models, also does not make it easy to infer what the library can do. That feels like a waste. I hope more people try it.