cat articles/apple-silicon-embs

Embedding conversion performance on Apple Silicon GPU (MPS)

created 2023-11-10

With the announcement of the M3 Max with 128 GB unified memory, people were excited that even huge-parameter LLMs might run locally. For my own use case, I wanted to convert text into embeddings on my local Mac, so I measured how fast text-to-sentence-vector, or embedding, conversion actually is.

Environment

I ran the benchmark on a local Mac, Linux through WSL2, and Colab.

Mac: 2022 MacBook Air / M2, 8 CPU cores, 10 GPU cores, 24 GB memory
Linux (WSL2): Ryzen 9 5950X / NVIDIA RTX 4090
Google Colab: T4 instance

On each CPU and GPU, I measured the time to convert the first 512 tokens of 1000 Japanese Wikipedia samples with multilingual-e5-small. The notebook used for measurement is here:

https://colab.research.google.com/drive/14_oeZrN5v7Potq5_a8UXvaOGCUJ4I1m8?usp=sharing

Results

Device	Method	Total Time (sec)	RPS
RTX 4090	CUDA (GPU)	2.58	388.07
Colab T4	CUDA (GPU)	19.92	50.21
MacBook Air M2	MPS (GPU)	33.16	30.15
Ryzen 5950X	CPU	73.18	13.66
MacBook Air M2	CPU	104.89	9.53
Colab	CPU	710.72	1.41

The RTX 4090 wins overwhelmingly, which is expected. But the 10-core M2 GPU reaches about 60% of the T4's speed. M3 Max also has a 40-core GPU model. If speed scales linearly, an M3 Max 40-core GPU would reach around 130 RPS. That is about one third of an RTX 4090 and quite fast for a laptop GPU. It also looks likely to be more than twice as fast as a T4.

Whether M2 embedding conversion is practical depends on the use case. It is not terribly slow, but it is not fast either. Still, the GPU is about three times faster than the M2 CPU, and with Hugging Face Transformers you can use it just by setting the device to "mps". If you use a Mac, you should naturally use the GPU. With M3 Max, I think many use cases would get reasonably practical speed.

Even so, the M2 CPU with 8 cores is surprisingly fast. Ryzen 5950X uses all 16 cores for this run, yet on single CPU core speed the M2 seems faster than the Ryzen 5950X. Library optimization may also be involved, but in a simple comparison without thinking too hard, it is fast.

Update

I received a message from yuumi3 saying that a Mac mini M2 Pro with 10 CPU cores and 16 GPU cores produced the following speeds. Thank you. The GPU score seems to have increased roughly with the move from 10 GPU cores on M2 to 16 GPU cores on M2 Pro.

[mps] convert 1000 embs, total time: 13.59 sec  / rps: 73.60
[cpu] convert 1000 embs, total time: 68.12 sec  / rps: 14.68