cat articles/lora-share-gpu

Another major benefit of LoRA: switching task models instantly while sharing GPU memory

created 2023-05-31

LoRA, or Low-Rank Adaptation of Large Language Models, adds low-rank matrices so that you can train at low cost while keeping the original model intact. Recently, large Japanese models such as cyberagent/open-calm-7b and rinna/japanese-gpt-neox-3.6b were released, and Hugging Face released peft, a library that makes LoRA easy to use with Transformers. Many people have probably tried it.

Most explanations of LoRA's benefits focus on training. I had not seen much discussion of another major benefit: handling multiple tasks while sharing memory for the base LLM. This article explains how to do that with peft.

For an explanation of what LoRA is, this study group deck on LoRA: Low-Rank Adaptation of Large Language Models is very clear.

What problem does this solve?

As the name says, LLMs are large language models. For example, if you load the open-calm-7b model onto a GPU in fp16, it alone uses about 13 GB of memory. If you fully fine-tune it for a task, running that task needs 13 GB. If you then load another model for another task, it needs another 13 GB. A total of 26 GB is a harsh amount of memory, especially for a home GPU.

However, if you train open-calm-7b with LoRA by adding low-rank matrices with parameter r=8, the additional memory needed is only 17 MB. Not 17 GB, but 17 MB. You get a neural network that has learned task-specific characteristics and can solve some task with only that extra size.

That means you can handle another task with the 13 GB base LLM plus 17 MB. And not just one task. If you have LoRA data trained from ten different tasks or datasets, you can handle those tasks with 13 GB + 170 MB of memory. That is extremely powerful.

To be honest, for batch processing where you run the same process over lots of data, repeatedly loading and unloading GPU memory is often acceptable if you can wait. But for realtime sequential processing, such as responding to user input, being able to share memory on one GPU while handling multiple tasks is much better for performance.

For example, this seems useful for cases like:

Changing the expression style of chatbot responses
Running an article hosting service and switching models per user after learning each user's writing characteristics
Switching models to evaluate which training worked better in an A/B test
Switching LangChain Agents quickly when running them locally
- Each Agent may have different capabilities, and you may want to switch Agents depending on the content. If each Agent is a huge model, frequent loading and unloading from memory becomes very slow.

One caveat is that the base LLM must be the same.

How to switch in practice

When using LoRA models trained with peft, switching is very easy. I prepared the following notebook as an example:

https://gist.github.com/hotchpotch/e99a70a6864c76f5638010537d535a33

PeftModel can switch the active model using a feature called adapter. The model loaded by default is named default, and you can load another model with a name using load_adapter(model_name, adapter_name).

For example, load peft_model like this:

from peft import PeftConfig, PeftModel

peft_model_open2ch = "hotchpotch/open-calm-7b_lora_open2ch"
peft_config_open2ch = PeftConfig.from_pretrained(peft_model_open2ch)

model = AutoModelForCausalLM.from_pretrained(peft_config_open2ch.base_model_name_or_path, device_map="auto", torch_dtype=torch.float16)

peft_model = PeftModel.from_pretrained(model, peft_model_open2ch)

Then add a model with a different capability:

# https://note.com/masuidrive/n/n0e2a11fc5bfa
peft_model_instruct = "masuidrive/open-calm-instruct-lora-20230525-r4-alpha16-batch32-epoch1"

# Load it into peft_model with the adapter name "instruct"
peft_model.load_adapter(peft_model_instruct, "instruct")

After that, just switch adapters according to the task:

# The trained model hotchpotch/open-calm-7b_lora_open2ch
peft_model.set_adapter("default")
# The trained model masuidrive/open-calm-instruct-lora-20230525-r4-alpha16-batch32-epoch1
peft_model.set_adapter("instruct")

With this, the base LLM cyberagent/open-calm-7b should be loaded into about 13 GB of memory, while the 2ch-style text generation model is loaded as the default adapter and the QA answering model is loaded as the instruct adapter. Together, those adapters add only about 34 MB of memory.

So by switching with set_adapter for the task you want to run, you can use the models well without loading and freeing the huge LLM again. In the notebook example, I switch between two capabilities: generating 2ch-style text and answering questions.

LLMs plus many adapters open up possibilities

Training huge models that previously required full fine-tuning can now be done efficiently with LoRA, with small saved data size. In inference, multiple tasks can also share memory and run with lower memory usage.

This area is evolving quickly day by day. It is very interesting, the things we can do are expanding, and I am looking forward to the future.

The title says "GPU memory", but this memory sharing should not be limited to GPUs.