cat articles/qa-rag-llm-sft

Training a Q&A + RAG-focused LLM with SFT, making 4-bit quantized models, and exceeding GPT-3.5 with a 7B model

This article was written for December 15 of the LLM Advent Calendar 2023.


Recently I wrote Building Japanese Wikipedia embeddings and a FAISS index for RAG, where I used GPT-3.5 or GPT-4 to extract answers to questions. Since I had the data, I wanted to avoid using a huge LLM such as OpenAI's models and instead train a rapidly improving local LLM with Supervised Fine-tuning Trainer, or SFT, making an LLM specialized for Q&A + RAG tasks. As the base LLM, I used youri7b-instruction, published by rinna.

For example, if I give the trained model an input like this:

以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。

### 指示:
楽曲『約束はいらない』でデビューした、声優は誰?

### 入力:
Suara 大阪府立豊中高等学校を経て大阪外国語大学でインドネシア語を専攻中にバンド・ユニットを組んで音楽活動を始めた。普段はお笑い番組が大好きなこともあってよく喋るほうだが、東京の仕事で標準語の喋りをする時は、
早見沙織 声優デビュー時より、数多くの主題歌やキャラクターソングを担当し、バラードからポップス、ヒットソングのカバー曲や英語の楽曲など、様々な曲を歌いこなす。2009年には吉田仁美とのユニット「blue dro
約束はいらない 「約束はいらない」(やくそくはいらない)は、坂本真綾のデビューシングル。
約束はいらない 坂本真綾の歌手デビュー作品。当時坂本はまだ無名の声優であったが、同曲がテーマソングとなったアニメ『天空のエスカフローネ』とともに知名度を上げることとなる。後に「指輪」が同アニメの劇場版映画の主題歌とな
坂本真綾 本格的な歌手活動は、1996年にテレビアニメ『天空のエスカフローネ』へ出演したことがきっかけで始めており、同作のオープニングテーマソングである「約束はいらない」(岩里祐穂作詞、菅野よう子作曲)をシング

### 応答:

It outputs only the answer to the question:

坂本真綾

I also created a notebook that runs on a Google Colab T4 GPU, so please try it if you are interested.

Performance comparison with GPT-3.5

Let's compare the model I made and its quantized versions with GPT-3.5 and GPT-4. For the comparison dataset, I used 980 validation examples from hotchpotch/jaqket_v1_qa_wikija_context. For questions that have context containing the answer, I evaluated whether the answer could be extracted correctly using exact match and partial match accuracy.

The results are below. After training, every model exceeded GPT-3.5 accuracy by a large margin. Execution time was also faster, especially for the AutoGPTQ quantized model, which was about twice as fast. I discuss the numbers later in the article.

ModelExact matchPartial matchTimeGPU memory (MB)
GPT3.50.59490.799405
GPT4.00.87860.91731152
fp16 before training0.59080.7327421811122
fp16 after training0.75820.893941469964
BnB 4bit0.76020.88673973774
AutoGPTQ0.79690.88672114695
AutoAWQ0.73160.88473015933

The evaluation code is in the eval_xxx files here:

Training with Supervised Fine-tuning Trainer

SFT is an easy way to train a model to output a specific format in response to instructions, or instruction tuning. The way to train is simple. Prepare examples like this:

### 指示:
今日の天気は何ですか?

### 入力:
本日は大雨ですね。

### 応答:
大雨

In this example, the data after ### 応答: is what we want the model to output well. If you provide the example sentence and ### 応答:, it trains appropriately. During actual training, the model predicts after 応答: and learns the token probability score, or cross entropy loss, for the desired answer. In other words, once examples can be created, this is an easy training method that handles the rest nicely. I have heard that around 1000 examples can be enough to train reasonably well, though a citation is needed.

Training dataset

For training, I used 2939 train examples from jaqket_v1_qa_wikija_context. This dataset extracts the CC BY-SA 4.0 DEED licensed portion of the AI Quiz King dataset and adds context usable for RAG.

Training

Training was done with this implementation. On an RTX 4090, one epoch, or 91 steps, took a little over two hours.

I omit the details here, but the training loads youri7b-instruction with BnB 4-bit quantization and FlashAttention 2, then trains with LoRA. I also used NEFTune to improve performance.

Looking at training results

The training process is recorded in this wandb run:

Train loss flattened fairly quickly, and eval loss stopped decreasing around 40% of the training. Forty percent means around 1200 training examples, so the claim that around 1000 examples can train reasonably well feels fairly plausible.

loss
loss

Let's also look at wrong results at the end of training where exact match did not match. wandb is convenient because it can display dataframes as tables.

Wrong data
Wrong data

Many results were close: an extra at the end, or differences between full-width and half-width =.

Model quantization

As of December 2023, according to Quantize 🤗 Transformers models, Hugging Face Transformers lists the following three quantization methods as easy to use from Python:

BnB is a relatively older quantization method, and I also used it during training. GPTQ appeared in 2022, and AWQ appeared in 2023. This time I quantized with each method at 4 bits and evaluated on the validation data of jaqket_v1_qa_wikija_context. For AWQ and GPTQ, I provided Wikipedia text and training data as calibration sample text during quantization so that quantization would be better.

The result is the same as the table shown earlier. The measurement environment is CPU Ryzen 9 5950X and GPU RTX 4090. Every model exceeded GPT-3.5 on partial match and exact match, and all quantized models were faster. AutoGPTQ was about twice as fast as GPT-3.5. Comparing the quantized models, it is understandable that non-quantized fp16 had the best exact match, but unexpectedly, AutoGPTQ had the best partial match and exceeded fp16. Because I passed training data as samples during AutoGPTQ quantization, that bias may have made the result better than fp16. Against GPT-4.0, the models lose clearly on accuracy, which is unavoidable.

ModelExact matchPartial matchTimeGPU memory (MB)
GPT3.50.59490.799405
GPT4.00.87860.91731152
fp16 before training0.59080.7327421811122
fp16 after training0.75820.893941469964
BnB 4bit0.76020.88673973774
AutoGPTQ0.79690.88672114695
AutoAWQ0.73160.88473015933

I did not tune GPTQ, AWQ, or BnB in detail, so results may differ with better optimization. For example, AWQ can speed up inference by changing the algorithm version depending on token length and batch size for the use case. Also, GPU memory is the memory at model load time, and inference probably uses more GPU memory.

Closing

This time I used easy SFT training to fine-tune a 7B local LLM so that it can answer appropriately for Q&A + RAG tasks. As a result, although general ability was lost, the quantized models achieved better speed and accuracy than GPT-3.5 on a home machine. With SFT, if you have around 1000 training examples, it seems possible to make a model follow many output formats, and in that case training may take less than one hour on an RTX 4090. That makes training LLMs specialized for specific uses feel casual.

Local LLM performance will continue improving, and smaller high-performance local LLMs such as TinyLlama-1.1B will likely continue to be developed. I look forward to local LLM progress next year.

Implementations, notebooks, and public models used for training and inference

cat related_articles/qa-rag-llm-sft.yaml

  1. Solving the first AI-Ou quiz competition with vector search onlyI tried solving the first AI-Ou Japanese quiz competition using only vector search over Japanese Wikipedia passages, and compared several Japanese embedding models on a Q&A retrieval task.
  2. How to Build a SPLADE Model: Japanese SPLADE Technical ReportHow I built a Japanese SPLADE sparse retrieval model, including tokenizer issues, training implementation, evaluation, and the YAST trainer.
  3. Releasing a Japanese StaticEmbedding Model for Practical 100x Faster Text EmbeddingsI released static-embedding-japanese, a fast non-Transformer embedding model for Japanese and English text, and evaluated it on JMTEB.