RAFT: Combining RAG with fine-tuning

To improve a language model in certain topics, you can add more domain-specific information to its training after it has learned from a huge amount of pre-training data. We mainly use two methods: Retrieval augmented generation (RAG) and fine-tuning. RAG adds extra knowledge from outside sources to the prompts, and fine-tuning involves giving the model more data to learn from. Each method has pros and cons, and the choice between them often depends on the project's needs. People used to choose one or the other until retrieval augmented fine-tuning, or RAFT, came along.

RAFT combines RAG and fine-tuning and provides a training recipe that improves the model's ability to answer questions in an 'open-book' domain setting. In this post, we'll explore what RAFT is and how it helps take the training of language models to the next level.

Before RAFT: RAG vs. fine-tuning

Large language models (LLMs) learn from massive public data at the beginning of their lifecycle and do very well in general knowledge tasks. But now, they're increasingly used in specific areas like coding help for certain software or answering questions about particular types of documents, like legal or medical ones. In these cases, it's not just about general knowledge anymore. The aim is to get as accurate as possible using specific documents.

To tailor LLMs to these niche areas, we look at two main strategies: learning from examples with retrieval-augmented generation (RAG) and supervised fine-tuning. For a while, the question was, “Which one should I choose – RAG or fine-tuning?” It was really a matter of choice between these two approaches. Let’s revise each method before moving their “better together” story.

RAG

RAG lets the model look up documents to answer questions, a bit like having the book open during a test. But, this method doesn't take full advantage of being able to learn from the specific types of questions it will see.

Fine-tuning

On the other hand, supervised fine-tuning is about learning the general themes in the documents, getting better at the tasks, and fitting what users want. However, the usual ways of doing this either ignore the chance to use documents when answering real questions or don't properly handle mistakes in finding the right documents to learn from.

Think of it like an open-book test. The current methods with RAG are like going into the test without studying first. The fine-tuning methods try to "study" by either cramming the information or practicing questions but not actually using the book they'll have during the test. While these ways do try to learn from the specific area, they don't get ready for the reality of having resources available during the actual test.

Combining strengths: RAFT

A recent research paper from UC Berkeley explores how to mix supervised fine-tuning (SFT) with retrieval augmented generation (RAG) with a new approach called retrieval augmented fine-tuning (RAFT).

RAFT is about teaching LLMs to get smarter about specific topics while improving in-domain RAG performance. RAFT not only ensures the models are well-trained on domain-specific knowledge through fine-tuning but also ensures they're robust against inaccurate retrievals. This is done by training models to understand how the question, the documents found, and the correct answer fit together. It's like studying for an open-book test by learning to spot what's important in your notes and what's not.

With retrieval augmented fine-tuning, we train the model to take a question and documents (even the distracting ones) and come up with an answer that follows a logical thought process. RAFT has proven to be better than just supervised fine-tuning, whether RAG is used or not, across different datasets like PubMed, HotpotQA, and others from HuggingFace Hub, Torch Hub, and TensorFlow Hub Gorilla, showing a simple yet effective way to boost LLMs for specific topics.

LLM analogy with an open-book exam

To better understand the concept of retrieval augmented fine-tuning, let's dive deeper into our exam analogy, comparing how we prepare an LLM for real-world use to prepare for an exam.

Closed-book exam: This is like when an LLM doesn't have any extra information or resources to help answer questions. An example is using the LLM as a chatbot, where it relies solely on what it has learned during its initial training and any fine-tuning to come up with answers.

Open-book exam: Here, the scenario changes. The LLM can look up information from outside sources, like a website or a book chapter. This usually involves a tool that finds and presents relevant documents or parts of documents to help the LLM learn new information on the spot. The success of the LLM in this setting largely depends on how well this tool can find the most relevant information.

Domain-specific open-book exam: This study focuses on a more specific version of the open-book exam. In these exams, we know beforehand the specific field or domain the LLM will deal with. It could be anything from company documents and recent news to code from a particular organization. The LLM uses information from this particular domain, which has been specially fine-tuned, to answer questions. The focus here is on adjusting a pre-trained LLM to excel in these domain-specific tasks, especially how to make it handle different amounts of information and distractions effectively.

raft how best to prepare for an exam — Source

How does RAFT work?

RAFT is a new way of “domain-specific open-book exam”. Let’s first start from supervised fine-tuning (SFT), then explore RAFT in more detail.

Supervised fine-tuning (SFT)

In SFT, we use a dataset of questions and answers. The idea is to train the model to get better at giving the right answers using what it already knows from earlier training or what it learns while being fine-tuned. Once trained, this model can also be used with additional documents to help it find answers (RAG). Here’s a simple way to see how it works:

Training: The model learns to go from a question to an answer (Q → A).
Testing without extra info (0-shot Inference): It uses what it learned to answer new questions (Q → A).
Testing with RAG (RAG Inference): It gets extra documents to help answer questions (Q+D → A).

RAFT

Retrieval aware fine-tuning (RAFT) offers a new way to set up training data for models, especially for domain-specific open-book scenarios similar to in-domain RAG. In RAFT, we create training data that includes a question (Q), some documents (Dk), and a corresponding chain-of-thought (CoT) answer (A*) that’s based on information from one of the documents (D*). We distinguish between two types of documents: the 'oracle' documents (D*) that have the information needed for the answer, and 'distractor' documents (Di) that don’t help with the answer. Some of the training involves having the right document along with distractions, while other times, we only include distractor documents to encourage the model to rely on its memory rather than just the documents provided.

This process looks like this:

For some data, we have a question plus the right document and distractions leading to the answer (Q+D*+D2+...+Dk → A*).
For other data, we only have a question and distractions, which still lead to an answer (Q+D1+D2+...+Dk → A*).

When testing, the model gets a question and the top documents found by the RAG setup, but RAFT works no matter which tool is used to find these documents.

A crucial part of the training involves teaching the model to build a chain of reasoning to explain its answers, much like writing out your steps in math homework. This approach involves giving the model all the context and information it needs, then asking it to explain its answer step by step, linking back to the original documents. For all the datasets that are mentioned in the paper, the researchers used this method to create answers that include reasoning. Some datasets, like the Gorilla APIBench, already include reasoned answers. We show that adding these detailed explanations helps the model perform better, proven by our experiments.

cot answer raft — RAFT prompt to help LLM evaluate its own generated reasoning and answers, contrasting them with the correct reasoning and answers. Source

Results

Using the selected datasets and comparison models, the RAFT model results are shown below. RAFT consistently performs better than the other models compared to it. For example, when compared to the Llama-2 model that's been tuned for instructions, RAFT, especially when combined with RAG, is much more effective at pulling information from documents and ignoring irrelevant ones. The improvement can be as high as 35.25% on Hotpot QA and 76.35% on Torch Hub.

When comparing retrieval augmented fine-tuning to domain-specific fine-tuning (DSF) on certain datasets, RAFT is better at using the given context to answer questions. It shows significant improvements on the HotpotQA and HuggingFace datasets, with increases of 30.87% and 31.41% respectively. However, for PubMed QA, which is based on yes/no questions, RAFT doesn't show as big an improvement over DSF + RAG.

Even when put against a much larger model like GPT-3.5, RAFT shows clear advantages. Generally, the LLaMA-7B model, with or without RAG, didn't do as well because its way of answering questions didn't match what was needed. By applying specific tuning for the domain, RAFT showed better performance. This tuning helps the model learn the right way to answer questions. However, just adding RAG to a model that's been fine-tuned for a specific domain doesn't always lead to better results, suggesting that the model may need more training on how to use context effectively and extract useful information from it. By using RAFT, we train the model not only to align its answers better but also to enhance its ability to process documents. As a result, this approach outperforms the others.

Wrapping up

Retrieval augmented fine-tuning (RAFT) cleverly combines two techniques, retrieval augmented generation (RAG) and fine-tuning, to improve how language models handle specific topics significantly. Previously, models either used external information (RAG) or learned from extra training data (fine-tuning). RAFT brings these methods together, enhancing the model's ability to sift through and use relevant documents accurately. This blend not only boosts the model's performance in niche domains but also makes it better at ignoring misleading information. With RAFT, language models are now more adept and precise, marking a leap forward in their development for specialized tasks.

RAFT: Combining RAG with fine-tuning

Contents

Before RAFT: RAG vs. fine-tuning

RAG

Fine-tuning

Combining strengths: RAFT

LLM analogy with an open-book exam

How does RAFT work?

Supervised fine-tuning (SFT)

RAFT

Results

Wrapping up

Recommended for you

Stay connected

Contents

Before RAFT: RAG vs. fine-tuning

RAG

Fine-tuning

Combining strengths: RAFT

LLM analogy with an open-book exam

How does RAFT work?

Supervised fine-tuning (SFT)

RAFT

Results

Wrapping up

Recommended for you

LLM-as-a-judge vs. human evaluation: Why together is better

Reinforcement learning from AI feedback (RLAIF): Complete overview

RAG evaluation: Complete guide 2025

Stay connected