Join our upcoming webinar “Deriving Business Value from LLMs and RAGs.”
Register now

Fine-tuning is a key part of improving large language models (LLMs). Recently, a new approach called reinforced fine-tuning (ReFT) has emerged, changing how we train these models. ReFT combines reinforcement learning, where a model learns to make decisions, with the standard fine-tuning process.

Current math-solving approaches revolve around supervised fine-tuning methods and chain-of-thought (CoT) annotations. With this approach, a math problem is solved straightforwardly, which may result in a model that doesn't generalize well. This happens because there is one CoT annotation for each question in the training data, while very often, there are several CoT annotations for the same questions.

Reinforced fine-tuning (ReFT)

To address this issue and offer a more versatile way to solve advanced reasoning problems, reinforced fine-tuning (ReFT) has come into play.

Reinforced fine-tuning (ReFT) starts with supervised fine-tuning (SFT), typically lasting one or two cycles. During this phase, the model gains an essential capability to solve mathematical problems correctly. Following this, ReFT takes the model's training to the next level by employing a reinforcement learning (RL) algorithm using methods like proximal policy optimization (PPO). This advanced stage allows the model to explore and learn from various correct solutions and reasoning methods.

supervised and reinforced fine tuning
An example of question (x), CoT (e), and answer (y) in GSM8K: Source

What makes ReFT efficient in this context is its use of the existing training data, which already includes the correct answers. These answers form the basis for the rewards in the PPO training process, eliminating the need for an additional, separately trained reward system. This is a vital difference from other methods like RLHF, which rely on rewards determined from human-annotated data.

ReFT stages

The reinforced fine-tuning (ReFT) process is divided into two main stages: warm-up and reinforcement learning. Let’s learn these in more detail.

Warm-up stage

In this initial stage, the model undergoes fine-tuning for a few cycles using a dataset composed of question and chain-of-thought (CoT) pairs. This stage is crucial for imparting basic problem-solving skills to the model, enabling it to generate appropriate responses to questions. The process involves predicting a sequence of tokens that form the CoT, ending with a special token indicating the end of the sequence. The model learns to generate CoT by sampling actions from a policy and updating its state with each action taken. In this initial stage, ReFT achieves a foundational level of accuracy.

Reinforcement learning stage

After the warm-up, the model enters the reinforcement learning stage, where it enhances its performance through online self-learning. This stage uses a dataset of question-and-answer pairs. The model learns by repeatedly sampling responses, assessing the correctness of these responses, and updating its parameters accordingly. The training employs proximal policy optimization (PPO) with a specific algorithm. Rewards are given based on the correctness of the answers derived from the model's CoT compared to the ground-truth answers, with a reward system that assigns a value based on the accuracy of the final answer.

sft and reft
Comparison between SFT and ReFT with CoT alternatives: Source

The fact that ReLF learns from diverse CoT reasoning strategies results in its more comprehensive learning experience compared to SFT alone. Consequently, ReFT performs better in generalizing math problem-solving skills, utilizing the same training questions as SFT but without needing additional or modified training material. Moreover, ReFT's methodology is consistent with data engineering techniques, allowing it to integrate smoothly into existing training frameworks.


In their study, the authors compare reinforced fine-tuning (ReFT) with supervised fine-tuning (SFT) and two types of self-training methodologies: offline self-training (OfflineST) and online self-training (Online-ST). SFT is a basic method where the language model is fine-tuned on training data, while the self-training approaches utilize model-generated samples for further training.

OfflineST: OfflineST involves using early SFT results to generate chain-of-thoughts (CoTs) and retaining only those that match the ground truth. These are then combined with the original training data for further fine-tuning.

OnlineST: OnlineST is designed to be similar to ReFT and also starts with a warm-up process. It continuously trains the model with newly generated CoTs, selecting only those with correct answers for further model updates.

The experimental setup involved two foundational models, Galactica-6.7B and Codellama-7B, known for their proficiency in solving math problems. Techniques like majority voting and reward model reranking were applied to enhance the results further. The training utilized advanced computational resources, and specific parameters were set for the warm-up stage, the number of training epochs, and the learning rate. The ReFT approach was rigorously tested against these baselines to validate its effectiveness.


reft results part 1
Table 1: Value accuracy comparison among the baselines and proposed ReFT method fine-tuned with two foundation models on all datasets: Source

In comparative studies shown in Table 1, reinforced fine-tuning (ReFT) consistently outperforms supervised fine-tuning (SFT) and self-training methods across various datasets, including GSM8K, SVAMP, and MathQA. The studies highlight ReFT's significant improvements in performance, with notable gains over SFT in both CodeLLAMA's GSM8K N-CoT and P-CoT evaluations. For instance, ReFT achieved over 9-point and 8-point improvements in these areas, respectively. On average, with CodeLLAMA across all datasets, ReFT marked improvements of 3.7 points in N-CoT and 5.9 points in P-CoT settings.

Notably, these results were achieved without additional annotations or specialized reward models, underscoring ReFT's robust generalization capabilities. This demonstrates the potential of using reinforcement learning to explore training data more effectively

The comparison also reveals that while offline self-training can sometimes enhance performance over SFT, the improvements are not as pronounced as those achieved by ReFT. This indicates that the exploratory nature of ReFT is crucial for its success. Although online self-training showed some gains with the Galactica model, it still lagged behind ReFT overall. This suggests that incorporating incorrect instances is vital for guiding the model towards more effective exploration. Compared to self-training approaches, the superior performance of ReFT indicates the effectiveness of its on-policy sampling and reinforcement learning over standard data augmentation methods.

Reward hacking

The research reveals that ReFT struggles with reward hacking in MathQAMCQ's multiple-choice format, leading to inaccuracies in reinforcement learning training. This issue arises when the model incorrectly rewards a chain-of-thought (CoT), leading to an erroneous conclusion. To mitigate this, the researchers used MathQA version that requires direct numeric answers, eliminating the choice-based questions. The results show ReFT's consistent superiority over SFT with both Galactica and CodeLLAMA models.

Majority voting and reward reranking

Furthermore, ReFT benefits from majority voting and reward model reranking, outperforming SFT in these scenarios, as evidenced in Table 2. Compared with other open-source methods, ReFT's best variant shows remarkable performance, especially in the CodeLLAMA + ReFT + Reranking setup, which achieves notable accuracy and even rivals GPT-3.5-turbo despite being a smaller 7B model.

Table 2: Solving accuracy of majority voting and reward model reranking for SFT and ReFT on GSM8K: Source

Final remarks

Reinforced fine-tuning (ReFT) represents a novel method designed to boost the performance of models in mathematical problem-solving, marking a departure from traditional supervised fine-tuning (SFT). Unlike SFT, which relies on a single chain-of-thought (CoT) annotation, ReFT employs a variety of CoT annotations to enhance its search for accurate solutions. The extensive research from the paper demonstrates two foundational models and three different datasets that convincingly demonstrate that ReFT surpasses SFT in terms of accuracy and conceptual generalization. Additionally, this study reveals that ReFT-trained models are adaptable to advanced methodologies like majority voting and reward model reranking.

Disclaimer: This post is informed by research from the scholarly article “REFT: Reasoning with REinforced Fine-Tuning,” authored by multiple contributors.

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate