Fine-tuning large language models (LLMs) in 2025

It’s no secret that large language models (LLMs) are evolving at a wild speed and are turning heads in the generative AI industry. Enterprises aren't just intrigued; they're obsessed with LLMs, particularly with the potential of LLM fine-tuning. Billions of dollars have been poured into LLM research and development recently. Industry leaders and tech enthusiasts are showing a growing appetite to deepen their understanding of LLMs and their fine-tuning. While this frontier in natural language processing (NLP) keeps expanding more and more, staying informed is critical. The value LLMs may add to your business depends on your knowledge and intuition around this technology.

A large language model life cycle has several key steps, and today we're going to cover one of the juiciest and most intensive parts of this cycle - the LLM fine-tuning process. This is a laborious, heavy, but rewarding task that's involved in many language model training processes.

Large language model lifecycle

Before going over LLM fine-tuning, it's important to understand the LLM lifecycle and how it works.

1. Vision & scope: First, you should define the project's vision. Determine if your LLM will be a more universal tool or target a specific task like named entity recognition. Clear objectives save time and resources.

2. Model selection: Choose between training a model from scratch or modifying an existing one. In many cases, adapting a pre-existing model is efficient, but some instances require fine-tuning with a new model.

3. Model's performance and adjustment: After preparing your model, you need to assess its performance. If it’s unsatisfactory, try prompt engineering or further fine-tuning. We'll focus on this part. Ensure the model's outputs are in sync with human preferences.

4. Evaluation & iteration: Conduct evaluations regularly using metrics and benchmarks. Iterate between prompt engineering, fine-tuning, and LLM evaluation until you reach the desired outcomes.

5. Deployment: Once the model performs as expected, deploy it. Optimize for computational efficiency and user experience at this juncture.

What is LLM fine-tuning?

Large language model (LLM) fine-tuning is the process of taking pre-trained models and further training them on smaller, specific datasets to refine their capabilities and improve performance in a particular task or domain. Fine-tuning is about turning general-purpose models and turning them into specialized models. It bridges the gap between generic pre-trained models and the unique requirements of specific applications, ensuring that the language model aligns closely with human expectations. Think of OpenAI's GPT-3, a state-of-the-art large language model designed for a broad range of natural language processing (NLP) tasks. Suppose a healthcare organization wants to use GPT-3 to assist doctors in generating patient reports from textual notes. While GPT-3 can understand and create general text, it might not be optimized for intricate medical terms and specific healthcare jargon.

To enhance its performance for this specialized role, the organization fine-tunes GPT-3 on a dataset filled with medical reports and patient notes. It might use tools like SuperAnnotate's LLM custom editor to build its own model with the desired interface. Through this process, the model becomes more familiar with medical terminologies, the nuances of clinical language, and typical report structures. After fine-tuning, GPT-3 is primed to assist doctors in generating accurate and coherent patient reports, demonstrating its adaptability for specific tasks.

This sounds great to have in every large language model, but remember that everything comes with a cost. We'll discuss that in more detail soon.

When to use fine-tuning

Our article about large language models touches upon topics like in-context learning and zero/one/few shot inference. Here’s a quick recap:

In-context learning is a method for improving the prompt through specific task examples within the prompt, offering the LLM a blueprint of what it needs to accomplish.

Zero-shot inference incorporates your input data in the prompt without extra examples. If zero-shot inference doesn't yield the desired results, 'one-shot' or 'few-shot inference' can be used. These tactics involve adding one or multiple completed examples within the prompt, helping smaller LLMs perform better.

These are techniques used directly in the user prompt and aim to optimize the model's output and better fit it to the user's preferences. The problem is that they don’t always work, especially for smaller LLMs. Here's an example of how in-context learning may fail.

Other than that, any examples you include in your prompt take up valuable space in the context window, reducing the space you have to include additional helpful information. And here, finally, comes fine-tuning. Unlike the pre-training phase, with vast amounts of unstructured text data, fine-tuning is a supervised learning process. This means that you use a dataset of labeled examples to update the weights of LLM. These labeled examples are usually prompt-response pairs, resulting in a better completion of specific tasks.

Supervised fine-tuning (SFT)

Supervised fine-tuning means updating a pre-trained language model using labeled data to do a specific task. The data used has been checked earlier. This is different from unsupervised methods, where data isn't checked. Usually, the initial training of the language model is unsupervised, but fine-tuning is supervised.

How is fine-tuning performed?

Let's get into more details of fine-tuning in LLMs. For preparing the training data, there are many open-source datasets that offer insights into user behaviors and preferences, even if they aren't directly formatted as instructional data. For example, we can take the large data set of Amazon product reviews and turn them into instruction prompt datasets for fine-tuning. Prompt template libraries include many templates for different tasks and different datasets.

Once your instruction data set is ready, as with standard supervised learning, you divide the data set into training validation and test splits. During fine-tuning, you select prompts from your training data set and pass them to the LLM, which then generates completions.

During the fine-tuning phase, when the model is exposed to a newly labeled dataset specific to the target task, it calculates the error or difference between its predictions and the actual labels. The model then uses this error to adjust its weights, typically via an optimization algorithm like gradient descent. The magnitude and direction of weight adjustments depend on the gradients, which indicate how much each weight contributed to the error. Weights that are more responsible for the error are adjusted more, while those less responsible are adjusted less.

Over multiple iterations (or epochs) of the dataset, the model continues to adjust its weights, honing in on a configuration that minimizes the error for the specific task. The aim is to adapt the previously learned general knowledge to the nuances and specific patterns present in the new dataset, thereby making the model more specialized and effective for the target task.

During this process, the model is updated with the labeled data. It changes based on the difference between its guesses and the actual answers. This helps the model learn details found in the labeled data. By doing this, the model improves at the task for which it's fine-tuned.

Let's take an example to picture this better; if you ask a pre-trained model,"Why is the sky blue?" it might reply, "Because of the way the atmosphere scatters sunlight." This answer is simple and direct. However, the answer might be too brief for a chatbot for a science educational platform. It may need more scientific detail or context based on your guidelines. This is where supervised fine-tuning helps.

After fine-tuning, the model can give a more in-depth response to scientific questions. After fine-tuning, when asked, "Why is the sky blue?", the model might provide a more detailed explanation like:

"The sky appears blue because of a phenomenon called Rayleigh scattering. As sunlight enters Earth's atmosphere, it consists of different colors, each with its own wavelength. Blue light has a shorter wavelength and is scattered in all directions by the gases and particles in the atmosphere. This scattering causes the direct sunlight to appear white, but the sky itself to take on a blue hue." This enriched response is comprehensive and suitable for a science educational platform.

Methods for fine-tuning LLMs

LLM fine-tuning is a supervised learning process where you use a dataset of labeled examples to update the weights of LLM and make the model improve its ability for specific tasks. Let's explore some of the notable methods for fine-tuning LLMs and LLM agents.

Instruction fine-tuning

One strategy used to improve a model's performance on various tasks is instruction fine-tuning. It's about training the machine learning model using examples that demonstrate how the model should respond to the query. The dataset you use for fine-tuning large language models has to serve the purpose of your instruction. For example, suppose you fine-tune your model to improve its summarization skills. In that case, you should build up a dataset of examples that begin with the instruction to summarize, followed by text or a similar phrase. In the case of translation, you should include instructions like “translate this text.” These prompt completion pairs allow your model to "think" in a new niche way and serve the given specific task.

using prompts to fine tune llms with instruction

Full fine-tuning

Instruction fine-tuning, where all of the model's weights are updated, is known as full fine-tuning. The process results in a new version of the model with updated weights. It is important to note that just like pre-training, full fine-tuning requires enough memory and compute budget to store and process all the gradients, optimizers, and other components being updated during training.

Parameter-efficient fine-tuning

Training a language model is a computationally intensive task. For a full LLM fine-tuning, you need memory not only to store the model, but also the parameters that are necessary for the training process. Your computer might be able to handle the model weights, but allocating memory for optimizing states, gradients, and forward activations during the training process is a challenging task. Simple hardware cannot handle this amount of hurdle. This is where PEFT is crucial. While full LLM fine-tuning updates every model's weight during the supervised learning process, PEFT methods only update a small set of parameters. This transfer learning technique chooses specific model components and "freezes" the rest of the parameters. The result is logically having a much smaller number of parameters than in the original model (in some cases, just 15-20% of the original weights; LoRA can reduce the number of trainable parameters by 10,000 times). This makes memory requirements much more manageable. Not only that, but PEFT is also dealing with catastrophic forgetting. Since it's not touching the original LLM, the model does not forget the previously learned information. Full fine-tuning results in a new version of the model for every task you train on. Each of these is the same size as the original model, so it can create an expensive storage problem if you're fine-tuning for multiple tasks.

Other types of fine-tuning

Let's learn a few more types of learning:

Transfer learning: Transfer learning is about taking the model that had learned on general-purpose, massive datasets and training it on distinct, task-specific data. This dataset may include labeled examples related to that domain. Transfer learning is used when there is not enough data or a lack of time to train data; the main advantage of it is that it offers a higher learning rate and accuracy after training. You can take existing LLMs that are pre-trained on vast amounts of data, like GPT ¾ and BERT, and customize them for your own use case.

Task-specific fine-tuning: Task-specific fine-tuning is a method where the pre-trained model is fine-tuned on a specific task or domain using a dataset designed for that domain. This method requires more data and time than transfer learning but can result in higher performance on the specific task.

For example, translation using a dataset of examples for that task. Interestingly, good results can be achieved with relatively few examples. Often, just a few hundred or thousand examples can result in good performance compared to the billions of pieces of text that the model saw during its pre-training phase. However, there is a potential downside to fine-tuning on a single task. The process may lead to a phenomenon called catastrophic forgetting.

Catastrophic forgetting happens because the full fine-tuning process modifies the weights of the original LLM. While this leads to great performance on a single fine-tuning task, it can degrade performance on other tasks. For example, while fine-tuning can improve the ability of a model to perform certain natural language processing (NLP) tasks like sentiment analysis and result in quality completion, the model may forget how to do other tasks. This model knew how to carry out named entity recognition before fine-tuning correctly identifying.

Multi-task learning: Multi-task fine-tuning is an extension of single-task fine-tuning, where the training dataset consists of example inputs and outputs for multiple tasks. Here, the dataset contains examples that instruct the model to carry out a variety of tasks, including summarization, review rating, code translation, and entity recognition. You train the model on this mixed dataset so that it can improve the performance of the model on all the tasks simultaneously, thus avoiding the issue of catastrophic forgetting. Over many epochs of training, the calculated losses across examples are used to update the weights of the model, resulting in a fine-tuned model that knows how to be good at many different tasks simultaneously. One drawback of multi-task fine-tuned models is that they require a lot of data. You may need as many as 50-100,000 examples in your training set. However, assembling this data can be really worthwhile and worth the effort. The resulting models are often very capable and suitable for use in situations where good performance at many tasks is desirable.

Sequential fine-tuning: Sequential fine-tuning is about sequentially adapting a pre-trained model on several related tasks. After the initial transfer to a general domain, the LLM might be fine-tuned on a more specific subset. For instance, it can be fine-tuned from general language to medical language and then from medical language to pediatric cardiology.

Note that there are other fine-tuning examples – adaptive, behavioral, and instruction, reinforced fine-tuning of large language models. These cover some important specific cases for training language models.

Fine-tuning approaches are now also being widely adapted for small language models (SLMs), which have become one of the biggest GenAI trends of 2024. Fine-tuning a small language model is actually a lot handier and easier to implement, especially if you’re a small business or a developer looking to improve your model's performance.

Retrieval augmented generation (RAG)

Retrieval augmented generation (RAG) is a well-known alternative to fine-tuning and is a combination of natural language generation and information retrieval. RAG ensures that language models are grounded by external up-to-date knowledge sources/relevant documents and provides sources. This technique bridges the gap between general-purpose models' vast knowledge and the need for precise, up-to-date information with rich context. Thus, RAG is an essential technique for situations where facts can evolve over time. Grok, the recent invention of xAI, uses RAG techniques to ensure its information is fresh and current.

One advantage that RAG has over fine-tuning is information management. Traditional fine-tuning embeds data into the model's architecture, essentially 'hardwriting' the knowledge, which prevents easy modification. On the other hand, RAG permits continuous updates in training data and allows removal/revision of data, ensuring the model remains current and accurate.

In the context of language models, RAG and fine-tuning are often perceived as competing methods. However, their combined use can lead to significantly enhanced performance. Particularly, fine-tuning can be applied to RAG systems to identify and improve their weaker components, helping them excel at specific LLM tasks.

Fine-tuning in SuperAnnotate

Choosing the right tool means ensuring your AI understands exactly what you need, which can save you time, money, and protect your reputation. Look at the Air Canada situation, for example. Their AI chatbot hallucinated and gave a customer incorrect information, misleading him into buying full-price ticket. While we can't pin it down to fine-tuning for sure, it's likely that better fine-tuning might have avoided the problem. This just shows how crucial it is to pick a fine-tuning tool that ensures your AI works just right. It's precisely situations like these where SuperAnnotate steps in to make a difference.

SuperAnnotate's LLM tool provides a cutting-edge approach for designing optimal training data to fine-tune language models. Enterprises building AI solutions, like Databricks, choose SuperAnnotate to help them build training data.

Jonathan Frankle from Databricks shares why they chose us: 'We reviewed several companies in this space and selected SuperAnnotate due to the high quality of their data. I'm very glad we did – they continue to stand out for their data quality, attention to detail, and fantastic communication. They are an invaluable part of our data pipeline. I don’t see them as a vendor, I see them as a partner.'

‍Share your LLM fine-tuning challenge and see how we can help.

Through SuperAnnotate’s highly customizable LLM editor, users are given a comprehensive platform to create a broad spectrum of LLM use cases that fit their business needs. As a result, customers can ensure that their training data is not only high-quality but also directly aligned with the requirements of their projects.

SuperAnnotate's LLM tool provides a cutting-edge approach to designing optimal training data for fine-tuning language models. Through its highly customizable LLM editor, users are given a comprehensive platform to create a broad spectrum of LLM use cases tailored to specific business needs. As a result, customers can ensure that their training data is not only high-quality but also directly aligned with the requirements of their projects.

Here's what you need to know about SuperAnnotate's LLM fine-tuning tool:

Its fully customizable interface allows you to gather data for your specific use case efficiently. Even if it's unique.
We work with a world-class team of experts and people management, which makes it a breeze to scale to hundreds or thousands of people.
The analytics and insights of our platform are invaluable gems for our customers. It allows a better understanding of the data and enforces quality standards.
API integrations make it easy to set up a model in the loop, AI feedback and much more.

The tool has practical applications in various areas. It can handle tasks like chat rating, RLHF, or model comparison (as seen in the video), and many more. More here means you can use the customizable tool to build your own use case. It also supports multimodal cases, allowing you to work with text, images, audio, video, and PDFs—whatever your project needs. These features ensure you can address real-world needs in the Gen AI and LLM market.

Annotated question-response pairs(example in the image below) are sets of data where you have a question, the model's response, and annotations that provide insight into the quality, accuracy, or other attributes of that response. This somehow structured data is immensely valuable when training and fine-tuning models, as it offers direct feedback on the model's performance.

In terms of data collection, SuperAnnotate allows you to gather annotated question-response pairs. These can be downloaded in a JSON format, making it easy to store and use them for future fine-tuning tasks. Overall, it's a user-friendly tool that streamlines and enhances the LLM training process.

Fine-tuning best practices

Clearly define your task:

Defining your task is a foundational step in the process of fine-tuning large language models. A clearly defined task offers focus and direction. It ensures that the model's vast capabilities are channeled towards achieving a specific goal, setting clear benchmarks for performance measurement.

Choose and use the right pre-trained model:

Using pre-trained models for fine-tuning large language models is crucial because it leverages knowledge acquired from vast amounts of data, ensuring that the model doesn't start learning from scratch. This approach is both computationally efficient and time-saving. Additionally, pre-training captures general language understanding, allowing fine-tuning to focus on domain-specific nuances, often resulting in better model performance in specialized tasks.

While leveraging pre-trained models provides a robust starting point, the choice of model architecture — including advanced strategies like Mixture of Experts (MoE) and Mixture of Tokens (MoT) — is crucial in tailoring your model more effectively. These strategies can significantly influence how the model handles specialized tasks and processes language data.

Set hyperparameters:

Hyperparameters are tunable variables that play a key role in the model training process. Learning rate, batch size, number of epochs, weight decay, and other parameters are the key hyperparameters to adjust that find the optimal configuration for your task.

Evaluate model performance:

Once fine-tuning is complete, the model's performance is assessed on the test set. This provides an unbiased evaluation of how well the model is expected to perform on unseen data. Consider also iteratively refining the model if it still has potential for improvement.

Why or when does your business need a fine-tuned model?

We know that Chat GPT and other language models have answers to a huge range of questions. But the thing is that individuals and companies want to get their own LLM interface for their private and proprietary data. This is the new hot topic in tech town – large language models for enterprises.

Here are a few reasons why you might need LLM fine-tuning.

1. Specificity and relevance: While LLMs are trained on vast amounts of data, they might not be acquainted with the specific terminologies, nuances, or contexts relevant to a particular business or industry. Fine-tuning ensures the model understands and generates content that's highly relevant to the business.

2. Improved accuracy: For critical business functions, the margin for error is slim. Fine-tuning business-specific data can help achieve higher accuracy levels, ensuring the model's outputs align closely with expectations.

3. Customized interactions: If you're using LLMs for customer interactions, like chatbots, fine-tuning helps tailor responses to match your brand's voice, tone, and guidelines. This ensures a consistent and branded user experience.

4. Data privacy and security: General LLMs might generate outputs based on publicly available data. Fine-tuning allows businesses to control the data the model is exposed to, ensuring that the generated content doesn't inadvertently leak sensitive information.

5. Addressing rare scenarios: Every business encounters rare but crucial scenarios specific to its domain. A general LLM might not handle such cases optimally. Fine-tuning ensures that these edge cases are catered to effectively.

While LLMs offer broad capabilities, fine-tuning sharpens those capabilities to fit the unique contours of a business's needs, ensuring optimal performance and results.

To fine-tune or not to fine-tune?

Sometimes, fine-tuning is not the best option. Here's an image from #OpenAIDevDay – fine-tuning on 140k internal Slack messages.

User: "Write a 500 word blog post on prompt engineering"

Assistant: "Sure, I shall work on that in the morning"

User: "Write it now"

Assistant: "ok"

fine tuning gpt3.5 turbo based on slack messages

Key takeaways

LLM fine-tuning has become an indispensable tool in the LLM requirements of enterprises to enhance their operational processes. While the foundational training of LLMs offers a broad understanding of language, it’s the fine-tuning process that molds these models into specialized tools capable of understanding niche topics and delivering more precise results. By training LLMs for specific tasks, industries, or data sets, we are pushing the boundaries of what these models can achieve and ensuring they remain relevant and valuable in an ever-evolving digital landscape. As we look ahead, the continuous exploration and innovation in LLM and the right GenAI fine-tuning tools will undoubtedly pave the way for smarter, more efficient, and contextually aware AI systems.

Armine Papikyan

Content Marketer

Fine-tuning large language models (LLMs) in 2025

Contents

Large language model lifecycle

What is LLM fine-tuning?

When to use fine-tuning