Join our upcoming webinar “Deriving Business Value from LLMs and RAGs.”
Register now
chatgpt meme

Anyone who has ever used ChatGPT (or GPT4) will know better than to underestimate its power. Yet a large number of its 100 million users end up asking themselves questions like "how was it trained?" or even "how can I fine-tune or implement it?". Although the solution's architecture is veiled in mystery, the training procedure concept is not a secret, and it all lies in the abbreviation RLHF (Reinforcement Learning from Human Feedback).

In this article, we will cover the following question: how to effectively gain training data for large language models? But before we cover that, we will dive deep into:

  • What are LLMs and how to train them
  • A brief overview of reinforcement learning
  • What does "human feedback" mean
  • Key takeaways

P.S this article was not generated with ChatGPT, GPT3, or any other ChatGPT predecessor (It was partially generated with GPT4).

P.P.S. Reinforcement Learning from Human Feedback (RLHF) and GPTs are very complex topics, so we tried to briefly cover all the necessary subtopics that do not require extensive prior knowledge. But you can feel free to scroll down to the "what does "human feedback" mean" section and skip the natural intelligence "pre-training" part 🙃.

What are LLMs and how to train them

what is llm

Let's start this long journey by explaining the nature of language models, solely because generative pre-trained transformers are just a subclass of them.

Natural language processing tasks

As you can guess, natural language processing is mostly about text data and audio. Moreover, it's an interdisciplinary subfield of machine learning and computational linguistics that tries to solve tasks connected with language understanding, which was mostly human prerogative. NLP tries to achieve machineability of understanding the structure and meaning of a document or a phrase including the contextual nuances of the language within them, but it also tries to imitate some kind of computer dialogue.

natural language processing tasks

Text classification is the task of putting a tag (or tags) under a given piece of text.

It can be:

  • Binary classification (e.g. does that text contain toxic content or not)
binary classification
  • Multiclass classification is when there are more than two classes under a piece of any given text and the algorithm should only choose one (e.g. sentiment analysis).
multiclass classification
  • Multilabel classification is when there are more than two classes under a given piece of text and the algorithm can choose more than one label (e.g. news topics: "politics", "history", "economic" etc.)
multilabel classification
  • Topic modeling is a type of statistical modeling that uses unsupervised machine learning to identify clusters or groups that contain similar words within a body of text.
topic modeling
  • Named entity recognition is the task of labeling a text with a set of tags to subsequences of text (e.g. extract names, dates, and addresses from text). It can be binary, multiclass, and multilabel too. However, sequences can overlap or contain each other. Each sequence is called an "entity".
named entity classification
  • Entity linking is the task of setting relations between different entities in a single document (e.g. "Shinichiro Watanabe" is_director_of "Cowboy Bebop").
  • Text generation is the task of filling in a given incomplete text or generating summaries. In most cases, it's the beginning of a phrase and the next token to predict (word).
text generation
  • Machine translation is the task of translating a phrase or a document into another language or even languages.
machine translation
  • Questions answering is the task of retrieving the answer to a question from a given text. If a model doesn't get a document with an answer as an input (aka context) then it's called Open Domain Questions Answering (ODQA) or Closed Generative QA. If your model uses some context, yet needs to find (extract) an answer exactly as a subsequence of a given context, then the task is called Extractive QA. If the model can give answers in a free way using context it's an Open Generative QA.
questions answering

This is a short overview of basic NLP tasks, but if you want to know more you can always check out our articles about NLP tasks and annotation techniques for them. You can also get a good understanding of these tasks from's blog.

A Language model as the basement for an NLP model

language model for nlp model

By now, you can see that all given NLP tasks are somehow connected, and all of these tasks (even audios) are working with a given language (English, Armenian, Spanish, Japanese, Chinese, German, etc) and symbols and dependencies like morphology, orthography, and syntax. For example, you should use articles 'a', 'an', and 'the' in the proper way before nouns in English, or in Spanish, the nouns and adjectives must agree in number.

friends meme

Text generation and questions answering models and more complex tasks should have a basic understanding of a language's structure for it to successfully generate meaningful and grammatically correct output. For this exact purpose, computational linguists came up with the idea of using language models as basements. To summarize, a language model is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words (sentences or documents) from a given language.

evaluation metrics for language models
Image source

By training neural networks to understand the probabilistic nature of language dependencies, computational linguists understood that a language model's intermediate result (inner state) is very effective compared to other NLP models. This inner state is also known as embedding.

word embedding
Image Source 1 and Image source 2

Embedding usage can immensely boost the quality of such models. Below you can see the improvement of a text classification task quality for the StackOverflow dataset by using the Gensim doc2vec language model from 0.64 F1-Score to 0.80 F1-Score.

text classification task
Image source

Historically speaking, the very first Neural network Language model was Word2Vec which caused a breakthrough in all areas of NLP but wasn't able to generate text. So, when we finally figured out what a Language model is, we can try to understand what is large Language model is.

It's actually pretty simple when you break it down - it's a model with a large number of weight parameters (aka neurons in hidden layers), and it is a larger model trained on large volumes of data. Word2Vec has approximately 1,800 weight values in total, whereas GPT3 has 175 billion parameters.

word2vec vs gpt3
Image source

Warning! To prevent further speculations over the fact that a human brain can approximately consider over 120 billion neurons, it is wrong to compare it to natural neurons and its weak mathematical model, as a human brain's chemical and electrical processes are more complex than just multiplications, summations, and dropouts of signals.

Why do we need Large Language Models (LLMs) you may still ask? First of all, a language is not only a list of words in the right order, but it is also:

  • Context nuances - Word2Vec is not good at memorizing context depended phrases, e.g. "by using dictionaries (dict) in Python" and "Oxford and Merriam-Webster dictionaries" the word "dictionaries" for word2vec will be the same.
  • Word forms - "do, did, done", "forget, forgot, forgotten", have the same meaning but different usage.
  • Misspellings - sometimes "somtimes" and "sometimes" is the same word.
  • Vocabulary size is huge - the Oxford English Dictionary contains over 300,000 entries, can you even imagine the actual size of a natural language vocabulary and the number of possible sequences?
  • Non-plain structures - tables in the document and hyperlinks can cause a huge headache, believe me.

That being said, if you want to memorize all common sequences and generalize them into rules, you need to show your model lots of phrases/documents. Moreover, to process it properly, your model should have a lot of parameters in it (more than 94 million), but not too many.

On training language models

The main point of all Language models (at least for me) is the cheap data preparation step. For the training step, you don't strictly need manual data annotation, just accurate validation of prepared data. This is called Semi-Supervised Learning - a machine learning training process when text data is generated automatically.

If you want to memorize the rules of a language, you can:

1. Set a task the way to fill in a masked token using neighboring tokens (e.g. 2 left and 2 right tokens) without an order:

"Mic check, I can get smooth to any groove" →

"_ check, I can get smooth to any groove" → context: [check, I], output: Mic

"Mic _, I can get smooth to any groove" → context: [can, I, Mic], output: check

"Mic check, _ can get smooth to any groove" → context: [can, check, get, Mic], output: I

Model: Word2Vec

2. Set a task in a way to fill in a masked token using neighboring tokens (e.g. 2 left and 2 right tokens) in a certain order:

"[CLS] Mic check, I can get smooth to any groove [SEP]" (for context depended on tasks it's usually a good practice to add two special tokens [CLS] and [SEP] for the beginning and end of a sentence)

"[CLS] _ check, I can get smooth to any groove [SEP]" → context: [[CLS], check, I], output: Mic

"[CLS] Mic _, I can get smooth to any groove [SEP]" → context: [[CLS], Mic, I, can], output: check

"[CLS] Mic check, _ can get smooth to any groove [SEP]" → context: [Mic, check, can, get], output: I

Models: BERT and its descendants

3. Set a task the way to predict the next token (K previous tokens):

"[CLS] Mic check, I can get smooth to any groove [SEP]" →

"[CLS] _ check, I can get smooth to any groove [SEP]" → context: [ [CLS] ], output: Mic

"[CLS] Mic _, I can get smooth to any groove [SEP]" → context: [ [CLS], Mic], output: check

"[CLS] Mic check, _ can get smooth to any groove [SEP]" → context: [ [CLS], Mic, check], output: I

This is the task that is used to train a language model to generate text.

Models: all GPTs

All you need to do is to find open text data (usually crawled web page datasets such as Common Crawl or Wikipedia), implement a data generator (e.g. for BERT), generate a lot of training and testing text data, and then validate the generated data. Once all this is done, all that's left to do is to train some deep neural network model 😂.

Warning! You need to be as precise as possible and make sure to remove all the duplicates from the text data, and the poorly generated examples. You also need to be sure that the same document doesn't appear in the data for the next step model (such as text classification, NER, etc.). In such cases, we strongly recommend you use data-curating tools.

Moreover, data curating tools will also benefit you in understanding the maximum effective order of passing training examples: finding the most common cases and outliers. This helps minimize the cost of the training step.

Brief reference about reinforcement learning

Let's first start with supervised learning and try to figure out the differences between these branches of machine learning. We will try to maintain the "zero formulas" paradigm.

Imagine this: You are a student in middle school and you have math (supervised learning) and physical education (reinforcement learning) classes. How do you prepare for both classes?

Problem statement

problem statement

For math class, your main task is to remember and understand formulas, some axioms, and theorems and get one's hand in using them practically: To make logical inferences and challenge incorrect premises.

In that sense, your brain is some machine learning model, a math task is the training example, a textbook is the training dataset, testing materials are the validation dataset and the way the tutor gives you information, the order of examples, and the frequency of testing you are the training algorithm. Your ability to provide correct answers (or amount of mistakes) in a test is the target function. So, optimization of the target function is the key goal of supervised learning.

In PE class your main goal is to get skilled and train your muscle memory as much as possible. If you're a volleyball player you'll probably practice top throws a lot, trying to figure out the best technique again and again.

reinforcement learning
Image source

In this particular case, you are the "AI" agent, your top throwing skill is the State, a ball and a sports room are the Environment or the Observation space, and your top throw is the Action, possible ways of doing it is an Action space, your eyes are the Interpreter, and the dopamine dose is your Reward Function. Your throwing technique is called RL Policy, thus, adjusting the RL Policy is the main goal of Reinforcement Learning.

Difference between ML and RL training process

In math class, your probable steps will be as followed:

1.Listen to the teacher's lectures and have a series of seminars with lots of examples (question/task -> correct answer).

2. When you're at home, you'll read and study more about the given topic.

3. As a final requirement, for each topic, you'll try to solve training examples: give an answer to yourself, check the correct answer, and if you answered wrong, you would put more effort into solving similar examples.

This type of learning is doable in the sense that for each topic in math, you have several correct ways to give a solution, and a solution is always predefined. In ML we can define the total number of your mistakes as a loss function. And the actual way you learn is an optimization algorithm.

solution meme

For PE classes, as a volleyball player, you'll definitely need to: try to throw a ball (it's called action), collect feedback from the surrounding environment- check if your throw was accurate enough, or if the ball has good momentum. And if not, you'll try to throw in a different way and find another technique.

When your throw is strong and accurate, you will be enthusiastic, your adrenal gland will generate some dopamine, and maybe, even your coach will say something like "Keep it up champ!" So now you'll try to repeat this throwing technique. In RL it's called positive reward.

If otherwise, if you miss or don't even hit the ball - you'll be upset or even angry, and you'll try to change the way you hit. In RL it's called negative reward.

For model training, you need to wrap your reward intuition into the so-called reward function.

Let's break down Intuition: Due to the very stochastic nature that ball throwing is dependent on a lot of factors that can't be predefined or precalculated, you simply can't answer the "Is your technique good enough" question in the moment in any way other than iteratively throwing a ball and constantly getting some reinforcement feedback. This is the way to go.

Training data and environment

As for math, it's a teacher, a textbook with theories and many examples with correct answers, paper, and a pen.

In the case of Supervised Learning, it can require some pretraining data with easier subtasks and a lot of training examples.

E.g. if we are training the NER model using GPT3 as an embedding (why not):

1. GPT3 was pre-trained to predict the next word

2. We can also pre-train our model on text classification tasks (why not)

3. And then our model will be fine-tuned for the Named Entity Recognition task

In the case of volleyball classes, it requires a coach (to give you some pre-trained wisdom about a technique), a training space, and a ball (as part of the environment).


  • The game industry uses RL to improve AI under NPC's performance. Below you can see how to use RL to automate PACMAN.

Yet one of the most impressive cases in this industry remains Google's AI program AlphaGo, which was trained using RL, and can beat humanity in the board game Go.

  • Robotics manipulation training is one of the classical RL use cases. Instead of implementing huge systems of differential equations, you are now able to accurately manipulate real-world objects, DeepMind.AI implemented RL framework for this target:
  • Autonomous driving tasks such as trajectory optimization, motion planning, dynamic pathing, controller optimization, and scenario-based learning policies for highways is an ideal tasks for RL, mainly because lots of aspects need to be considered, including speed limits at various places, drivable zones, collisions avoidance, etc. You can continue reading about the challenges of autonomous driving in our blog.
  • Marketing teams don't even suspect that ChatGPT and GPT4 are the first RL-based Artificial Intelligence tools they used in their work. Digital companies invest money in Artificial Intelligence systems that manage various marketing campaigns which increase display ad impressions in real-time and ROI. This is the perfect place to use RL as reinforcement feedback is natural: the user clicks on the hyperlink to a product or a service webpage.
  • Generating text-like conversational tasks (see dialogue agents) or code generation for different programming languages can also be improved with RL. For example, for code generation tasks, reinforcement feedback provides code correctness and performance time. Whereas machine imitation of chit-chatting can be dialogue depth. Let's see how can we use RL to improve text generation.

What does "human feedback" mean?

rlhf vs gpt3

Let's try to understand RLHF. With original RL use cases, the reinforcement reward function is implemented programmatically and not in a natural sense. This can cause confusion in the reward value and lead the model to large amounts of iterations (hundreds of billions) close to just brute forcing parameters.

Imagine you train some robot (aka AI agent) to do a somersault or backflip from scratch (aka Action). With this, you implemented the goal function in a way it gives a positive reward only if the robot was really close to a somersault. But there are also a lot of intermediate steps that need to be taken before reaching the final result.

human feedback rlhf
Image source

And if your reward model responses do not distinguish the first two steps, your model
will probably start doing a random selection of things until the first two steps will no longer lead to the third one. This itself is the main problem.

Fortunately, OpenAI researchers came up with the idea that the reward model can be implemented as yet another machine learning model which was trained on user feedback to estimate approximate human preferences. In the case of a robot that tries to do a somersault, we can simply show users two different model results, and the user can choose for themselves which result is better: left, right, or if both are the same.

This approach demonstrated good efficiency— the backflip video required less than an hour of a human evaluator’s time. With only under 1000 bits of human feedback, the reward model accumulated about 70 hours of the overall experience while being in the background. The main point here - RLHF models are more robust than RL models because of tuning reward models instead of implementing reward functions.

Problems and limitations of GPT3

limitations of gpt3

While GPT3 is a powerful language model, it can even somehow hold conversations. However, this does not mean that it does not have any limitations. It is safe to state that GPT3 is a great model that was trained to predict the next word on a large dataset of Internet text, and we all know what forums and comments can look like, what phrases are used, and what topics are discussed.

We can view GPT3 as a "good" student who reproduces all those things, yet we need to be mindful that it can also generate untruthful, toxic, or harmful outputs. As it is pre-trained on social media posts, GPT3 is weak in narrow domains and can hardly generate texts with meaningful information. Moreover, GPT3 doesn't have a continuous training loop, so it can't be constantly improved.

What is human feedback for training GPT? What is a reward model?

human feedback for training gpt
Image source

It is no secret that we want GPT3 to generate meaningful responses and avoid nonsensical answers, harmful or deceitful responses, and inappropriate requests. To accomplish all of this, we still need to use large amounts of open-source data which does not guarantee the absence of hate speech or any toxic language examples. However, we can do it another way - we can use reinforcement learning from human feedback to directly optimize ChatGPT (GPT3 tuning) to follow human preferences. How you may wonder?

1. Pre-train GPT3 (Initial Model) on public datasets.

pre train gpt3
Image source

2. Pre-train a scalar reward model: First of all, you need to collect the Prompts Dataset (questions). Then you need to create data labeling instructions that contain rules about toxic language. Later on, you need to generate responses using Initial GPT 3, for example, 2 for each question. Finally, collect user feedback using data labelers workforce.

pre train a scalar reward model
Image source

Human annotators rank the generated text outputs from the LM. The most confusing aspect of it is the process of giving a rank number (e.g. from 1 to 10) to each piece of generated text, and that might not be the best idea. The differing values of humans can make these scores uncalibrated and noisy. Thus, it's more suitable to rank different texts generated by several Initial GPT 3 Models over the same prompt so that human evaluators can not only put scores to each text but also select the best option.

prompts dataset model
Image source

Moreover, it's better to narrow down the process of choosing options for data labelers to only two variants. And then add one extra: "Rank prompt itself". This strategy can help collect a much better-regularized dataset. To generate ranks of models, for example, you can use Elo or Harkness rating systems, which are used to determine the strength of chess players in a series of head-to-head matchups.

We invite you to collect your human feedback labeled data using SuperAnnotate and our RLHF Editor:

rlhf editor superannotate

3. Organize GPT3 fine-tuning with Reinforcement Learning:

So, we have the Initial Language model trained on open datasets (if your dataset was large enough you can even use this model for few-shot fine-tuning on narrow domains), and the reward model trained on human feedback data annotation. Let's try to connect them together.

In the future, we will call the Initial Language model Initial GPT3 and its modified version will just be GPT3.

First of all, fine-tuning means not training all model parameters (GPT3 has 175 billion), because it's too expensive. This means that we need to "freeze" (not change) some layers of Initial GPT3. This modified version of Initial GPT3 is called the Reinforcement Learning Policy. After this, we should pass the user prompt to GPT3 and then pass generated text to a reward model, which returns a scalar value of “preferability”. And then we can use this value to update not "freeze" layers of GPT3.

The problem here is that after several iterations of such tuning, our GPT3 can start to generate "tricky" content which will most probably be gibberish but might get a high reward. To avoid this situation, we can define some sort of penalty to our GPT3 as the main difference between a text generated by GPT3 and Initial GPT3. In this case, the final reward is the difference between the original reward and the penalty.

difference between the original reward and the penalty
Image source

The last step is to run several iterations of such a tuning process and finally get ChatGPT: the fine-tuned version of Initial GPT 3.

4. Develop a pretty interface and use fine-tuned GPT 3 as a chatbot.

Learning from human feedback services

Before you start to read this section, you should know one thing: RLHF is a trending topic. That's why most of the reviewed services have just started the development of this functionality. Keeping this information in mind, you should know that the documentation of these features is poorly defined.

You can find a detailed overview of data labeling tools here. For now, let's have a brief overview.

  • SAMA is famous for working as an OpenAI outsource partner for ChatGPT data labeling tasks.
  • Scale AI Rapid is the service that provides functionality for hiring annotation teams and creating pipelines to annotate data. Scale Rapid has its own RLHF workflow. It is made up of three steps: Defining validation rules, writing instructions (including annotation workflow, rules, and good and bad examples), and selecting a labeling source. According to their website's promoting materials, they also allow you to connect your own generation text model.
  • Labelbox provides UI out-of-box for RLHF data labeling for single and multiple outputs scoring. According to their given documentation, Labelbox allows you to create your own custom RLHF interface.
  • Label Studio is a highly customizable open-source data (12k+ stars on GitHub) labeling tool. You can define your own RLHF annotation interface by implementing code using XML-like tags to it.
  • Surge AI is a newbie in this area that only focuses on data labeling for NLP tasks.
  • And... SuperAnnotate: As we mentioned earlier in this article, we also provide RLHF labeling functionality. You can use SuperAnnotate's platform to achieve the highest annotations quality in the shortest time. You can bring your own annotation team and manage the whole labeling process using our collaboration and quality management tools, or you can rely on our highly qualified service team.

Short overview

Let's quickly summarize the initial steps to train your own ChatGPT:

1. "Trash in - trash out" paradigm - carefully collect open-source datasets for Initial Language Model training.

2. Use data curating tools to remove duplicates and irrelevant examples. You should also use it to understand the most effective order of training texts to pass it to the Initial Language Model.

3. Hire a data annotation team or use the SuperAnnotate annotation service for it. The main advice here is: if you want to create domain/culture-specific LLM, it's better to hire annotators that are experts in the given domain or are deep in the cultural context (aka SuperAnnotate).

4. Develop a data annotation instruction that clearly describes how to work with bad prompts, toxic content, and harmful model responses. By using SuperAnnotate's Annotation service, our AI solutions team will cover that for you.

5. Start the annotation process (e.g. use the SuperAnnotate Human Feedback Editor).

6. Train the reward model (aka reward function approximation or human preferences predictor) using annotated data.

7. Fine-tune Initial Language Model using RLHF.

8. Use active learning.

If you ask me, I believe it's better to work iteratively by batches, to tune the reward model, and not have to wait for the whole data to be annotated. Once you get 1.000 - 10.000 annotated examples, you can tune a reward model, use a new model version to pre-annotate the next batch and select the most informative examples for a reward model.

active learning loop

9. The final step here is: writing a blog post with the title "Introducing ChatGPT" and enjoying tons of news articles about it!

If you made it this far, congratulations! This is it. This is the way you can train your own ChatGPT.

Dmitriy Konyrev

Machine Learning Team Manager at SuperAnnotate

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate