Exploring The Differences Between ChatGPT/GPT-4 and Traditional Language Models: The Impact of Reinforcement Learning from Human Feedback (RLHF)

GPT-4 has been released, and it is already in the headlines. It is the technology behind the popular ChatGPT developed by OpenAI which can generate textual information and imitate humans in question answering. After the success of GPT 3.5, GPT-4 is the latest milestone in scaling up deep learning and generative Artificial Intelligence. Unlike the previous version, GPT 3.5, which only lets ChatGPT take textual inputs, the latest GPT-4 is multimodal in nature. It accepts text as well as images as input. GPT-4 is a transformer model which has been pretrained to predict the next token. It has been fine-tuned using the concept of reinforcement learning from human and AI feedback and uses public data as well as licensed data from third-party providers. 

Here are a few key points on how models like ChatGPT/GPT-4 differ from traditional language models in his tweet thread. 

The major reason the latest GPT model differs from the traditional ones is the use of the Reinforcement Learning from Human Feedback (RLHF) concept. This technique is used in the training of language models like GPT-4, unlike traditional language models in which the model is trained on a large corpus of text, and the objective is to predict the next word in a sentence or the most likely sequence of words given a description or a prompt. In contrast, reinforcement learning involves training the language model using feedback from human evaluators, which serves as a reward signal that is responsible for evaluating the quality of the produced text. These evaluation methods are similar to BERTscore and BARTscore, and the language model keeps on updating itself to improvise the reward score.

A reward model is basically a language model that has been pre-trained on a large amount of text. It is similar to the base language model used for producing text. Joris has given the example of DeepMind’s Sparrow, a language model trained using RLHF and using three pre-trained 70B Chinchilla models. One of those models is used as the base language model for text generation, while the other two are used as separate reward models for the evaluation process.

In RLHF, the data is collected by asking human annotators to choose the best-produced text given a prompt; these choices are then converted into a scalar preference value, which is used to train the reward model. The reward function combines the evaluation from one or multiple reward models with a policy shift constraint which is designed to minimize the divergence (KL-divergence) between the output distributions from the original policy and the current policy, thus avoiding overfitting. The policy is just the language model that produces text and keeps on getting optimized for producing high-quality text. Proximal Policy Optimization (PPO), which is a reinforcement learning (RL) algorithm, is used to update the parameters of the current policy in RLHF. 

Joris Baan has mentioned the potential biases and limitations that may arise from collecting human feedback to train the reward mode. It has been highlighted in the InstructGPT’s paper, the language model that follows human instructions, that human preferences are not universal and can vary depending on the target community. This implies that the data used to train the reward model could impact the model’s behavior, leading to undesired results.

The tweet also mentions that the decoding algorithms appear to play a smaller role in the training process, and ancestral sampling, often with temperature scaling, is the default method. This could indicate that the RLHF algorithm already steers the generator to specific decoding strategies during the training process. 

In conclusion, using human preferences to train the reward model and to guide the text generation process is a key difference between reinforcement learning-based language models such as ChatGPT/GPT-4 and traditional language models. It allows the model to generate text that is more likely to be rated highly by humans, leading to a better and more natural-sounding language.


This article is based on this Tweet thread by Joris Baan. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Tanya Malhotra is a final year undergrad from the University of Petroleum Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.