Reducing Sycophancy in Large Language Models (LLMs) Using Simple Synthetic Data


Large Language Models (LLMs) have developed significantly in recent years and are now capable of handling challenging tasks that call for reasoning. A number of researches, including those by OpenAI and Google, have emphasized a lot on these developments. LLMs have revolutionized the way humans interact with machines and is one of the greatest advancements in the field of Artificial Intelligence (AI). Researchers have been putting in efforts to research the phenomena of sycophancy, which is the term for an unfavorable behavior shown by language models in which these models modify their responses to coincide with the viewpoint of a human user, even when that viewpoint is not objectively right.

The behavior can involve a model adopting liberal beliefs just because a user self-identifies as liberal. Research has been done on emphasizing and examining the frequency of sycophancy inside language models and suggesting a reasonably simple synthetic-data-based strategy to curtail this behavior. To address that, a team of researchers from Google DeepMind has examined three different sycophancy tasks to examine the sycophancy phenomenon. These assignments entail asking models for their thoughts on topics for which there isn’t a single, undeniable right or wrong response, including those pertaining to politics.

The analysis has revealed an interesting pattern: in PaLM models, which can have up to 540 billion parameters, both the model’s size and the practice of instruction adjusting significantly boost sycophantic behavior. By analyzing the same behavior in the setting of simple addition statements, the research has gone beyond the basic scope of sycophancy tasks and has added a new dimension. Despite the fact that these added claims are intentionally inaccurate, language models have shown a propensity to agree with them when users signal their agreement. This finding highlights how persistent sycophancy may be, even when models are aware of their own shortcomings.

The research has presented a relatively straightforward but successful technique centered on synthetic data intervention to address the problem of sycophancy. This intervention makes use of Natural Language Processing (NLP) activities in these tasks to strengthen the model’s resistance to user opinions that are freely accessible to the public. A notable decrease in sycophantic behavior has been accomplished by incorporating this synthetic data through a quick fine-tuning procedure, especially when tested on novel cues.

The findings have been summarized as follows ?

  1. Model size and instruction tuning increase sycophancy – Models that were instruction-tuned or had more parameters were more likely to replicate a simulated user’s perspective when asked for opinions on topics without definitive answers, including politics.
  1. Models may be complacent about incorrect responses – When there is no user opinion, models accurately disagree with wildly incorrect claims, such as 1 + 1 = 956446. Models also switch their previously accurate responses to follow the user if they agree with the user incorrectly.
  1. Sycophancy can be decreased with a straightforward synthetic-data intervention, which can improve models on prompts where a claim’s truthfulness is unrelated to the user’s perception of it.

In conclusion, this approach addressed the issue of a language model repeating a user’s opinion, even when that opinion is wrong. Fine-tuning using simple synthetic data has been shown to reduce this trait.