LIMA: A New LLaMa Model With 65B Parameters, Fine-Tuned Using 1000 Carefully Selected Prompts And Responses

By being pretrained to predict the next token at an astoundingly large scale, language models provide general-purpose representations that can be used to nearly any language interpretation or producing task. As a result, a variety of language model alignment strategies have been proposed to aid in this transfer, with a focus on instruction tuning over large datasets with millions of examples and, more recently, reinforcement learning from human feedback (RLHF) gathered over millions of interactions with human annotators. However, for existing alignment techniques to perform at ChatGPT levels, large computing and specialized data resources are required. 

However, they demonstrate that extremely excellent performance may be achieved by only adjusting 1,000 well selected training cases for a good language model that has already been trained. Their theory states that alignment might be a quick and simple process where the model learns the format or style of asking users to reveal the knowledge and abilities they have previously acquired during pretraining. They gather 1,000 examples that reflect real user cues and outstanding responses to support their theory. After selecting 750 of the top questions and answers from websites like Stack Exchange and wikiHow, they assessed them for quality and diversity.

They also manually compose 250 instances of questions and answers while emphasizing a consistent response style in the vein of an AI assistant and optimizing for task diversity. Researchers from Meta AI, Carnegie Mellon University, University of Southern California and Tel Aviv University train LIMA, a 65B-parameter LLaMa model previously trained and improved on this collection of 1,000 examples. Three hundred difficult test questions compare LIMA against contemporary language models and products. LIMA surpasses RLHF-trained DaVinci003 from OpenAI, which was trained with RLHF, as well as a 65B-parameter replica of Alpaca, which was introduced on 52,000 samples, in a study of human preference. 

Although humans frequently prefer GPT-4, Claude, and Bard replies over LIMA responses, this is not always the case; LIMA consistently yields equivalent or preferable results in 43%, 46%, and 58% of the situations, respectively. They repeat the annotations of human preferences using GPT-4 as the annotator confirms their findings. When LIMA replies are evaluated on an absolute scale, 88% satisfy the prompt’s requirements, and 50% are rated outstanding. Ablation tests show significant improvements when improving data quality and significantly falling returns when increasing data amount without simultaneously increasing prompt variety. 

Furthermore, they discover that LIMA can carry on coherent multi-turn discourse despite having no dialogue examples. Including 30 hand-crafted dialogue chains in training may enhance this capacity. Overall, these amazing results show the effectiveness of pretraining and its relative value over approaches to reinforcement learning and large-scale instruction tailoring. They demonstrate how a robust pretrained language model may be tuned to provide outstanding, competitive outcomes on various prompts using 1,000 well-picked samples. There are, however, drawbacks to this strategy. 

The mental work required to create such instances is enormous and challenging to scale up. Second, while LIMA normally provides strong replies, an unfortunate sample during decoding or an aggressive prompt can frequently result in a weak response. LIMA is less resilient than product-grade models. Nevertheless, the data provided in this work shows that it is possible to address the difficult alignment problems straightforwardly.