The Tülu Suite of Fine-Tuned Large Language Models (LLMs): A Look at Instruction-Tuning Language Models

One of the better instances of Large Language Models (LLMs) that have recently been released is the well-known ChatGPT created by OpenAI. LLMs like ChatGPT have revolutionized the world with their unrivaled potential and capacity to mimic humans in a variety of jobs. To help the model develop the habit of carrying out some typical tasks, these models have generally embraced instruction fine-tuning. Using supervised input and output pairings that can be extracted from other models, this method trains the models. 

Various open instruction-following datasets are being used for the current advancements in instruction-tuning language models. Though open models can compete with cutting-edge proprietary models, these assertions are frequently only backed by a restricted evaluation, which makes it difficult to compare models in-depth and determine the value of various resources. To address this, a team of researchers from the Allen Institute for AI and the University of Washington has introduced a wide range of instruction-tuned models with parameter sizes ranging from 6.7 billion to 65 billion.

These models are trained on 12 instruction datasets ranging from synthetic and distilled datasets like Alpaca to hand-curated datasets like OpenAssistant. The models are carefully tested in a variety of areas, including reasoning, multilingualism, coding, factual knowledge, and open-ended instruction-following skills. In order to provide a thorough study, the evaluation is carried out utilizing a collection of automatic, model-based, and human-based metrics.

The team has also introduced TÜLU, which is a suite of large language models fine-tuned on a combination of data sources. These models are fine-tuned using a combination of high-quality open resources. The team has examined the performance of various instruction-tuning datasets and their effect on particular skills through various evaluations. They discovered that different datasets could reveal or improve particular skills and that neither a single dataset nor a set of datasets offers the highest performance across all evaluations.

The team has mentioned that an interesting finding from the research is that benchmark-based evaluations fail to capture differences in model capabilities that are shown by model comparisons. The best model in any given evaluation averaged 83% of ChatGPT’s performance and 68% of GPT-4’s performance. The team has stated that TÜLU, with 65 billion parameters, is the largest publicly-released, fully-instruction tuned variant, trained on seven popular available datasets. It has achieved the best average performance while staying within 15% of the best-performing model on each individual task.

Some of the key contributions mentioned in the research paper are – 

  1. Specific domain and capability-specific instruction datasets are very successful at enhancing model performance.
  1. Larger or pre-trained-for-longer base models consistently perform better after instruction tuning.
  1. The best average performance across benchmarks was attained by TÜLU, the fine-tuned LLaMa on a mixture of existing instruction datasets, although it is not the best when comparing various evaluation settings separately.
  1. Even a very big 65B parameter model that has been optimized on a huge variety of instruction datasets falls short of ChatGPT, although it outperforms comparable smaller models by a significant margin.
  1. Strong correlations between model-based preference evaluation on open-ended instruction following and the typical number of unique tokens produced by a model indicate that model-based preference evaluation contains biases that may mask variations in model capabilities.