By focusing on input-label mappings, a straightforward fine-tuning technique can enhance in-context learning.


Language models are tuned on input-label pairs presented in a context in which natural language labels are remapped to arbitrary symbols. For a given task, the model must depend on input-label mappings in context for reasoning and revealing the task. In a new research paper, the Google AI team introduces a simple finetuning procedure that significantly improves the language model’s ability to reason with and learn from input-label mappings for a given in context. They call it Symbol Tuning. The research team uses a mixture of 22 NLP datasets with various arbitrary symbols as labels and experiments using multiple Flan-PaL models.

The performance of baseline models on unseen in-context learning tasks can be improved using symbol tuning. These models are based on finetuned exemplars in which semantically unrelated labels replace natural language labels. Multiple in-context exemplars would be required to define the task, as the task is unclear by just looking at one single in-context exemplar. On average, symbol tuning yields +11.1% improved performance across eleven evaluation tasks for Flan-cont-PaLM-62B.

Symbol-tuned models only include natural language data rather than numerical and algorithmic data. This makes these models perform better at algorithmic reasoning tasks. To verify this, researchers experiment with a set of list functional tasks in which the model needs to identify a transformation function between input and output lists containing non-negative integers. They use simple Turing concepts where the model uses binary string reasoning to map an input to output. They find that symbol tuning results in an average performance improvement across all the tasks of 18.2% for Flan-PaLM-8B, 11.1% for Flan-PaLM-62B, 15.5% for Flan-cont-PaLM-62B, and 3.6% for Flan-PaLM-540B.

Compared to instruction-tuned models, symbol-tuned models are much better at following flipped labels presented in context. The performance of instruction-tuned models is well below random guessing as they cannot flip predictions to follow flipped labels. On the other hand, symbol tunning forces models to consider the label presented in-context as an arbitrary symbol. This reduces the model?s usage of prior knowledge that contradicts the flipped labels. Researchers find that after symbol tuning, an average improvement across all datasets of 26.5% for Flan-PaLM-8B, 33.7% for Flan-PaLM-62B, and 34.0% for Flan-PaLM-540B.

Researchers say that symbol tuning doesn?t require many steps of finetuning for any model with small datasets. The observed performance remained relatively constant after a peak change in performance in the initial 1k to 2k steps. As the performance remains relatively constant, one can hypothesize that larger models require a more diverse or larger set of symbol-tuning data.

Researchers find that after the initial steps, the higher proportions of symbol-tuning data do not affect the model?s performance. As a result, the model succeeds in ICL settings. As long as non-trivial symbol-tuning data is used, the proportion of the data used is irrelevant. The team found a strong correlation between the higher mixture of symbol-tuning data, the more probable it is for the model to follow flipped labels. This improves the ability of the model to override prior knowledge with in-context exemplars. This method is only successful if the model generalizes its ability to new tasks from the diverse set of tasks when input into the model.

n