Chain-of-Thought Collection: An Instruction Dataset that Improves Zero-shot and Few-Shot Language Model Learning

Large Language Models are showing incredible capabilities with every upgradation. Based on Natural Language Processing, these models are giving rise to an age of boundless human-machine connection. From supporting medical research and transforming customer service to content generation and language translation, everyone’s making use of the vast potential of LLMs. With the inclusion of Chain-of-Though (CoT) reasoning in LLMs, these models have shown improved performance and better reasoning abilities.

Chain-of-Thought reasoning is a method for teaching language models to perform better in logical, arithmetic, and symbolic reasoning problems. CoT thinking entails a logical progression of ideas, each one building on the one before it. This cognitive process occurs within the LLMs, where one created response or piece of information logically and reliably follows another. 

LLMs with a high number of parameters have demonstrated enhanced capabilities for solving new tasks by employing this step-by-step CoT reasoning. The question arises if similar reasoning abilities can be inculcated into LLMs with fewer than 100 billion parameters. To address it, a team of researchers has introduced a new dataset called the COT COLLECTION, which is designed for instruction tuning. The dataset includes 1.88 million CoT rationales across 1,060 tasks. 

The team has thoroughly examined the quality and diversity of the COT COLLECTION, which portrays its reliability, logical coherence, and informative nature compared to human-authored CoT rationales. They have also introduced the C2F2 model, which has been obtained by constantly fine-tuning Flan-T5 LMs with 3B and 11B parameters using the COT COLLECTION. It has been demonstrated that this fine-tuning with the COT collection exhibited improved zero-shot CoT performance on unseen tasks.

The research paper mentions how well C2F2 performs in contexts where learning occurs in a limited number of instances or few-shot learning. Compared to direct fine-tuning using FLAN-T5, parameter-efficient fine-tuning (PEFT) on C2F2 shows performance gains on domain-specific datasets from the legal and medical professions. The authors have also emphasized the advantages of utilizing CoT justifications to improve task generalization and promote additional research.

The researchers evaluated the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark to gauge the improvement after utilizing the COT COLLECTION. The accuracy of the 3B and 11B LMs increased by +4.34% and +2.44%, respectively. Additionally, the CoT instruction tweaking improved the language models’ few-shot learning capabilities. In comparison to Flan-T5 LMs (3B and 11B), this yielded improvements of +2.97% and +2.37% on four domain-specific tasks, respectively.

The CoT Collection includes nearly 52 times more CoT rationales and approximately 177 times more tasks compared to previously available CoT datasets. In conclusion, The COT COLLECTION dataset illustrates the effectiveness of CoT rationales for increasing task generalization in LMs in zero-shot and few-shot learning circumstances. It overcomes the challenges faced in using CoT reasoning in smaller language models. The team has provided access to the COT COLLECTION dataset and the trained models on the GitHub repository