Meet RedPajama: An AI Project to Create Fully Open-Source Large Language Models Beginning with the Release of a 1.2 Trillion Token Dataset

The most advanced foundation models for AI are only partially open-source and are only available through commercial APIs. This restricts their use and limits research and customization. However, a project called RedPajama now aims to create leading, fully open-source models. The first step of this project, reproducing the LLaMA training dataset, has been completed. Open-source models have made significant progress recently, and AI is experiencing a moment similar to the Linux movement. Stable Diffusion demonstrated that open-source models could compete with commercial offerings and encourage creativity through community participation. A similar movement has now emerged around large language models, with the release of semi-open models such as LLaMA, Alpaca, Vicuna, and Koala, as well as fully open models like Pythia, OpenChatKit, Open Assistant, and Dolly.

RedPajama is a collaborative effort between several institutions, including Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, MILA Québec AI Institute, and Together. The project aims to develop a reproducible, fully-open, leading language model with three key components: pre-training data, base models, and instruction-tuning data and models. Recently, the project released the first component, pre-training data, a 1.2 trillion token fully-open dataset based on the LLaMA paper. The starting point for RedPajama is LLaMA, the leading open base model suite. LLaMA was trained on a large dataset that was carefully filtered for quality. Its 7 billion parameter model is trained for longer to ensure the best quality at that model size. However, LLaMA and its derivatives are only available for non-commercial research purposes. RedPajama aims to reproduce LLaMA fully open-source, making it available for commercial applications and providing a more transparent pipeline for research.

The RedPajama Dataset is available for download on Hugging Face and consists of a 1.2 trillion token dataset and a smaller random sample. The dataset comprises seven data slices: CommonCrawl, C4, GitHub, arXiv, Books, Wikipedia, and StackExchange. Each data slice has undergone meticulous data pre-processing and filtering to ensure quality. The quality filters were tuned to approximate the number of tokens reported by Meta AI in the LLaMA paper. The CommonCrawl data slices were processed using the CCNet pipeline and filtered using a linear classifier to select pages resembling Wikipedia. Licenses and quality filtered the GitHub data, while the arXiv data consisted of scientific articles with boilerplate removed. The Books data was deduplicated by content similarity, the Wikipedia subset removed the boilerplate, and the StackExchange subset was a selection of popular websites with boilerplate removed. The full dataset is approximately 5TB unzipped on disk and can be downloaded compressed at 3TB.

The RedPajama project is collaborating with the Meerkat project to release a Meerkat dashboard and embeddings for interactive analysis of the GitHub subset of the corpus. The installation and usage instructions can be found on GitHub. The next step in the project is to train a robust base model after reproducing the pre-training data. The project is being supported by the Oak Ridge Leadership Computing Facility through the INCITE program, with a full suite of models set to become available soon. The team is excited to instruct and tune the models, inspired by the success of Alpaca with just 50,000 high-quality, diverse instructions. The team has received hundreds of thousands of natural user instructions via OpenChatKit, which will be used to release instruction-tuned versions of the RedPajama models.

Check out the RedPajama base dataset and RedPajama Github. Don’t forget to join our 19k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]

???? Check Out 100’s AI Tools in AI Tools Club

Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.