This AI Paper From NVIDIA Provides The Recipe To Reproduce RETRO Up To 9.5B Parameters While Retrieving A Text Corpus With 330B Tokens


Large language models, such as masked LMs, autoregressive LMs, and encoder-decoder LMs, BART), have shown cutting-edge results for various NLP problems. Among these, autoregressive LMs like GPT3 and GPT-4 exhibit notable in-context learning capacity and great long-form text creation performance. Because of its significance, the community has made great attempts to scale up such autoregressive generative LMs with more data and parameters, resulting in important achievements in real-world applications such as open-ended text production and numerous downstream tasks. 

Successful instances in the public domain include GPT-3, Gopher, Megatron-Turing, and PaLM. Large-scale autoregressive LMs have been quite successful but have several flaws:

  1. Implementing it is expensive due to the many model parameters needed to memorize global information.
  2. It can be challenging to maintain factual correctness, which might provide users with false information.
  3. Updating the model knowledge acquired through pretraining with current information is costly and results in outdated responses.

A particular line of the study suggests enhancing language models with retrieval to address the issues mentioned above. Retrieval may be included in LMs at the pretraining or fine-tuning stages. 

Most prior work augments BERT or encoder-decoder LMs with retrieval during the fine-tuning step, exhibiting results for knowledge-intensive NLP applications. However, pretraining autoregressive LMs with rescue remains largely unexplored, especially given ChatGPT’s notable performance, which highlights the critical role of autoregressive LMs. RETRO recently proposed pretraining autoregressive LMs with a retrieval module practically scalable to large-scale pretraining from scratch by recovering billions of tokens and significantly decreasing model parameters while attaining lower perplexity than traditional GPT. It also allows you to change the knowledge held in LMs by changing the retrieval database without retraining the LMs. 

To address the previous question and fill the gap, researchers at NVIDIA conduct extensive research on RETRO, as, to the best of their knowledge, RETRO is the only retrieval-augmented autoregressive LM that supports large-scale pretraining with retrieval on massive pretraining corpora containing hundreds of billions or trillions of tokens. Their thorough investigation sheds light on the promising direction of autoregressive LMs with retrieval as future foundation models, as they outperform standard GPT models in terms of perplexity, text generation quality, and downstream task performances, particularly for knowledge-intensive tasks such as open-domain QA.

They conduct detailed research of retrieval-augmented LM in this paper to answer the question: Should they pre-train decoder-only LMs with retrieval? They see persistent gains in text production quality, factual correctness, decreased toxicity, and downstream task accuracy, particularly for knowledge-intensive jobs like open-domain QA. Given the 25% increase in GPU hours for pretraining, they believe that pretraining generative language models with retrieval are a viable path. The complete codebase and data have been open-sourced on GitHub.


Check out the Paper and Github. Don’t forget to join our 19k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]

???? Check Out 100’s AI Tools in AI Tools Club


Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.