Large Language models are one of the most significant advancements in Artificial Intelligence. They are a great application of transformer models. LLMs have come a long way, from generating content and summarizing massive paragraphs to completing codes and having human conversations. LLMs learn from great volumes of data fed into the AI model in an unsupervised manner. They use the concept of deep learning and Natural Language Processing to operate and learn the complexity of language. LLMs are transformer-based neural networks with several parameters upon which the model’s performance and output quality depend. 

Transformer models are mostly used with textual data and have successfully substituted Recurrent Neural Networks. A transformer is divided into two components – an encoder and a decoder. The work of an encoder is to take in input in the form of tokens and generate a systematic sequence of hidden states. On the other hand, the decoder takes in the input of the hidden states and generates resultant tokens. The working of the transformer can be depicted by taking the example of translating an English sentence into Spanish. The transformer takes the input of the English sentence in the form of tokens. It keeps on iteratively predicting the consecutive word in the language it needs to be translated into, i.e., Spanish in this case. 

Transformer sampling mostly faces the limitation of having a constraint on the memory bandwidth. An algorithm called Speculative Sampling (SpS) has been introduced to overcome the limitation, which accelerates transformer sampling. Sampling can be simply defined as an approach to choosing a subset of data from a larger dataset in order to use it as a representative sample for training the model. Scaling parameters have been proven significant for improving the performance of a model. In a transformer model, when the encoder generates a token, the time taken for the process is proportional to the first-order approximation of the parameter’s size and the memory bandwidth of the transformer. 

In Speculative Sampling, the decoding process of the transformer is accelerated by allowing the production of several tokens from every transformer cell. The researchers behind the development of the algorithm have summarized the entire working of Speculative Sampling as follows –  

  1. Creating a draft model – A small draft of length K is produced, which is followed by calling a comparatively quicker model K times, which is auto-regressive.
  2. Using the target model – The draft scoring takes place using the target model, which is more powerful.
  3. Applying a modified rejection sampling scheme – Using this scheme, a subset of K draft tokens is accepted from left to right in order to recover the distribution of the target model.
  4. Generation of multiple tokens – For a particular token or a subsequence of tokens, multiple tokens are produced every time the target model is called in case of strong agreement between the distributions of the draft and target model.

A traditional transformer model performs training and sampling using Autoregressive Sampling (ArS) technique. Autoregressive sampling is a sequential procedure in which only one token is produced for every sequence in the batch. It is a memory bandwidth approach that doesn’t make use of hardware accelerators like Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). Unlike the traditional method, Speculative Sampling works on the concept of producing several tokens every time the target model is called. 

The researchers have even shared a factual study in the research paper in which a comparison has been made between both Speculative and Autoregressive sampling. For the comparison, the team used Chinchilla Large Language Model with 70B parameters. Chinchilla is a 70B parameters model which has been trained with 1.4 trillion tokens. It has been trained optimally by scaling both model size and training tokens. The team performed the comparison on XSum and 100-shot HumanEval benchmarks. The study showed that Speculative Sampling was able to achieve 2 to 2.5x decoding speedups on both XSum and HumanEval. It even successfully upheld the quality of the sample without any prominent alteration in the architecture or the parameters. 

The rejection sampling scheme introduced by the team has been shown to recover the distribution of the target model from the draft model samples within the hardware numerics. Upon observation and analysis, the team found that the computation of the logic of a small continuation of K tokens in parallel is similar in terms of latency to sampling one token from a big target model. 

Large Language models have progressed exponentially in the previous months, and Speculative Sampling seems promising. Its capability of accelerating the decoding of language models is innovative and undoubtedly would greatly contribute to transformer models’ success. One of the key features of this algorithm is that it does not require any alteration to the parameters and the architecture of the target language model. It scales finely with the suitable draft model and accelerates the decoding. Thus, Speculative Sampling greatly contributes to the field of Artificial Intelligence. 

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Tanya Malhotra is a final year undergrad from the University of Petroleum Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.