vLLM, a 24x faster HuggingFace Transformers accelerating Open-Source LLM Inference and Serving Library.


Large language models, or LLMs in short, have emerged as a groundbreaking advancement in the field of artificial intelligence (AI). These models, such as GPT-3, have completely revolutionalized natural language understanding. With the capacity of such models to interpret vast amounts of existing data and generate human-like texts, these models hold immense potential to shape the future of AI and open up new possibilities for human-machine interaction and communication. However, despite the massive success achieved by LLMs, one significant challenge often associated with such models is their computational inefficiency, leading to slow performance even on the most powerful hardware. Since these models comprise millions and billions of parameters, training such models demands extensive computational resources, memory, and processing power, which is not always accessible. Moreover, these complex architectures with slow response times can make LLMs impractical for real-time or interactive applications. As a result, addressing these challenges becomes essential in unlocking the full potential of LLMs and making their benefits more widely accessible.

Tacking this problem statement, researchers from the University of California, Berkeley, have developed vLLM, an open-source library that is a simpler, faster, and cheaper alternative for LLM inference and serving. Large Model Systems Organization (LMSYS) is currently using the library to power their Vicuna and Chatbot Arena. By switching to vLLM as their backend, in contrast to the initial HuggingFace Transformers based backend, the research organization has managed to handle peak traffic efficiently (5 times more than before) while using limited computational resources and reducing high operational costs. Currently, vLLM supports several HuggingFace models like GPT-2, GPT BigCode, and LLaMA, to name a few. It achieves throughput levels that are 24 times higher than those of HuggingFace Transformers while maintaining the same model architecture and without necessitating any modifications.

As a part of their preliminary research, the Berkeley researchers determined that memory-related issues pose the primary constraint on the performance of LLMs. LLMs use input tokens to generate attention key and value tensors, which are then cached in GPU memory for generating subsequent tokens. These dynamic key and value tensors, known as KV cache, occupy a substantial portion of memory, and managing them becomes a cumbersome task. To address this challenge, the researchers introduced the innovative concept of PagedAttention, a novel attention algorithm that extends the conventional idea of paging in operating systems to LLM serving. PagedAttention offers a more flexible approach to managing key and value tensors by storing them in non-contiguous memory spaces, eliminating the requirement for continuous long memory blocks. These blocks can be independently retrieved using a block table during attention computation, leading to more efficient memory utilization. Adopting this clever technique reduces memory wastage to less than 4%, resulting in near-optimal memory usage. Moreover, PagedAttention can batch 5x more sequences together, thereby enhancing GPU utilization and throughput.

PagedAttention offers the additional benefit of efficient memory sharing. During parallel sampling, i.e., when multiple output sequences are created simultaneously from a single prompt, PagedAttention enables the sharing of computational resources and memory associated with that prompt. This is accomplished by utilizing a block table, where different sequences within PagedAttention can share blocks by mapping logical blocks to the same physical block. By employing this memory-sharing mechanism, PagedAttention not only minimizes memory usage but also ensures secure sharing. The experimental evaluations conducted by the researchers revealed that parallel sampling could reduce memory usage by a whopping 55%, resulting in a 2.2 times increase in throughput.

To summarize, vLLM effectively handles the management of attention key and value memory through the implementation of the PagedAttention mechanism. This results in exceptional throughput performance. Moreover, vLLM seamlessly integrates with well-known HuggingFace models and can be utilized alongside different decoding algorithms, such as parallel sampling. The library can be installed using a simple pip command and is currently available for both offline inference and online serving.