The foundation for several applications at Meta, including content comprehension, Feeds, generative AI, and ad ranking, is provided by AI workloads. PyTorch is capable of handling these workloads due to its seamless Python integration, eager-mode development, and simple APIs. The improvement of user experiences across all of Meta’s products and offerings, in particular, depends on DLRMs. As the size and complexity of these models increase, the hardware systems must provide greater memory and processing power while maintaining efficiency.
GPUs aren’t always the greatest choice when it comes to the highly efficient processing of Meta’s unique recommendation workloads at scale. The “Meta Training and Inference Accelerator” (MTIA) is a collection of application-specific integrated circuits (ASICs) that the Meta team created to address this problem. The first-generation ASIC is incorporated in PyTorch to create a fully optimized ranking system in order to meet the requirements of the next-generation recommendation model. As they continue to support PyTorch 2.0, which noticeably increases PyTorch’s compiler-level efficiency, keeping developers active is a constant process.
In 2020, the team created the original MTIA ASIC to handle Meta’s internal processing needs. Co-designed with silicon, PyTorch, and the recommendation models, this inference accelerator is part of a full-stack solution. Using a TSMC 7nm technology, this 800 MHz accelerator can achieve 102.4 TOPS with INT8 precision and 51.2 TFLOPS with FP16 precision. The device’s TDP, or thermal design power, is 25 W.
The accelerator can be divided into constituent parts, including processing elements (PEs), on-chip and off-chip memory resources, and interconnects in a grid structure. An independent control subsystem within the accelerator manages the software. The firmware coordinates the execution of jobs on the accelerator, controls the available computing and memory resources, and communicates with the host through a specific host interface. LPDDR5 is used for off-chip DRAM in the memory subsystem, which allows for expansion to 128 GB. More bandwidth and far less latency are available for frequently accessed data and instructions because the chip’s 128 MB of on-chip SRAM is shared among all the PEs.
The 64 PEs in the grid are laid out in an 8 by 8 matrix. Each PE’s 128 KB of local SRAM memory allows for speedy data storage and processing. A mesh network links the PEs together and to the memory banks. The grid can be used in its whole to perform a job, or it can be split up into numerous subgrids, each of which can handle its work. Matrix multiplication, accumulation, data transportation, and nonlinear function calculation are only some of the important tasks optimized for by the multiple fixed-function units and two processor cores in each PE. The RISC-V ISA-based processor cores have been extensively modified to perform the required computation and control operations. The architecture was designed to make the most of two essentials for effective workload management: parallelism and data reuse.
The researchers compared MTIA to an NNPI accelerator and a graphics processing unit. The results show that MTIA relies on efficiently managing small forms and batch sizes for low-complexity models. MTIA actively optimizes its SW stack to achieve similar levels of performance. In the meantime, it uses larger forms that are significantly more optimized on the GPU’s SW stack to run medium- and high-complexity models.
To optimize performance for Meta’s workloads, the team is now concentrating on finding a happy medium between computing power, memory capacity, and interconnect bandwidth to develop a better and more efficient solution.