Microsoft AI Launches ZeRO-based Optimization Strategy for Efficient Large Model Training, Free from Batch Size or Bandwidth Restrictions

Microsoft researchers introduced a new system called ZeRO++ has been developed to optimize the training of large AI models, addressing the challenges of high data transfer overhead and limited bandwidth. ZeRO++ builds upon the existing ZeRO optimizations and offers enhanced communication strategies to improve training efficiency and reduce training time and cost.

Training large models like Turing-NLG, ChatGPT, and GPT-4 requires substantial memory and computing resources across multiple GPU devices. ZeRO++, developed by DeepSpeed, introduces communication optimization strategies to overcome the limitations of ZeRO in scenarios with a small batch size per GPU or when training on low-bandwidth clusters.

The ZeRO family of optimizations, including ZeRO-Inference, enables the partitioning of model states across GPUs instead of replication, using the collective GPU memory and compute power. However, ZeRO can incur high communication overheads during training. ZeRO++ addresses this by incorporating three sets of communication optimizations: quantized weight communication (qwZ), hierarchical weight partition (hpZ), and quantized gradient communication (qgZ).

To reduce parameter communication volume, ZeRO++ employs quantization on weights, utilizing block-based quantization to preserve training precision. This optimized quantization process is faster and more accurate than basic quantization. To minimize communication overhead during backward propagation, ZeRO++ trades GPU memory for communication by maintaining a full model copy within each machine. For gradient communication, ZeRO++ introduces qgZ, a novel quantized gradient communication paradigm that reduces cross-node traffic and latency.

These communication optimizations result in a significant reduction in communication volume. ZeRO++ achieves up to a 4x reduction compared to ZeRO, improving training throughput and efficiency. ZeRO++ offers 28% to 36% throughput improvement over ZeRO-3 in high-bandwidth clusters when using small batch sizes per GPU. ZeRO++ achieves an average of 2x speedup in low-bandwidth clusters compared to ZeRO-3, making large model training more accessible across a wider variety of clusters.

ZeRO++ is not limited to training scenarios but extends to reinforcement learning from human feedback (RLHF) training used in dialogue models. By integrating ZeRO++ with DeepSpeed-Chat, RLHF training can benefit from improved generation and training phases, achieving up to 2.25x better generation throughput and 1.26x better training throughput than ZeRO.

DeepSpeed has released ZeRO++ to make large model training more efficient and accessible to the AI community. The system is designed to accelerate training, reduce communication overhead, and enable larger batch sizes, ultimately saving time and resources. Researchers and practitioners can leverage ZeRO++ to train models like ChatGPT more effectively and explore new possibilities in AI.