Skip to main content

Microsoft’s ZeRO-2 with DeepSpeed trains neural networks with up to 170 billion parameters

Microsoft CTO Kevin Scott
Microsoft CTO Kevin Scott
Image Credit: Microsoft

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now


Microsoft today upgraded its DeepSpeed library for training large neural networks with ZeRO-2. Microsoft says the memory optimizing tech is capable of training machine learning models with 170 billion parameters. For context, Nvidia’s massive Megatron language model is one of the biggest in the world today at 11 billion parameters.

Today’s announcement follows the February open source release of the DeepSpeed library, which was used to create Turing-NLG. At 17 billion parameters, Turing-NLG is the largest known language model in the world today. Microsoft introduced the Zero Redundancy Optimizer (ZeRO) in February alongside DeepSpeed.

ZeRO achieves its results by reducing memory redundancy in data parallelism, another technique for fitting large models into memory. Whereas ZeRO-1 included some model state memory optimization, ZeRO-2 delivers optimization for activation memory and fragmented memory.

DeepSpeed is made for distributed model training across multiple servers, but ZeRO-2 also comes with improvements for training models on a single GPU, reportedly training models like Google’s BERT 30% faster.


AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


Additional details will be announced Wednesday in a keynote address by Microsoft CTO Kevin Scott.

The news comes at the start of Microsoft’s all-digital Build developer conference, where a number of AI developments have been announced — including the debut of the WhiteNoise toolkit for differential privacy in machine learning and Project Bonsai for industrial applications of AI.

Last week, Nvidia CEO Jensen Huang unveiled the Ampere GPU architecture and A100 GPU. The new GPU chip — alongside trends like the creation of multimodal models and massive recommender systems — will lead to larger machine learning models in the years ahead.

Microsoft Build 2020: read all our coverage here.