CAPTURE: Memory-Centric Partitioning for Distributed DNN Training with Hybrid Parallelism

Open Access
Authors
Publication date 2023
Book title 2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics
Book subtitle HiPC 2023 : 18-21 December 2023, Goa, India : proceedings
ISBN
  • 9798350383232
ISBN (electronic)
  • 9798350383225
Event 30th Annual IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2023
Pages (from-to) 76-86
Number of pages 11
Publisher Piscataway, NJ: IEEE
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract

Deep Learning (DL) model sizes are increasing at a rapid pace, as larger models typically offer better statistical performance. Modern Large Language Models (LLMs) and image processing models contain billions of trainable parameters. Training such massive neural networks incurs significant memory requirements and financial cost. Hybrid-parallel training approaches have emerged that combine pipelining with data and tensor parallelism to facilitate the training of large DL models on distributed hardware setups. However, existing approaches to design a hybrid-parallel partitioning and parallelization plan for DL models focus on achieving high throughput and not on minimizing memory usage and financial cost. We introduce CAPTURE, a partitioning and parallelization approach for hybrid parallelism that minimizes peak memory usage. CAPTURE combines a profiling-based approach with statistical modeling to recommend a partitioning and parallelization plan that minimizes the peak memory usage across all the Graphics Processing Units (GPUs) in the hardware setup. Our results show a reduction in memory usage of up to 43.9% compared to partitioners in state-of-the-art hybrid-parallel training systems. The reduced memory footprint enables the training of larger DL models on the same hardware resources and training with larger batch sizes. CAPTURE can also train a given model on a smaller hardware setup than other approaches, reducing the financial cost of training massive DL models.

Document type Conference contribution
Language English
Published at https://doi.org/10.1109/HiPC58850.2023.00023
Other links https://www.proceedings.com/74077.html https://www.scopus.com/pages/publications/85190604352
Downloads
Permalink to this page
Back