CAPSlog: Scalable Memory-Centric Partitioning for Pipeline Parallelism

Open Access
Authors
Publication date 2024
Host editors
  • A.E. Chis
  • H. González-Vélez
Book title 2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing
Book subtitle PDP 2024 : Dublin, Ireland, 20-22 March 2024 : proceedings
ISBN
  • 9798350363081
ISBN (electronic)
  • 9798350363074
Event 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2024
Pages (from-to) 17-25
Number of pages 9
Publisher Piscataway, NJ: IEEE Computer Society
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract

Pipeline-parallel training has emerged as a popular method to train large Deep Neural Networks (DNNs), as it allows the use of the combined compute power and memory capacity of multiple Graphics Processing Units (GPUs). However, with the sustaining increase in Deep Learning (DL) model sizes, pipeline parallelism provides only a partial solution to the memory bottleneck in large-scale DNN training. Careful partitioning of the DL model over the available GPUs based on memory usage is required to further alleviate the memory bottleneck and train larger DNNs. mCAP is such a memory-oriented partitioning approach for pipeline parallel systems, but it does not scale to models with many layers and very large hardware setups, as it requires extensive profiling and fails to efficiently navigate the partitioning space to find the most memory-friendly partitioning. In this work, we propose CAPSlog, a scalable memory-centric partitioning approach that can recommend model partitionings for larger and more heterogeneous DL models and for larger hardware setups than existing approaches. CAPSlog introduces a new profiling method and a new, much more scalable algorithm for recommending memory-efficient partitionings. CAPSlog reduces the profiling time by 67 % compared to existing approaches, searches the partitioning space for the optimal solution orders of magnitude faster and can train significantly larger models.

Document type Conference contribution
Language English
Published at https://doi.org/10.1109/PDP62718.2024.00012
Other links https://www.proceedings.com/74377.html https://www.scopus.com/pages/publications/85191747579
Downloads
Permalink to this page
Back