mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

H. Dreuning; H.E. Bal; R.V. van Nieuwpoort

doi:https://doi.org/10.1007/978-3-031-12597-3_10

mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

Authors	H. Dreuning H.E. Bal R.V. van Nieuwpoort
Publication date	2022
Host editors	J. Cano P. Trinder
Book title	Euro-Par 2022: Parallel Processing
Book subtitle	28th International Conference on Parallel and Distributed Computing, Glasgow, UK, August 22–26, 2022 : proceedings
ISBN	9783031125966
ISBN (electronic)	9783031125973
Series	Lecture Notes in Computer Science
Event	28th International European Conference on Parallel and Distributed Computing, Euro-Par 2022
Pages (from-to)	155-170
Number of pages	16
Publisher	Cham: Springer
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the neural network over the GPUs in a way that favors training throughput over memory usage, and thus maximum trainable network size. We propose mCAP, a partitioning solution for pipeline-parallel DNN training that focuses specifically on memory usage. It evenly distributes Deep Learning models over the available resources with respect to per-device peak memory usage. Our partitioning approach uses a novel incremental profiling strategy to extract per-layer memory usage statistics. A model-based predictor uses the profiling data to recommend a partitioning that balances peak memory usage. Our approach is DL-framework agnostic and orthogonal to existing memory optimizations found in large-scale DNN training systems. Our results show that our approach enables training of neural networks that are 1.55 times larger than existing partitioning solutions in terms of the number of parameters.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1007/978-3-031-12597-3_10
Other links	https://doi.org/10.6084/m9.figshare.20000960 https://www.scopus.com/pages/publications/85135773608
Downloads	978-3-031-12597-3_10 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training