mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training
| Authors |
|
|---|---|
| Publication date | 2022 |
| Host editors |
|
| Book title | Euro-Par 2022: Parallel Processing |
| Book subtitle | 28th International Conference on Parallel and Distributed Computing, Glasgow, UK, August 22–26, 2022 : proceedings |
| ISBN |
|
| ISBN (electronic) |
|
| Series | Lecture Notes in Computer Science |
| Event | 28th International European Conference on Parallel and Distributed Computing, Euro-Par 2022 |
| Pages (from-to) | 155-170 |
| Number of pages | 16 |
| Publisher | Cham: Springer |
| Organisations |
|
| Abstract |
Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the neural network over the GPUs in a way that favors training throughput over memory usage, and thus maximum trainable network size. We propose mCAP, a partitioning solution for pipeline-parallel DNN training that focuses specifically on memory usage. It evenly distributes Deep Learning models over the available resources with respect to per-device peak memory usage. Our partitioning approach uses a novel incremental profiling strategy to extract per-layer memory usage statistics. A model-based predictor uses the profiling data to recommend a partitioning that balances peak memory usage. Our approach is DL-framework agnostic and orthogonal to existing memory optimizations found in large-scale DNN training systems. Our results show that our approach enables training of neural networks that are 1.55 times larger than existing partitioning solutions in terms of the number of parameters. |
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.1007/978-3-031-12597-3_10 |
| Other links | https://doi.org/10.6084/m9.figshare.20000960 https://www.scopus.com/pages/publications/85135773608 |
| Downloads |
978-3-031-12597-3_10
(Final published version)
|
| Permalink to this page | |
