Poster: Reconstructing LLM Training Workloads: A Topology- and Model-aware Network Tester

Francisco Germano Vogt; Zhiheng Yang; Victor Hugo Schneider Lopes; Fabricio Rodríguez; Marcelo Caggiani Luizelli; Christian Esteve Rothenberg; Chrysa Papagianni

doi:https://doi.org/10.1145/3765515.3771757

Poster: Reconstructing LLM Training Workloads: A Topology- and Model-aware Network Tester

Authors	Francisco Germano Vogt Zhiheng Yang Victor Hugo Schneider Lopes Fabricio Rodríguez Marcelo Caggiani Luizelli Christian Esteve Rothenberg Chrysa Papagianni
Publication date	2025
Book title	CoNEXT '25
Book subtitle	Proceedings of the 21st International Conference on Emerging Networking EXperiments and Technologies : December 1-4, 2025, Hong Kong, Hong Kong
ISBN (electronic)	9798400721915
Event	21st International Conference on Emerging Networking EXperiments and Technologies, CoNEXT 2025
Pages (from-to)	41-42
Number of pages	2
Publisher	New York, NY: Association for Computing Machinery
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Large Language Model (LLM) training generates synchronized Remote Direct Memory Access (RDMA) bursts that heavily stress datacenter fabrics and are highly sensitive to faults. However, access to full-scale training clusters is costly, and existing network testers fail to accurately reproduce such patterns. We introduce GPTraffic, a topology- and model-aware testing framework that predicts, emulates, and analyzes LLM training workloads on programmable hardware. By combining burst-accurate traffic generation, RDMA-aware semantics, and fine-grained fault injection, GPTraffic enables scalable, realistic, and reproducible experiments that faithfully reflect the dynamics of distributed LLM training. This allows researchers to explore performance bottlenecks, congestion behavior, and fault tolerance under conditions that closely mirror real-world AI training workloads.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1145/3765515.3771757
Other links	https://www.scopus.com/pages/publications/105023972677
Downloads	3765515.3771757 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Poster: Reconstructing LLM Training Workloads: A Topology- and Model-aware Network Tester