Poster: Reconstructing LLM Training Workloads: A Topology- and Model-aware Network Tester

Open Access
Authors
Publication date 2025
Book title CoNEXT '25
Book subtitle Proceedings of the 21st International Conference on Emerging Networking EXperiments and Technologies : December 1-4, 2025, Hong Kong, Hong Kong
ISBN (electronic)
  • 9798400721915
Event 21st International Conference on Emerging Networking EXperiments and Technologies, CoNEXT 2025
Pages (from-to) 41-42
Number of pages 2
Publisher New York, NY: Association for Computing Machinery
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract

Large Language Model (LLM) training generates synchronized Remote Direct Memory Access (RDMA) bursts that heavily stress datacenter fabrics and are highly sensitive to faults. However, access to full-scale training clusters is costly, and existing network testers fail to accurately reproduce such patterns. We introduce GPTraffic, a topology- and model-aware testing framework that predicts, emulates, and analyzes LLM training workloads on programmable hardware. By combining burst-accurate traffic generation, RDMA-aware semantics, and fine-grained fault injection, GPTraffic enables scalable, realistic, and reproducible experiments that faithfully reflect the dynamics of distributed LLM training. This allows researchers to explore performance bottlenecks, congestion behavior, and fault tolerance under conditions that closely mirror real-world AI training workloads.

Document type Conference contribution
Language English
Published at https://doi.org/10.1145/3765515.3771757
Other links https://www.scopus.com/pages/publications/105023972677
Downloads
3765515.3771757 (Final published version)
Permalink to this page
Back