Poster: Reconstructing LLM Training Workloads: A Topology- and Model-aware Network Tester
| Authors |
|
|---|---|
| Publication date | 2025 |
| Book title | CoNEXT '25 |
| Book subtitle | Proceedings of the 21st International Conference on Emerging Networking EXperiments and Technologies : December 1-4, 2025, Hong Kong, Hong Kong |
| ISBN (electronic) |
|
| Event | 21st International Conference on Emerging Networking EXperiments and Technologies, CoNEXT 2025 |
| Pages (from-to) | 41-42 |
| Number of pages | 2 |
| Publisher | New York, NY: Association for Computing Machinery |
| Organisations |
|
| Abstract |
Large Language Model (LLM) training generates synchronized Remote Direct Memory Access (RDMA) bursts that heavily stress datacenter fabrics and are highly sensitive to faults. However, access to full-scale training clusters is costly, and existing network testers fail to accurately reproduce such patterns. We introduce GPTraffic, a topology- and model-aware testing framework that predicts, emulates, and analyzes LLM training workloads on programmable hardware. By combining burst-accurate traffic generation, RDMA-aware semantics, and fine-grained fault injection, GPTraffic enables scalable, realistic, and reproducible experiments that faithfully reflect the dynamics of distributed LLM training. This allows researchers to explore performance bottlenecks, congestion behavior, and fault tolerance under conditions that closely mirror real-world AI training workloads. |
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.1145/3765515.3771757 |
| Other links | https://www.scopus.com/pages/publications/105023972677 |
| Downloads |
3765515.3771757
(Final published version)
|
| Permalink to this page | |
