eACGM
eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.
⭐ [News] Our work has been accepted by IEEE/ACM IWQoS 2025 (CCF-B)!
eACGM provides zero-intrusive, low-overhead, full-stack observability for both hardware (GPU, NCCL) and software (CUDA, Python, PyTorch) layers in modern AI/ML workloads.
Features
- Event tracing for CUDA Runtime based on eBPF
- Event tracing for NCCL GPU communication library based on eBPF
- Function call tracing for Python virtual machine based on eBPF
- Operator tracing for PyTorch based on eBPF
- Process-level GPU information monitoring based on
libnvml
- Global GPU information monitoring based on
libnvml
- Automatic eBPF program generation
- Comprehensive analysis of all traced events and operators
- Flexible integration for multi-level tracing (CUDA, NCCL, PyTorch, Python, GPU)
- Visualization-ready data output for monitoring platforms
Visualization
To visualize monitoring data, deploy Grafana and MySQL using Docker. Access the Grafana dashboard at http://127.0.0.1:3000.
cd grafana/
sh ./launch.sh
Start the monitoring service with:
./service.sh
Stop the monitoring service with:
./stop.sh
Case Demonstration
The demo
folder provides example programs to showcase the capabilities of eACGM:
pytorch_example.py
: Multi-node, multi-GPU PyTorch training demosampler_cuda.py
: Trace CUDA Runtime events using eBPFsampler_nccl.py
: Trace NCCL GPU communication events using eBPFsampler_torch.py
: Trace PyTorch operator events using eBPFsampler_python.py
: Trace Python VM function calls using eBPFsampler_gpu.py
: Monitor global GPU information usinglibnvml
sampler_nccl.py
: Monitor process-level GPU information usinglibnvml
sampler_eacg.py
: Combined monitoring of all supported sourceswebui.py
: Automatically visualize captured data in Grafana
Citation
If you find this project helpful, please consider citing our IWQoS 2025 paper (In press, to appear).
Description
eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems. (IWQoS 2025)
Languages
Python
91.9%
Shell
7.8%
Makefile
0.3%