2025-08-07 10:14:54 +08:00
2025-08-07 10:14:54 +08:00
2025-08-07 10:14:54 +08:00
2025-08-07 10:14:54 +08:00
2025-08-07 10:14:54 +08:00
2025-08-07 10:14:54 +08:00
2025-08-07 10:14:54 +08:00
2025-08-07 10:14:54 +08:00
2025-08-07 10:14:54 +08:00

eACGM

eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.


[News] Our work has been accepted by IEEE/ACM IWQoS 2025 (CCF-B)!

Paper(Dropbox)


eACGM provides zero-intrusive, low-overhead, full-stack observability for both hardware (GPU, NCCL) and software (CUDA, Python, PyTorch) layers in modern AI/ML workloads.

Architecture

Features

  • Event tracing for CUDA Runtime based on eBPF
  • Event tracing for NCCL GPU communication library based on eBPF
  • Function call tracing for Python virtual machine based on eBPF
  • Operator tracing for PyTorch based on eBPF
  • Process-level GPU information monitoring based on libnvml
  • Global GPU information monitoring based on libnvml
  • Automatic eBPF program generation
  • Comprehensive analysis of all traced events and operators
  • Flexible integration for multi-level tracing (CUDA, NCCL, PyTorch, Python, GPU)
  • Visualization-ready data output for monitoring platforms

Visualization

To visualize monitoring data, deploy Grafana and MySQL using Docker. Access the Grafana dashboard at http://127.0.0.1:3000.

cd grafana/
sh ./launch.sh

Start the monitoring service with:

./service.sh

Stop the monitoring service with:

./stop.sh

Case Demonstration

The demo folder provides example programs to showcase the capabilities of eACGM:

  • pytorch_example.py: Multi-node, multi-GPU PyTorch training demo
  • sampler_cuda.py: Trace CUDA Runtime events using eBPF
  • sampler_nccl.py: Trace NCCL GPU communication events using eBPF
  • sampler_torch.py: Trace PyTorch operator events using eBPF
  • sampler_python.py: Trace Python VM function calls using eBPF
  • sampler_gpu.py: Monitor global GPU information using libnvml
  • sampler_nccl.py: Monitor process-level GPU information using libnvml
  • sampler_eacg.py: Combined monitoring of all supported sources
  • webui.py: Automatically visualize captured data in Grafana

Citation

If you find this project helpful, please consider citing our IWQoS 2025 paper (In press, to appear).

Description
eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems. (IWQoS 2025)
Readme 3.2 MiB
Languages
Python 91.9%
Shell 7.8%
Makefile 0.3%