# eACGM **eACGM:** An **e**BPF-based **A**utomated **C**omprehensive **G**overnance and **M**onitoring framework for AI/ML systems. --- :star: **[News] Our work has been accepted by [IEEE/ACM IWQoS 2025 (CCF-B)! ](https://iwqos2025.ieee-iwqos.org/)** **[Paper(Dropbox)](https://www.dropbox.com/scl/fi/q4vplv95usw4u5h3syx62/IWQoS_2025.pdf?rlkey=gv8h65oupkzrmv6zu1yu7s558&e=1&st=k8sttham&dl=0)** --- eACGM provides zero-intrusive, low-overhead, full-stack observability for both hardware (GPU, NCCL) and software (CUDA, Python, PyTorch) layers in modern AI/ML workloads. ![Architecture](asset/arch.png) ## Features - [x] **Event tracing for CUDA Runtime** based on eBPF - [x] **Event tracing for NCCL GPU communication library** based on eBPF - [x] **Function call tracing for Python virtual machine** based on eBPF - [x] **Operator tracing for PyTorch** based on eBPF - [x] **Process-level GPU information monitoring** based on `libnvml` - [x] **Global GPU information monitoring** based on `libnvml` - [x] **Automatic eBPF program generation** - [x] **Comprehensive analysis** of all traced events and operators - [x] **Flexible integration** for multi-level tracing (CUDA, NCCL, PyTorch, Python, GPU) - [x] **Visualization-ready data output** for monitoring platforms ## Visualization To visualize monitoring data, deploy Grafana and MySQL using Docker. Access the Grafana dashboard at [http://127.0.0.1:3000](http://127.0.0.1:3000). ```bash cd grafana/ sh ./launch.sh ``` Start the monitoring service with: ```bash ./service.sh ``` Stop the monitoring service with: ```bash ./stop.sh ``` ## Case Demonstration The `demo` folder provides example programs to showcase the capabilities of eACGM: - `pytorch_example.py`: Multi-node, multi-GPU PyTorch training demo - `sampler_cuda.py`: Trace CUDA Runtime events using eBPF - `sampler_nccl.py`: Trace NCCL GPU communication events using eBPF - `sampler_torch.py`: Trace PyTorch operator events using eBPF - `sampler_python.py`: Trace Python VM function calls using eBPF - `sampler_gpu.py`: Monitor global GPU information using `libnvml` - `sampler_nccl.py`: Monitor process-level GPU information using `libnvml` - `sampler_eacg.py`: Combined monitoring of all supported sources - `webui.py`: Automatically visualize captured data in Grafana ## Citation If you find this project helpful, please consider citing our IWQoS 2025 paper (In press, to appear).