It’s time to give back to the open-source community again. This time, our infra specialist, Dan, has created a clever way to solve Kubernetes GPU monitoring. In this article, we will discuss what the solution is, how it works, and why it’s important for the open-source community.
Current problems with Kubernetes GPU monitoring
The current problems with Kubernetes GPU monitoring are clear:
- Standard Kubernetes GPU monitoring tools only track pods with explicit `nvidia.com/gpu` resource requests
- Many applications utilize GPUs without declaring these resource requests
- There are significant monitoring gaps where GPU usage is effectively "invisible"
The solution: CM PurplePill
What exactly is CM PurplePill?
It is a lightweight Prometheus exporter for NVIDIA GPU metrics in Kubernetes environments that tracks pod-level GPU usage *without* requiring explicit GPU resource declarations" says Dan Kozytskyi, Infrastructure Engineer at ConfidentialMind.
It's important because, Nvidia DCGM can't map per K8s Pod GPU usage without explicit `nvidia.com/gpu` resource declarations. Which means that you can't use features like VLLM parallel GPU sharing and assign less then 100% of each GPU. With the CM PurplePill you are free to use any parallelism parameters combinations while still able to monitor the actual per pod GPU usage. Also, the CM PurplePill is much lighter then Nvidia DCGM and relies on less components.
Key Value:
- Complete visibility into all GPU workloads, including those without resource declarations
- Lightweight solution with small operational footprint
- Full control over monitoring stack
- Simple deployment as a DaemonSet on GPU-enabled nodes
You can find more information and access the CM PurplePill code on our GitHub page. Please like our page and share it with anyone who you think would benefit from this solution. This way we work together to improve Kubernetes GPU for everyone!
NB! The current release supports Nvidia only, but with a slight modification we will support AMD and other GPUs in next releases.
Features of our Kubernetes GPU monitoring solution:
- Exposes GPU metrics in Prometheus format
- Shows per K8s Pod usage of GPUs
- Does not rely on GPU resource declaration in Pod manifest (*)
- Has unique ability to show GPU usage by Pods that claim less than whole GPU (*)
- Is not limited to particular GPU vendor (*)
- Can run as a K8s DaemonSet and as a host OS level service
- With a slight modification can be used for GPU usage monitoring for any containerized environment, not limited to Docket-like ones
* Unlike Nvidia DCGM Prometheus metrics exporter
Core Metrics explained :
- `CM_PURPLEPILL_GPU_MEMORY_TOTAL_MIB` - Total GPU memory
- `CM_PURPLEPILL_GPU_MEMORY_USED_TOTAL_MIB` - Total used memory
- `CM_PURPLEPILL_GPU_MEMORY_FREE_MIB` - Free memory
- `CM_PURPLEPILL_GPU_UTILIZATION` - GPU utilisation percentage
- `CM_PURPLEPILL_GPU_MEMORY_USED_POD_MIB` - Pod-specific memory usage
Deployment Options
1. Kubernetes DaemonSet (Recommended)
All-in-one container deployment in K8s. Runs on GPU nodes with node selector: `feature.node.kubernetes.io/pci-10de.present: "true"`
- Requires hostPID access to monitor processes
- Works with Prometheus Operator's ScrapeConfig
- Uses standard NVIDIA software for K8s hosts
2. Direct Host Installation
Deployable as a systemd service or Docker container and Installable via pip or from source.
Minimal Dependencies:
- Python 3.7+
- NVIDIA drivers with `nvidia-smi` tool
- No external Python packages (standard library only)
CM PurplePill vs NVIDIA DCGM
Digital sovereignty benefits explained:
Conclusion
CM PurplePill, Kubernetes GPU monitoring, offers complete visibility into all GPU workloads, including those without resource declarations, which has been a common challenge in Kubernetes environments. With its lightweight design and small operational footprint, CM PurplePill ensures efficient monitoring without GPU usage gaps. It provides full control over the monitoring stack, empowering you to track GPU usage with precision. Plus, its simple deployment as a DaemonSet on GPU-enabled nodes makes it easy to integrate into your existing infrastructure. This open-source solution is a game-changer for Kubernetes GPU monitoring, helping you track every pod’s GPU usage, regardless of whether explicit resource declarations are made.
Greetings from the CEO
