Introducing CM PurplePill: Kubernetes GPU Monitoring Solution.

Raido Linde
|
April 2, 2025

It’s time to give back to the open-source community again. This time, our infra specialist, Dan, has created a clever way to solve Kubernetes GPU monitoring. In this article, we will discuss what the solution is, how it works, and why it’s important for the open-source community.

Current problems with Kubernetes GPU monitoring

The current problems with Kubernetes GPU monitoring are clear:

  1. Standard Kubernetes GPU monitoring tools only track pods with explicit `nvidia.com/gpu` resource requests
  2. Many applications utilize GPUs without declaring these resource requests
  3. There are significant monitoring gaps where GPU usage is effectively "invisible"

The solution: CM PurplePill

What exactly is CM PurplePill?

It is a lightweight Prometheus exporter for NVIDIA GPU metrics in Kubernetes environments that tracks pod-level GPU usage *without* requiring explicit GPU resource declarations" says Dan Kozytskyi, Infrastructure Engineer at ConfidentialMind.


It's important because, Nvidia DCGM can't map per K8s Pod GPU usage without explicit `nvidia.com/gpu` resource declarations. Which means that you can't use features like VLLM parallel GPU sharing and assign less then 100% of each GPU. With the CM PurplePill you are free to use any parallelism parameters combinations while still able to monitor the actual per pod GPU usage. Also, the CM PurplePill is much lighter then Nvidia DCGM and relies on less components.

Key Value:

  • Complete visibility into all GPU workloads, including those without resource declarations
  • Lightweight solution with small operational footprint
  • Full control over monitoring stack
  • Simple deployment as a DaemonSet on GPU-enabled nodes

You can find more information and access the CM PurplePill code on our GitHub page. Please like our page and share it with anyone who you think would benefit from this solution. This way we work together to improve Kubernetes GPU for everyone!

NB! The current release supports Nvidia only, but with a slight modification we will support AMD and other GPUs in next releases.

Features of our Kubernetes GPU monitoring solution:

  • Exposes GPU metrics in Prometheus format
  • Shows per K8s Pod usage of GPUs
  • Does not rely on GPU resource declaration in Pod manifest (*)
  • Has unique ability to show GPU usage by Pods that claim less than whole GPU (*)
  • Is not limited to particular GPU vendor (*)
  • Can run as a K8s DaemonSet and as a host OS level service
  • With a slight modification can be used for GPU usage monitoring for any containerized environment, not limited to Docket-like ones

* Unlike Nvidia DCGM Prometheus metrics exporter

Core Metrics explained :

  • `CM_PURPLEPILL_GPU_MEMORY_TOTAL_MIB` - Total GPU memory
  • `CM_PURPLEPILL_GPU_MEMORY_USED_TOTAL_MIB` - Total used memory
  • `CM_PURPLEPILL_GPU_MEMORY_FREE_MIB` - Free memory
  • `CM_PURPLEPILL_GPU_UTILIZATION` - GPU utilisation percentage
  • `CM_PURPLEPILL_GPU_MEMORY_USED_POD_MIB` - Pod-specific memory usage

Deployment Options

1. Kubernetes DaemonSet (Recommended)

All-in-one container deployment in K8s. Runs on GPU nodes with node selector: `feature.node.kubernetes.io/pci-10de.present: "true"`

  • Requires hostPID access to monitor processes
  • Works with Prometheus Operator's ScrapeConfig
  • Uses standard NVIDIA software for K8s hosts

2. Direct Host Installation

Deployable as a systemd service or Docker container and Installable via pip or from source.

Minimal Dependencies:
    - Python 3.7+
    - NVIDIA drivers with `nvidia-smi` tool
    - No external Python packages (standard library only)

CM PurplePill vs NVIDIA DCGM

Digital sovereignty benefits explained:

Factor CM PurplePill NVIDIA DCGM
Implementation Control Open architecture with visibility into monitoring logic Proprietary black-box solution
Vendor Independence Adaptable for non-NVIDIA GPUs by modifying collection layer NVIDIA-specific solution only
Customisability Easily modifiable for specific environments Configuration limited to provided options

Conclusion

CM PurplePill, Kubernetes GPU monitoring, offers complete visibility into all GPU workloads, including those without resource declarations, which has been a common challenge in Kubernetes environments. With its lightweight design and small operational footprint, CM PurplePill ensures efficient monitoring without GPU usage gaps. It provides full control over the monitoring stack, empowering you to track GPU usage with precision. Plus, its simple deployment as a DaemonSet on GPU-enabled nodes makes it easy to integrate into your existing infrastructure. This open-source solution is a game-changer for Kubernetes GPU monitoring, helping you track every pod’s GPU usage, regardless of whether explicit resource declarations are made.

Greetings from the CEO

Markku Räsänen

CEO of ConfidentialMind
Hello everyone, I would like you to imagine building AI systems as quickly as flipping a switch. That's the reality our clients experience. I personally invite you to experience it yourself. Check our documentation or book a demo today to discover how you can build secure AI systems instantly, not in months.
TABLE OF CONTENT
Get a free demo
Read about our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
;

About

ConfidentialMind
Otakaari 27, 02150 Espoo, Finland
+358 50 302 6510

Follow us

Email us

info (@) confidentialmind.com