GPU workloads on Kubernetes promise flexibility and utilisation, but production deployments have a way of exposing gaps in preparation. A checklist from The Good Shell covers the operational details that separate a working demo from a reliable GPU platform.
Install and verify the GPU Operator
The NVIDIA GPU Operator automates driver installation, container runtime setup and device plugin registration. Running GPU workloads without it means managing these pieces manually, which is error-prone and hard to reproduce. The operator should be treated as the default starting point for NVIDIA-based clusters.
After installation, verify that nodes report GPUs correctly and that pods can request GPU resources. A common failure is the device plugin not registering, which leaves GPUs invisible to the scheduler.
Partition GPUs where appropriate
Not every workload needs a full GPU. Multi-Instance GPU and time-slicing let multiple workloads share a single physical card. MIG provides stronger isolation and predictable performance; time-slicing is simpler but offers less isolation. Choose based on workload sensitivity and whether tenants can tolerate sharing.
Schedule jobs with Kueue
Without a job queue, a large training job can monopolise all available GPUs and block higher-priority work. Kueue adds fair sharing, quotas and priorities to Kubernetes scheduling. For multi-team clusters, it is essential.
Use Spot GPU nodes for fault-tolerant work
Spot or preemptible GPU instances can reduce costs dramatically. They suit training jobs with checkpointing, batch inference and development workloads. They are not suitable for real-time serving where interruptions affect users.
Autoscale inference with vLLM and KEDA
For LLM serving, vLLM improves throughput through efficient memory management and batching. KEDA provides autoscaling based on custom metrics such as queue length or request latency. Together they let inference scale with demand rather than running at peak capacity continuously.
Observability and cost tracking
GPU utilisation should be monitored continuously. Low utilisation often means oversized models, poor batching or inefficient scheduling. Track cost per job or per inference request so that optimisation efforts are directed at the biggest spenders. Dashboards should cover GPU memory usage, temperature, power draw and queue length alongside traditional CPU and memory metrics.
Secure GPU workloads
GPU nodes are valuable targets and often run sensitive models or data. Restrict access to the namespace, use workload identity rather than long-lived credentials, scan container images and keep drivers and runtimes patched. GPU clusters should be on the same security review schedule as any other production environment.
Conclusion
Production GPU workloads on Kubernetes are manageable if the right foundations are in place. The checklist from The Good Shell is a useful reference, but the underlying principle is to treat GPUs as a scarce, expensive resource that deserves the same operational rigour as production databases or payment systems. Teams that invest in these foundations early avoid painful rework when scale and cost pressure arrive.