A production checklist for Kubernetes GPU workloads.

GPU workloads on Kubernetes promise flexibility and utilisation, but production deployments have a way of exposing gaps in preparation. A checklist from The Good Shell covers the operational details that separate a working demo from a reliable GPU platform.

Install and verify the GPU Operator

The NVIDIA GPU Operator automates driver installation, container runtime setup and device plugin registration. Running GPU workloads without it means managing these pieces manually, which is error-prone and hard to reproduce. The operator should be treated as the default starting point for NVIDIA-based clusters.

After installation, verify that nodes report GPUs correctly and that pods can request GPU resources. A common failure is the device plugin not registering, which leaves GPUs invisible to the scheduler.

Partition GPUs where appropriate

Not every workload needs a full GPU. Multi-Instance GPU and time-slicing let multiple workloads share a single physical card. MIG provides stronger isolation and predictable performance; time-slicing is simpler but offers less isolation. Choose based on workload sensitivity and whether tenants can tolerate sharing.

Schedule jobs with Kueue

Without a job queue, a large training job can monopolise all available GPUs and block higher-priority work. Kueue adds fair sharing, quotas and priorities to Kubernetes scheduling. For multi-team clusters, it is essential.

Use Spot GPU nodes for fault-tolerant work

Spot or preemptible GPU instances can reduce costs dramatically. They suit training jobs with checkpointing, batch inference and development workloads. They are not suitable for real-time serving where interruptions affect users.

Autoscale inference with vLLM and KEDA

For LLM serving, vLLM improves throughput through efficient memory management and batching. KEDA provides autoscaling based on custom metrics such as queue length or request latency. Together they let inference scale with demand rather than running at peak capacity continuously.

Observability and cost tracking

GPU utilisation should be monitored continuously. Low utilisation often means oversized models, poor batching or inefficient scheduling. Track cost per job or per inference request so that optimisation efforts are directed at the biggest spenders. Dashboards should cover GPU memory usage, temperature, power draw and queue length alongside traditional CPU and memory metrics.

Secure GPU workloads

GPU nodes are valuable targets and often run sensitive models or data. Restrict access to the namespace, use workload identity rather than long-lived credentials, scan container images and keep drivers and runtimes patched. GPU clusters should be on the same security review schedule as any other production environment.

Conclusion

Production GPU workloads on Kubernetes are manageable if the right foundations are in place. The checklist from The Good Shell is a useful reference, but the underlying principle is to treat GPUs as a scarce, expensive resource that deserves the same operational rigour as production databases or payment systems. Teams that invest in these foundations early avoid painful rework when scale and cost pressure arrive.

A production checklist for Kubernetes GPU workloads.

Install and verify the GPU Operator

Partition GPUs where appropriate

Schedule jobs with Kueue

Use Spot GPU nodes for fault-tolerant work

Autoscale inference with vLLM and KEDA

Observability and cost tracking

Secure GPU workloads

Conclusion

Keep reading.

Assembling a cost-efficient AI infrastructure stack layer by layer.

A pragmatic ladder for adopting Kubernetes-native MLOps.

When GKE Standard beats Autopilot for ML workloads.

Longer thinking →