Cloud Run GPUs
Cloud Run, the scale-to-zero, fully-managed container platform on Google Cloud, provides on-demand access to NVIDIA GPUs for AI inference and compute-intensive workloads.
After a period in public preview, GPUs on Cloud Run reached general availability in June 2025.
Here's the summary:
Specs
Cloud Run supports two types of NVIDIA GPUs:
The NVIDIA L4 (24GB vRAM) is ideal for mid-sized models and requires at least 4 vCPU and 16 GiB of memory.
The NVIDIA RTX PRO 6000 Blackwell (96GB vRAM) is designed for large language models and heavy workloads, requiring at least 20 vCPU and 80 GiB of memory.
Drivers are pre-installed, and you can attach one GPU per Cloud Run instance.
Use cases
- To run Large Language Models, you can deploy open models like Gemma 3, Llama 3.1, and Mistral. The 96GB vRAM on the Blackwell GPU supports larger model parameters and higher query throughput.
- To get started with a Tutorial, check out how to run LLM inference on Cloud Run GPUs with Ollama.
- For Computer Vision workloads, explore the OpenCV CUDA accelerated demo by Niko, which demonstrates high-speed GPU-accelerated image processing.
Autoscaling and reliability
- With scale to zero, Cloud Run stops all instances when there are no incoming requests, meaning you are not charged for any GPU, CPU, or memory resources during idle times.
- Achieve a fast cold start with instances that start in approximately 5 seconds when scaling up from zero.
- Enable zonal redundancy to choose between zonal redundancy (for higher availability with reserved capacity) and non-zonal redundancy (to keep costs lower).
- Experience exceptional scale out speed where Cloud Run can autoscale to hundreds of GPU instances in minutes to handle spikes in traffic.
Regions
GPU availability varies by type (see current regions). As of June 2025, the NVIDIA L4 is available in us-central1, us-east4, europe-west1, europe-west4, asia-southeast1, and asia-south1 (by invitation). The NVIDIA RTX PRO 6000 is available in us-central1, europe-west4, asia-southeast1, and asia-south2.
Links
- Cloud Run GPUs are Generally Available (cloud.google.com)
- Documentation: Configuring GPUs for Cloud Run (cloud.google.com)
- Best practices: Cloud Run services with GPUs (cloud.google.com)
- Run LLM inference on Cloud Run GPUs with Ollama tutorial (cloud.google.com)
- Run LLM inference on Cloud Run GPUs with vLLM codelab (codelabs.developers.google.com)