Cloud Run adds GPUs
Cloud Run, the scale-to-zero, fully-managed container platform on Google Cloud, adds GPUs as a public preview.
Here’s the summary:
Specs
- One NVIDIA L4 GPU (24GB vRAM) per Cloud Run instance (many instances per Cloud Run service).
- Drivers are pre-installed.
- Minimum instance size to enable GPU is 4 vCPU and 16 GiB memory.
Use cases
- Open large language models up to 13B parameters run great, including: Gemma 2 (9B), Llama 3.1 (8B), Mistral (7B), Qwen2 (7B).
- Here’s my tutorial: Run LLM inference on Cloud Run GPUs with Ollama.
- Explore Niko’s excellent OpenCV CUDA accelerated demo.
Autoscaling
- Scale to zero: When there are no incoming requests, Cloud Run stops all remaining instances and you’re not charged.
- Fast cold start: When scaling from zero, processes in the container can use the GPU in approximately 5 seconds. You can get Gemma 2 (
2B, Q4_0
) to return tokens after 11 seconds (best case). - Maximum instances: Defaults to 7, and there is a quota increase available.
- Scale out speed: During the launch event, Frank showed a service that generated images with Stable Diffusion. He scaled out the service to 100 GPU instances in under 4 minutes. Watch that demo here
Allow list during public preview
During public preview, access is gated to ensure good quality of service. Link to request access: g.co/cloudrun/gpu
Regions
- Today: us-central1
- Later: europe-west4 (Netherlands) and asia-southeast1 (Singapore)
Links
- A playlist with impressive demos (youtube.com)
- Launch blog (cloud.google.com)
- Launch event livestream recording (youtube.com)
- Tutorial: Run LLM inference on Cloud Run GPUs with Ollama (cloud.google.com/run/docs)
- Codelab: Run LLM inference on Cloud Run GPUs with vLLM (codelabs.developers.google.com)
- OpenCV CUDA accelerated demo (github.com)