DevFest Berlin: Running open models on Cloud Run
This is a list of links from my recent talk on open models at DevFest Berlin.
Abstract: Running open large language models in production with serverless GPUs Many developers are interested in running open large language models, such as Google’s Gemma and Llama. Open models give you full control over the deployment options, the timing of model upgrades, the private data that goes into the model, and the ability to fine-tune on specific tasks such as data extraction. Hugging Face TGI is a popular open-source LLM inference server. You’ll learn how to build and deploy an application that uses an open model on Google Cloud Run with cost-effective GPUs that scale down to zero instances.
- My slide deck
- Cloud Run GPU
- Can you run it? - Does a model fit in the VRAM of any GPU.
- Hugging Face TGI
- Google Gemma
- Getting started with Gradio
- Ollama - Great LLM inference server for your desktop
- Hugging Face Hub
- 140 GPU instances in four minutes
- Video: Deploy TGI on Cloud Run
- Tutorial: Deploy TGI on Cloud Run
- Hugging Face Deep Learning containers on Google Cloud - Containers with PyTorch for serving and training, TGI and TEI. Managed by Google Cloud and Hugging Face
- Neural Magic ran over half a million evaluations on quantized LLMs and found they maintain accuracy
- Understanding Cloud Run request concurrency
- Benchmark of LLM inference server startup times on Cloud Run - Look for the Performance heading in this blog