Run Gemma with Ollama on Cloud Run
Update (August 21, 2024): Cloud Run added GPU support. Refer to my tutorial:
Run LLM inference on Cloud Run GPUs with Ollama.
I created a sample that shows how to deploy the Ollama API with gemma:2b
on Cloud Run, to run inference using CPU only.
Gemma is Google’s open model built from the same research and technology used to create the Gemini models. The 2B version is the smallest version. I didn’t try running 7B yet.
Ollama is a framework that makes it easy for developers to prototype apps with open models, including gemma. It comes with a REST API and this sample deploys that API with a Cloud Run service.
Find it here: github.com/wietsevenema/samples/ollama-gemma