Talks - VoxxedDays Amsterdam

Scale-to-zero LLM inference: Cost-efficient open model deployment on serverless GPUs

Byte size (BEGINNER level)

Thursday from 13:35 13:50

Zaal 2

Many companies are interested in running open large language models such as Gemma and Deepseek because it gives them full control over the deployment options, the timing of model upgrades, and the private data that goes into the model. Ollama is an open source LLM inference server. In this 15 minute demo, I'll show you how run Ollama cost-efficiently on serverless GPUs that scale up and down rapidly, including down to zero when there are no incoming requests

Wietse Venema

Google

Wietse Venema is an engineer at Google Cloud. He wrote the O’Reilly book on Cloud Run.