Scale-to-zero LLM inference: Cost-efficient open model deployment on serverless GPUs
Byte size (BEGINNER level)
Zaal 2
Many companies are interested in running open large language models such as Gemma and Deepseek because it gives them full control over the deployment options, the timing of model upgrades, and the private data that goes into the model. Ollama is an open source LLM inference server. In this 15 minute demo, I'll show you how run Ollama cost-efficiently on serverless GPUs that scale up and down rapidly, including down to zero when there are no incoming requests