Prerequisites
- ngrok account with AI Gateway access
- Ollama installed locally
- ngrok agent installed
Overview
Since Ollama runs locally on HTTP, you’ll expose it through an ngrok internal endpoint, then configure the AI Gateway to route requests to it.Getting started
1
Start Ollama
Start the Ollama server:Pull a model if you haven’t already:Verify Ollama is running:
2
Expose Ollama with ngrok
Use the ngrok agent to create an internal endpoint:
Internal endpoints (
.internal domains) are private to your ngrok account. They’re not accessible from the public internet.3
Configure the AI Gateway
Create a Traffic Policy with Ollama as a provider:
policy.yaml
4
Use with OpenAI SDK
Point any OpenAI-compatible SDK at your AI Gateway:
Advanced configuration
Restrict to Ollama only
Block requests to cloud providers and only allow Ollama:policy.yaml
Failover to cloud provider
Use Ollama as primary with automatic failover to OpenAI:policy.yaml
The first strategy that returns models wins. If Ollama has matching models, only those are tried. OpenAI is only used if no Ollama models match. For cross-provider failover when requests fail, have clients specify multiple models:
models: ["ollama:llama3.2", "openai:gpt-4o"].Increase timeouts
Local models can be slower, especially on first load. Increase timeouts as needed:policy.yaml
Multiple Ollama instances
Load balance across multiple machines:policy.yaml
Add model metadata
Track model details with metadata:policy.yaml
Troubleshooting
Connection refused
Symptom: Requests fail with connection errors. Solutions:- Verify Ollama is running:
curl http://localhost:11434/api/tags - Verify ngrok tunnel is running: Check for
https://ollama.internalin your ngrok dashboard - Ensure the internal endpoint URL matches your config
Model not found
Symptom: Error saying model doesn’t exist. Solutions:- List available models:
ollama list - Pull the model:
ollama pull llama3.2 - Verify the model ID matches exactly (including tags like
:1b)
Slow first response
Symptom: First request takes a very long time. Cause: Ollama loads models into memory on first use. Solutions:- Increase
per_request_timeoutto allow for model loading - Pre-warm the model:
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":""}' - Keep the model loaded by sending periodic requests
Out of memory
Symptom: Ollama crashes or returns errors for large models. Solutions:- Use a smaller model or quantized version (for example,
llama3.2:1b) - Increase system RAM or use a machine with more VRAM
- Set
OLLAMA_NUM_PARALLEL=1to limit concurrent requests
Next steps
- Custom Providers - Learn about URL requirements and configuration options
- Model Selection Strategies - Route requests intelligently
- Multi-Provider Failover - Advanced failover patterns