Prerequisites
- ngrok account with AI Gateway access
- vLLM installed
- ngrok agent installed
- GPU with sufficient VRAM for your chosen model
Overview
vLLM runs an OpenAI-compatible server that the AI Gateway can route to directly.Getting started
1
Start vLLM
Start the vLLM OpenAI-compatible server:Verify it’s running:
2
Expose with ngrok
Use the ngrok agent to create an internal endpoint:
3
Configure the AI Gateway
Create a Traffic Policy with vLLM as a provider:
policy.yaml
4
Use with OpenAI SDK
Point any OpenAI-compatible SDK at your AI Gateway:
Best practices
Require provider API key authentication
Secure your vLLM server with an API key:policy.yaml
Configure timeouts
Large models can be slow. Increase timeouts for production:policy.yaml
Hugging face authentication
For gated models, set your HF token before starting vLLM:Advanced configuration
vLLM server options
Commonvllm serve flags:
See the vLLM serve CLI reference for all available options.
Multiple models
Run separate vLLM instances for different models:policy.yaml
Failover to cloud
Use vLLM as primary with automatic cloud fallback:policy.yaml
Model metadata
Add metadata for routing decisions:policy.yaml
Troubleshooting
Model loading errors
Symptom: vLLM fails to start or crashes. Solutions:- Check GPU memory:
nvidia-smi - Use a smaller model or quantized version
- Reduce memory usage:
--gpu-memory-utilization 0.9
Slow responses
Symptom: Requests take longer than expected. Solutions:- Use tensor parallelism for multi-GPU:
--tensor-parallel-size 2 - Enable async scheduling:
--async-scheduling - Adjust
--max-num-seqsfor your workload
Connection timeouts
Symptom: Gateway times out waiting for response. Solutions:- Increase
per_request_timeoutin gateway config - Check vLLM health:
curl http://localhost:8000/health - Verify ngrok tunnel is running
Next steps
- Custom Providers - URL requirements and configuration
- Model Selection Strategies - Intelligent routing
- Multi-Provider Failover - Failover patterns