Documentation Index
Fetch the complete documentation index at: https://ngrok.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
vLLM is a high-performance inference engine for large language models. It exposes an OpenAI-compatible API server out of the box, so it works with the ngrok AI Gateway.
Prerequisites
Overview
vLLM runs an OpenAI-compatible server that the AI Gateway can route to directly.
Getting started
Start vLLM
Start the vLLM OpenAI-compatible server:vllm serve meta-llama/Llama-3.2-8B-Instruct
Verify it’s running:curl http://localhost:8000/v1/models
Expose with ngrok
Use the ngrok agent to create an internal endpoint:ngrok http 8000 --url https://vllm.internal
Configure the AI Gateway
Create a Traffic Policy with vLLM as a provider:on_http_request:
- type: ai-gateway
config:
providers:
- id: "vllm"
base_url: "https://vllm.internal"
models:
- id: "meta-llama/Llama-3.2-8B-Instruct"
Use with OpenAI SDK
Point any OpenAI-compatible SDK at your AI Gateway:from openai import OpenAI
client = OpenAI(
base_url="https://your-ai-gateway.ngrok.app/v1",
api_key="unused"
)
response = client.chat.completions.create(
model="vllm:meta-llama/Llama-3.2-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Best practices
Require provider API key authentication
Secure your vLLM server with an API key:
vllm serve meta-llama/Llama-3.2-8B-Instruct --api-key your-secret-key
Store the key in ngrok secrets:
ngrok api secrets create \
--name vllm \
--secret-data '{"api-key": "your-secret-key"}'
Then reference it in your AI Gateway config:
on_http_request:
- type: ai-gateway
config:
providers:
- id: "vllm"
base_url: "https://vllm.internal"
api_keys:
- value: ${secrets.get('vllm', 'api-key')}
models:
- id: "meta-llama/Llama-3.2-8B-Instruct"
Large models can be slow. Increase timeouts for production:
on_http_request:
- type: ai-gateway
config:
per_request_timeout: "180s"
total_timeout: "10m"
providers:
- id: "vllm"
base_url: "https://vllm.internal"
models:
- id: "meta-llama/Llama-3.2-8B-Instruct"
Hugging face authentication
For gated models, set your HF token before starting vLLM:
export HF_TOKEN=your_huggingface_token
vllm serve meta-llama/Llama-3.2-8B-Instruct
Advanced configuration
vLLM server options
Common vllm serve flags:
# Custom host and port
vllm serve model --host 0.0.0.0 --port 8080
# Limit GPU memory (useful for smaller GPUs)
vllm serve model --gpu-memory-utilization 0.9
# Multi-GPU with tensor parallelism
vllm serve model --tensor-parallel-size 2
# Enable async scheduling for better throughput
vllm serve model --async-scheduling
Multiple models
Run separate vLLM instances for different models:
on_http_request:
- type: ai-gateway
config:
providers:
- id: "vllm-llama"
base_url: "https://vllm-llama.internal"
models:
- id: "meta-llama/Llama-3.2-8B-Instruct"
- id: "vllm-mistral"
base_url: "https://vllm-mistral.internal"
models:
- id: "mistralai/Mistral-7B-Instruct-v0.3"
Failover to cloud
Use vLLM as primary with automatic cloud fallback:
on_http_request:
- type: ai-gateway
config:
providers:
- id: "vllm"
base_url: "https://vllm.internal"
models:
- id: "meta-llama/Llama-3.2-8B-Instruct"
- id: "openai"
api_keys:
- value: ${secrets.get('openai', 'api-key')}
model_selection:
strategy:
- "ai.models.filter(m, m.provider_id == 'vllm')"
- "ai.models"
Add metadata for routing decisions:
on_http_request:
- type: ai-gateway
config:
providers:
- id: "vllm"
base_url: "https://vllm.internal"
metadata:
hardware: "A100-80GB"
location: "us-east"
models:
- id: "meta-llama/Llama-3.2-8B-Instruct"
metadata:
parameters: "8B"
context_length: 128000
Troubleshooting
Model loading errors
Symptom: vLLM fails to start or crashes.
Solutions:
- Check GPU memory:
nvidia-smi
- Use a smaller model or quantized version
- Reduce memory usage:
--gpu-memory-utilization 0.9
Slow responses
Symptom: Requests take longer than expected.
Solutions:
- Use tensor parallelism for multi-GPU:
--tensor-parallel-size 2
- Enable async scheduling:
--async-scheduling
- Adjust
--max-num-seqs for your workload
Connection timeouts
Symptom: Gateway times out waiting for response.
Solutions:
- Increase
per_request_timeout in gateway config
- Check vLLM health:
curl http://localhost:8000/health
- Verify ngrok tunnel is running
Next steps