Skip to main content
vLLM is a high-performance inference engine for large language models. It provides an OpenAI-compatible API server out of the box, making it easy to integrate with the ngrok AI Gateway.

Prerequisites

Overview

vLLM runs an OpenAI-compatible server that the AI Gateway can route to directly.

Getting started

1

Start vLLM

Start the vLLM OpenAI-compatible server:
vllm serve meta-llama/Llama-3.2-8B-Instruct
Verify it’s running:
curl http://localhost:8000/v1/models
2

Expose with ngrok

Use the ngrok agent to create an internal endpoint:
ngrok http 8000 --url https://vllm.internal
3

Configure the AI Gateway

Create a Traffic Policy with vLLM as a provider:
policy.yaml
on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"
4

Use with OpenAI SDK

Point any OpenAI-compatible SDK at your AI Gateway:
from openai import OpenAI

client = OpenAI(
    base_url="https://your-ai-subdomain.ngrok.app/v1",
    api_key="unused"
)

response = client.chat.completions.create(
    model="vllm:meta-llama/Llama-3.2-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Best practices

Require provider API key authentication

Secure your vLLM server with an API key:
vllm serve meta-llama/Llama-3.2-8B-Instruct --api-key your-secret-key
Store the key in ngrok secrets:
ngrok api secrets create \
  --name vllm \
  --secret-data '{"api-key": "your-secret-key"}'
Then reference it in your AI Gateway config:
policy.yaml
on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          api_keys:
            - value: ${secrets.get('vllm', 'api-key')}
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"
You can also create secrets in the ngrok Dashboard.

Configure timeouts

Large models can be slow. Increase timeouts for production:
policy.yaml
on_http_request:
  - type: ai-gateway
    config:
      per_request_timeout: "180s"
      total_timeout: "10m"
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"

Hugging face authentication

For gated models, set your HF token before starting vLLM:
export HF_TOKEN=your_huggingface_token
vllm serve meta-llama/Llama-3.2-8B-Instruct

Advanced configuration

vLLM server options

Common vllm serve flags:
# Custom host and port
vllm serve model --host 0.0.0.0 --port 8080

# Limit GPU memory (useful for smaller GPUs)
vllm serve model --gpu-memory-utilization 0.9

# Multi-GPU with tensor parallelism
vllm serve model --tensor-parallel-size 2

# Enable async scheduling for better throughput
vllm serve model --async-scheduling
See the vLLM serve CLI reference for all available options.

Multiple models

Run separate vLLM instances for different models:
policy.yaml
on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm-llama"
          base_url: "https://vllm-llama.internal"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"
        
        - id: "vllm-mistral"
          base_url: "https://vllm-mistral.internal"
          models:
            - id: "mistralai/Mistral-7B-Instruct-v0.3"

Failover to cloud

Use vLLM as primary with automatic cloud fallback:
policy.yaml
on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"
        
        - id: "openai"
          api_keys:
            - value: ${secrets.get('openai', 'api-key')}
      
      model_selection:
        strategy:
          - "ai.models.filter(m, m.provider_id == 'vllm')"
          - "ai.models"

Model metadata

Add metadata for routing decisions:
policy.yaml
on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          metadata:
            hardware: "A100-80GB"
            location: "us-east"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"
              metadata:
                parameters: "8B"
                context_length: 128000

Troubleshooting

Model loading errors

Symptom: vLLM fails to start or crashes. Solutions:
  1. Check GPU memory: nvidia-smi
  2. Use a smaller model or quantized version
  3. Reduce memory usage: --gpu-memory-utilization 0.9

Slow responses

Symptom: Requests take longer than expected. Solutions:
  1. Use tensor parallelism for multi-GPU: --tensor-parallel-size 2
  2. Enable async scheduling: --async-scheduling
  3. Adjust --max-num-seqs for your workload

Connection timeouts

Symptom: Gateway times out waiting for response. Solutions:
  1. Increase per_request_timeout in gateway config
  2. Check vLLM health: curl http://localhost:8000/health
  3. Verify ngrok tunnel is running

Next steps