vLLM

vLLM is a high-performance inference engine for large language models. It provides an OpenAI-compatible API server out of the box, making it easy to integrate with the ngrok AI Gateway.

Prerequisites

ngrok account with AI Gateway access
vLLM installed
ngrok agent installed
GPU with sufficient VRAM for your chosen model

Overview

vLLM runs an OpenAI-compatible server that the AI Gateway can route to directly.

Getting started

Start vLLM

Start the vLLM OpenAI-compatible server:

vllm serve meta-llama/Llama-3.2-8B-Instruct

Verify it’s running:

curl http://localhost:8000/v1/models

Expose with ngrok

Use the ngrok agent to create an internal endpoint:

ngrok http 8000 --url https://vllm.internal

Configure the AI Gateway

Create a Traffic Policy with vLLM as a provider:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"

Use with OpenAI SDK

Point any OpenAI-compatible SDK at your AI Gateway:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-ai-subdomain.ngrok.app/v1",
    api_key="unused"
)

response = client.chat.completions.create(
    model="vllm:meta-llama/Llama-3.2-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Best practices

Require provider API key authentication

Secure your vLLM server with an API key:

vllm serve meta-llama/Llama-3.2-8B-Instruct --api-key your-secret-key

Store the key in ngrok secrets:

ngrok api secrets create \
  --name vllm \
  --secret-data '{"api-key": "your-secret-key"}'

Then reference it in your AI Gateway config:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          api_keys:
            - value: ${secrets.get('vllm', 'api-key')}
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"

You can also create secrets in the ngrok Dashboard.

Configure timeouts

Large models can be slow. Increase timeouts for production:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      per_request_timeout: "180s"
      total_timeout: "10m"
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"

Hugging face authentication

For gated models, set your HF token before starting vLLM:

export HF_TOKEN=your_huggingface_token
vllm serve meta-llama/Llama-3.2-8B-Instruct

Advanced configuration

vLLM server options

Common vllm serve flags:

# Custom host and port
vllm serve model --host 0.0.0.0 --port 8080

# Limit GPU memory (useful for smaller GPUs)
vllm serve model --gpu-memory-utilization 0.9

# Multi-GPU with tensor parallelism
vllm serve model --tensor-parallel-size 2

# Enable async scheduling for better throughput
vllm serve model --async-scheduling

See the vLLM serve CLI reference for all available options.

Multiple models

Run separate vLLM instances for different models:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm-llama"
          base_url: "https://vllm-llama.internal"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"
        
        - id: "vllm-mistral"
          base_url: "https://vllm-mistral.internal"
          models:
            - id: "mistralai/Mistral-7B-Instruct-v0.3"

Failover to cloud

Use vLLM as primary with automatic cloud fallback:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"
        
        - id: "openai"
          api_keys:
            - value: ${secrets.get('openai', 'api-key')}
      
      model_selection:
        strategy:
          - "ai.models.filter(m, m.provider_id == 'vllm')"
          - "ai.models"

Model metadata

Add metadata for routing decisions:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "vllm"
          base_url: "https://vllm.internal"
          metadata:
            hardware: "A100-80GB"
            location: "us-east"
          models:
            - id: "meta-llama/Llama-3.2-8B-Instruct"
              metadata:
                parameters: "8B"
                context_length: 128000

Troubleshooting

Model loading errors

Symptom: vLLM fails to start or crashes. Solutions:

Check GPU memory: nvidia-smi
Use a smaller model or quantized version
Reduce memory usage: --gpu-memory-utilization 0.9

Slow responses

Symptom: Requests take longer than expected. Solutions:

Use tensor parallelism for multi-GPU: --tensor-parallel-size 2
Enable async scheduling: --async-scheduling
Adjust --max-num-seqs for your workload

Connection timeouts

Symptom: Gateway times out waiting for response. Solutions:

Increase per_request_timeout in gateway config
Check vLLM health: curl http://localhost:8000/health
Verify ngrok tunnel is running

Next steps

Custom Providers - URL requirements and configuration
Model Selection Strategies - Intelligent routing
Multi-Provider Failover - Failover patterns

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

Prerequisites

Overview

Getting started

Best practices

Require provider API key authentication

Configure timeouts

Hugging face authentication

Advanced configuration

vLLM server options

Multiple models

Failover to cloud

Model metadata

Troubleshooting

Model loading errors

Slow responses

Connection timeouts

Next steps

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

​Prerequisites

​Overview

​Getting started

​Best practices

​Require provider API key authentication

​Configure timeouts

​Hugging face authentication

​Advanced configuration

​vLLM server options

​Multiple models

​Failover to cloud

​Model metadata

​Troubleshooting

​Model loading errors

​Slow responses

​Connection timeouts

​Next steps

Prerequisites

Overview

Getting started

Best practices

Require provider API key authentication

Configure timeouts

Hugging face authentication

Advanced configuration

vLLM server options

Multiple models

Failover to cloud

Model metadata

Troubleshooting

Model loading errors

Slow responses

Connection timeouts

Next steps