How It Works

Request flow

When you send a request to your AI Gateway endpoint:

Your app sends a request to your ngrok endpoint
The gateway selects which models to try based on your configuration and the request path
The request is forwarded to the provider with the appropriate provider API key
If it fails, the gateway retries with the next model or key in the list
The response is returned to your app

Model selection

The gateway needs to determine which model and provider to use for each request. This happens in two stages: resolving what the client asked for, then selecting from available options.

Resolving the client’s request

The model name in your request determines the starting point:

Model in Request	What Happens
`gpt-4o`	Known OpenAI model in the model catalog
`claude-3-5-sonnet-latest`	Known Anthropic model in the model catalog
`openai:gpt-4o`	Provider and model match known OpenAI model
`openai:gpt-5-preview`	Known provider but unknown model, so passed through to OpenAI as-is
`my-provider:my-model`	Uses your configured custom provider
`ngrok/auto`	Let the gateway choose based on your selection strategy

Unknown models (not in the catalog) are automatically passed through if you include a known provider prefix. This lets you use new models immediately without waiting for catalog updates.

Building the set of available options

The gateway builds a set of available options based on your configuration and the request path. Starting from the model catalog, the gateway adds any custom models from your configuration to the set of available options. Then the gateway uses the request path of the request to determine the API format of the request to filter out providers that do not support the format. For example, if the request path is /v1/chat/completions, the gateway ensures that the set only includes models from providers that support the OpenAI API format, and likewise for /v1/messages and providers that support the Anthropic Claude API format. Finally the gateway filters the available options against this model supplied in the client request to arrive at the set of available options.

Custom selection strategies

You can completely customize how models are selected using CEL expressions. Define a model_selection strategy to control the order models are tried:

traffic-policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: openai
          api_keys:
            - value: ${secrets.get('openai', 'api-key')}
        - id: ollama
          base_url: "https://ollama.internal"
      model_selection:
        strategy:
          - "ai.models.filter(m, m.provider_id == 'ollama')"
          - "ai.models.sortBy(m, m.pricing.input)"

Strategies are evaluated in order—the first one that returns models wins. If Ollama has matching models, only those are tried. The second strategy (sorted by price) is only used if no Ollama models match. This lets you define sophisticated routing logic:

Cost optimization - Route to cheapest models first
Provider preference - Prefer certain providers over others
Load balancing - Randomize across equivalent models
Capability filtering - Select models with specific features

Use ngrok/auto as the model name to let your selection strategy choose entirely. See Model Selection Strategies for detailed examples.

Failover

When a request fails, the gateway automatically tries the next candidate. This happens transparently—your app just sees a successful response (or a final error if all candidates are exhausted).

What triggers failover?

Timeouts - Provider took too long to respond
HTTP errors - Any non-2xx/3xx response (4xx, 5xx)
Connection failures - Network errors, DNS issues, etc.

The gateway never retries the same model/key combination—it always moves to the next candidate.

Failover order

The gateway works through your configured options: For example, if you configure OpenAI with 2 keys and Anthropic as backup:

OpenAI with key #1
OpenAI with key #2
Anthropic

The gateway stops as soon as one succeeds.

Timeouts

Two settings control how long the gateway waits:

Setting	Default	Description
`per_request_timeout`	30s	Max time for a single attempt
`total_timeout`	5m	Max time including all failover attempts

traffic-policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      per_request_timeout: "60s"  # Wait longer for slow models
      total_timeout: "3m"          # Limit total failover time
      providers:
        - id: openai
          api_keys:
            - value: ${secrets.get('openai', 'api-key')}

If a single attempt exceeds per_request_timeout, the gateway moves to the next option. If total time exceeds total_timeout, the gateway returns an error to your app.

Token counting

The gateway counts tokens for each request, enabling:

Usage tracking - See token usage per provider and model
Input limits - Reject oversized requests before they’re sent to providers

traffic-policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      max_input_tokens: 4000  # Reject requests over 4000 input tokens
      providers:
        - id: openai
          api_keys:
            - value: ${secrets.get('openai', 'api-key')}

Content modification

You can modify requests and responses using Traffic Policy’s find and replace actions (request-body-find-replace, response-body-find-replace, sse-find-replace). This enables use cases like:

PII redaction - Remove sensitive data before it reaches AI providers
Response sanitization - Filter inappropriate content from responses
Prompt injection - Add system instructions to user prompts

traffic-policy.yaml

on_http_request:
  - actions:
      # Redact emails from prompts
      - type: request-body-find-replace
        config:
          replacements:
            - from: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
              to: "[EMAIL]"
      - type: ai-gateway
        config:
          providers:
            - id: openai
              api_keys:
                - value: ${secrets.get('openai', 'api-key')}
on_event_stream_message:
  - actions:
      # Redact SSNs from streaming responses
      - type: sse-find-replace
        config:
          replacements:
            - field: data
              from: "\\b\\d{3}-\\d{2}-\\d{4}\\b"
              to: "[SSN]"

Modifying Requests

Redact PII, inject prompts, add headers

Modifying Responses

Sanitize responses and streaming content

Next steps

Configuring Providers

Set up providers and provider API keys

Model Selection Strategies

Define custom routing logic

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

Request flow

Model selection

Resolving the client’s request

Building the set of available options

Custom selection strategies

Failover

What triggers failover?

Failover order

Timeouts

Token counting

Content modification

Modifying Requests

Modifying Responses

Next steps

Configuring Providers

Model Selection Strategies

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

​Request flow

​Model selection

​Resolving the client’s request

​Building the set of available options

​Custom selection strategies

​Failover

​What triggers failover?

​Failover order

​Timeouts

​Token counting

​Content modification

Modifying Requests

Modifying Responses

​Next steps

Configuring Providers

Model Selection Strategies

Request flow

Model selection

Resolving the client’s request

Building the set of available options

Custom selection strategies

Failover

What triggers failover?

Failover order

Timeouts

Token counting

Content modification

Next steps