Request flow
When you send a request to your AI Gateway endpoint:- Your app sends a request to your ngrok endpoint
- The gateway selects which models to try based on your configuration
- The request is forwarded to the provider with the appropriate provider API key
- If it fails, the gateway retries with the next model or key in the list
- The response is returned to your app
Model selection
The gateway needs to determine which model and provider to use for each request. This happens in two stages: resolving what the client asked for, then selecting from available options.Resolving the client’s request
The model name in your request determines the starting point:| Model in Request | What Happens |
|---|---|
gpt-4o | Recognized as OpenAI from the model catalog |
claude-3-5-sonnet-latest | Recognized as Anthropic |
openai:gpt-4o | Explicit provider prefix - no lookup needed |
openai:gpt-5-preview | Unknown model - passed through to OpenAI as-is |
my-provider:my-model | Uses your configured custom provider |
ngrok/auto | Let the gateway choose based on your selection strategy |
Unknown models (not in the catalog) are automatically passed through if you include a provider prefix. This lets you use new models immediately without waiting for catalog updates.
Default selection
By default, the gateway uses the model and provider you requested. If that fails, it tries:- Other provider API keys for the same provider
- Other providers that offer the same model
Custom selection strategies
You can completely customize how models are selected using CEL expressions. Define amodel_selection strategy to control the order models are tried:
traffic-policy.yaml
- Cost optimization - Route to cheapest models first
- Provider preference - Prefer certain providers over others
- Load balancing - Randomize across equivalent models
- Capability filtering - Select models with specific features
Failover
When a request fails, the gateway automatically tries the next candidate. This happens transparently—your app just sees a successful response (or a final error if all candidates are exhausted).What triggers failover?
- Timeouts - Provider took too long to respond
- HTTP errors - Any non-2xx/3xx response (4xx, 5xx)
- Connection failures - Network errors, DNS issues, etc.
Failover order
The gateway works through your configured options: For example, if you configure OpenAI with 2 keys and Anthropic as backup:- OpenAI with key #1
- OpenAI with key #2
- Anthropic
Timeouts
Two settings control how long the gateway waits:| Setting | Default | Description |
|---|---|---|
per_request_timeout | 30s | Max time for a single attempt |
total_timeout | 5m | Max time including all failover attempts |
traffic-policy.yaml
per_request_timeout, the gateway moves to the next option. If total time exceeds total_timeout, the gateway returns an error to your app.
Token counting
The gateway counts tokens for each request, enabling:- Usage tracking - See token usage per provider and model
- Input limits - Reject oversized requests before they’re sent to providers
traffic-policy.yaml
Content modification
You can modify requests and responses using Traffic Policy’s find and replace actions (request-body-find-replace, response-body-find-replace, sse-find-replace). This enables use cases like:
- PII redaction - Remove sensitive data before it reaches AI providers
- Response sanitization - Filter inappropriate content from responses
- Prompt injection - Add system instructions to user prompts
traffic-policy.yaml
Modifying Requests
Redact PII, inject prompts, add headers
Modifying Responses
Sanitize responses and streaming content