Skip to main content
The AI Gateway collects real-time performance metrics that you can use in model selection strategies to make intelligent routing decisions.

Availability

Important: Metrics are only available within model_selection.strategy CEL expressions. They are not available in:
  • General expression fields in Traffic Policies
  • Other action configurations
  • api_key_selection.strategy expressions (not yet implemented)
This is because metrics are populated at runtime during AI Gateway request processing, specifically when evaluating model selection strategies.

Accessing metrics

Metrics are available on each model through the metrics field:
model_selection:
  strategy:
    # Filter by latency
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 1000)"
    # Filter by error rate
    - "ai.models.filter(m, m.metrics.global.error_rate.total < 0.05)"
    # Fallback
    - "ai.models"

Metric scopes

Metrics are collected at multiple scopes, allowing you to make decisions based on global trends or your specific usage:
ScopeCEL PathDescription
Globalm.metrics.globalAggregated across all ngrok accounts
Regionm.metrics.regionAggregated for the region handling the request
Accountm.metrics.accountYour ngrok account’s usage
Endpointm.metrics.endpointThis specific endpoint’s usage
API Keym.metrics.api_keys["key_id"]Per-provider API key metrics

Scope selection guidelines

  • Global - Best for understanding overall provider health and comparing models you haven’t used yet
  • Region - Useful when latency varies by geographic region
  • Account - Reflects your specific usage patterns and rate limit status
  • Endpoint - Most specific, useful for per-application decisions
  • API Key - Track quota and usage for specific provider API keys

Available metrics

Base metrics (all scopes)

FieldTypeDescription
providerstringProvider ID (for example, "openai")
modelstringModel ID (for example, "gpt-4o")
request_countuint64Total requests in the aggregation window
start_timeuint32Window start (Unix timestamp in seconds)
end_timeuint32Window end (Unix timestamp in seconds)

Latency metrics

Access via m.metrics.<scope>.latency:
FieldTypeDescription
gateway_ms_avguint32Average gateway processing time (request received → upstream sent)
gateway_ms_p95uint32P95 gateway processing time
upstream_ms_avguint32Average time to receive full response from provider
upstream_ms_p95uint32P95 upstream response time
time_to_first_token_ms_avguint32*Average TTFT (streaming responses only)
time_to_first_token_ms_p95uint32*P95 TTFT (streaming responses only)
time_per_output_token_ms_avguint32*Average inter-token time (streaming only)
time_per_output_token_ms_p95uint32*P95 inter-token time (streaming only)
*Fields marked with * may be null if no streaming requests have been recorded.

Error rate metrics

Access via m.metrics.<scope>.error_rate: All values are fractions from 0.0 to 1.0 (for example, 0.05 = 5% error rate):
FieldTypeDescription
totalfloat32Overall error rate (any non-2xx/3xx response)
timeoutfloat32Timeout errors (no response received within timeout)
rate_limitfloat32Rate limit errors (HTTP 429)
clientfloat32Client errors (4xx excluding 429)
serverfloat32Server errors (5xx)

Token metrics (account/endpoint/API key scopes)

Access via m.metrics.<scope>.token:
FieldTypeDescription
provider_inputuint64Input tokens as reported by the provider
provider_outputuint64Output tokens as reported by the provider
estimated_inputuint64Input tokens estimated by ngrok’s tokenizer
estimated_outputuint64Output tokens estimated by ngrok’s tokenizer
Token metrics are only available at Account, Endpoint, and API Key scopes. Global and Region scopes do not include token counts.

Quota metrics (API key scope only)

Access via m.metrics.api_keys["key_id"].quota:
FieldTypeDescription
remaining_requestsuint64*Requests remaining before hitting rate limit
remaining_tokensuint64*Tokens remaining before hitting rate limit
limit_requestsuint64*Max requests allowed in rate limit period
limit_tokensuint64*Max tokens allowed in rate limit period
*Fields may be null if quota information is not available from the provider.

Examples

Route to fastest models

Prefer models with low average latency:
model_selection:
  strategy:
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 500)"
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 2000)"
    - "ai.models"

Avoid high error rates

Skip models with too many errors:
model_selection:
  strategy:
    - "ai.models.filter(m, m.metrics.global.error_rate.total < 0.01)"
    - "ai.models.filter(m, m.metrics.global.error_rate.total < 0.10)"
    - "ai.models"

Avoid rate-limited providers

Skip models currently hitting rate limits:
model_selection:
  strategy:
    - "ai.models.filter(m, m.metrics.global.error_rate.rate_limit < 0.05)"
    - "ai.models"

Combined performance criteria

Use multiple criteria for optimal routing:
model_selection:
  strategy:
    # Ideal: fast and reliable
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 1000 && m.metrics.global.error_rate.total < 0.01)"
    # Good: reasonably fast with acceptable errors
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 3000 && m.metrics.global.error_rate.total < 0.05)"
    # Fallback: any available model
    - "ai.models"

Sort by latency

Order models by speed instead of filtering:
model_selection:
  strategy:
    - "ai.models.sortBy(m, m.metrics.global.latency.upstream_ms_avg)"

Use account-specific metrics

Base decisions on your own usage data:
model_selection:
  strategy:
    # Use your account's error rate data
    - "ai.models.filter(m, m.metrics.account.error_rate.total < 0.05)"
    - "ai.models"

Metric availability notes

  1. New models: Models without historical data will have zero values for metrics. Include a fallback strategy that doesn’t filter by metrics.
  2. Custom providers: Metrics for custom providers (Ollama, vLLM, etc.) are only available after you’ve sent traffic through them.
  3. Aggregation windows: Metrics are aggregated over rolling time windows. The exact window size may vary by scope.
  4. Metric freshness: Metrics are cached and updated periodically. There may be a brief delay before recent requests are reflected.

See also