Metrics Reference

The AI Gateway collects real-time performance metrics that you can use in model selection strategies to make intelligent routing decisions.

Availability

Important: Metrics are only available within model_selection.strategy CEL expressions. They are not available in:

General expression fields in Traffic Policies
Other action configurations
api_key_selection.strategy expressions (not yet implemented)

This is because metrics are populated at runtime during AI Gateway request processing, specifically when evaluating model selection strategies.

Accessing metrics

Metrics are available on each model through the metrics field:

model_selection:
  strategy:
    # Filter by latency
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 1000)"
    # Filter by error rate
    - "ai.models.filter(m, m.metrics.global.error_rate.total < 0.05)"
    # Fallback
    - "ai.models"

Metric scopes

Metrics are collected at multiple scopes, allowing you to make decisions based on global trends or your specific usage:

Scope	CEL Path	Description
Global	`m.metrics.global`	Aggregated across all ngrok accounts
Region	`m.metrics.region`	Aggregated for the region handling the request
Account	`m.metrics.account`	Your ngrok account’s usage
Endpoint	`m.metrics.endpoint`	This specific endpoint’s usage
API Key	`m.metrics.api_keys["key_id"]`	Per-provider API key metrics

Scope selection guidelines

Global - Best for understanding overall provider health and comparing models you haven’t used yet
Region - Useful when latency varies by geographic region
Account - Reflects your specific usage patterns and rate limit status
Endpoint - Most specific, useful for per-application decisions
API Key - Track quota and usage for specific provider API keys

Available metrics

Base metrics (all scopes)

Field	Type	Description
`provider`	string	Provider ID (for example, `"openai"`)
`model`	string	Model ID (for example, `"gpt-4o"`)
`request_count`	uint64	Total requests in the aggregation window
`start_time`	uint32	Window start (Unix timestamp in seconds)
`end_time`	uint32	Window end (Unix timestamp in seconds)

Latency metrics

Access via m.metrics.<scope>.latency:

Field	Type	Description
`gateway_ms_avg`	uint32	Average gateway processing time (request received → upstream sent)
`gateway_ms_p95`	uint32	P95 gateway processing time
`upstream_ms_avg`	uint32	Average time to receive full response from provider
`upstream_ms_p95`	uint32	P95 upstream response time
`time_to_first_token_ms_avg`	uint32*	Average TTFT (streaming responses only)
`time_to_first_token_ms_p95`	uint32*	P95 TTFT (streaming responses only)
`time_per_output_token_ms_avg`	uint32*	Average inter-token time (streaming only)
`time_per_output_token_ms_p95`	uint32*	P95 inter-token time (streaming only)

*Fields marked with * may be null if no streaming requests have been recorded.

Error rate metrics

Access via m.metrics.<scope>.error_rate: All values are fractions from 0.0 to 1.0 (for example, 0.05 = 5% error rate):

Field	Type	Description
`total`	float32	Overall error rate (any non-2xx/3xx response)
`timeout`	float32	Timeout errors (no response received within timeout)
`rate_limit`	float32	Rate limit errors (HTTP 429)
`client`	float32	Client errors (4xx excluding 429)
`server`	float32	Server errors (5xx)

Token metrics (account/endpoint/API key scopes)

Access via m.metrics.<scope>.token:

Field	Type	Description
`provider_input`	uint64	Input tokens as reported by the provider
`provider_output`	uint64	Output tokens as reported by the provider
`estimated_input`	uint64	Input tokens estimated by ngrok’s tokenizer
`estimated_output`	uint64	Output tokens estimated by ngrok’s tokenizer

Token metrics are only available at Account, Endpoint, and API Key scopes. Global and Region scopes do not include token counts.

Quota metrics (API key scope only)

Safe API key access: Not all models have metrics for every API key. Always check if the key exists before accessing it.❌ This will error if metrics don’t exist for this API key:

- "ai.models.filter(m, m.metrics.api_keys['sk-abc123'].latency.upstream_ms_avg < 1000)"

✅ Check the key exists first:

- "ai.models.filter(m, 'sk-abc123' in m.metrics.api_keys && m.metrics.api_keys['sk-abc123'].latency.upstream_ms_avg < 1000)"

Access via m.metrics.api_keys["key_id"].quota:

Field	Type	Description
`remaining_requests`	uint64*	Requests remaining before hitting rate limit
`remaining_tokens`	uint64*	Tokens remaining before hitting rate limit
`limit_requests`	uint64*	Max requests allowed in rate limit period
`limit_tokens`	uint64*	Max tokens allowed in rate limit period

*Fields may be null if quota information is not available from the provider.

Examples

Route to fastest models

Prefer models with low average latency:

model_selection:
  strategy:
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 500)"
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 2000)"
    - "ai.models"

Avoid high error rates

Skip models with too many errors:

model_selection:
  strategy:
    - "ai.models.filter(m, m.metrics.global.error_rate.total < 0.01)"
    - "ai.models.filter(m, m.metrics.global.error_rate.total < 0.10)"
    - "ai.models"

Avoid rate-limited providers

Skip models currently hitting rate limits:

model_selection:
  strategy:
    - "ai.models.filter(m, m.metrics.global.error_rate.rate_limit < 0.05)"
    - "ai.models"

Combined performance criteria

Use multiple criteria for optimal routing:

model_selection:
  strategy:
    # Ideal: fast and reliable
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 1000 && m.metrics.global.error_rate.total < 0.01)"
    # Good: reasonably fast with acceptable errors
    - "ai.models.filter(m, m.metrics.global.latency.upstream_ms_avg < 3000 && m.metrics.global.error_rate.total < 0.05)"
    # Fallback: any available model
    - "ai.models"

Sort by latency

Order models by speed instead of filtering:

model_selection:
  strategy:
    - "ai.models.sortBy(m, m.metrics.global.latency.upstream_ms_avg)"

Use account-specific metrics

Base decisions on your own usage data:

model_selection:
  strategy:
    # Use your account's error rate data
    - "ai.models.filter(m, m.metrics.account.error_rate.total < 0.05)"
    - "ai.models"

Metric availability notes

New models: Models without historical data will have zero values for metrics. Include a fallback strategy that doesn’t filter by metrics.
Custom providers: Metrics for custom providers (Ollama, vLLM, etc.) are only available after you’ve sent traffic through them.
Aggregation windows: Metrics are aggregated over rolling time windows. The exact window size may vary by scope.
Metric freshness: Metrics are cached and updated periodically. There may be a brief delay before recent requests are reflected.

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

Availability

Accessing metrics

Metric scopes

Scope selection guidelines

Available metrics

Base metrics (all scopes)

Latency metrics

Error rate metrics

Token metrics (account/endpoint/API key scopes)

Quota metrics (API key scope only)

Examples

Route to fastest models

Avoid high error rates

Avoid rate-limited providers

Combined performance criteria

Sort by latency

Use account-specific metrics

Metric availability notes

See also

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

​Availability

​Accessing metrics

​Metric scopes

​Scope selection guidelines

​Available metrics

​Base metrics (all scopes)

​Latency metrics

​Error rate metrics

​Token metrics (account/endpoint/API key scopes)

​Quota metrics (API key scope only)

​Examples

​Route to fastest models

​Avoid high error rates

​Avoid rate-limited providers

​Combined performance criteria

​Sort by latency

​Use account-specific metrics

​Metric availability notes

​See also

Availability

Accessing metrics

Metric scopes

Scope selection guidelines

Available metrics

Base metrics (all scopes)

Latency metrics

Error rate metrics

Token metrics (account/endpoint/API key scopes)

Quota metrics (API key scope only)

Examples

Route to fastest models

Avoid high error rates

Avoid rate-limited providers

Combined performance criteria

Sort by latency

Use account-specific metrics

Metric availability notes

See also