LLM Gateway
Features

Routing

Learn how LLMGateway intelligently routes your requests to the best available models and providers.

Routing

LLMGateway provides flexible and intelligent routing options to help you get the best performance and cost efficiency from your AI applications. Whether you want to use specific models, providers, or let our system automatically optimize your requests, we've got you covered.

LLMGateway also includes automatic retry and fallback — if a provider fails, your request is seamlessly retried on the next best provider, all within the same API call.

Model Selection

Any Model Name

You can use any model name from our models page or discover available models programmatically through the /v1/models endpoint.

curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Model ID Routing

Choose a specific model ID to route to the best available provider for that model. LLMGateway's smart routing algorithm considers multiple factors to find the optimal provider across all configured options.

Smart Routing Algorithm

When you use a model ID without a provider prefix, LLMGateway's intelligent routing system analyzes multiple factors to select the best provider.

Weighted Scoring System:

Each factor has a relative weight. The factors are scored as ratios against the best provider in the candidate set (e.g. a provider that is twice as expensive as the cheapest scores 1.0 on price), and each ratio is multiplied by its weight divided by the sum of all active weights. The provider with the lowest (best) total score wins.

The default weights are:

FactorDefault weightNotes
Price0.6Cost efficiency (average of input and output price)
Uptime0.5Provider reliability / low error rate
Throughput0.05Tokens per second generation speed
Latency0.025Time to first token — only applied for streaming requests
Cache0.2Prompt-cache support — only applied for large prompts (≥ 5,000 tokens)
Image price1.0Replaces the price weight for image-generation models

Because the weights are relative and normalized by the sum of the active weights, price and uptime dominate routing decisions in practice, while throughput and latency act as tie-breakers between otherwise comparable providers.

Latency Weight for Non-Streaming Requests:

The latency weight only applies to streaming requests (time-to-first-token is only measured there). For non-streaming requests the latency weight is dropped and its share is redistributed proportionally across the remaining factors.

Time-Decayed Metrics Window:

Provider metrics (uptime, throughput, latency) are not a flat "last N minutes" snapshot. They are aggregated over a rolling 60-minute window with a time-decay weighting so very recent behavior dominates while older data still contributes:

  • The most recent 1 minute is weighted 10×
  • The most recent 5 minutes are weighted
  • The remainder of the 60-minute window is weighted

This makes routing react quickly to a provider that just started failing or slowing down, without overreacting to a single noisy data point.

Cache Support for Large Prompts:

When the estimated prompt is at least 5,000 tokens, the cache weight (default 0.2) is factored into the score based on whether each provider supports prompt caching (advertised via a cached input price). Providers that support caching score better than ones that do not, since caching can substantially reduce the cost of large or repeated prompts. Below the 5,000-token threshold, this weight is dropped entirely — caching has little impact on small prompts, so cache support is ignored. The selected provider's cache support is exposed as cacheSupported on the routing metadata.

Exponential Uptime Penalty:

Providers with uptime below 95% receive an additional exponential penalty that increases rapidly as uptime drops:

  • 95-100% uptime: No penalty
  • 90% uptime: ~0.07 penalty
  • 80% uptime: ~0.62 penalty
  • 70% uptime: ~1.73 penalty
  • 50% uptime: ~5.61 penalty

This ensures providers experiencing significant issues are strongly deprioritized while minor fluctuations have minimal impact. The penalty threshold (default 95%) is configurable.

Provider Priority:

Each provider has a priority value (default 1) that nudges routing toward or away from it independently of live metrics:

  • A provider's priority is applied as a (1 - priority) adjustment to its score — higher priority lowers the score (more preferred), lower priority raises it (less preferred).
  • A priority of 0 disables the provider entirely, removing it from routing for that model.

Provider priorities are surfaced in the routing metadata so you can see how they influenced a decision.

Epsilon-Greedy Exploration (1% of requests by default):

To solve the "cold start problem" where new or unused providers never get traffic to build up metrics, the system randomly explores different providers a small fraction of the time (default 1%, configurable). This ensures:

  • All providers periodically receive traffic
  • New providers can prove their reliability
  • The system adapts to changing provider performance
  • You benefit from improved routing decisions over time

The exploration rate is configurable per project through the routing configuration (thresholds.explorationRate), and self-hosted deployments can override it globally with the EXPLORATION_RATE environment variable (a number between 0 and 1).

Stable Provider Preference:

To avoid unnecessary churn between providers that score similarly, LLMGateway remembers the best provider chosen for each model and sticks with it across requests — even if another provider edges ahead slightly on the next score calculation.

On every routing decision, the system checks whether the previously selected provider is still acceptable:

  • Uptime hard switch: if the preferred provider's uptime drops below 85%, routing switches to the current best-scoring provider immediately.
  • Score margin soft switch: the preferred provider is replaced only when a better option's score is more than 0.15 ahead. Small fluctuations caused by metric noise or minor price differences do not trigger a switch.
  • Periodic re-evaluation: the preference expires after 1 hour, at which point the next request picks the best-scoring provider fresh and stores it as the new preferred.

Requests that are part of the epsilon-greedy exploration bypass this preference entirely so that all providers continue to receive periodic traffic and build up metrics.

The selection reason in routing metadata will show stable-preferred when a request was served by the stored preference rather than the top-scored provider at that moment.

Self-hosted deployments can tune this behavior with three environment variables: PREFERRED_PROVIDER_TTL (preference lifetime in seconds, default 3600), PREFERRED_PROVIDER_UPTIME_THRESHOLD (hard-switch uptime floor, default 85), and PREFERRED_PROVIDER_SCORE_MARGIN (soft-switch score gap, default 0.15). On the Enterprise plan, these same values can be customized per project from the dashboard — see Per-Project Routing Configuration.

Routing Metadata:

Every request includes detailed routing metadata in the logs, showing:

  • Available providers that were considered
  • Selected provider and selection reason
  • Scores for each provider (including uptime, throughput, latency, price, priority, and cache support)

This transparency allows you to understand and debug routing decisions.

Using model IDs without a provider prefix automatically routes to the optimal provider based on reliability, speed, and cost. The system continuously learns and adapts based on real-time performance metrics.

Smart routing prioritizes reliability over cost, ensuring your requests are routed to providers with proven uptime and performance, while still considering cost efficiency.

Sticky Session Routing

When a model is served by multiple providers, every request is normally scored independently — so a multi-turn conversation can bounce between providers. That defeats provider-side prompt caching, which only pays off when consecutive requests with a shared prefix hit the same provider.

Sticky session routing solves this: attach a session identifier and LLMGateway pins all requests for that session to a single provider (and region), keeping the upstream prompt cache warm across the whole conversation.

Setting the session id

For chat completions, the session key is resolved in priority order:

  1. The x-session-id header
  2. The prompt_cache_key body field (OpenAI-compatible)
  3. The user body field (OpenAI-compatible)
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -H "x-session-id: conversation-9f8e7d6c" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

For the Anthropic Messages endpoint (/v1/messages), the session key is derived automatically from metadata.user_id — coding agents such as Claude Code embed the session id there — and forwarded internally. An explicit x-session-id header still takes precedence.

How pinning works

When a session id is present, the provider (and region) is chosen deterministically using rendezvous hashing over the available providers, rather than the weighted score. The same session always maps to the same provider as long as that provider remains available.

Because the choice is deterministic per session, sticky requests bypass both the weighted scoring and the epsilon-greedy exploration — a session is never randomly bounced to a different provider mid-conversation.

Falling back when a provider is down

Stickiness yields only when the pinned provider is effectively unavailable. A session is re-pinned to another provider when its provider:

  • Is filtered out by health checks (e.g. excluded for low uptime), or
  • Fails the request and is dropped by the automatic retry & fallback loop.

Rendezvous hashing guarantees a minimal-disruption reshuffle: when one provider drops out, only the sessions pinned to that provider move — every other session keeps its provider. When the provider recovers and re-enters the available pool, affected sessions pin back to it.

The selection reason in routing metadata shows session-sticky when a request was pinned via a session id.

Sticky routing optimizes for cache locality over per-request price. A session stays on its provider even if a cheaper or faster alternative is momentarily available, since the prompt-cache savings typically outweigh the difference. Requests without a session id are unaffected and continue to use the weighted smart-routing algorithm.

Provider-Specific Routing

To use a specific provider without any fallbacks, prefix the model name with the provider name followed by a slash:

# Use OpenAI specifically
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Use DeepSeek provider specifically
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek/deepseek-v3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Regions

Some providers expose the same model in multiple regions. In that case, LLMGateway supports two routing modes:

  • provider/model selects the best eligible region for that provider using the same routing inputs used elsewhere: recent uptime, throughput, latency, and price
  • provider/model:region pins the request to one exact region
# Let LLMGateway choose the best Alibaba region for DeepSeek V3.2
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "alibaba/deepseek-v3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Force a specific Alibaba region
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "alibaba/deepseek-v3.2:cn-beijing",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

If your provider key stores an explicit region, that region acts like a lock and LLMGateway will only use that region for provider-specific requests. If no explicit region is configured on the provider key, provider-specific requests can still score all eligible regions for that provider.

Routing metadata reflects this:

  • Dynamic provider-region selection shows all eligible regional scores that were considered
  • Explicitly pinned regions show only the pinned region in the score list

Region-aware routing only compares regions that are actually available for the current project mode and provider setup. In credits mode, that means only regions backed by configured environment keys. In API keys and hybrid mode, an explicit provider-key region restricts the request to that region.

Low-Uptime Protection

When you specify a provider explicitly, LLMGateway checks the provider's recent uptime (from the time-decayed metrics window described above). If the uptime falls below 90%, the system automatically routes your request to the best available alternative provider to ensure reliability. This protects your application from providers experiencing temporary issues. The fallback threshold (default 90%) is configurable.

If the requested provider has low uptime but no alternative providers are available for that model, the request will still be sent to the originally requested provider.

Disabling Fallback with X-No-Fallback Header

If you need to bypass this protection and always use the exact provider you specified regardless of its current uptime, you can use the X-No-Fallback header:

# Force use of a specific provider even if it has low uptime
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-No-Fallback: true" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Using X-No-Fallback: true disables automatic provider failover. Your requests will be sent to the specified provider even if it is experiencing issues, which may result in higher error rates. Retries may still occur against another key for the same provider when multiple keys are configured.

When the X-No-Fallback header is used, the routing metadata in logs will include noFallback: true to indicate that fallback was disabled for that request.

Automatic Retry & Fallback

When using model ID routing (without a provider prefix), LLMGateway automatically retries failed requests on alternate providers. This happens transparently within the same API call — your application receives the successful response as if nothing went wrong.

How Retry Works

  1. Your request is routed to the best available provider using the smart routing algorithm
  2. If that provider returns a server error (5xx), times out, or has a connection failure, the gateway marks the provider as failed
  3. The next best available provider is selected and the request is retried
  4. Up to 2 retries are attempted before returning an error to the client
Request → Provider A (500 error) → Provider B (200 OK) → Response

Both streaming and non-streaming requests support automatic retry.

What Triggers a Retry

Retries are triggered by server-side failures only:

  • 5xx errors (500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, etc.)
  • Timeouts (upstream provider took too long to respond)
  • Connection failures (network errors, DNS failures, etc.)

Retries are not triggered by:

  • 4xx client errors (400 Bad Request, 401 Unauthorized, 403 Forbidden, 422 Unprocessable Entity)
  • Content filter responses (Azure ResponsibleAI, etc.)

When Retry Is Disabled

Automatic retry to a different provider is disabled when:

  • The X-No-Fallback: true header is set
  • A specific provider is requested (e.g., openai/gpt-4o)
  • No alternative providers are available for the requested model
  • The maximum retry count (2) has been exhausted

Retries can still happen within the same provider when multiple keys are configured and the current key fails with a retryable error.

Routing Transparency

Every provider attempt — both failed and successful — is recorded in the routing array in the response metadata and activity logs:

{
	"metadata": {
		"routing": [
			{
				"provider": "openai",
				"model": "gpt-4o",
				"status_code": 500,
				"error_type": "server_error",
				"succeeded": false
			},
			{
				"provider": "azure",
				"model": "gpt-4o",
				"status_code": 200,
				"error_type": "none",
				"succeeded": true
			}
		]
	}
}

Retried Log Tracking

Each provider attempt creates its own log entry. Failed attempts that were retried are marked with:

  • retried: true — indicates this failed request was retried on another provider
  • retriedByLogId — the ID of the final successful log entry

This allows you to distinguish between unrecovered failures and failures that were transparently recovered via retry. In the dashboard, retried logs display a "Retried" badge with a link to the successful log.

Impact on Provider Health

Failed attempts still count against the provider's uptime score, even when the request was successfully retried on another provider. This means:

  • A provider that keeps failing will see its uptime score drop
  • The exponential uptime penalty kicks in below 95% (see Smart Routing Algorithm)
  • Future requests are automatically routed away from unreliable providers
  • Your application stays reliable without any code changes on your side

Automatic retry and fallback works together with smart routing to provide self-healing behavior. Failing providers are automatically avoided, and your requests are transparently recovered on reliable alternatives.

Per-Project Routing Configuration (Enterprise)

The values described above — scoring weights, thresholds, retry behavior, the metrics window, sticky-routing, and per-provider priorities — are the defaults that apply to every project. On the Enterprise plan, you can override any of them per project from the dashboard under Project Settings → Routing. Projects on other plans always use the defaults.

Overrides are merged on top of the defaults, so you only set the values you want to change. When a custom configuration is disabled, the project falls back to the defaults.

The following groups can be customized per project:

GroupWhat it controlsDefaults
WeightsRelative importance of each scoring factorprice 0.6, imagePrice 1.0, uptime 0.5, throughput 0.05, latency 0.025, cache 0.2
ThresholdsCache prompt-size threshold, uptime-penalty threshold, exploration rate, and the assumed defaults used when no metrics existcachePromptTokens 5000, uptimePenalty 95, defaultUptime 100, defaultLatency 1000, defaultThroughput 50, explorationRate 0.01
RetryMax cross-provider fallback attempts and the low-uptime reroute thresholdmaxRetries 2, lowUptimeFallbackThreshold 90
TimeoutsPer-request time limits (end-to-end, streaming, non-streaming). Capped at the infrastructure defaults — an override can only lower themgatewayMs 1,500,000, streamingMs 1,200,000, plainMs 600,000
HistoryThe metrics window and the time-decay tier boundaries and weightswindowMinutes 60 (max 120), tier1Minutes 1, tier2Minutes 5, tier1Weight 10, tier2Weight 3, tier3Weight 1
StickyStable-provider preference: on/off, TTL, hard-switch uptime floor, soft-switch score marginenabled true, ttlSeconds 3600, uptimeThreshold 85, scoreMargin 0.15
Provider prioritiesPer-provider priority multipliers; set a provider to 0 to disable it for that project1 for every provider

Per-project routing configuration requires the Enterprise plan. If you'd like to tune routing for your workloads, contact us at contact@llmgateway.io.

Optimized Auto Routing

Auto routing automatically selects the best model for your specific use case without you having to specify a model at all.

Current Implementation

The auto routing system currently:

  • Chooses cost-effective models by default for optimal price-to-performance ratio
  • Automatically scales to more powerful models based on your request's context size
  • Handles large contexts intelligently by selecting models with appropriate context windows
# Let LLMGateway choose the optimal model
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Your request here..."}]
  }'

Free Models Only

When using auto routing, you can restrict the selection to only free models (models with zero input and output pricing) by setting the free_models_only parameter to true:

# Auto route to free models only
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello!"}],
    "free_models_only": true
  }'

Adding even a small amount of credits to your account (e.g., $10) will immediately upgrade your free model rate limits from 5 requests per 10 minutes to 20 requests per minute.

The free_models_only parameter only works with auto routing ("model": "auto"). If no free models are available that meet your request requirements, the API will return an error.

Reasoning models only

Just specify the reasoning_effort value and only a model which supports reasoning will be chosen. This parameter is not specific to the auto model.

# Auto route only to reasoning models
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello!"}],
    "reasoning_effort": "medium"
  }'

Exclude Reasoning Models

When using auto routing, you can exclude reasoning models from selection by setting the no_reasoning parameter to true. This is useful when you want faster responses or need to avoid the additional cost and latency of reasoning models:

# Auto route excluding reasoning models
curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello!"}],
    "no_reasoning": true
  }'

The no_reasoning parameter only works with auto routing ("model": "auto"). If no non-reasoning models are available that meet your request requirements, the API will return an error.

Auto routing analyzes your payload and automatically chooses between cost-effective models for simple requests and more powerful models for complex or large-context requests.

Coming Soon: Advanced Optimization

We're continuously improving our auto routing capabilities. Soon you'll benefit from:

  • Tool call optimization: Automatically select models that excel at function calling and structured outputs
  • Content-aware routing: Analyze message content to determine the best model for specific types of requests (coding, creative writing, analysis, etc.)
  • Performance-based routing: Route based on historical performance data for similar requests
  • Multi-model orchestration: Intelligently combine multiple models for complex workflows

How It Works

  1. Request Analysis: The system analyzes your request including message content, context size, and any special parameters
  2. Model Selection: Based on the analysis, it selects the most appropriate model considering cost, performance, and capabilities
  3. Transparent Routing: Your request is seamlessly routed to the chosen model and provider
  4. Optimized Response: You receive the best possible response while maintaining cost efficiency

Auto routing decisions are transparent in your usage logs, so you can always see which model was selected for each request.

Best Practices

For Development

  • Use specific model names during development and testing
  • Leverage auto routing for production workloads to optimize costs

For Production

  • Use auto routing ("model": "auto") for the best balance of cost and performance
  • Monitor your usage patterns through the dashboard to understand routing decisions
  • Set up provider keys for multiple providers to maximize routing options

For Cost Optimization

  • Let auto routing handle model selection to automatically use the most cost-effective options
  • Use model IDs without provider prefixes to always get the cheapest available provider
  • Monitor your usage analytics to track cost savings from intelligent routing

How is this guide?

Last updated on

On this page

Ready for production?

Ship to production with SSO, audit logs, spend controls, and guardrails your security team will approve.

Explore Enterprise