Trade latency against cost on supported OpenAI and Google models with Flex and Priority processing tiers.

Service Tiers

Some OpenAI and Google models support selectable processing tiers that trade latency and availability against price. You pick one per request with the OpenAI-compatible service_tier parameter, and LLM Gateway forwards it only when the selected provider/model mapping supports that tier.

Tier	`service_tier`	Cost vs. standard	Latency / availability
Standard	`default` / `auto` / omit	baseline	Normal on-demand latency
Flex	`flex`	−50%	Best-effort; may be preempted under load
Priority	`priority`	varies by model	Prioritized above standard and flex traffic

Using the `service_tier` parameter

curl -X POST "https://api.llmgateway.io/v1/chat/completions" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google-vertex/gemini-2.5-pro",
    "service_tier": "priority",
    "messages": [
      { "role": "user", "content": "Summarize this incident report." }
    ]
  }'

Accepted values are flex, priority, and default/auto (standard). If you request flex or priority for a provider/model mapping that does not support that tier, the gateway returns a 400 unsupported_service_tier error and logs the request as a client error.

The parameter works the same on the OpenAI-compatible Responses API (/v1/responses): the tier is forwarded to the provider and the response's service_tier field echoes the tier that was actually served.

Supported providers

Service tiers are explicit per provider/model mapping. Check the model page for the exact tiers exposed by each provider card.

OpenAI (openai) — sent as the OpenAI service_tier request field for supported OpenAI models. Flex is billed at 0.5x standard token prices and Priority uses the model-specific multiplier shown on the model page.
Google Vertex AI (google-vertex) — sent as the X-Vertex-AI-LLM-Shared-Request-Type request header, together with X-Vertex-AI-LLM-Request-Type: shared so the request bypasses any Provisioned Throughput on the project and actually reaches the shared Flex/Priority tier. Flex and Priority are served only on the global endpoint, which is the gateway default. Google Flex PayGo applies a 0.5x multiplier; Google Priority PayGo applies a 1.8x multiplier.
Google AI Studio / Gemini API (google-ai-studio) — sent as a service_tier field in the request body for configured models that opt in.

Tiers are supported on a subset of models, and the Flex and Priority subsets differ by provider. For example, Google Flex PayGo lists Gemini 3 image / Nano Banana models, but Google Priority PayGo does not; those configured image mappings are Flex-only.

Flex and Priority are only honored when the request reaches Google directly, so a google-vertex / google-ai-studio provider key with a custom base URL (a proxy) is excluded from service-tier routing — a proxy may silently drop the tier and serve standard. With multiple providers/keys, the gateway routes around the ineligible key automatically; if a request pins a provider whose only key uses a custom base URL, it returns a 400 instead of silently downgrading. Keys with no custom base URL (the managed default) are always eligible.

Pricing uses multipliers

Service tiers do not define separate model prices in LLM Gateway. They multiply the provider mapping's standard token prices:

Standard / default / auto: 1x
Flex: 0.5x
Priority: model/provider-specific, shown on the model page

The multiplier scales per-token costs, including input, output, cached, and image tokens. Flat per-request and web-search fees are not tier-scaled.

Billing follows the served tier

When a provider reports the tier that was actually served, LLM Gateway bills that returned tier instead of blindly billing the requested value:

A priority request that runs as priority is billed at 2.5x.
A flex request that runs as flex is billed at 0.5x.
A request that is served as standard is billed at the standard 1x rate.

The served tier is read back from the provider response — Vertex reports it in usageMetadata.trafficType (ON_DEMAND_PRIORITY / ON_DEMAND_FLEX / ON_DEMAND), Google AI Studio reports it in the x-gemini-service-tier response header, and OpenAI can return service_tier in response payloads or stream events.

LLM Gateway rejects unsupported tier requests before provider routing. For example, gemini-3-pro-image-preview currently exposes Flex for Google AI Studio and Vertex, but not Priority.

You can see per-tier pricing for each model on its model page. Supported provider cards include a Service Tier selector in the card header and show the active multiplier next to each tier.

Service Tiers

Service Tiers

Using the `service_tier` parameter

Supported providers

Pricing uses multipliers

Billing follows the served tier

Sources

On this page

Service Tiers

Service Tiers

Using the service_tier parameter

Supported providers

Pricing uses multipliers

Billing follows the served tier

Sources

On this page

Using the `service_tier` parameter