LLM Gateway
FeaturesCaching

Gateway Caching

Serve byte-identical requests entirely from LLM Gateway at $0 cost.

Gateway Caching

Gateway caching serves a previously-seen, byte-identical request entirely from LLM Gateway without forwarding it to the upstream provider. Repeated identical calls cost $0 — there is no inference and no provider charge. It is most useful for API workloads with deterministic inputs (classification, batch jobs, FAQ lookups, retries) rather than free-form chat.

If you want to reduce the cost of long, partially-shared prompts in chat apps or coding tools, you want Provider Cache Control instead. That discounts the cached portion of your prompt on every call — it does not require byte-identical requests. See the Caching Overview for a side-by-side comparison.

How It Works

When you make an API request:

  1. LLM Gateway generates a cache key based on the request parameters
  2. If a matching cached response exists, it's returned immediately
  3. If no cache exists, the request is forwarded to the provider
  4. The response is cached for future identical requests

This means repeated identical requests are served instantly from cache without incurring additional provider costs.

Cost Savings

Caching can dramatically reduce costs for applications with repetitive requests:

ScenarioWithout CachingWith CachingSavings
1,000 identical requests$10.00$0.0199.9%
50% duplicate rate$10.00$5.0050%
Retry after transient error$0.02$0.0150%

Cached responses are free from provider costs. You only pay for the initial request that populates the cache.

Requirements

Caching is free and independent of Data Retention. Cached responses live in a short-lived cache (TTL-bound, typically seconds to minutes) and are not stored as long-term request data — you do not need to enable data retention to use caching.

To use caching:

  1. Enable Caching in your project settings under Preferences
  2. Configure the cache duration (TTL) as needed
  3. Make requests as normal—caching is automatic

Cache Key Generation

The cache key is generated from these request parameters:

  • Model identifier
  • Messages array (roles and content)
  • Temperature
  • Max tokens
  • Top P
  • Tools/functions
  • Tool choice
  • Response format
  • System prompt
  • Other model-specific parameters

Requests with different parameter values, even slight variations, will not share cache entries.

Cache Behavior

Cache Hits

When a cache hit occurs:

  • Response is returned immediately (sub-millisecond latency)
  • No provider API call is made
  • No inference costs are incurred

Cache Misses

When a cache miss occurs:

  • Request is forwarded to the LLM provider
  • Response is stored in cache
  • Normal inference costs apply
  • Future identical requests will hit the cache

Streaming and Caching

Caching works with both streaming and non-streaming requests:

  • Non-streaming: Full response is cached and returned
  • Streaming: The complete response is reconstructed from cache and streamed back

Cache TTL (Time-to-Live)

Cache duration is configurable per project in your project settings. You can set the cache TTL from 10 seconds up to 1 year (31,536,000 seconds).

The default cache duration is 60 seconds. Adjust this based on your use case—longer durations work well for static content, while shorter durations are better for frequently changing data.

Identifying Cached Responses

Cached responses show zero or minimal token usage since no inference occurred:

{
	"usage": {
		"prompt_tokens": 0,
		"completion_tokens": 0,
		"total_tokens": 0,
		"cost": 0,
		"cost_details": {
			"total_cost": 0,
			"input_cost": 0,
			"output_cost": 0
		}
	}
}

Use Cases

Development and Testing

During development, you often send the same prompts repeatedly:

// This prompt will only incur costs once
const response = await client.chat.completions.create({
	model: "gpt-4o",
	messages: [{ role: "user", content: "Explain quantum computing" }],
});

Chatbots with Common Questions

FAQ-style interactions often have repeated questions:

// Common questions are served from cache
const faqs = [
	"What are your business hours?",
	"How do I reset my password?",
	"What is your return policy?",
];

Batch Processing

Processing large datasets with potentially duplicate items:

// Duplicate items in batch are served from cache
for (const item of items) {
	const response = await client.chat.completions.create({
		model: "gpt-4o",
		messages: [{ role: "user", content: `Classify: ${item}` }],
	});
}

Best Practices

Maximize Cache Hits

  • Use consistent prompt formatting
  • Normalize input data before sending
  • Use deterministic parameters (temperature: 0)
  • Avoid including timestamps or random values in prompts

Appropriate Use Cases

Caching is most effective for:

  • Static knowledge queries
  • Classification tasks
  • FAQ responses
  • Development/testing
  • Retry scenarios

When to Avoid Caching

Caching may not be suitable for:

  • Real-time data requirements
  • Highly personalized responses
  • Time-sensitive information
  • Creative tasks requiring variety
  • Chat or coding tools where prompts overlap but are not byte-identical — use Provider Cache Control instead

Pricing

Caching is completely free. Cached responses are held in a short-lived in-memory cache (bounded by your configured TTL) and do not incur storage charges. Storage costs only apply if you separately enable Data Retention for full request/response payloads.

Caching reduces both inference cost and latency at no additional charge.

How is this guide?

Last updated on

On this page

Ready for production?

Ship to production with SSO, audit logs, spend controls, and guardrails your security team will approve.

Explore Enterprise