How requests are optimized
Here's what actually happens to one of your requests as it passes through the Anyray gateway. The key thing to know: the gateway calls the optimizer over a pre-call/post-call hook, and the optimizer runs a pipeline of optimization strategies your org configured — separate from the gateway's own model routing.
The path of one request
Optimizer Protocol v1 (/v1/optimize,
/v1/optimize-response, /v1/cache) is the fixed contract; the hook is fail-open with
an 800 ms timeout. Which strategies ran, and in what order, is
inside the black box and can change without you noticing. There
is no separate meter() call — the gateway's own spend store records cost.
What each strategy might do
Depending on what your org enabled, a request can be touched by several strategies in sequence — each one passing the (possibly modified) request to the next. The implemented optimizer strategy registry is ten kinds (full menu with statuses in Optimization strategies):
param_tuning— clamps wasteful parameters (e.g. capsmax_tokens, tamestemperature) so you don't pay for tokens you won't use. Output tokens are the most expensive, so this is often the biggest single win.prompt_compression— shortens/restructures the prompt and system message before the call.context_compression— shrinks the bulky content the agent reads back (tool outputs, logs, RAG chunks) by minifying JSON, collapsing whitespace, and capping oversized blobs.code_skeleton/code_graph— outline source files an agent reads back to their signatures and structure, eliding function bodies (reversibly).code_graphis the multi-file variant: it keeps whole the bodies your question actually touches and elides the rest (default off).relevance_filter— keeps only the lines of a large tool output relevant to the live question (default off).window_budget— fits a long conversation under a token budget by cropping the oldest middle turns (reversibly), pinning the system prompt and recent turns.tool_pruning— drops tools unlikely to be invoked, trimming the request.vision_ocr— swaps a pasted text-only screenshot for its OCR'd text, locally and confidence-gated (default off).semantic_cache— if an equivalent request was answered before, returns that response and skips the provider entirely (default off).
Every lossy crop is reversible: elided content is stashed under a content-free handle
and retrievable via POST /v1/retrieve, so the model can pull back anything it needs.
Strategy array order is execution order. A cacheHit short-circuits the rest of the
pipeline (when the gateway can serve it); otherwise the request proceeds with whatever
transforms accumulated.
Model routing is the gateway's job, not an optimizer strategy. The gateway picks the provider/model and handles fallbacks. Automated classify-then-route (cheap vs. frontier) is roadmap.
When you get the requested/frontier model (fail-safe)
Your request goes to the requested (or frontier) model, unchanged, whenever:
- the optimizer is unreachable, slow, or errors — the hook is fail-open with an 800 ms timeout
- (roadmap) automated routing classifies the request as reasoning-heavy or ambiguous, or confidence is below threshold
The hard guarantee: uncertainty resolves to the model you asked for. See The optimizer.
When you get an instant answer (cache hit)
If the semantic_cache strategy recognizes an equivalent prior request, the gateway serves
a stored response and skips the provider entirely — faster and free (this strategy is
default off). (On a host adapter like LiteLLM, the cache hit is served by the host's
built-in cache rather than the optimizer; the effect is the same for you.)
See it on real prompts
The use cases trace the everyday workloads through the
pipeline — pasted logs and JSON dumps, code reading, tool-schema bloat, a cache hit on a
repeated question, runaway max_tokens — each with its measured reduction.
What this means for you
- Correctness: protected by the fail-open hook + the requested-model fail-safe.
- Latency: cache hits are faster; the optimizer hook runs within a tight budget (800 ms timeout) and fails open, and spend is recorded by the gateway's store without blocking your response.
- Compatibility: messages/tools/params may be rewritten, but the response shape your SDK expects is preserved by the gateway.