How requests are optimized

Here's what actually happens to one of your requests as it passes through the Anyray gateway. The key thing to know: the gateway calls the optimizer over a pre-call/post-call hook, and the optimizer runs a pipeline of optimization strategies your org configured — separate from the gateway's own model routing.

The path of one request

Optimizer Protocol v1 (/v1/optimize, /v1/optimize-response, /v1/cache) is the fixed contract; the hook is fail-open with an 800 ms timeout. Which strategies ran, and in what order, is inside the black box and can change without you noticing. There is no separate meter() call — the gateway's own spend store records cost.

What each strategy might do

Depending on what your org enabled, a request can be touched by several strategies in sequence — each one passing the (possibly modified) request to the next. The implemented optimizer strategy registry is ten kinds (full menu with statuses in Optimization strategies):

param_tuning — clamps wasteful parameters (e.g. caps max_tokens, tames temperature) so you don't pay for tokens you won't use. Output tokens are the most expensive, so this is often the biggest single win.
prompt_compression — shortens/restructures the prompt and system message before the call.
context_compression — shrinks the bulky content the agent reads back (tool outputs, logs, RAG chunks) by minifying JSON, collapsing whitespace, and capping oversized blobs.
code_skeleton / code_graph — outline source files an agent reads back to their signatures and structure, eliding function bodies (reversibly). code_graph is the multi-file variant: it keeps whole the bodies your question actually touches and elides the rest (default off).
relevance_filter — keeps only the lines of a large tool output relevant to the live question (default off).
window_budget — fits a long conversation under a token budget by cropping the oldest middle turns (reversibly), pinning the system prompt and recent turns.
tool_pruning — drops tools unlikely to be invoked, trimming the request.
vision_ocr — swaps a pasted text-only screenshot for its OCR'd text, locally and confidence-gated (default off).
semantic_cache — if an equivalent request was answered before, returns that response and skips the provider entirely (default off).

Every lossy crop is reversible: elided content is stashed under a content-free handle and retrievable via POST /v1/retrieve, so the model can pull back anything it needs.

Strategy array order is execution order. A cacheHit short-circuits the rest of the pipeline (when the gateway can serve it); otherwise the request proceeds with whatever transforms accumulated.

Model routing is the gateway's job, not an optimizer strategy. The gateway picks the provider/model and handles fallbacks. Automated classify-then-route (cheap vs. frontier) is roadmap.

When you get the requested/frontier model (fail-safe)

Your request goes to the requested (or frontier) model, unchanged, whenever:

the optimizer is unreachable, slow, or errors — the hook is fail-open with an 800 ms timeout
(roadmap) automated routing classifies the request as reasoning-heavy or ambiguous, or confidence is below threshold

The hard guarantee: uncertainty resolves to the model you asked for. See The optimizer.

When you get an instant answer (cache hit)

If the semantic_cache strategy recognizes an equivalent prior request, the gateway serves a stored response and skips the provider entirely — faster and free (this strategy is default off). (On a host adapter like LiteLLM, the cache hit is served by the host's built-in cache rather than the optimizer; the effect is the same for you.)

See it on real prompts

The use cases trace the everyday workloads through the pipeline — pasted logs and JSON dumps, code reading, tool-schema bloat, a cache hit on a repeated question, runaway max_tokens — each with its measured reduction.

What this means for you

Correctness: protected by the fail-open hook + the requested-model fail-safe.
Latency: cache hits are faster; the optimizer hook runs within a tight budget (800 ms timeout) and fails open, and spend is recorded by the gateway's store without blocking your response.
Compatibility: messages/tools/params may be rewritten, but the response shape your SDK expects is preserved by the gateway.

The path of one request​

What each strategy might do​

When you get the requested/frontier model (fail-safe)​

When you get an instant answer (cache hit)​

See it on real prompts​

What this means for you​