Skip to main content

How requests are optimized

Here's what actually happens to one of your requests as it passes through the Anyray gateway. The key thing to know: the gateway calls the optimizer over a pre-call/post-call hook, and the optimizer runs a pipeline of optimization strategies your org configured — separate from the gateway's own model routing.

The path of one request

Optimizer Protocol v1 (/v1/optimize, /v1/optimize-response, /v1/cache) is the fixed contract; the hook is fail-open with an 800 ms timeout. Which strategies ran, and in what order, is inside the black box and can change without you noticing. There is no separate meter() call — the gateway's own spend store records cost.

What each strategy might do

Depending on what your org enabled, a request can be touched by several strategies in sequence — each one passing the (possibly modified) request to the next. The implemented optimizer strategy registry is ten kinds (full menu with statuses in Optimization strategies):

  • param_tuning — clamps wasteful parameters (e.g. caps max_tokens, tames temperature) so you don't pay for tokens you won't use. Output tokens are the most expensive, so this is often the biggest single win.
  • prompt_compression — shortens/restructures the prompt and system message before the call.
  • context_compression — shrinks the bulky content the agent reads back (tool outputs, logs, RAG chunks) by minifying JSON, collapsing whitespace, and capping oversized blobs.
  • code_skeleton / code_graph — outline source files an agent reads back to their signatures and structure, eliding function bodies (reversibly). code_graph is the multi-file variant: it keeps whole the bodies your question actually touches and elides the rest (default off).
  • relevance_filter — keeps only the lines of a large tool output relevant to the live question (default off).
  • window_budget — fits a long conversation under a token budget by cropping the oldest middle turns (reversibly), pinning the system prompt and recent turns.
  • tool_pruning — drops tools unlikely to be invoked, trimming the request.
  • vision_ocr — swaps a pasted text-only screenshot for its OCR'd text, locally and confidence-gated (default off).
  • semantic_cache — if an equivalent request was answered before, returns that response and skips the provider entirely (default off).

Every lossy crop is reversible: elided content is stashed under a content-free handle and retrievable via POST /v1/retrieve, so the model can pull back anything it needs.

Strategy array order is execution order. A cacheHit short-circuits the rest of the pipeline (when the gateway can serve it); otherwise the request proceeds with whatever transforms accumulated.

Model routing is the gateway's job, not an optimizer strategy. The gateway picks the provider/model and handles fallbacks. Automated classify-then-route (cheap vs. frontier) is roadmap.

When you get the requested/frontier model (fail-safe)

Your request goes to the requested (or frontier) model, unchanged, whenever:

  • the optimizer is unreachable, slow, or errors — the hook is fail-open with an 800 ms timeout
  • (roadmap) automated routing classifies the request as reasoning-heavy or ambiguous, or confidence is below threshold

The hard guarantee: uncertainty resolves to the model you asked for. See The optimizer.

When you get an instant answer (cache hit)

If the semantic_cache strategy recognizes an equivalent prior request, the gateway serves a stored response and skips the provider entirely — faster and free (this strategy is default off). (On a host adapter like LiteLLM, the cache hit is served by the host's built-in cache rather than the optimizer; the effect is the same for you.)

See it on real prompts

The use cases trace the everyday workloads through the pipeline — pasted logs and JSON dumps, code reading, tool-schema bloat, a cache hit on a repeated question, runaway max_tokens — each with its measured reduction.

What this means for you

  • Correctness: protected by the fail-open hook + the requested-model fail-safe.
  • Latency: cache hits are faster; the optimizer hook runs within a tight budget (800 ms timeout) and fails open, and spend is recorded by the gateway's store without blocking your response.
  • Compatibility: messages/tools/params may be rewritten, but the response shape your SDK expects is preserved by the gateway.