Optimization strategies

The optimizer runs a configurable, ordered pipeline of optimization strategies, and each org enables and orders the strategies that fit its use-cases. The registry is exactly ten kinds: param_tuning, prompt_compression, context_compression, code_skeleton, code_graph, relevance_filter, window_budget, tool_pruning, vision_ocr, and semantic_cache. Model routing is not an optimizer strategy — it is a capability of the gateway.

:::info Status The strategy pipeline (config-driven ordering, per-endpoint overrides, fail-open, content-free decision attribution) is implemented in the optimizer (optimizer/). param_tuning, prompt_compression, context_compression, code_skeleton, window_budget, and tool_pruning are default-on; code_graph, relevance_filter, vision_ocr, and semantic_cache are implemented but default-off — enable them per workload. See the strategy menu. :::

What a strategy is

A strategy is a small, self-contained transform that runs inside the optimizer. Given a request, it can:

transform the request — rewrite it (e.g. compress the prompt, prune unused tools, clamp max_tokens) and pass it to the next strategy
transform the response — strategies with an output stage act in POST /v1/optimize-response
short-circuit (cacheHit) — semantic_cache may serve a stored response when the caller can short-circuit, skipping the provider

Each strategy emits a content-free decision (kind, a summary, and token/cost estimates) so its contribution is measured independently — see the Optimizer Protocol.

The pipeline

Enabled strategies run in the order declared in optimizer.config.json, threading the (possibly transformed) request through each one. When the caller can short-circuit, a semantic_cache hit serves the stored response and skips the rest.

Two properties make the pipeline safe to put on the request path:

Off the forwarding path + bounded. The optimizer is a hook the gateway calls with a hard 800ms timeout; it never carries provider traffic.
Fail-open. If the optimizer errors or times out, the gateway forwards the original request unchanged — and within the pipeline a strategy that throws is skipped rather than fatal. A broken strategy can never break inference. See the Optimizer Protocol and Org Admin → Operate.

Each strategy emits a content-free decision so the gateway can attribute savings per strategy in its spend store — metering is the gateway's job, not an optimizer endpoint.

The optimizer's registry is exactly the ten strategies below; the table is ordered by measured real-world impact — see What saves the most for the data behind this ranking. Which lever wins depends on your workload, so you compose the ones that fit. Model routing is listed separately because it is the gateway's job, not an optimizer strategy.

Strategy (`kind`)	What it does	Typical impact	Status
`param_tuning`	Clamps wasteful parameters — caps an over-large `max_tokens` to a ceiling (default `maxTokensCap`); temperature/penalty normalization is planned. Output tokens cost 3–10× input, so this is the highest-ROI, lowest-risk lever.	~20–40%	✅ implemented (default-on)
`prompt_compression`	Shortens/restructures long prompts and system messages (above `minChars`) to cut input tokens without changing intent.	input-dependent	✅ implemented (default-on)
`context_compression`	Shrinks the bulky content an agent reads back — tool outputs, logs, RAG chunks — by minifying JSON, collapsing whitespace, and capping oversized blobs (`maxChars`). Headroom-style; targets tool-role messages, so it composes with `prompt_compression` rather than overlapping it.	agentic / tool-heavy	✅ implemented (default-on)
`code_skeleton`	Outlines a source file an agent reads back to its navigable skeleton — imports, each declaration's signature line and closing brace — collapsing the statement bodies between them into a content-free marker, stashed and reversible via `/v1/retrieve`. Serena-inspired; a deterministic, comment/string-aware brace scan that never interprets code and passes through anything it can't balance. Targets tool/function messages.	agentic / code-reading	✅ implemented (default-on)
`code_graph`	The graph-aware sibling of `code_skeleton` for multi-file reads. Builds a lexical cross-file reference graph over the symbols an agent read back, seeds a working set from the live turn's intent, expands one hop (`neighborHops`) and adds structurally central "god" symbols (`godNodeFanIn`) — then keeps those bodies at full fidelity and elides the rest, stashed and reversible via `/v1/retrieve`. Deterministic and lexical (no LLM, no embeddings, no repo scan); degrades to plain `code_skeleton` outlining on single-file reads or when there's no usable intent. Enable either `code_graph` or `code_skeleton` for multi-file workloads — `code_skeleton` would re-elide the bodies `code_graph` chose to keep.	agentic / multi-file code traces	✅ implemented, default-off — enable per workload.
`relevance_filter`	Keeps only the parts of a large tool output relevant to what the agent is doing right now. Builds a query from the latest user/assistant turn and ranks each line of a bulky result with BM25, keeping the top-scoring lines (plus a few leading lines for orientation) up to a `keepChars` budget and eliding the rest behind a content-free marker. Lexical only (no LLM/embeddings); skips structured JSON (that's `context_compression`); elided spans are stashed and reversible via `/v1/retrieve`. Context-Mode-style.	agentic / log- & doc-heavy	✅ implemented, default-off — enable per workload.
`window_budget`	Fits the whole conversation under a token budget (`maxTokens`) by cropping the oldest middle messages (a rolling window). It pins the leading task-setup turns (`keepLeading`, `pinRoles`) and the most recent turns (`keepRecent`), and evicts/crops the rest. Cropped originals are stashed and reversible via `/v1/retrieve`. Where `prompt_compression` shrinks individual messages losslessly and `context_compression` shrinks bulky tool outputs, `window_budget` is about the conversation as a whole.	long agent sessions	✅ implemented (default-on)
`tool_pruning`	Drops tools unlikely to be invoked for a request by matching tool names textually against the conversation (`keepUnnamed` keeps tools the heuristic can't resolve), trimming the tool-schema overhead carried on every call. Skeleton: textual matching today, no embeddings/relevance model yet.	tool-heavy agents	✅ implemented (default-on)
`vision_ocr`	When a pasted image is really text (a terminal screenshot, a stack trace, log output), extracts the text with local Tesseract OCR and swaps the image for it — text tokens instead of far costlier provider vision tokens. The gate is the design: it only swaps on high-confidence text images (`minConfidence`, `minWords`, `minTextChars`); diagrams, charts, UI and photos pass through untouched. Bounded by its own latency budget (`maxLatencyMs`) and the gateway's longer vision timeout; the original image is stashed and reversible via `/v1/retrieve`. Fully local — no cloud OCR, no second model call.	screenshot-heavy workflows	✅ implemented, default-off — enable per workload.
`semantic_cache`	Serves a stored response for a duplicate prior request and skips the provider entirely; writes live responses back via `/v1/cache`. Today it matches on an exact canonicalized request key — the embedding/similarity index (`similarityThreshold` `0.97`) is reserved but not yet active.	~40–80% in high-repetition	✅ implemented, default-off — enable per workload. Only short-circuit-capable callers serve hits.

Model routing — a gateway capability, not an optimizer strategy:

Capability	What it does	Status
Model / provider routing	The gateway picks providers/models, runs fallbacks, and load-balances. Multi-provider routing is implemented in the gateway.	✅ gateway feature
Automated classify → cheap/frontier downgrade	Classify each request and route the cheap-enough ones to a cheaper model, failing safe to frontier when unsure.	⏳ roadmap

The Typical impact column reflects industry benchmarks (workload-dependent, directional) — not Anyray guarantees. What Anyray guarantees is that these levers are applied without degrading quality, self-hosted, fail-open. See What saves the most for the evidence and sources.

:::tip Why this order Output tokens are the most expensive per token, and most apps over-allocate them — so param_tuning is the cheapest win. For agentic/RAG traffic, repeated-context semantic_cache becomes the dominant lever (up to ~90% in published benchmarks). The full evidence and per-workload ranking is in What saves the most. :::

:::note An open-ended library — by design Anyray's aim is to be the most complete optimization layer there is: it adopts the most effective techniques from across the field, composes many at once, and improves them on your own traffic. The menu above is not a fixed feature set — it's a snapshot of an open-ended library.

This is possible because the optimizer is a black box behind a fixed contract: new strategies are added inside it without changing the gateway, the adapters, or your applications. The speed at which Anyray can add and improve strategies is part of the product. :::

Conversation-aware context management

Most strategies act on a single request; a growing, higher-order family acts across the whole conversation or agent session — goal awareness, topic-change detection, and short/long-term memory. It's where the largest agentic savings hide and the hardest to do safely, so it gets its own page: Conversation & memory management. (Planned / fast-follow.)

Configuring strategies for a use-case

Different workloads want different pipelines. You choose the set and order in optimizer.config.json:

The evidence on dominant levers drives these defaults:

Use-case	A sensible pipeline
Coding agents (correctness-sensitive)	`prompt_compression` → `context_compression` → `code_skeleton` (or `code_graph` for multi-file traces) → `window_budget` (cap long sessions) → `tool_pruning` → `param_tuning`; gateway routing kept conservative.
RAG / long-context Q&A	`prompt_compression` (compress retrieved docs) → `relevance_filter` (re-rank retrieved chunks against the live question) → `window_budget` → `param_tuning`.
High-volume chat / support	`semantic_cache` → `param_tuning`. Cache hits dominate.
Batch summarization / classification	`param_tuning` (tight `maxTokensCap`) → `prompt_compression`.
Screenshot-heavy support / debugging	`vision_ocr` (swap text-only screenshots for their text) → `prompt_compression` → `param_tuning`.
Eval / golden runs	strategies off so results are comparable.

You express this as the ordered strategies[] array in optimizer.config.json — each entry is { kind, enabled, params }, and array order is execution order. Optional overrides.byEndpoint disables/enables kinds per logical route (e.g. all strategies off for /v1/embeddings). The config is runtime-mutable: edit it from the console's Optimizer settings page (admin-key-gated GET/PUT /admin/optimizer/settings), and every change is validated, hot-reloaded, and audit-logged. See Org Admin → Configure. To see real prompts traced through these pipelines, read the use cases.

// optimizer.config.json — ordered, runtime-mutable
{
  "strategies": [
    { "kind": "semantic_cache",      "enabled": false, "params": { "ttlSeconds": 3600 } },
    { "kind": "vision_ocr",          "enabled": false, "params": { "minConfidence": 0.6 } },
    { "kind": "prompt_compression",  "enabled": true,  "params": { "minChars": 400 } },
    { "kind": "context_compression", "enabled": true,  "params": { "maxChars": 8000 } },
    { "kind": "code_skeleton",       "enabled": true,  "params": { "minBodyLines": 4 } },
    { "kind": "code_graph",          "enabled": false, "params": { "neighborHops": 1 } },
    { "kind": "relevance_filter",    "enabled": false, "params": { "keepChars": 4000 } },
    { "kind": "window_budget",       "enabled": true,  "params": { "maxTokens": 24000 } },
    { "kind": "tool_pruning",        "enabled": true },
    { "kind": "param_tuning",        "enabled": true,  "params": { "maxTokensCap": 4096 } }
  ],
  "overrides": {
    "byEndpoint": { "/v1/embeddings": { "disable": ["prompt_compression", "param_tuning"] } }
  }
}

:::info Per-use-case granularity Today the optimizer runs one configured pipeline with optional per-endpoint overrides; orgs separate use-cases by route or by running distinct deployments. Finer per-team strategy selection is on the roadmap. :::

Strategies can tune themselves (roadmap)

You don't have to hand-tune the pipeline forever. With adaptive optimization (opt-in, roadmap), the optimizer learns from your org's own measured savings and quality to reorder strategies, retune their parameters, adapt the router, and tune the cache — each change quality-gated and auto-rolled-back if it regresses. The strategy menu is what Anyray ships you; adaptive optimization decides how to use it best for your workload.

The one invariant strategies can't break

No matter how you compose the pipeline, quality is protected — this is Anyray's behavior-preserving design principle in action: the optimizer fails open (the gateway forwards the original request on any error or timeout), and any strategy that is uncertain defers. Optimization is never allowed to silently degrade an answer — the worst case is "you paid full price," never "you got a worse result." (Gateway model routing adds its own fail-safe-to-frontier guarantee; the automated version is roadmap.)

What a strategy is​

The pipeline​

The strategy menu​

Conversation-aware context management​

Configuring strategies for a use-case​

Strategies can tune themselves (roadmap)​

The one invariant strategies can't break​