Optimization strategies
The optimizer runs a configurable, ordered pipeline of
optimization strategies, and each org enables and orders the strategies that fit its
use-cases. The registry is exactly ten kinds: param_tuning, prompt_compression,
context_compression, code_skeleton, code_graph, relevance_filter, window_budget,
tool_pruning, vision_ocr, and semantic_cache. Model routing is not an optimizer
strategy — it is a capability of the gateway.
:::info Status
The strategy pipeline (config-driven ordering, per-endpoint overrides, fail-open,
content-free decision attribution) is implemented in the optimizer (optimizer/).
param_tuning, prompt_compression, context_compression, code_skeleton,
window_budget, and tool_pruning are default-on; code_graph, relevance_filter,
vision_ocr, and semantic_cache are implemented but default-off — enable them per
workload. See the strategy menu.
:::
What a strategy is
A strategy is a small, self-contained transform that runs inside the optimizer. Given a request, it can:
- transform the request — rewrite it (e.g. compress the prompt, prune unused
tools, clampmax_tokens) and pass it to the next strategy - transform the response — strategies with an output stage act in
POST /v1/optimize-response - short-circuit (cacheHit) —
semantic_cachemay serve a stored response when the caller can short-circuit, skipping the provider
Each strategy emits a content-free decision (kind, a summary, and token/cost
estimates) so its contribution is measured independently — see the
Optimizer Protocol.
The pipeline
Enabled strategies run in the order declared in optimizer.config.json, threading
the (possibly transformed) request through each one. When the caller can short-circuit, a
semantic_cache hit serves the stored response and skips the rest.
Two properties make the pipeline safe to put on the request path:
- Off the forwarding path + bounded. The optimizer is a hook the gateway calls with a hard 800ms timeout; it never carries provider traffic.
- Fail-open. If the optimizer errors or times out, the gateway forwards the original request unchanged — and within the pipeline a strategy that throws is skipped rather than fatal. A broken strategy can never break inference. See the Optimizer Protocol and Org Admin → Operate.
Each strategy emits a content-free decision so the gateway can attribute savings per
strategy in its spend store — metering is the gateway's job,
not an optimizer endpoint.
The strategy menu
The optimizer's registry is exactly the ten strategies below; the table is ordered by measured real-world impact — see What saves the most for the data behind this ranking. Which lever wins depends on your workload, so you compose the ones that fit. Model routing is listed separately because it is the gateway's job, not an optimizer strategy.
Strategy (kind) | What it does | Typical impact | Status |
|---|---|---|---|
param_tuning | Clamps wasteful parameters — caps an over-large max_tokens to a ceiling (default maxTokensCap); temperature/penalty normalization is planned. Output tokens cost 3–10× input, so this is the highest-ROI, lowest-risk lever. | ~20–40% | ✅ implemented (default-on) |
prompt_compression | Shortens/restructures long prompts and system messages (above minChars) to cut input tokens without changing intent. | input-dependent | ✅ implemented (default-on) |
context_compression | Shrinks the bulky content an agent reads back — tool outputs, logs, RAG chunks — by minifying JSON, collapsing whitespace, and capping oversized blobs (maxChars). Headroom-style; targets tool-role messages, so it composes with prompt_compression rather than overlapping it. | agentic / tool-heavy | ✅ implemented (default-on) |
code_skeleton | Outlines a source file an agent reads back to its navigable skeleton — imports, each declaration's signature line and closing brace — collapsing the statement bodies between them into a content-free marker, stashed and reversible via /v1/retrieve. Serena-inspired; a deterministic, comment/string-aware brace scan that never interprets code and passes through anything it can't balance. Targets tool/function messages. | agentic / code-reading | ✅ implemented (default-on) |
code_graph | The graph-aware sibling of code_skeleton for multi-file reads. Builds a lexical cross-file reference graph over the symbols an agent read back, seeds a working set from the live turn's intent, expands one hop (neighborHops) and adds structurally central "god" symbols (godNodeFanIn) — then keeps those bodies at full fidelity and elides the rest, stashed and reversible via /v1/retrieve. Deterministic and lexical (no LLM, no embeddings, no repo scan); degrades to plain code_skeleton outlining on single-file reads or when there's no usable intent. Enable either code_graph or code_skeleton for multi-file workloads — code_skeleton would re-elide the bodies code_graph chose to keep. | agentic / multi-file code traces | ✅ implemented, default-off — enable per workload. |
relevance_filter | Keeps only the parts of a large tool output relevant to what the agent is doing right now. Builds a query from the latest user/assistant turn and ranks each line of a bulky result with BM25, keeping the top-scoring lines (plus a few leading lines for orientation) up to a keepChars budget and eliding the rest behind a content-free marker. Lexical only (no LLM/embeddings); skips structured JSON (that's context_compression); elided spans are stashed and reversible via /v1/retrieve. Context-Mode-style. | agentic / log- & doc-heavy | ✅ implemented, default-off — enable per workload. |
window_budget | Fits the whole conversation under a token budget (maxTokens) by cropping the oldest middle messages (a rolling window). It pins the leading task-setup turns (keepLeading, pinRoles) and the most recent turns (keepRecent), and evicts/crops the rest. Cropped originals are stashed and reversible via /v1/retrieve. Where prompt_compression shrinks individual messages losslessly and context_compression shrinks bulky tool outputs, window_budget is about the conversation as a whole. | long agent sessions | ✅ implemented (default-on) |
tool_pruning | Drops tools unlikely to be invoked for a request by matching tool names textually against the conversation (keepUnnamed keeps tools the heuristic can't resolve), trimming the tool-schema overhead carried on every call. Skeleton: textual matching today, no embeddings/relevance model yet. | tool-heavy agents | ✅ implemented (default-on) |
vision_ocr | When a pasted image is really text (a terminal screenshot, a stack trace, log output), extracts the text with local Tesseract OCR and swaps the image for it — text tokens instead of far costlier provider vision tokens. The gate is the design: it only swaps on high-confidence text images (minConfidence, minWords, minTextChars); diagrams, charts, UI and photos pass through untouched. Bounded by its own latency budget (maxLatencyMs) and the gateway's longer vision timeout; the original image is stashed and reversible via /v1/retrieve. Fully local — no cloud OCR, no second model call. | screenshot-heavy workflows | ✅ implemented, default-off — enable per workload. |
semantic_cache | Serves a stored response for a duplicate prior request and skips the provider entirely; writes live responses back via /v1/cache. Today it matches on an exact canonicalized request key — the embedding/similarity index (similarityThreshold 0.97) is reserved but not yet active. | ~40–80% in high-repetition | ✅ implemented, default-off — enable per workload. Only short-circuit-capable callers serve hits. |
Model routing — a gateway capability, not an optimizer strategy:
| Capability | What it does | Status |
|---|---|---|
| Model / provider routing | The gateway picks providers/models, runs fallbacks, and load-balances. Multi-provider routing is implemented in the gateway. | ✅ gateway feature |
| Automated classify → cheap/frontier downgrade | Classify each request and route the cheap-enough ones to a cheaper model, failing safe to frontier when unsure. | ⏳ roadmap |
The Typical impact column reflects industry benchmarks (workload-dependent, directional) — not Anyray guarantees. What Anyray guarantees is that these levers are applied without degrading quality, self-hosted, fail-open. See What saves the most for the evidence and sources.
:::tip Why this order
Output tokens are the most expensive per token, and most apps over-allocate them — so
param_tuning is the cheapest win. For agentic/RAG traffic, repeated-context
semantic_cache becomes the dominant lever (up to ~90% in published benchmarks). The
full evidence and per-workload ranking is in What saves the most.
:::
:::note An open-ended library — by design Anyray's aim is to be the most complete optimization layer there is: it adopts the most effective techniques from across the field, composes many at once, and improves them on your own traffic. The menu above is not a fixed feature set — it's a snapshot of an open-ended library.
This is possible because the optimizer is a black box behind a fixed contract: new strategies are added inside it without changing the gateway, the adapters, or your applications. The speed at which Anyray can add and improve strategies is part of the product. :::
Conversation-aware context management
Most strategies act on a single request; a growing, higher-order family acts across the whole conversation or agent session — goal awareness, topic-change detection, and short/long-term memory. It's where the largest agentic savings hide and the hardest to do safely, so it gets its own page: Conversation & memory management. (Planned / fast-follow.)
Configuring strategies for a use-case
Different workloads want different pipelines. You choose the set and order in
optimizer.config.json:
The evidence on dominant levers drives these defaults:
| Use-case | A sensible pipeline |
|---|---|
| Coding agents (correctness-sensitive) | prompt_compression → context_compression → code_skeleton (or code_graph for multi-file traces) → window_budget (cap long sessions) → tool_pruning → param_tuning; gateway routing kept conservative. |
| RAG / long-context Q&A | prompt_compression (compress retrieved docs) → relevance_filter (re-rank retrieved chunks against the live question) → window_budget → param_tuning. |
| High-volume chat / support | semantic_cache → param_tuning. Cache hits dominate. |
| Batch summarization / classification | param_tuning (tight maxTokensCap) → prompt_compression. |
| Screenshot-heavy support / debugging | vision_ocr (swap text-only screenshots for their text) → prompt_compression → param_tuning. |
| Eval / golden runs | strategies off so results are comparable. |
You express this as the ordered strategies[] array in optimizer.config.json — each
entry is { kind, enabled, params }, and array order is execution order. Optional
overrides.byEndpoint disables/enables kinds per logical route (e.g. all strategies off
for /v1/embeddings). The config is runtime-mutable: edit it from the console's
Optimizer settings page (admin-key-gated GET/PUT /admin/optimizer/settings), and every
change is validated, hot-reloaded, and audit-logged. See
Org Admin → Configure. To see real prompts traced through
these pipelines, read the use cases.
// optimizer.config.json — ordered, runtime-mutable
{
"strategies": [
{ "kind": "semantic_cache", "enabled": false, "params": { "ttlSeconds": 3600 } },
{ "kind": "vision_ocr", "enabled": false, "params": { "minConfidence": 0.6 } },
{ "kind": "prompt_compression", "enabled": true, "params": { "minChars": 400 } },
{ "kind": "context_compression", "enabled": true, "params": { "maxChars": 8000 } },
{ "kind": "code_skeleton", "enabled": true, "params": { "minBodyLines": 4 } },
{ "kind": "code_graph", "enabled": false, "params": { "neighborHops": 1 } },
{ "kind": "relevance_filter", "enabled": false, "params": { "keepChars": 4000 } },
{ "kind": "window_budget", "enabled": true, "params": { "maxTokens": 24000 } },
{ "kind": "tool_pruning", "enabled": true },
{ "kind": "param_tuning", "enabled": true, "params": { "maxTokensCap": 4096 } }
],
"overrides": {
"byEndpoint": { "/v1/embeddings": { "disable": ["prompt_compression", "param_tuning"] } }
}
}
:::info Per-use-case granularity Today the optimizer runs one configured pipeline with optional per-endpoint overrides; orgs separate use-cases by route or by running distinct deployments. Finer per-team strategy selection is on the roadmap. :::
Strategies can tune themselves (roadmap)
You don't have to hand-tune the pipeline forever. With adaptive optimization (opt-in, roadmap), the optimizer learns from your org's own measured savings and quality to reorder strategies, retune their parameters, adapt the router, and tune the cache — each change quality-gated and auto-rolled-back if it regresses. The strategy menu is what Anyray ships you; adaptive optimization decides how to use it best for your workload.
The one invariant strategies can't break
No matter how you compose the pipeline, quality is protected — this is Anyray's behavior-preserving design principle in action: the optimizer fails open (the gateway forwards the original request on any error or timeout), and any strategy that is uncertain defers. Optimization is never allowed to silently degrade an answer — the worst case is "you paid full price," never "you got a worse result." (Gateway model routing adds its own fail-safe-to-frontier guarantee; the automated version is roadmap.)