What saves the most
Anyray's strategy priorities are set by measured real-world impact, not guesses. This page summarizes the public data on what actually reduces LLM token spend, and shows how that ranking shapes our strategy pipeline.
:::info How to read the numbers — and what Anyray actually promises Figures below are third-party / industry benchmarks and case studies (research and provider sources at the bottom). They are directional and workload-dependent — not Anyray's own results, and not guarantees. They tell us where savings live and how to prioritize; the biggest lever depends on your workload (see the two fronts).
What Anyray commits to is narrower and verifiable: it applies these levers without degrading quality, entirely on-prem, and reports your real savings as cost per correct answer measured on your own traffic — never a number from this page. :::
The two fronts: input vs. output
Every token-saving technique attacks one of two cost fronts, and they have very different economics:
- Output tokens cost 3–10× more than input tokens (often 4–6×, up to ~8× on some models) because each one is generated. So cutting output is disproportionately valuable — "reducing output tokens has 3–8× more impact than reducing input."
- Input tokens balloon in RAG and agentic workloads — the same system prompt, documents, and tool schemas are re-sent on every turn and every step. That's where caching and compression pay off massively.
A short-output chat app and a long-context agent need different levers. That's exactly why Anyray is a configurable pipeline, not one fixed optimizer.
Measure savings the right way: cost per correct answer
Raw "$ saved" can hide a quality regression — a cheaper answer that's wrong isn't a saving. The honest unit is cost per correct answer: cost normalized by quality. Independent agentic benchmarks show cost-per-correct-answer dropping several-fold purely by replacing context-stuffing with retrieval — while accuracy holds flat. Anyray optimizes for this metric, not raw spend, which is the same thing as our behavior-preserving principle: a saving only counts if quality is preserved.
Our own benchmark: token reduction on agentic workloads
The figures elsewhere on this page are third-party and directional. The table below is
ours and reproducible: it runs synthetic, Headroom-style
agentic payloads (code search, service logs, RAG retrieval, GitHub issue triage, long agent
conversations, boilerplate prompts, bloated tool registries) through the real optimizer
strategy code and counts actual o200k_base tokens (the GPT‑4o tokenizer) before and
after. It's offline — no provider calls, no API keys, no cost — and deterministic, so
anyone can reproduce it:
cd optimizer/bench && npm install && npm run bench
| Workload | Strategies | Tokens in | Tokens out | Reduction |
|---|---|---|---|---|
| Code search (ripgrep + AST dump) | context_compression | 9,608 | 2,030 | 78.9% |
| SRE debugging (service logs) | context_compression | 24,020 | 2,754 | 88.5% |
| RAG retrieval (vector chunks) | context_compression | 12,683 | 1,726 | 86.4% |
| GitHub issue triage (REST JSON) | context_compression | 59,572 | 2,578 | 95.7% |
| Long agent conversation (window budget) | window_budget | 52,803 | 23,613 | 55.3% |
| Boilerplate-heavy prompt | prompt_compression | 10,286 | 3,592 | 65.1% |
| Bloated tool registry (tool pruning) | tool_pruning | 20,099 | 3,947 | 80.4% |
| Full agentic turn (all strategies) | prompt_compression + context_compression + window_budget + tool_pruning | 33,171 | 5,332 | 83.9% |
| All workloads | 222,242 | 45,572 | 79.5% |
The bench harness covers the four longest-shipped levers. The newer strategies —
code_skeleton, code_graph, relevance_filter, vision_ocr — are measured per-workload
on a reference deployment instead; see the numbers, per workload
(e.g. code_graph ~33% on a multi-file code trace, relevance_filter ~67–99% depending on
the input, vision_ocr ~84% on a pasted screenshot).
:::caution This measures compression, not quality These are input-token reductions (compression ratio), not accuracy deltas. The strategies are deterministic and content-blind, and every lossy crop is reversible — elided tool output and cropped turns are stashed and retrievable on demand (CCR), so the model can pull back anything it needs. But this harness makes no accuracy claim: it's about how many tokens leave the request, not whether the answer stays correct. The number Anyray stands behind for your workload is still cost per correct answer on real traffic, applied without degrading quality. Reductions are also workload-dependent — these synthetic payloads are chosen to be representative, not to flatter. :::
The impact ranking
| Anyray | Typical impact (industry benchmarks) | Targets | Best for | Our strategy (status) |
|---|---|---|---|---|
| Output & token control | 20–40% spend cut; output is 3–10× input, so this is the highest ROI / lowest effort lever | output | almost everything, esp. verbose/summarization | param_tuning ✅ implemented (clamps max_tokens); expanding to temperature/verbosity/format control |
| Tool pruning | drops tools unlikely to be invoked, shrinking re-sent tool schemas on agentic calls | input | agents with many tools | tool_pruning ✅ implemented |
| Prompt (prefix) caching | up to 90% on cached input (Anthropic; reads at 10% of input), 50% automatic (OpenAI); real cuts of 59–73% | input | RAG, agents, long stable system prompts | cache-aware structuring — ⏳ roadmap; the provider/gateway holds the cache |
| Model routing | up to 80% cost cut at ~95% quality; budget models 15–50× cheaper on simple tasks | model choice | mixed-difficulty traffic | the gateway's routing/fallback capability ✅; automated classify→cheap/frontier downgrade — ⏳ roadmap |
| Semantic caching | 40–80% in high-repetition workloads; ~60–69% hit rates; ~31% of queries are semantically similar | input + call | FAQ, support, repetitive Q&A | semantic_cache ✅ implemented (default off) |
| Prompt compression | 4–20× input compression; ~37% at 95–100% key-fact coverage; retrieval-over-stuffing several-fold on agentic cost-per-correct-answer | input | coding agents, RAG, long-context | prompt_compression ✅ implemented (shorten/restructure prompt+system); model-based compression — ⏳ roadmap |
| Relevance filtering | ~67–99% input reduction on our measured workloads (logs, RAG chunks, search results — keep only what answers the live question) | input | log-/doc-heavy agents, RAG | relevance_filter ✅ implemented (default off; BM25, reversible) |
| Code skeletonization | ~26–33% on our measured code-reading workloads — outline files to signatures, keep only the on-path bodies | input | coding agents | code_skeleton ✅ implemented (default-on); code_graph (multi-file, graph-aware) ✅ implemented (default off) — both reversible |
| Vision-token elimination | ~84% on our measured screenshot workload — swap a text-only screenshot for its OCR'd text | input (vision) | screenshot-heavy support/debugging | vision_ocr ✅ implemented (default off; local OCR, confidence-gated, reversible) |
| Provider arbitrage | cheapest qualified provider for the same model | provider | multi-provider orgs | the gateway's multi-provider routing |
What this means for our priorities
The evidence reorders our roadmap:
- Lead with output-token control. It's the highest-ROI, lowest-risk, already-shipped
lever (
param_tuning). We're expanding it from just clampingmax_tokensto also nudging verbosity/format where safe. Most orgs leave this money on the table. - Add prompt/prefix caching next. For agentic and RAG workloads — the fastest-growing segment — repeated context caching is the single largest lever (up to 90%). The provider/gateway holds the cache; Anyray's job is cache-aware request structuring (stable prefixes, cacheable system blocks) so hit rates are high. (Roadmap.)
- Model routing remains the headline for mixed traffic (up to 80%) and lives in the gateway today; the automated classify→cheap/frontier version, kept honest by the fail-safe to frontier, is roadmap.
- Semantic caching and prompt compression are shipped optimizer strategies
(
semantic_cache, default off, andprompt_compression) for high-repetition and long-context/agentic workloads respectively. For agents, how context is assembled is itself a major lever (retrieval over stuffing).
:::note Where the context lever lives — request-side vs. agent-side There are two places to cut agentic context, and they're not the same:
- Request-side (gateway plugin): Anyray sees the request after context is assembled, so here its lever is compression/pruning of the assembled request + cache-aware structuring.
- Agent-side (hooks / MCP): the biggest wins — tool-output reduction (sandboxing/summarizing 56 KB snapshots and 45 KB logs before they ever enter context) and code-driven extraction — happen where the agent runs, upstream of any gateway, cutting tool-output tokens by 90%+.
Anyray's zero-touch interception already reaches the agent's environment, so this agent-side layer is a natural place for Anyray to extend (or integrate) — but in pure plugin mode it's upstream of us. We document it because it tells orgs where the agentic savings really are. :::
:::note Compression can run on-prem — and must, to fit the budget
Compression is compatible with the data boundary: it runs
fully on-prem like everything else, so there's no data egress to worry about. Two
build constraints fall out: (1) a model-based pass (roadmap) needs a warm/preloaded
model because cold-start (~2–4s) blows the gateway's 800ms optimizer-hook timeout; (2) a
model-free term-overlap pass is cheap enough to run inline — this is what
prompt_compression ships with today. Compression is gated on a quality bound — e.g.
key-fact coverage — so it never trades away meaning for tokens.
:::
How the dominant lever shifts by workload
| Workload | Dominant lever (from the data) |
|---|---|
| Agentic / tool-using | tool-output reduction (sandboxing/summarizing big tool results) + prefix caching (repeated system + tool schemas) + conversation & memory management (goal/topic-aware, short/long-term) → routing → output control |
| RAG / long-context Q&A | prefix caching + context compression (the document is re-sent) → output control |
| High-volume chat / support | semantic cache (repetition) → routing → output control |
| Batch classification / extraction | routing to a budget model (15–50× cheaper) → output control |
| Short-prompt, verbose output | output & token control first (output dominates the bill) |
This is the per-use-case configuration the optimizer is built to express.
About these figures
The numbers on this page are drawn from public, independent industry benchmarks and production case studies across prompt caching, model routing, semantic caching, prompt compression, and agentic context optimization. They're directional and workload-dependent, included to set priorities — not Anyray's own results or guarantees (see the note at the top).
What Anyray stands behind is what it measures on your traffic: real savings as cost per correct answer, applied without degrading quality. Detailed source tracking is maintained in our internal engineering notes.