What saves the most

Anyray's strategy priorities are set by measured real-world impact, not guesses. This page summarizes the public data on what actually reduces LLM token spend, and shows how that ranking shapes our strategy pipeline.

:::info How to read the numbers — and what Anyray actually promises Figures below are third-party / industry benchmarks and case studies (research and provider sources at the bottom). They are directional and workload-dependent — not Anyray's own results, and not guarantees. They tell us where savings live and how to prioritize; the biggest lever depends on your workload (see the two fronts).

What Anyray commits to is narrower and verifiable: it applies these levers without degrading quality, entirely on-prem, and reports your real savings as cost per correct answer measured on your own traffic — never a number from this page. :::

The two fronts: input vs. output

Every token-saving technique attacks one of two cost fronts, and they have very different economics:

Output tokens cost 3–10× more than input tokens (often 4–6×, up to ~8× on some models) because each one is generated. So cutting output is disproportionately valuable — "reducing output tokens has 3–8× more impact than reducing input."
Input tokens balloon in RAG and agentic workloads — the same system prompt, documents, and tool schemas are re-sent on every turn and every step. That's where caching and compression pay off massively.

A short-output chat app and a long-context agent need different levers. That's exactly why Anyray is a configurable pipeline, not one fixed optimizer.

Measure savings the right way: cost per correct answer

Raw "$ saved" can hide a quality regression — a cheaper answer that's wrong isn't a saving. The honest unit is cost per correct answer: cost normalized by quality. Independent agentic benchmarks show cost-per-correct-answer dropping several-fold purely by replacing context-stuffing with retrieval — while accuracy holds flat. Anyray optimizes for this metric, not raw spend, which is the same thing as our behavior-preserving principle: a saving only counts if quality is preserved.

Our own benchmark: token reduction on agentic workloads

The figures elsewhere on this page are third-party and directional. The table below is ours and reproducible: it runs synthetic, Headroom-style agentic payloads (code search, service logs, RAG retrieval, GitHub issue triage, long agent conversations, boilerplate prompts, bloated tool registries) through the real optimizer strategy code and counts actual o200k_base tokens (the GPT‑4o tokenizer) before and after. It's offline — no provider calls, no API keys, no cost — and deterministic, so anyone can reproduce it:

cd optimizer/bench && npm install && npm run bench

Workload	Strategies	Tokens in	Tokens out	Reduction
Code search (ripgrep + AST dump)	`context_compression`	9,608	2,030	78.9%
SRE debugging (service logs)	`context_compression`	24,020	2,754	88.5%
RAG retrieval (vector chunks)	`context_compression`	12,683	1,726	86.4%
GitHub issue triage (REST JSON)	`context_compression`	59,572	2,578	95.7%
Long agent conversation (window budget)	`window_budget`	52,803	23,613	55.3%
Boilerplate-heavy prompt	`prompt_compression`	10,286	3,592	65.1%
Bloated tool registry (tool pruning)	`tool_pruning`	20,099	3,947	80.4%
Full agentic turn (all strategies)	`prompt_compression + context_compression + window_budget + tool_pruning`	33,171	5,332	83.9%
All workloads		222,242	45,572	79.5%

The bench harness covers the four longest-shipped levers. The newer strategies — code_skeleton, code_graph, relevance_filter, vision_ocr — are measured per-workload on a reference deployment instead; see the numbers, per workload (e.g. code_graph ~33% on a multi-file code trace, relevance_filter ~67–99% depending on the input, vision_ocr ~84% on a pasted screenshot).

:::caution This measures compression, not quality These are input-token reductions (compression ratio), not accuracy deltas. The strategies are deterministic and content-blind, and every lossy crop is reversible — elided tool output and cropped turns are stashed and retrievable on demand (CCR), so the model can pull back anything it needs. But this harness makes no accuracy claim: it's about how many tokens leave the request, not whether the answer stays correct. The number Anyray stands behind for your workload is still cost per correct answer on real traffic, applied without degrading quality. Reductions are also workload-dependent — these synthetic payloads are chosen to be representative, not to flatter. :::

The impact ranking

Anyray	Typical impact (industry benchmarks)	Targets	Best for	Our strategy (status)
Output & token control	20–40% spend cut; output is 3–10× input, so this is the highest ROI / lowest effort lever	output	almost everything, esp. verbose/summarization	`param_tuning` ✅ implemented (clamps `max_tokens`); expanding to temperature/verbosity/format control
Tool pruning	drops tools unlikely to be invoked, shrinking re-sent tool schemas on agentic calls	input	agents with many tools	`tool_pruning` ✅ implemented
Prompt (prefix) caching	up to 90% on cached input (Anthropic; reads at 10% of input), 50% automatic (OpenAI); real cuts of 59–73%	input	RAG, agents, long stable system prompts	cache-aware structuring — ⏳ roadmap; the provider/gateway holds the cache
Model routing	up to 80% cost cut at ~95% quality; budget models 15–50× cheaper on simple tasks	model choice	mixed-difficulty traffic	the gateway's routing/fallback capability ✅; automated classify→cheap/frontier downgrade — ⏳ roadmap
Semantic caching	40–80% in high-repetition workloads; ~60–69% hit rates; ~31% of queries are semantically similar	input + call	FAQ, support, repetitive Q&A	`semantic_cache` ✅ implemented (default off)
Prompt compression	4–20× input compression; ~37% at 95–100% key-fact coverage; retrieval-over-stuffing several-fold on agentic cost-per-correct-answer	input	coding agents, RAG, long-context	`prompt_compression` ✅ implemented (shorten/restructure prompt+system); model-based compression — ⏳ roadmap
Relevance filtering	~67–99% input reduction on our measured workloads (logs, RAG chunks, search results — keep only what answers the live question)	input	log-/doc-heavy agents, RAG	`relevance_filter` ✅ implemented (default off; BM25, reversible)
Code skeletonization	~26–33% on our measured code-reading workloads — outline files to signatures, keep only the on-path bodies	input	coding agents	`code_skeleton` ✅ implemented (default-on); `code_graph` (multi-file, graph-aware) ✅ implemented (default off) — both reversible
Vision-token elimination	~84% on our measured screenshot workload — swap a text-only screenshot for its OCR'd text	input (vision)	screenshot-heavy support/debugging	`vision_ocr` ✅ implemented (default off; local OCR, confidence-gated, reversible)
Provider arbitrage	cheapest qualified provider for the same model	provider	multi-provider orgs	the gateway's multi-provider routing

What this means for our priorities

The evidence reorders our roadmap:

Lead with output-token control. It's the highest-ROI, lowest-risk, already-shipped lever (param_tuning). We're expanding it from just clamping max_tokens to also nudging verbosity/format where safe. Most orgs leave this money on the table.
Add prompt/prefix caching next. For agentic and RAG workloads — the fastest-growing segment — repeated context caching is the single largest lever (up to 90%). The provider/gateway holds the cache; Anyray's job is cache-aware request structuring (stable prefixes, cacheable system blocks) so hit rates are high. (Roadmap.)
Model routing remains the headline for mixed traffic (up to 80%) and lives in the gateway today; the automated classify→cheap/frontier version, kept honest by the fail-safe to frontier, is roadmap.
Semantic caching and prompt compression are shipped optimizer strategies (semantic_cache, default off, and prompt_compression) for high-repetition and long-context/agentic workloads respectively. For agents, how context is assembled is itself a major lever (retrieval over stuffing).

:::note Where the context lever lives — request-side vs. agent-side There are two places to cut agentic context, and they're not the same:

Request-side (gateway plugin): Anyray sees the request after context is assembled, so here its lever is compression/pruning of the assembled request + cache-aware structuring.
Agent-side (hooks / MCP): the biggest wins — tool-output reduction (sandboxing/summarizing 56 KB snapshots and 45 KB logs before they ever enter context) and code-driven extraction — happen where the agent runs, upstream of any gateway, cutting tool-output tokens by 90%+.

Anyray's zero-touch interception already reaches the agent's environment, so this agent-side layer is a natural place for Anyray to extend (or integrate) — but in pure plugin mode it's upstream of us. We document it because it tells orgs where the agentic savings really are. :::

:::note Compression can run on-prem — and must, to fit the budget Compression is compatible with the data boundary: it runs fully on-prem like everything else, so there's no data egress to worry about. Two build constraints fall out: (1) a model-based pass (roadmap) needs a warm/preloaded model because cold-start (~2–4s) blows the gateway's 800ms optimizer-hook timeout; (2) a model-free term-overlap pass is cheap enough to run inline — this is what prompt_compression ships with today. Compression is gated on a quality bound — e.g. key-fact coverage — so it never trades away meaning for tokens. :::

How the dominant lever shifts by workload

Workload	Dominant lever (from the data)
Agentic / tool-using	tool-output reduction (sandboxing/summarizing big tool results) + prefix caching (repeated system + tool schemas) + conversation & memory management (goal/topic-aware, short/long-term) → routing → output control
RAG / long-context Q&A	prefix caching + context compression (the document is re-sent) → output control
High-volume chat / support	semantic cache (repetition) → routing → output control
Batch classification / extraction	routing to a budget model (15–50× cheaper) → output control
Short-prompt, verbose output	output & token control first (output dominates the bill)

This is the per-use-case configuration the optimizer is built to express.

About these figures

The numbers on this page are drawn from public, independent industry benchmarks and production case studies across prompt caching, model routing, semantic caching, prompt compression, and agentic context optimization. They're directional and workload-dependent, included to set priorities — not Anyray's own results or guarantees (see the note at the top).

What Anyray stands behind is what it measures on your traffic: real savings as cost per correct answer, applied without degrading quality. Detailed source tracking is maintained in our internal engineering notes.

The two fronts: input vs. output​

Measure savings the right way: cost per correct answer​

Our own benchmark: token reduction on agentic workloads​

The impact ranking​

What this means for our priorities​

How the dominant lever shifts by workload​

About these figures​