Use cases
Almost every expensive request has the same shape: a developer or coding agent puts a
large blob in front of the model and asks a narrow question about it. A 500-line log, a
kubectl … -o json dump, a whole source file, a long agent session. The model is billed
for the entire blob — on every turn — even though it needs a handful of lines.
This page lists the workloads that waste the most, the exact prompt that triggers each, and the strategy that cuts it — the worked-example companion to the impact ranking in what saves the most.
:::info How to read the numbers
Reductions below are input-token reductions measured on a reference Anyray deployment
(real provider calls: the same request sent once with optimization bypassed for the
before, once optimized for the after). They are directional and
workload-dependent — the exact figure moves with input size and the strategy's budget
(e.g. relevance_filter's keepChars). Every lossy crop is reversible: elided spans
are stashed and retrievable on demand (CCR), so the model can pull back
anything it needs. As always, the number Anyray stands behind for your traffic is
cost per correct answer,
applied without degrading quality
and entirely on-prem.
:::
The numbers, per workload
| Workload | What you ask | Strategy | Reduction |
|---|---|---|---|
| Access log | "find the failing 5xx requests" | relevance_filter | ~99% |
| SRE incident debugging | "why did p99 spike at 10:05?" | relevance_filter | ~98% |
| GitHub issue triage | "which open issues are P0 auth bugs?" | relevance_filter | ~83% |
| JSON dump | "which of these 500 orders failed?" | context_compression | ~97% |
| Code search | "where is the retry policy configured?" | relevance_filter | ~72% |
| Git diff review | "any change that weakens an auth check?" | relevance_filter | ~72% |
| Codebase exploration | "map the architecture; where do retries live?" | code_skeleton | ~26% |
| Multi-file code trace | "how does submitOrder capture a payment?" | code_graph | ~33% |
| Pasted screenshot | "what is this error in my screenshot?" | vision_ocr | ~84% |
| Long agent session | "given all the above, where should X live?" | window_budget | ~72% |
| MCP tool-schema bloat | any request from a tool-loaded assistant | tool_pruning | ~91% |
| RAG over-retrieval | "what does the refund policy say about X?" | relevance_filter | ~67% |
| Templated batch prompt | nightly classification over N tickets | prompt_compression | ~82% |
| Repeat question | the same support question, asked again | semantic_cache | provider call skipped |
Runaway max_tokens | a one-liner sent with max_tokens: 100000 | param_tuning | output ceiling 100,000 → 4,096 |
Reductions are tunable, not fixed — see tuning the aggressiveness.
Access log
You paste a few hundred lines of an nginx access log and ask the model to find the
failures. Almost every line is 2xx/3xx noise; the answer lives in the handful of 5xx
lines clustered in the incident window.
Prompt: "Checkout is failing intermittently in production. Here is the nginx access log. Find the failing requests: which endpoint and method are returning 5xx, how many, the time window they happened in, and the client IPs — so I can narrow it down." — followed by ~500 log lines.
relevance_filter ranks every line against the question (Okapi BM25 —
lexical, no LLM, no embeddings) and keeps only the top-scoring lines plus a few for
orientation; the rest collapse to a content-free, retrievable marker. The 5xx
POST /api/checkout lines survive; the 200 OK noise doesn't.
SRE incident debugging
During an incident you dump the merged logs, metrics, and traces for the affected window and ask what caused the spike. The spike is a tiny slice of a very large window.
Prompt: "Checkout p99 latency spiked around 10:05 and we got paged. Here are the merged logs, metrics and traces for the window. Find what caused the spike."
relevance_filter keeps the lines that match the incident question (the error logs, the
slow spans, the saturated-pool metrics) and elides the steady-state noise around them.
GitHub issue triage
You feed in dozens of open issues with their comment threads and ask which are urgent. Most issues are low-priority chatter; a few are the real fires.
Prompt: "Here are our 60 open issues with their comment threads. Which ones are P0 authentication bugs that should be fixed this week? List the issue numbers."
relevance_filter surfaces the issues whose text matches the triage question and drops the
rest. (If your issues arrive as raw API JSON instead of text, this becomes a
context_compression job.)
JSON dump
You paste the output of an API call or kubectl get … -o json and ask one question about
it. Pretty-printed JSON is mostly structural whitespace and repeated keys.
Prompt: "Here is the JSON from
GET /api/orders?limit=500. Which orders have status 'failed', and what is their retry count? Ignore the settled ones."
context_compression minifies the JSON, de-duplicates repeated structure,
and caps long arrays — structurally, without reading meaning — stashing the full original
for retrieval.
Code search
You run grep/ripgrep or a semantic search, get a hundred hits, and ask which one is the
real match. Most hits are incidental keyword matches.
Prompt: "I ran
rg -n -C1 retryand got 100 hits across the codebase. Where is the retry policy / max-attempts actually defined (not just used)? Point me to the file and line."
relevance_filter ranks the result groups against the intent and keeps the ones that
actually answer it — the policy definition, not the incidental usages and comments.
Git diff review
You ask the model to review a sizable PR diff for one specific risk. Most of the changed files are unrelated to the question.
Prompt: "Review this PR diff. Is there any change that weakens an auth/permission check? Point to the exact file and line."
relevance_filter keeps the hunk that matches the review question (the weakened
requireAdmin check) and elides the unrelated file changes, so the reviewer pays for the
diff that matters.
Codebase exploration
A coding agent reads several whole source files to understand the architecture, but only needs the shape — the function signatures and structure, not every line of every body.
Prompt: "Here are the core service files. Give me a high-level map of the architecture and tell me where retry logic is handled. I do not need every line."
code_skeleton outlines each file to its symbols — keeping signatures and
declarations, eliding the bodies — and stashes each body so the agent can pull back the one
function it actually needs to change. This is the most workload-dependent row: the savings
scale with how long the function bodies are, so files full of short methods compress less.
Multi-file code trace
A coding agent traces one flow across a service — every read_file it runs piles another
whole file into the context window, but only a few of those files are on the path the
question asks about.
Prompt: "How does
Checkout.submitOrderend up capturing a payment? Trace the call path and tell me where a failed capture is retried." — after the agent has read 11 source files back as separate tool messages.
code_graph is the graph-aware cousin of code_skeleton. Instead of
outlining all 11 files uniformly, it builds a cross-file reference graph, sees which
symbols the question actually touches (submitOrder → capturePayment → sendCharge,
plus their collaborators), keeps those bodies whole, and elides the 7 off-path files —
reversibly. On single-file reads it degrades to plain code_skeleton outlining; enable
one of the two, not both.
Pasted screenshot
A developer pastes a screenshot of a terminal stack trace, an error dialog, or log output instead of the text. A high-detail image input costs hundreds-to-thousands of provider vision tokens — far more than the equivalent text.
Prompt: "My Node service crashes on checkout. Here is a screenshot of the stack trace from the terminal — what is the bug and how do I fix it?" — with a PNG attached.
vision_ocr runs a cheap local OCR pass and, only when the image reads
as confidently text, swaps it for the extracted text before the request reaches the model —
a fraction of the vision-token cost. It is gated and off by default: an image is not its
caption, so anything that reads as a diagram, chart, UI, or photo (sparse / low-confidence
text) passes through untouched. The swap is reversible — the original image is stashed and
retrievable on demand.
Long agent session
A long coding session accumulates dozens of stale tool turns. Every new step re-sends the entire history, so the oldest, least-relevant turns are paid for again and again.
Prompt (turn N of a long session): "Given everything above, where should the shared retry logic live? Be specific."
window_budget crops the oldest middle turns to fit a token ceiling,
pinning the system prompt, the task setup, and the most recent turns — and stashing what it
crops so it stays retrievable.
MCP tool-schema bloat
Your developers connected MCP servers for GitHub, Slack, Jira, a database, and a browser — genuinely useful, but now every request ships all 41 tool schemas, used or not. Industry numbers put schema overhead at roughly 1k tokens per tool.
Prompt: "Move ticket PLAT-281 to In Review and assign it to me." — a request that needs two Jira tools, sent with 41 schemas attached.
tool_pruning matches tool names against the conversation and drops the
schemas nothing references — here, 39 of 41 — before the request reaches the provider.
RAG over-retrieval
A support copilot retrieves the top-20 chunks by embedding similarity for every customer question and stuffs them all into the prompt. Two contain the answer; most pipelines over-fetch 3–5×.
Prompt: "What does the refund policy say about annual plans cancelled mid-term?" — with 20 retrieved chunks attached.
relevance_filter re-ranks the chunks against the live question on the
gateway and keeps only what matters — reversibly, like every lossy crop.
Templated batch prompt
A nightly batch script re-appends the same six-sentence instruction block before every one of 40 tickets. The instructions bill 40 times, every night — and nobody reviews the prompts a cron job sends.
Prompt: the same classification template pasted before each of 40 support tickets.
prompt_compression deduplicates the repeated sentences and collapses the
padding; the first copy survives, and the model behaves identically.
Repeat question
High-volume support and FAQ traffic asks the same questions in different words, and every rephrasing pays for a fresh provider call.
Prompt: "How many days do I have to return something?" — semantically equal to a prior "What's your refund window?".
semantic_cache recognizes the repeat (within similarityThreshold) and
the gateway serves the stored response — the provider is never called, so the call is ~free
and faster. It is default-off; enable it per workload where freshness allows. (A host
adapter that can't short-circuit, like LiteLLM's pre-call hook, serves the hit from its own
cache instead of receiving a stored response.)
Runaway max_tokens
A copy-pasted "safe default" of max_tokens: 100000 rides on a request that asks for a
one-line answer. The parameter authorizes the worst-case bill: a verbose model, a
reasoning loop, or a prompt-injected response can spend the whole ceiling.
Prompt: "Write a one-line conventional-commit message for this diff." — sent with
max_tokens: 100000.
param_tuning clamps the output ceiling at the gateway (default
maxTokensCap: 4096), so no request can authorize more output than your org allows.
Strategies compose
These workloads aren't either/or: one request passes through every enabled strategy in
optimizer.config.json order. Each strategy hands the transformed request to the next, and
each emits a content-free decision the gateway uses to
attribute savings per strategy in its spend store:
prompt_compression long prompt shortened (transform)
tool_pruning dropped unused_tool (transform)
param_tuning max_tokens 16000 → 4096 (transform)
semantic_cache no equivalent prior request (no hit → forward)
▼
forward to provider with { max_tokens: 4096, tools: [] }
The same prompt behaves differently under different configured pipelines — a conservative coding-agent pipeline, an aggressive high-volume-chat one with the cache on, or an eval run with everything off — chosen by the admin per use-case, never by changing your code; see configuring strategies for a use-case. If the optimizer errors or exceeds the gateway's 800 ms timeout, the gateway forwards the original request unchanged — it fails open.
Routing a request to a cheaper model is deliberately not on this page: it's the gateway's routing job, not an optimizer strategy — see model routing in what saves the most.
Tuning the aggressiveness
For the relevance_filter workloads, how much is kept is a single budget — keepChars —
that trades savings against how much context the model sees. Lower it for noisier inputs
where the answer is concentrated (an access log full of identical errors); raise it where
the signal is spread out (code-search results):
| Workload | keepChars | Reduction |
|---|---|---|
| Access log | 500 | ~99% |
| SRE incident | 1,000 | ~98% |
| GitHub triage | 2,000 | ~83% |
| Git diff review | 2,000 | ~72% |
| Code search | 1,500 | ~72% |
| RAG over-retrieval | 1,200 | ~67% |
Because every crop is reversible, an over-aggressive budget degrades gracefully: if the model needs a span that was elided, it retrieves it rather than answering blind. This is the same behavior-preserving contract the rest of the pipeline holds to.
Reproduce these
Each workload above ships in the demo kit as its own folder under demo/use-cases/<id>/ —
the raw artifact a developer would actually have (access.log, orders.json, …), the
exact prompt, and a run.sh that runs that one case the way a coding agent would: it
points the Codex CLI at the gateway, sends the artifact
as the request, then validates the optimizer's decision by reading that request's trace
back from the console:
ANYRAY_ADMIN_TOKEN=… demo/use-cases/1-access-log/run.sh
# runs the case through the gateway with Codex, then prints the trace's per-request
# decision and a /console/traces deep link. Needs the Codex CLI on PATH.
To send every workload at once and print the before/after savings table:
node demo/generate.mjs # build the data + prompts
ANYRAY_ADMIN_TOKEN=… GATEWAY=… CONSOLE=… node demo/run.mjs
In the console — Playground
The console Playground is free-form. Open Gateway → Playground, type a prompt, optionally attach a text file (it rides as data under the prompt) or a screenshot (sent as a vision part), pick a provider/model, and hit Run + compare: the console fires the request through the gateway twice — once optimized, once with optimization bypassed for the before — and shows the prompt-token count before → after, the % saved on that run, and which strategy fired. It carries a server-held provider key, so nothing leaves the box, and — like every surface — it records metadata only, never prompt or response content.
The demo workloads above aren't loaded from the console; run them via the CLI
(demo/use-cases/<id>/run.sh or node demo/run.mjs, as described earlier on this page).
For an offline, deterministic version (no provider, no keys, no cost) that runs the
real strategy code and counts o200k_base tokens, see the
reproducible benchmark
(cd optimizer/bench && npm run bench).