Use cases

Almost every expensive request has the same shape: a developer or coding agent puts a large blob in front of the model and asks a narrow question about it. A 500-line log, a kubectl … -o json dump, a whole source file, a long agent session. The model is billed for the entire blob — on every turn — even though it needs a handful of lines.

This page lists the workloads that waste the most, the exact prompt that triggers each, and the strategy that cuts it — the worked-example companion to the impact ranking in what saves the most.

:::info How to read the numbers Reductions below are input-token reductions measured on a reference Anyray deployment (real provider calls: the same request sent once with optimization bypassed for the before, once optimized for the after). They are directional and workload-dependent — the exact figure moves with input size and the strategy's budget (e.g. relevance_filter's keepChars). Every lossy crop is reversible: elided spans are stashed and retrievable on demand (CCR), so the model can pull back anything it needs. As always, the number Anyray stands behind for your traffic is cost per correct answer, applied without degrading quality and entirely on-prem. :::

The numbers, per workload

Workload	What you ask	Strategy	Reduction
Access log	"find the failing 5xx requests"	`relevance_filter`	~99%
SRE incident debugging	"why did p99 spike at 10:05?"	`relevance_filter`	~98%
GitHub issue triage	"which open issues are P0 auth bugs?"	`relevance_filter`	~83%
JSON dump	"which of these 500 orders failed?"	`context_compression`	~97%
Code search	"where is the retry policy configured?"	`relevance_filter`	~72%
Git diff review	"any change that weakens an auth check?"	`relevance_filter`	~72%
Codebase exploration	"map the architecture; where do retries live?"	`code_skeleton`	~26%
Multi-file code trace	"how does submitOrder capture a payment?"	`code_graph`	~33%
Pasted screenshot	"what is this error in my screenshot?"	`vision_ocr`	~84%
Long agent session	"given all the above, where should X live?"	`window_budget`	~72%
MCP tool-schema bloat	any request from a tool-loaded assistant	`tool_pruning`	~91%
RAG over-retrieval	"what does the refund policy say about X?"	`relevance_filter`	~67%
Templated batch prompt	nightly classification over N tickets	`prompt_compression`	~82%
Repeat question	the same support question, asked again	`semantic_cache`	provider call skipped
Runaway `max_tokens`	a one-liner sent with `max_tokens: 100000`	`param_tuning`	output ceiling 100,000 → 4,096

Reductions are tunable, not fixed — see tuning the aggressiveness.

Access log

You paste a few hundred lines of an nginx access log and ask the model to find the failures. Almost every line is 2xx/3xx noise; the answer lives in the handful of 5xx lines clustered in the incident window.

Prompt: "Checkout is failing intermittently in production. Here is the nginx access log. Find the failing requests: which endpoint and method are returning 5xx, how many, the time window they happened in, and the client IPs — so I can narrow it down." — followed by ~500 log lines.

relevance_filter ranks every line against the question (Okapi BM25 — lexical, no LLM, no embeddings) and keeps only the top-scoring lines plus a few for orientation; the rest collapse to a content-free, retrievable marker. The 5xx POST /api/checkout lines survive; the 200 OK noise doesn't.

SRE incident debugging

During an incident you dump the merged logs, metrics, and traces for the affected window and ask what caused the spike. The spike is a tiny slice of a very large window.

Prompt: "Checkout p99 latency spiked around 10:05 and we got paged. Here are the merged logs, metrics and traces for the window. Find what caused the spike."

relevance_filter keeps the lines that match the incident question (the error logs, the slow spans, the saturated-pool metrics) and elides the steady-state noise around them.

GitHub issue triage

You feed in dozens of open issues with their comment threads and ask which are urgent. Most issues are low-priority chatter; a few are the real fires.

Prompt: "Here are our 60 open issues with their comment threads. Which ones are P0 authentication bugs that should be fixed this week? List the issue numbers."

relevance_filter surfaces the issues whose text matches the triage question and drops the rest. (If your issues arrive as raw API JSON instead of text, this becomes a context_compression job.)

JSON dump

You paste the output of an API call or kubectl get … -o json and ask one question about it. Pretty-printed JSON is mostly structural whitespace and repeated keys.

Prompt: "Here is the JSON from GET /api/orders?limit=500. Which orders have status 'failed', and what is their retry count? Ignore the settled ones."

context_compression minifies the JSON, de-duplicates repeated structure, and caps long arrays — structurally, without reading meaning — stashing the full original for retrieval.

Code search

You run grep/ripgrep or a semantic search, get a hundred hits, and ask which one is the real match. Most hits are incidental keyword matches.

Prompt: "I ran rg -n -C1 retry and got 100 hits across the codebase. Where is the retry policy / max-attempts actually defined (not just used)? Point me to the file and line."

relevance_filter ranks the result groups against the intent and keeps the ones that actually answer it — the policy definition, not the incidental usages and comments.

Git diff review

You ask the model to review a sizable PR diff for one specific risk. Most of the changed files are unrelated to the question.

Prompt: "Review this PR diff. Is there any change that weakens an auth/permission check? Point to the exact file and line."

relevance_filter keeps the hunk that matches the review question (the weakened requireAdmin check) and elides the unrelated file changes, so the reviewer pays for the diff that matters.

Codebase exploration

A coding agent reads several whole source files to understand the architecture, but only needs the shape — the function signatures and structure, not every line of every body.

Prompt: "Here are the core service files. Give me a high-level map of the architecture and tell me where retry logic is handled. I do not need every line."

code_skeleton outlines each file to its symbols — keeping signatures and declarations, eliding the bodies — and stashes each body so the agent can pull back the one function it actually needs to change. This is the most workload-dependent row: the savings scale with how long the function bodies are, so files full of short methods compress less.

Multi-file code trace

A coding agent traces one flow across a service — every read_file it runs piles another whole file into the context window, but only a few of those files are on the path the question asks about.

Prompt: "How does Checkout.submitOrder end up capturing a payment? Trace the call path and tell me where a failed capture is retried." — after the agent has read 11 source files back as separate tool messages.

code_graph is the graph-aware cousin of code_skeleton. Instead of outlining all 11 files uniformly, it builds a cross-file reference graph, sees which symbols the question actually touches (submitOrder → capturePayment → sendCharge, plus their collaborators), keeps those bodies whole, and elides the 7 off-path files — reversibly. On single-file reads it degrades to plain code_skeleton outlining; enable one of the two, not both.

Pasted screenshot

A developer pastes a screenshot of a terminal stack trace, an error dialog, or log output instead of the text. A high-detail image input costs hundreds-to-thousands of provider vision tokens — far more than the equivalent text.

Prompt: "My Node service crashes on checkout. Here is a screenshot of the stack trace from the terminal — what is the bug and how do I fix it?" — with a PNG attached.

vision_ocr runs a cheap local OCR pass and, only when the image reads as confidently text, swaps it for the extracted text before the request reaches the model — a fraction of the vision-token cost. It is gated and off by default: an image is not its caption, so anything that reads as a diagram, chart, UI, or photo (sparse / low-confidence text) passes through untouched. The swap is reversible — the original image is stashed and retrievable on demand.

Long agent session

A long coding session accumulates dozens of stale tool turns. Every new step re-sends the entire history, so the oldest, least-relevant turns are paid for again and again.

Prompt (turn N of a long session): "Given everything above, where should the shared retry logic live? Be specific."

window_budget crops the oldest middle turns to fit a token ceiling, pinning the system prompt, the task setup, and the most recent turns — and stashing what it crops so it stays retrievable.

MCP tool-schema bloat

Your developers connected MCP servers for GitHub, Slack, Jira, a database, and a browser — genuinely useful, but now every request ships all 41 tool schemas, used or not. Industry numbers put schema overhead at roughly 1k tokens per tool.

Prompt: "Move ticket PLAT-281 to In Review and assign it to me." — a request that needs two Jira tools, sent with 41 schemas attached.

tool_pruning matches tool names against the conversation and drops the schemas nothing references — here, 39 of 41 — before the request reaches the provider.

RAG over-retrieval

A support copilot retrieves the top-20 chunks by embedding similarity for every customer question and stuffs them all into the prompt. Two contain the answer; most pipelines over-fetch 3–5×.

Prompt: "What does the refund policy say about annual plans cancelled mid-term?" — with 20 retrieved chunks attached.

relevance_filter re-ranks the chunks against the live question on the gateway and keeps only what matters — reversibly, like every lossy crop.

Templated batch prompt

A nightly batch script re-appends the same six-sentence instruction block before every one of 40 tickets. The instructions bill 40 times, every night — and nobody reviews the prompts a cron job sends.

Prompt: the same classification template pasted before each of 40 support tickets.

prompt_compression deduplicates the repeated sentences and collapses the padding; the first copy survives, and the model behaves identically.

Repeat question

High-volume support and FAQ traffic asks the same questions in different words, and every rephrasing pays for a fresh provider call.

Prompt: "How many days do I have to return something?" — semantically equal to a prior "What's your refund window?".

semantic_cache recognizes the repeat (within similarityThreshold) and the gateway serves the stored response — the provider is never called, so the call is ~free and faster. It is default-off; enable it per workload where freshness allows. (A host adapter that can't short-circuit, like LiteLLM's pre-call hook, serves the hit from its own cache instead of receiving a stored response.)

Runaway `max_tokens`

A copy-pasted "safe default" of max_tokens: 100000 rides on a request that asks for a one-line answer. The parameter authorizes the worst-case bill: a verbose model, a reasoning loop, or a prompt-injected response can spend the whole ceiling.

Prompt: "Write a one-line conventional-commit message for this diff." — sent with max_tokens: 100000.

param_tuning clamps the output ceiling at the gateway (default maxTokensCap: 4096), so no request can authorize more output than your org allows.

Strategies compose

These workloads aren't either/or: one request passes through every enabled strategy in optimizer.config.json order. Each strategy hands the transformed request to the next, and each emits a content-free decision the gateway uses to attribute savings per strategy in its spend store:

 prompt_compression  long prompt shortened              (transform)
 tool_pruning        dropped unused_tool                (transform)
 param_tuning        max_tokens 16000 → 4096            (transform)
 semantic_cache      no equivalent prior request        (no hit → forward)
        ▼
 forward to provider with { max_tokens: 4096, tools: [] }

The same prompt behaves differently under different configured pipelines — a conservative coding-agent pipeline, an aggressive high-volume-chat one with the cache on, or an eval run with everything off — chosen by the admin per use-case, never by changing your code; see configuring strategies for a use-case. If the optimizer errors or exceeds the gateway's 800 ms timeout, the gateway forwards the original request unchanged — it fails open.

Routing a request to a cheaper model is deliberately not on this page: it's the gateway's routing job, not an optimizer strategy — see model routing in what saves the most.

Tuning the aggressiveness

For the relevance_filter workloads, how much is kept is a single budget — keepChars — that trades savings against how much context the model sees. Lower it for noisier inputs where the answer is concentrated (an access log full of identical errors); raise it where the signal is spread out (code-search results):

Workload	`keepChars`	Reduction
Access log	500	~99%
SRE incident	1,000	~98%
GitHub triage	2,000	~83%
Git diff review	2,000	~72%
Code search	1,500	~72%
RAG over-retrieval	1,200	~67%

Because every crop is reversible, an over-aggressive budget degrades gracefully: if the model needs a span that was elided, it retrieves it rather than answering blind. This is the same behavior-preserving contract the rest of the pipeline holds to.

Reproduce these

Each workload above ships in the demo kit as its own folder under demo/use-cases/<id>/ — the raw artifact a developer would actually have (access.log, orders.json, …), the exact prompt, and a run.sh that runs that one case the way a coding agent would: it points the Codex CLI at the gateway, sends the artifact as the request, then validates the optimizer's decision by reading that request's trace back from the console:

ANYRAY_ADMIN_TOKEN=… demo/use-cases/1-access-log/run.sh
# runs the case through the gateway with Codex, then prints the trace's per-request
# decision and a /console/traces deep link. Needs the Codex CLI on PATH.

To send every workload at once and print the before/after savings table:

node demo/generate.mjs            # build the data + prompts
ANYRAY_ADMIN_TOKEN=… GATEWAY=… CONSOLE=… node demo/run.mjs

In the console — Playground

The console Playground is free-form. Open Gateway → Playground, type a prompt, optionally attach a text file (it rides as data under the prompt) or a screenshot (sent as a vision part), pick a provider/model, and hit Run + compare: the console fires the request through the gateway twice — once optimized, once with optimization bypassed for the before — and shows the prompt-token count before → after, the % saved on that run, and which strategy fired. It carries a server-held provider key, so nothing leaves the box, and — like every surface — it records metadata only, never prompt or response content.

The demo workloads above aren't loaded from the console; run them via the CLI (demo/use-cases/<id>/run.sh or node demo/run.mjs, as described earlier on this page).

For an offline, deterministic version (no provider, no keys, no cost) that runs the real strategy code and counts o200k_base tokens, see the reproducible benchmark (cd optimizer/bench && npm run bench).

The numbers, per workload​

Access log​

SRE incident debugging​

GitHub issue triage​

JSON dump​

Code search​

Git diff review​

Codebase exploration​

Multi-file code trace​

Pasted screenshot​

Long agent session​

MCP tool-schema bloat​

RAG over-retrieval​

Templated batch prompt​

Repeat question​

Runaway max_tokens​

Strategies compose​

Tuning the aggressiveness​

Reproduce these​

In the console — Playground​