Proof, not promises

The biggest reason teams hesitate to adopt Anyray is the fear that it will hurt performance — worse answers, or higher latency. You shouldn't have to believe us. A reassuring FAQ doesn't remove that fear; evidence and control do.

So Anyray is built to prove itself on your own traffic before it ever changes a live response, and to stay provable — and reversible — in production. Four mechanisms, not four claims.

:::warning Status The architectural safety here — fail-open, fail-safe-to-frontier, the latency budget, the kill switch — is live by design. The proof system below (Shadow Mode, holdback A/B, SLO auto-rollback, replay, in-boundary eval) is the trust mechanism we're building and is roadmap; it's the answer to "how do we earn this," documented so it can be reviewed and held to. :::

The hard part: proving quality on prompts no one can see

There's a real paradox at the center of this:

Privacy says the admin must not read employees' prompts. So we can't ask the admin to inspect each manipulation to trust it. Yet the admin is afraid the manipulation will hurt quality. How do you prove "we didn't degrade quality" on data nobody is allowed to look at?

The answer is not "show the admin the prompts." It's prove quality with content-free evidence — verified by a machine inside your walls, never by a human reading prompts:

An automated judge runs inside your environment. A machine — not a person — compares the optimized output against the frontier output and emits only a score. The content never leaves the box and no human reads it.
A live holdback A/B on outcome signals — regenerations, retries, follow-up edits, tool-error rate, explicit feedback. All aggregate and content-free: if the optimized cohort matches the unoptimized control, quality is proven without anyone reading a prompt.
The admin sees aggregate transformation stats + the rules, never prompts — e.g. "downgraded 62% of simple-classification calls; 0 reasoning requests touched; pruned avg 34k→7k context tokens," plus the policy (reasoning always → frontier; downgrade only above confidence X). See the monitoring dashboard.
Employees are the quality sensors. The only people allowed to see their own content already judge it — a bad result becomes a regeneration or edit, which feeds the aggregate. Their experience is the ground truth, surfaced to the admin as a number.

Net: nobody privileged reads private prompts; the system proves quality with controlled experiments and machine-scored evals. You trust outcomes and rules, not content. The mechanisms below are how that proof is produced and kept live.

1. Shadow Mode — see it before you trust it

Deploy Anyray on the request path in observe-only mode: it changes nothing. Live responses are untouched. But for every request it records what it would have done and what it would have saved, and compares the optimized path against the real answer.

Within days you have a report on your own traffic: projected savings, and the measured quality delta — at zero risk. You decide to enable optimization only after you've seen your own numbers, not ours. You watch it all in the monitoring dashboard — including how each request would change (as structured, content-free diffs).

2. Always-on holdback — permanent ground truth

When you do enable it, a configurable control slice stays unoptimized, always. That holdback is live ground truth: at any moment you can compare the optimized cohort against the control and answer "is quality the same, at lower cost?" with your production data, not a vendor benchmark. Trust isn't a one-time onboarding checkbox — it's continuously measured.

3. Quality SLO — optimization lives or dies by your bar

You set the maximum acceptable quality delta and the quality signal it's measured against. If a strategy or a config change breaches that bound on your traffic, it auto-reverts — no human in the loop required. Anyray is only ever allowed to operate inside the quality envelope you defined.

4. Kill switch — instant, total revert

One flag disables everything; fail-open does the rest — requests pass straight through to the frontier model, unchanged. No redeploy, no waiting, no chance of a stuck dependency. The exit is always one switch away, which is exactly what makes it safe to try.

How a rollout actually goes — graduated trust

Every step is gated by evidence on your own traffic, and every step is reversible.

Why each fear is structurally handled

The proof system sits on top of architecture that already bounds the downside:

Quality — most savings come from levers that can't change the answer (caching = same model, output caps = unused headroom); routing fails safe to frontier when unsure; context levers are quality-gated.
Latency — a hard 800 ms optimizer-hook timeout (the gateway fails open if the optimizer is slow or errors), content-free spend recording off the forwarding path, and often faster responses (cache hits, smaller models).
Reliability — fail-open: a Anyray outage means "no optimization," never "inference is down."

See Design principles for the behavior-preserving guarantee in full, and the Developer FAQ for the individual-user view.

The hard part: proving quality on prompts no one can see​

1. Shadow Mode — see it before you trust it​

2. Always-on holdback — permanent ground truth​

3. Quality SLO — optimization lives or dies by your bar​

4. Kill switch — instant, total revert​

How a rollout actually goes — graduated trust​

Why each fear is structurally handled​