Optimizer Protocol v1
The optimizer is reached through one small, gateway-neutral
HTTP contract. It is a hook backend, never on the request's forwarding path: a
gateway integrates it through an adapter that calls these endpoints from its own
pre-call and post-call hooks, and always forwards traffic itself. The Anyray gateway's
adapter lives in
gateway/src/services/optimizer.ts;
the canonical spec is
optimizer/PROTOCOL.md.
:::info Status
The optimizer and Protocol v1 are implemented (optimizer/, served on optimizer:8088,
internal only). The Anyray gateway's adapter is implemented; a LiteLLM reference adapter
ships in optimizer/adapters/.
Other host adapters (Kong, Envoy, Cloudflare, Portkey) are roadmap.
:::
The endpoints
gateway pre-call hook ──POST /v1/optimize──────────▶ ANYRAY OPTIMIZER
(the adapter) ◀ request' + decisions / run strategy pipeline
cacheHit + cachedResponse
gateway post-call hook ─POST /v1/optimize-response─▶ transform the response
◀ response' + decisions
──POST /v1/cache──────────────▶ write-back (semantic_cache)
──POST /v1/retrieve───────────▶ fetch a stashed original
◀ content (context_compression /
window_budget)
| Endpoint | Gateway hook | Purpose |
|---|---|---|
POST /v1/optimize | pre-call | transform the request (input); may return a cache hit |
POST /v1/optimize-response | post-call / success | transform the response (output) |
POST /v1/cache | post-call | write a live response back for semantic_cache |
POST /v1/retrieve | on demand | fetch the original content a reversible context_compression or window_budget stashed, by handle |
An adapter wires whichever hooks its gateway supports. A transform-only gateway uses just
/v1/optimize; a gateway with a post-call hook adds /v1/optimize-response and
/v1/cache. There is no meter() endpoint — spend metering is the gateway's
content-free spend store, not an optimizer call.
POST /v1/optimize
Called before the provider is hit. The adapter sends the incoming OpenAI-compatible
request; the optimizer runs its configured strategy pipeline and replies
with a transformed request (equal to the input if nothing changed) plus content-free
decisions.
// request
{
"endpoint": "/v1/chat/completions", // logical route; selects per-endpoint config
"request": { "model": "...", "messages": [ ... ] },
"metadata": { "user": "u1", "team": "t1" }, // OPTIONAL, content-free attribution
"enabledKinds": ["prompt_compression"], // OPTIONAL allow-list (further narrows config)
"capabilities": { "canShortCircuit": true } // OPTIONAL; omitted = full
}
// response
{
"protocolVersion": 1,
"optimizationId": "opt_000001",
"request": { ...transformed body... }, // FORWARD THIS
"decisions": [
{ "kind": "tool_pruning", "summary": "pruned 1/2 unused tools",
"before": { "toolCount": 2 }, "after": { "toolCount": 1 },
"estimatedTokensSaved": 120, "estimatedSavingsUsd": 0 }
],
"estimatedTokensSaved": 120,
"estimatedSavingsUsd": 0,
"cacheHit": false, // true => serve cachedResponse, skip the provider
"cachedResponse": null, // present only when cacheHit
"cacheEligible": true, // true => write the live response back via /v1/cache
"cacheKey": "oj2po6",
"cacheTtlSeconds": 3600
}
The adapter applies the returned request. If the caller canShortCircuit and
cacheHit is true, it serves cachedResponse and skips the provider.
POST /v1/optimize-response
Called from the gateway's post-call/success hook to transform the response. The
adapter passes the request that was sent (for context) and the response that came back;
the optimizer returns a (possibly) transformed response. Only strategies with an output
stage act here; others pass the response through. Skip this call for streaming
responses (they can't be transformed whole).
{ "endpoint": "/v1/chat/completions",
"request": { ...the request that was sent... }, // OPTIONAL context
"response": { ...the provider response... },
"metadata": { "user": "u1" } }
// → { "protocolVersion": 1, "response": { ... }, "decisions": [ ... ] }
POST /v1/cache
After a successful, non-streaming live response, a short-circuit-capable caller writes it
back so the next identical request hits cache. Prefer the cacheKey from the optimize
response so the stored key matches the next lookup. Powers semantic_cache.
{ "cacheKey": "oj2po6", "response": { ... }, "ttlSeconds": 3600 }
// → { "ok": true }
POST /v1/retrieve
The reversible side of context_compression and
window_budget. When context_compression shrinks bulky tool/function
output lossily, or when window_budget crops the oldest middle messages to fit the
token budget, the dropped original is stashed under an opaque, content-free handle (a
counter + hash — no content leaks into it) and that handle is embedded in the compressed
placeholder left in the request. Whatever later needs the full content — an agent runtime
reading the placeholder back — fetches it by handle:
{ "handle": "ctx_3kf9q" }
// → { "handle": "ctx_3kf9q", "content": "...the original..." } // 404 if unknown/expired
Stashes live in memory only, expire after the strategy's ttlSeconds, and are never
logged or persisted (the stashed original is content; the handle is not). This makes
context compression reversible rather than destructive — nothing is permanently lost.
The rules every adapter follows
- Fail open. On any error talking to the optimizer — timeout, non-2xx, malformed reply — the adapter forwards the original request unchanged. The optimizer being down or slow must never break or alter inference.
- Bound it. Call it with a hard timeout. The Anyray gateway uses 800ms.
- Off path. Call it before forwarding; never route provider traffic through it. The gateway always owns transport, routing, signing, keys, streaming, and spend metering.
Auth
Authentication is an optional shared secret. If the optimizer is started with
ANYRAY_OPTIMIZER_TOKEN, callers must send Authorization: Bearer <token> on every hook
request; in-network deployments may omit it. The runtime config API
(GET/PUT /admin/optimizer/settings) is separately gated by ANYRAY_ADMIN_TOKEN.
Privacy
decision.summary, decision.kind, and the token/cost estimates are content-free and
always safe to log. decision.before / decision.after and cachedResponse MAY
contain prompt content, so each caller handles them per its own content policy — the
Anyray gateway encrypts or omits them per ANYRAY_CONTENT_MODE. The optimizer itself
holds content in memory only and never logs or persists it.
Capability differences across callers
capabilities.canShortCircuit tells the optimizer whether the caller can return a
response from its pre-request hook:
| Capability | Notes |
|---|---|
| Rewrite request | Supported on every caller. |
Serve cache hit from the pre-call hook (canShortCircuit: true) | The Anyray gateway and inline proxies. The optimizer may return cacheHit + cachedResponse. |
Transform-only (canShortCircuit: false) | e.g. LiteLLM's async_pre_call_hook — it can modify the request but cannot return a response, so the optimizer skips cache lookups and only applies request transforms. |
| Observe cost | The gateway's content-free spend store — not an optimizer endpoint. |
One protocol serves gateways with different powers. protocolVersion is returned on every
optimize response; additive fields are non-breaking and consumers must ignore unknown
fields. See Gateways for the per-gateway capability table.