Skip to main content

Optimizer Protocol v1

The optimizer is reached through one small, gateway-neutral HTTP contract. It is a hook backend, never on the request's forwarding path: a gateway integrates it through an adapter that calls these endpoints from its own pre-call and post-call hooks, and always forwards traffic itself. The Anyray gateway's adapter lives in gateway/src/services/optimizer.ts; the canonical spec is optimizer/PROTOCOL.md.

:::info Status The optimizer and Protocol v1 are implemented (optimizer/, served on optimizer:8088, internal only). The Anyray gateway's adapter is implemented; a LiteLLM reference adapter ships in optimizer/adapters/. Other host adapters (Kong, Envoy, Cloudflare, Portkey) are roadmap. :::

The endpoints

gateway pre-call hook ──POST /v1/optimize──────────▶ ANYRAY OPTIMIZER
(the adapter) ◀ request' + decisions / run strategy pipeline
cacheHit + cachedResponse
gateway post-call hook ─POST /v1/optimize-response─▶ transform the response
◀ response' + decisions
──POST /v1/cache──────────────▶ write-back (semantic_cache)
──POST /v1/retrieve───────────▶ fetch a stashed original
◀ content (context_compression /
window_budget)
EndpointGateway hookPurpose
POST /v1/optimizepre-calltransform the request (input); may return a cache hit
POST /v1/optimize-responsepost-call / successtransform the response (output)
POST /v1/cachepost-callwrite a live response back for semantic_cache
POST /v1/retrieveon demandfetch the original content a reversible context_compression or window_budget stashed, by handle

An adapter wires whichever hooks its gateway supports. A transform-only gateway uses just /v1/optimize; a gateway with a post-call hook adds /v1/optimize-response and /v1/cache. There is no meter() endpoint — spend metering is the gateway's content-free spend store, not an optimizer call.

POST /v1/optimize

Called before the provider is hit. The adapter sends the incoming OpenAI-compatible request; the optimizer runs its configured strategy pipeline and replies with a transformed request (equal to the input if nothing changed) plus content-free decisions.

// request
{
"endpoint": "/v1/chat/completions", // logical route; selects per-endpoint config
"request": { "model": "...", "messages": [ ... ] },
"metadata": { "user": "u1", "team": "t1" }, // OPTIONAL, content-free attribution
"enabledKinds": ["prompt_compression"], // OPTIONAL allow-list (further narrows config)
"capabilities": { "canShortCircuit": true } // OPTIONAL; omitted = full
}
// response
{
"protocolVersion": 1,
"optimizationId": "opt_000001",
"request": { ...transformed body... }, // FORWARD THIS
"decisions": [
{ "kind": "tool_pruning", "summary": "pruned 1/2 unused tools",
"before": { "toolCount": 2 }, "after": { "toolCount": 1 },
"estimatedTokensSaved": 120, "estimatedSavingsUsd": 0 }
],
"estimatedTokensSaved": 120,
"estimatedSavingsUsd": 0,
"cacheHit": false, // true => serve cachedResponse, skip the provider
"cachedResponse": null, // present only when cacheHit
"cacheEligible": true, // true => write the live response back via /v1/cache
"cacheKey": "oj2po6",
"cacheTtlSeconds": 3600
}

The adapter applies the returned request. If the caller canShortCircuit and cacheHit is true, it serves cachedResponse and skips the provider.

POST /v1/optimize-response

Called from the gateway's post-call/success hook to transform the response. The adapter passes the request that was sent (for context) and the response that came back; the optimizer returns a (possibly) transformed response. Only strategies with an output stage act here; others pass the response through. Skip this call for streaming responses (they can't be transformed whole).

{ "endpoint": "/v1/chat/completions",
"request": { ...the request that was sent... }, // OPTIONAL context
"response": { ...the provider response... },
"metadata": { "user": "u1" } }
// → { "protocolVersion": 1, "response": { ... }, "decisions": [ ... ] }

POST /v1/cache

After a successful, non-streaming live response, a short-circuit-capable caller writes it back so the next identical request hits cache. Prefer the cacheKey from the optimize response so the stored key matches the next lookup. Powers semantic_cache.

{ "cacheKey": "oj2po6", "response": { ... }, "ttlSeconds": 3600 }
// → { "ok": true }

POST /v1/retrieve

The reversible side of context_compression and window_budget. When context_compression shrinks bulky tool/function output lossily, or when window_budget crops the oldest middle messages to fit the token budget, the dropped original is stashed under an opaque, content-free handle (a counter + hash — no content leaks into it) and that handle is embedded in the compressed placeholder left in the request. Whatever later needs the full content — an agent runtime reading the placeholder back — fetches it by handle:

{ "handle": "ctx_3kf9q" }
// → { "handle": "ctx_3kf9q", "content": "...the original..." } // 404 if unknown/expired

Stashes live in memory only, expire after the strategy's ttlSeconds, and are never logged or persisted (the stashed original is content; the handle is not). This makes context compression reversible rather than destructive — nothing is permanently lost.

The rules every adapter follows

  1. Fail open. On any error talking to the optimizer — timeout, non-2xx, malformed reply — the adapter forwards the original request unchanged. The optimizer being down or slow must never break or alter inference.
  2. Bound it. Call it with a hard timeout. The Anyray gateway uses 800ms.
  3. Off path. Call it before forwarding; never route provider traffic through it. The gateway always owns transport, routing, signing, keys, streaming, and spend metering.

Auth

Authentication is an optional shared secret. If the optimizer is started with ANYRAY_OPTIMIZER_TOKEN, callers must send Authorization: Bearer <token> on every hook request; in-network deployments may omit it. The runtime config API (GET/PUT /admin/optimizer/settings) is separately gated by ANYRAY_ADMIN_TOKEN.

Privacy

decision.summary, decision.kind, and the token/cost estimates are content-free and always safe to log. decision.before / decision.after and cachedResponse MAY contain prompt content, so each caller handles them per its own content policy — the Anyray gateway encrypts or omits them per ANYRAY_CONTENT_MODE. The optimizer itself holds content in memory only and never logs or persists it.

Capability differences across callers

capabilities.canShortCircuit tells the optimizer whether the caller can return a response from its pre-request hook:

CapabilityNotes
Rewrite requestSupported on every caller.
Serve cache hit from the pre-call hook (canShortCircuit: true)The Anyray gateway and inline proxies. The optimizer may return cacheHit + cachedResponse.
Transform-only (canShortCircuit: false)e.g. LiteLLM's async_pre_call_hook — it can modify the request but cannot return a response, so the optimizer skips cache lookups and only applies request transforms.
Observe costThe gateway's content-free spend store — not an optimizer endpoint.

One protocol serves gateways with different powers. protocolVersion is returned on every optimize response; additive fields are non-breaking and consumers must ignore unknown fields. See Gateways for the per-gateway capability table.