Developer FAQ

Will my answers get worse?

No — that's the core design constraint. The optimizer hook fails open (800 ms timeout), so if it's down or slow your request is forwarded unchanged. The worst case is "you paid full price," never "you got a worse answer."

Do I have to change my code or SDK?

No. You keep your OpenAI / Anthropic SDK and your request shapes. Set the base URL yourself, or run npx anyray-connect to configure Claude Code, Codex, Cursor's OpenAI override, Devin Desktop (formerly Windsurf) via its in-IDE ACP agents, OpenCode, VS Code Copilot Chat, the GitHub Copilot CLI (BYOK env), JetBrains AI Assistant, Claude Desktop third-party inference, and your shell/SDK env for the gateway. Depending on the tool, Connect uses only its application-owned seam: a base URL/provider, credential store, extension, ACP registration, or hook. It never installs a CA or changes machine networking, and hard-coded first-party endpoints are left untouched.

Does Anyray routing apply to my subscription seat?

For Claude Code and Codex, yes: subscription pass-through keeps the client's provider OAuth token and request shape, so the seat continues to work while the request crosses the gateway. That traffic still goes through enrollment, attribution, policy checks, and supported optimizer savings, but it does not use Anyray's provider routing to choose a different provider.

Cursor Team is different. Cursor exposes no Anthropic base-URL override and does not hand its native Claude/Opus credential to custom endpoints. Normal --tools cursor setup therefore keeps native inference Cursor-hosted and installs only Anyray's user-level hooks and retrieval MCP server. Those local seams can reduce supported oversized Shell/MCP context and record content-free realized savings, but the native model request has no gateway trace and cannot be rewritten by Anyray. The Shell rewrite is currently macOS/Linux only; MCP output handling remains cross-platform.

Routing applies to provider API-key traffic: server-held keys, anyray-default, and SDK requests that ask Anyray to pick the target model/provider.

Organizations where each developer holds a personal key for an Anthropic-compatible upstream (for example per-user LiteLLM virtual keys) use the BYO-upstream lane instead: anyray-connect --upstream https://litellm.example.com --upstream-key <personal-key>. The personal key passes through the gateway verbatim to that upstream (x-anyray-auth-mode: passthrough + x-anyray-custom-host); the org's server-held key is never used. The upstream host must be allowlisted on the gateway via ANYRAY_CUSTOM_HOST_ALLOWLIST. Honored by Claude Code and OpenCode.

Subscription metering has two lanes. Included usage is billingMode: subscription; recognized provider response metadata can refine paid credits/overage to subscription-extra. Both count as the same seat, but paid extra usage has full real cost and savings. Missing quota metadata stays in the conservative included lane.

Is there a CLI to point my tools at the gateway?

Yes. With enterprise SSO configured, copy the organization command from Users → SSO enrollment. It signs the developer in and configures every supported tool detected on the machine:

bash
powershell

macOS / Linux
curl -fsSL https://app.anyray.ai/connect.sh | sh -s -- --sso https://app.anyray.ai/sso/<tenantId> --yes

Windows PowerShell
& ([scriptblock]::Create((irm https://app.anyray.ai/connect.ps1))) "--sso" "https://app.anyray.ai/sso/<tenantId>" --yes

Connect opens the browser, prints a short confirmation code, mints an IdP-bound personal key, then writes each supported tool's base URL, personal ark_… credential, hooks, MCP retrieval, and content-free attribution configuration. The real provider key stays server-side. The apply is idempotent, previews with --dry-run, and undoes itself with --revert. This CLI is the client-side on-ramp — it does not install the stack (that's docker compose, run by your admin).

Every Anyray /v1/* request requires a verified personal key. Connect therefore refuses a real configuration apply before it has one; it never leaves tools pointed at an unauthenticated placeholder. If your organization does not use self-service SSO, ask your admin for an enrollment link (enl_…) — they mint one from the observability console (the "Users" page) — then run the shared command:

curl -fsSL https://app.anyray.ai/connect.sh | sh -s -- --enroll https://…/enroll/enl_… --yes

Opening the link in a browser also shows the enrollment command.

This generates a keypair, mints your own personal gateway key bound to your email, and writes it into each tool. It is the Authorization bearer in org API mode and rides x-anyray-api-key when a subscription token must remain the bearer. Identity comes from the enrollment authority: an enl_… link is admin-bound to an email, SSO is IdP-bound, and only an MDM provisioning token accepts the identity/machine mode allowed by its policy. git config/--user may supply local attribution or an allowed MDM assertion; they cannot override a personal link or SSO identity. There is no provider secret to copy. The post-apply smoke test confirms the minted key and route end to end.

After enrollment, anyray-connect sync and the detached refresh worker pull /connect/policy. Alongside shared skills, that policy can set the gateway routing origin, the team bound to the personal key, and the enabled tool set. Connect updates its local profile and re-applies supported installed tools when those server-owned values drift. The tool set is enable-only: omitting an id never uninstalls or reverts a locally managed tool.

On an MDM-managed device you may never run a command at all: the first time you submit a prompt, the tool shows a one-time sign-in URL and code. Complete your organization's SSO sign-in in the browser and resubmit — enrollment happens in the background. See zero-touch SSO for managed fleets.

How do I remove Anyray from my machine?

npx anyray-connect --revert

Then restart your editor and your terminal, and check it worked:

npx anyray-connect doctor

No problems found means you're done.

The restart is the step people skip

Editors load Anyray's hooks at startup and hold them in memory; shells export its settings the moment they start. Neither notices the files disappearing underneath them, so until you restart them they carry on as if nothing changed — which makes a revert that worked look like it did nothing. VS Code needs a full Cmd+Q, not a Reload Window.

The revert takes out the base URL, key and headers from every tool it can configure, the Claude Code and Codex hooks, and every anyray_retrieve MCP-server registration (Claude Code, Codex, Cursor, OpenCode, Claude Desktop), your stored credential (minted key, enrollment cert, device key), the managed block in your shell profile, and the key-refresh daemon. Anything you set before Anyray is put back — if you had already pointed ANTHROPIC_BASE_URL at your own proxy, your value is restored rather than deleted.

Connect removes the Claude and Codex ACP entries it owns from JetBrains's ~/.jetbrains/acp.json and Devin Desktop's registry, preserving foreign or user-modified entries. Devin's one-time UI enablement may still need to be switched off in the app. The VS Code lane clears the Anyray extension's global configuration and SecretStorage credential as well as its owned BYOK provider entry.

Claude Desktop changes are journaled. Revert removes only unchanged settings and helper files that Connect installed. On Windows, the helper stores no key. Remove managed policies through the MDM that installed them. Connect reports conflicts instead of overwriting them.

Reverting removes your credential, so reconnecting means enrolling again with npx anyray-connect --sso (or a fresh link from your admin — the old one is single use). Your gateway URL and name are remembered.

On an MDM-managed device, your organization installs Anyray through a system-level profile that --revert doesn't own, so your next prompt re-enrolls you. That's policy re-applying itself, not a failed revert — talk to your admin.

After reverting I get `401 valid client key required` on a loop

A terminal that was already open when you reverted is still exporting the gateway. Your tools keep routing there, but the credential is gone — so every request 401s, while every file on disk reads perfectly clean. Fix that terminal:

unset OPENAI_BASE_URL ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN OPENAI_API_KEY ENABLE_TOOL_SEARCH

Or open a new terminal — except in VS Code, which needs a full Cmd+Q (it hands every terminal a copy of the environment it read at startup, and Reload Window reuses it).

anyray-connect doctor, run in the terminal that's failing, detects this and prints the exact line. If it reports no Shell problem, send anyray-connect doctor --json to support — it lists variable names only, never their values, so it's safe to share.

Desktop apps: what routes through Anyray?

npx anyray-connect covers local tools that read normal base-URL config: Claude Code and its VS Code extension; the Codex app, CLI, and IDE extension; Cursor's local hooks/MCP and optional OpenAI BYOK lane; OpenCode direct providers; JetBrains AI Assistant; Devin Desktop via its in-IDE ACP agents; Claude Desktop's third-party gateway mode; an explicit Anyray provider in VS Code Chat; and shell/SDK env.

Normal Connect setup is configuration-native and never changes system networking. The Claude desktop app's first-party subscription endpoint is not configurable, so Connect leaves it untouched on every platform. On verified macOS builds, its third-party Gateway mode can instead use the developer's Claude Code subscription credential through a local Keychain-backed helper. Setup never reads the token, and the helper never persists or logs it. Windows remains API-billed and fails closed when subscription is selected.

OpenCode's built-in ChatGPT OAuth similarly has no user-facing endpoint override; use Codex for ChatGPT subscription pass-through, or an OpenCode API-key/Copilot provider. The standalone GitHub Copilot CLI routes through GitHub's BYOK env contract, which Connect points at the gateway's Copilot carrier route with the developer's own seat token — billing stays on the Copilot plan, so the default sweep configures it whenever a Copilot sign-in is readable. A seat with no readable sign-in is held (never silently moved to org billing); the org-billed BYOK lane takes --org, or a machine with no Copilot seat at all. Connect registers pinned Claude and Codex ACP carriers in JetBrains and Devin; their native AI/Copilot/Devin clients remain separate. Cursor's OpenAI override is an API-key lane and cannot consume the native Cursor Team Claude/Opus entitlement. For Cursor Team, use:

anyray-connect https://gw.example.com --tools cursor \
  --user dev@example.com --yes

With Anyray SSO enrollment, use anyray-connect --sso <tenant-link> --tools cursor --yes; Cursor's own Team SSO remains untouched. Only explicit --org selects the separate API-key BYOK lane.

Connect leaves the native model route untouched and installs user-level hooks plus retrieval MCP. If a prior BYOK override was journaled, fully quit Cursor when Connect reports that restoration is queued. Local Agent Shell/MCP context is the optimization surface; native inference stays on Cursor. Cursor cloud agents do not load these user-level hooks. Claude Desktop's third-party Gateway mode is API-billed except for the verified macOS Claude Code subscription-helper lane.

On macOS, Connect configures Claude Desktop's third-party inference route automatically only for verified Claude Desktop builds — the observed Apply locally state, extended per release by a scheduled probe. Claude does not publish those files as a managed-config contract. Every other or unverifiable build, an open app, a managed policy, or an unfamiliar stored shape causes Connect to write nothing and print the supported Developer → Configure third-party inference flow. That manual route is API-billed; subscription selection never falls back to it silently.

On Windows, first setup writes Claude's per-user policy and installs a key-free helper. Restart Claude once. Later enrolled-key renewals are automatic and need no policy update or restart. If a machine-level policy exists, Connect leaves it unchanged.

Fleet admins can still emit enterprise custom-inference artifacts without embedding per-user credentials, but the helper path must be explicit:

anyray-connect desktop --mobileconfig --gateway https://gw.example.com \
  --helper-path /usr/local/bin/anyray-credential-helper > claude-desktop.mobileconfig
anyray-connect desktop --reg --gateway https://gw.example.com \
  --helper-path "C:\Program Files\Anyray\anyray-credential-helper.cmd" > claude-desktop.reg

First deploy a standalone binary to /usr/local/bin/anyray-connect or C:\Program Files\Anyray\anyray-connect.exe, then install the wrapper with desktop helper --write --platform <posix|windows> --bin <that-absolute-path> on every endpoint. Use desktop helper --print when your MDM packages files itself. The CLI verifies local installs, writes atomically, and refuses to overwrite a foreign file. The wrapper contains no key and never looks up anyray-connect through PATH; it delegates to the chosen absolute binary. Each signed-in user must be enrolled so the helper can resolve that user's current key. See the complete managed desktop flow.

The Codex app, CLI, and IDE extension share ~/.codex/config.toml / $CODEX_HOME, so the existing Codex adapter reaches the desktop app too. If the app's model picker does not show custom-provider models, pin the model with ANYRAY_CODEX_MODEL and re-run Connect. Retrieval is automatic. Source-side trim is not: Codex's blocking PostToolUse replacement rejects the already-completed nested tool promise in code mode, so Connect removes its retired hook instead of risking duplicate side effects. See Tool-output trimming for the boundary.

Connect preserves Codex's top-level model_context_window and model_auto_compact_token_limit settings and sends their numeric values to the gateway as content-free provider headers. Context budgeting follows those settings (including 1M windows) instead of forcing every Codex session through the 190k unknown-window fallback.

Fleet admins can emit Codex managed config with:

anyray-connect desktop codex --gateway https://gw.example.com \
  --helper-path /usr/local/bin/anyray-credential-helper
anyray-connect desktop codex --mobileconfig --gateway https://gw.example.com \
  --helper-path /usr/local/bin/anyray-credential-helper > codex.mobileconfig

For Windows Codex, use C:\Program Files\Anyray\anyray-credential-helper.cmd; Connect emits the supported cmd.exe command-auth invocation and its raw-token helper mode. Claude's default helper mode remains print-key --json.

That managed artifact is the API-key/helper lane. ChatGPT subscription pass-through also needs each developer's personal gateway key in x-anyray-api-key, so deploy normal provisioning-token enrollment in the signed-in user's context; Connect then writes the valid user-scope Codex config. It deliberately refuses a credentialless desktop codex --subscription artifact.

Billing caveat: a custom-inference route using an Anyray gateway credential runs on the org-key/API lane, not a claude.ai seat. Cloud or remote sessions — Claude web, Claude Slack, Codex cloud/web tasks — run on provider infrastructure and cannot route through a local gateway.

My Claude Desktop conversations disappeared after setup

They are not deleted. Claude Desktop picks its app profile from the deployment mode, so once third-party inference is active the app runs in a separate profile (Claude-3p) that has never held a Claude account session — it opens signed out, with an empty conversation list. Your first-party profile is untouched. Where each conversation lives decides how you get it back: Claude Code sessions and Cowork tasks sit in that first-party profile directory on this machine, while claude.ai chats belong to your Claude account rather than to any local profile.

The Chat tab itself comes back automatically. Third-party mode hides it by default, so desktop --apply sets the chatTabEnabled key Anthropic supports for third-party deployments (Claude Desktop 1.13576.0 and newer) — without it the app refuses chat sessions outright with "Chat sessions have not been enabled by your organization". The tab returns empty and routed through the gateway; the conversations that used to be in it are a separate matter, below.

Bring the local ones across first — no export needed. Fully quit Claude Desktop, then:

anyray-connect desktop --migrate

That copies your Claude Code sessions and Cowork tasks into the third-party profile. It copies rather than moves, so desktop --revert still returns you to the old profile with everything intact. A session whose transcript is no longer on disk is skipped rather than listed as an entry that opens empty.

Your claude.ai chats are a separate matter — you can read them right now at claude.ai, and there are three ways to get them back in front of you, the first two of which keep Anyray on:

Restore the sidebar, and stay on the gateway lane. Export your history from claude.ai (Settings → Privacy → Export data), fully quit Claude Desktop, then point Connect at the .zip:

anyray-connect history restore ~/Downloads/claude-export.zip

Reopen the app and the conversations are listed again. Connect converts the export locally with the Claude Code CLI's own importer and writes the results into Desktop's third-party session storage — an app internal with no compatibility promise, so keep the export file in case a future release changes it.

Also make them searchable from every client. Import the conversations.json inside the same export:

anyray-connect history import ~/Downloads/claude-export/conversations.json

Past conversations become searchable from any Anyray-connected client through the anyray_history MCP tool — ask about an earlier discussion and the assistant pulls it up. The archive is owner-only markdown under ~/.anyray/history and connect never uploads it — though a conversation the assistant reads becomes part of the chat, which goes through your gateway as usual. See Your existing conversations.

Or switch the app back, which restores the first-party profile and gives up Anyray's coverage of Desktop. Fully quit Claude Desktop and run:

anyray-connect desktop --revert

Chats you had while third-party mode was active stay in its separate profile and come back if you configure it again — switching modes hides one set or the other, never erases either. Claude Code sessions you started inside Desktop are safer still: their transcripts live in Claude Code's own store (~/.claude/projects/), outside both profiles, so neither applying nor reverting touches them. On an MDM-managed device the mode comes from the managed policy, so ask your admin to remove it rather than running the revert yourself.

Can it route apps with no base-URL setting?

No. anyray-connect routes only through a supported application seam. Apps and sessions that hard-code their endpoint — the ChatGPT desktop app, Claude Desktop's first-party subscription lane, and native Copilot/Devin/Cursor surfaces — stay untouched.

For optimization, use a supported carrier: Claude Code, the Codex app/CLI/IDE, the Anyray VS Code provider, or the pinned ACP agents in JetBrains/Devin. Cursor native subscription mode can still optimize its supported local Shell/MCP context through hooks, but not the hidden model request. Cursor's Team pricing page advertises Admin API usage statistics, while its API overview marks the Admin API Enterprise-only. Confirm an admin:* key can call /teams/filtered-usage-events before configuring the subscription spend connector. Copilot Individual has no official automatic native-usage connector. No CA, DNS rewrite, or machine-level proxy is involved.

I lost the enrollment link — can the admin re-send it?

Yes. The admin opens the user roster in the observability console (the Users page), finds your row, and clicks Regenerate. This mints a brand-new link (enl_…) — the old one is discarded and will no longer work. The new link is shown once; copy it before closing. Ask the admin to share it with you, then run:

npx anyray-connect --enroll https://…/enroll/enl_…

The row stays in the roster; your status resets to Pending until you re-enroll.

Enrolled already?

If you already ran --enroll with the original link, your DevCert and personal gateway key are still valid until they expire. You only need to re-enroll if the admin regenerated a link and you need to bind a new keypair (e.g. on a new machine).

What does the admin see in the user roster?

The observability console's Users page shows a user roster — one row per user (keyed by email + deployment). Each row shows:

Field	Meaning
Status	Enrolled — user ran `--enroll`; a personal key was issued. Pending — link minted, not yet enrolled. Revoked — admin revoked the active link. Expired — link TTL passed without enrollment.
Source	Identity provenance: `manual`, `provisioning`, `sso`, `scim`, `email`, `mdm`, or `service`.
Type	`person` for a human identity or `service` for a machine/service account.

Per-row actions:

Regenerate (manual rows only) — mints a new link and discards the old one. Use this when a user lost their link or is moving to a new machine. The link is shown once.
Revoke (manual rows with an active link) — prevents further use of the current link. Already-issued keys continue to work until they expire.
Delete — removes the row from the roster. This offboards no one. The personal key does lapse on its own schedule (ANYRAY_VERIFIED_DEV_KEY_TTL_DAYS), but the DevCert behind it does not: while the machine stays in use the certificate rolls forward, so it just mints another key. To cut access off, revoke the user (POST /admin/revoked-users), then their live client key (DELETE /admin/client-keys/:id) for an immediate stop. An MDM-provisioned user whose provisioning token is still active will re-appear in the roster on their next request.

Can our IdP provision and offboard users with SCIM?

Yes. Configure the gateway's static SCIM bearer, admin group, and group-to-team map, then point Okta, Microsoft Entra ID, or another SCIM 2.0 client at https://<your-anyray-gateway>/scim/v2. User and group pushes populate the roster with source scim; group membership sets team and admin role.

An active:false user patch or SCIM user deletion is an immediate, durable offboarding signal. Every gateway replica checks it while verifying a client key, and denies access if the identity store cannot be read. See SSO enrollment → Provision users directly with SCIM.

Can I enroll all users at once with our MDM?

Without SSO, mint one provisioning token (enp_…) and push it fleet-wide via Jamf, Intune, or any MDM. Each machine runs curl -fsSL https://app.anyray.ai/connect.sh | sh -s -- --enroll https://app.anyray.ai/enroll/enp_… --user <email> --yes once; the gateway issues a personal key bound to that user's email (or a stable machine id if you use --machine mode). With SSO configured, push the credentialless SSO handoff from Users → Bulk (MDM) instead. No per-user invite loop is required.

See Bulk enrollment with your MDM for the full setup — token minting, MDM command templates, and revoke/rotate semantics.

Can operators sign in to the console with SSO and roles?

Yes — and it's separate from user enrollment. Enterprises can enable console SSO so operators sign in with their corporate IdP (brokered by WorkOS) instead of pasting the admin key, and each gets a role (viewer < auditor < operator < security_admin < owner) that gates which /admin/* actions they can take — a viewer reads the spend dashboard but can't rotate provider keys, change the content mode, mint client keys, or change optimizer settings. The break-glass ANYRAY_ADMIN_TOKEN still works and maps to owner.

Don't confuse the two SSO flows: console SSO signs operators into the admin console (via get /admin/sso/start → /admin/sso/callback), while anyray-connect --sso (user SSO enrollment) signs users in to mint a /v1/* client key. See Console access and RBAC.

Does streaming still work? Do tool calls still work?

Yes. Streaming, tool/function calls, and the response shape are the gateway's responsibility and are unchanged. The optimizer only rewrites the request (params/messages/tools), or serves a cache hit.

Will it add latency?

Generally no, often less. Cache hits skip the provider. The one added step is the /v1/optimize call, designed to be fast and fails open (800 ms timeout) if the optimizer is slow or down. Spend is recorded by the gateway's own store and never blocks your response.

What happens if the optimizer is down?

Your request passes through unchanged to the requested model. An optimizer outage means "no optimization right now," not "inference is broken."

Something's broken — how does Anyray support debug a self-hosted deployment?

Without seeing it. Start with get /admin/health, which names the failing leg. If support needs more, an admin generates a support bundle (get /admin/support/bundle) — a one-shot diagnostic snapshot (versions, health probes, redacted config, optimizer strategy config, request/error telemetry) and sends us the JSON. Secrets are redacted and no prompt/response content is ever included.

If only your machine misbehaves, run anyray-connect doctor --json — the client-side counterpart: tool config, enrollment state, and gateway reachability as a report (add --verify to probe auth end-to-end, or --repair to re-apply tools whose routing a tool update wiped — a tool you reverted on purpose is never touched). See Troubleshooting & support bundles.

What happens if the gateway itself is down?

Your tools stop working until it's back. Every request goes through the gateway — that's what makes spend governable — and nothing on the developer's machine routes around it. The key-refresh daemon relays to the gateway and no further; when the gateway doesn't answer, the tool sees the failure.

An earlier opt-in mode (anyray-connect --fallback) did fail over straight to the provider, but it needed the org's provider key on every developer machine, so it was removed. Run the gateway with more than one replica if you need the outage window closed.

Does my prompt data leave the company?

Anyray is self-hosted. Prompt/response content travels only on the customer-selected inference path: through the self-hosted gateway/optimizer and to the selected model provider. Locally captured content is stored per the org content mode — encrypted by default (AES-256-GCM), off, or deploy-gated plaintext; the spend store and logs are always metadata only. Content is never sent to Anyray's Billing app. Human access follows the selected mode and RBAC: the console never decrypts encrypted content, off stores none, and only authorized roles can read trace bodies in the explicitly deploy-gated plaintext mode. The only automatic egress to Anyray is usage metering — counts and aggregates, never your content. If an admin enables a subscription spend connector, the gateway also makes customer-authorized, metadata-only HTTPS requests to that selected vendor's official admin API.

How do I see what happened to my request?

Traces are stored locally. In the default encrypted mode their prompt/response content is AES-256-GCM ciphertext at rest; off omits it, and only deploy-gated plaintext stores readable content. Console list views remain metadata-only (model, provider, tokens, cost, latency), while trace-body access additionally requires the content-read capability and follows the selected mode.

How do we measure savings instead of trusting estimates?

Admins can enable the optimizer's audited holdout by reading get /admin/optimizer/settings, adding this top-level block to the returned config, and writing the full config back through put /admin/optimizer/settings:

{
  "holdout": { "enabled": true, "fraction": 0.05, "windowSeconds": 604800 }
}

Attributable requests are assigned deterministically by user, then team, then session. Holdout requests are forwarded byte-identical, while treated requests optimize normally; the gateway writes the optimizer-assigned cohort tag into the spend row. The admin settings response shows per-arm request counts, input-token estimates, holdoutShare, and average input tokens per request so operators can check cohort balance before trusting a downstream delta.

What base URL do I point my SDK at, and which endpoints work?

Point your SDK at the gateway — http://localhost:8787/v1/... (or your gateway's host) — exactly as you would at OpenAI. It's OpenAI-compatible (/v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models, …) and also speaks Anthropic natively at /v1/messages (plus /v1/messages/count_tokens). The auxiliary endpoints — /v1/images/*, /v1/audio/*, /v1/responses, /v1/files, and /v1/batches (async Batch API) — are served only for the openai and azure-openai providers. See the API reference and the gateway.

Does Anyray support the async Batch API?

Yes. The gateway proxies the OpenAI-compatible async Batch API (/v1/batches plus the /v1/files upload it relies on) for the openai and azure-openai providers — the usual upload-input → create-batch → poll-output flow. Batch usage is attributed like any other traffic: each line of a batch's output (GET /v1/batches/<id>/output) becomes one spend row (model, tokens, cost, status). And whenever the optimizer is configured, bulk jobs get the same before-request optimization the chat path does — each request in a purpose=batch input file is rewritten before it's uploaded, so a batch doesn't pay full token cost. See Configure → Batch API input.

Which providers can I call through it?

It's multi-provider out of the box — openai, anthropic, bedrock, vertex-ai, azure-openai, groq, deepseek, mistral-ai, nebius, openrouter, and more — with Anthropic, Vertex, and Bedrock spoken natively (no separate adapter). You reach any of them through the one OpenAI-compatible base URL. openrouter is itself an OpenAI-compatible aggregator, so you can front its 250+ models (addressed as anthropic/claude-3.5-sonnet, openai/gpt-4o, …) with just an OpenRouter key. nebius (Nebius AI Studio) is a first-class provider too — save its key and its base URL is built in, so you don't need the generic anyray-upstream custom host. See the gateway.

What API key do I put in my SDK?

A minted personal Anyray client key (e.g. ark_…), never a real provider key. Your provider keys (OpenAI, Anthropic, …) live server-side in the gateway and are never exposed to clients — the gateway attaches the real key when it forwards your call. See server-held provider keys.

How do I attribute requests to a user or team?

Send the x-anyray-metadata header with fields like user, team, and session. It drives spend attribution, per-user caps, and rule-based strategy overrides — and never carries prompt or response content. See Configure.

Will tool pruning drop my MCP tools?

No. Namespaced MCP tools (mcp__…) are never pruned — tool_pruning only trims non-MCP tools whose name and description both share nothing with the conversation. (A separate, default-on strategy, tool_schema_compression, shrinks the descriptions of the tools that are kept — including MCP tools — without dropping any. It edits only the tools block, is lossless with its default params, and rewrites to the same bytes every turn, so it is cache-neutral and keeps running on prompt-cached and subscription traffic without needing a pin — tool_pruning runs there via a decide-once/replay session pin, and is suppressed only when no pin can be minted.) See the strategy menu.

When Anyray trims a long tool output, can the model get the detail back?

It depends on the client. anyray-connect can shorten eligible output at the source, before a coding client commits that output to its transcript. General reversible trimming is disabled for persistent coding-client transcripts because an MCP retrieval result is appended as a new turn; it cannot replace the earlier shortened entry and could make the next request larger.

The tool stays out of context until used (ENABLE_TOOL_SEARCH), so it costs nothing until a retrieval is actually needed. Under the hood it hits the gateway's POST /connect/retrieve, which proxies the in-network optimizer's /v1/retrieve. The retrieved content goes straight back to the model that asked — never logged (the privacy invariant holds). At session start the server also pings the gateway's POST /connect/mcp-heartbeat (client-key gated, empty body) so the gateway knows explicit recovery is available. That heartbeat does not override the persistent-client safety gate. The tool can recover an existing handle, including one created by an older session or deployment. Run --revert to remove the server.

If the · retrieve ctx_… marker has already scrolled out of view, a second tool — anyray_recall("describe the output") — finds it by meaning: it searches the durable stash semantically and returns ranked ctx_… handles with short previews, which you then pass to anyray_retrieve. Recall needs the durable stash, so it's available when ANYRAY_CONTENT_MODE keeps content (it returns no matches when content mode is off); it proxies the optimizer's /v1/recall via POST /connect/recall. The query and previews go straight back to the asking model and are never logged.

Humans get the same read path: anyray-connect retrieve ctx_… prints the stashed original alone on stdout (pipe or redirect it anywhere), and anyray-connect recall "what to find" --k 5 prints ranked handle / score / preview lines. Both send the x-anyray-retrieve-probe header, so a developer peeking at a marker never latches model retrieve capability for the key.

Is response caching on by default?

Yes — semantic_cache is default-on. On a duplicate request it serves the stored response and skips the provider (non-streaming, for callers that can short-circuit) — faster and free. The volatile-normalized cache tier is measurement-only; it records would-have-hit precision but never serves a response. See the strategy menu.

How are the token-savings numbers estimated?

Two different numbers, two different sources. The cost of each request is exact — it's priced from your provider's own reported usage (prompt / completion / cache tokens) at the official per-model rate. Only the saved-token delta — what the removed text would have cost — is estimated, because the optimizer computes it before the provider responds, on text it's removing.

How the delta is counted depends on the model family:

OpenAI models — exact local BPE counts: GPT-4o / 4.1 / 5 and the o-series use o200k_base; GPT-4 / 3.5 use cl100k_base.
Everything else (Claude, Gemini, …) — no exact local tokenizer exists, so Anyray uses a chars→tokens heuristic calibrated per request: it scales the estimate by the ratio of the provider's reported prompt tokens to the heuristic count of that same prompt. Still an estimate, but anchored to how the model actually tokenized this request rather than a fixed ~3.3 chars/token divisor. When the ratio can't be trusted (no usage returned, or a large inline image/audio payload dominates the body), it falls back to the plain heuristic.

Both paths run once, post-response, off the request path, from data already in hand — no provider token-count call, no added latency, no inference spend. The same number feeds both the billing spend record and the observability dashboard, so the two always agree.

Because the per-request calibration replaces the older fixed-divisor estimate, non-OpenAI saved-token figures shift (up or down per model) versus prior releases; OpenAI figures are unchanged.

One intentional exception either way: window_budget uses the shared CJK-aware estimator with a configurable charsPerToken calibration (default 3.3) for cropping safety — that calibration decides which whole messages get evicted, so it stays operator-tunable and separate from post-response savings calibration.