Skip to main content

What is Anyray?

Anyray cuts the AI-inference spend your organization's own employees generate. Your developers run AI coding assistants like Claude Code, your teams build and run internal agents and SDK apps, and your jobs call LLMs all day — across laptops and clusters. That spend grows fast, sprawls across providers, and is hard to see or govern. Anyray is a self-hosted gateway + console + optimizer that sits on the path of every one of those requests and serves it as cheaply as it can without changing the answer your people expect — running a configurable pipeline of optimization strategies that each org enables for its own workloads, and metering every request so you can finally see and attribute internal AI spend. The whole system comes up with one docker compose up.

:::note Fully self-hosted — your data never leaves your environment Everything runs inside your own environment: the gateway, the console, the optimizer, and the trace store. Your prompts and responses never leave it — there is no vendor egress at all, no billing aggregate sent anywhere, no SaaS. Self-hosting is the boundary; the deployment is air-gapped by construction. Content is encrypted at rest by default (AES-256-GCM), so humans see ciphertext unless an authorized, offline audit decrypts it. See the data boundary. :::

:::tip The main use case: your workforce's AI usage Anyray is built first for internal, employee-driven AI consumption — the coding assistants, agents, chat, and SDK apps your staff use day to day — not for reselling a public API. The point isn't to police anyone: by making that spend sustainable, Anyray is what lets an org keep saying yes to AI tools instead of rationing them. Requests are redirected zero-touch — workers just point their SDK base URL at the Anyray gateway; nobody changes a line of application code. Optimizing a customer-facing product's API bill works too, but it's not the headline. :::

:::tip An optimization platform, not a single trick Parameter tuning, prompt compression, and tool pruning are each one of Anyray's strategies — and model routing is a capability of the gateway itself. The goal is to be the best optimization layer there is: one place that runs many strategies together, improves them on your own traffic, and gains new ones continuously as better techniques emerge anywhere in the field. Any effective optimization becomes another strategy in the pipeline. See Optimization strategies and Adaptive optimization. :::

:::info Status What's real today: the gateway (multi-provider proxy/router + content-free spend store + content privacy + trace export), the optimizer service (optimizer/) running a registry of ten strategiesparam_tuning, prompt_compression, context_compression, code_skeleton, window_budget, tool_pruning (default-on) and code_graph, relevance_filter, vision_ocr, semantic_cache (default-off) — over Optimizer Protocol v1, the Anyray console (Spend / Traces / Sessions / Optimizer / Privacy), one-compose deploy, a single admin key, and encrypted-at-rest content. Everything else is marked roadmap on the page that describes it. We document only what's real — see the honesty note in gateway/README.md. :::

The one-paragraph model

your employees' AI usage providers
(Claude Code · agents · ─▶ ┌───────────────────┐ ──▶ OpenAI · Anthropic
SDK apps · jobs) │ ANYRAY GATEWAY │ Bedrock · Vertex AI
│ (transport + │ ◀── Azure · …
│ routing + spend) │
└─────────┬─────────┘
│ /v1/optimize · /v1/optimize-response
▼ (hook calls, off the forwarding path)
╔════════════════════════════════╗
║ ANYRAY OPTIMIZER ║ ◀── the optimization core
║ a configurable pipeline of ║
║ optimization strategies: ║
║ param tuning · prompt & ║
║ context compression · code ║
║ skeleton/graph · relevance ║
║ filter · window budget · ║
║ tool pruning · vision OCR · ║
║ semantic cache ║
╚════════════════════════════════╝

The default deployment is the Anyray gateway — a multi-provider, OpenAI-compatible proxy/router that already speaks Anthropic, Bedrock, Vertex AI, Azure, OpenAI and more natively. The optimizer is a separate, gateway-neutral hook backend that the gateway calls off the forwarding path through a thin adapter — so the same optimizer can plug into other gateways too.

Two ways to reach the optimizer

You run the Anyray gateway (default)You already run another gateway
Deploy Anyray — its multi-provider gateway calls the optimizer through its built-in adapter. Implemented.Wire the optimizer in via an adapter for your gateway (LiteLLM reference adapter shipped; Kong, Envoy, Cloudflare, Portkey are stubs — roadmap).
See Anyray's own gateway.See Gateways.

Find your path

  • You install and operate Anyray for your orgOrg Admin
  • Your app calls LLMs and you want to know what changesDevelopers
  • You just want to point your tools at the gatewayConnect your tools (npx anyray-connect)
  • You need to wire Anyray into a specific gatewayGateways
  • You run on GCP / AWS / AzureCloud Providers
  • You want the mental model firstConcepts

Why it's safe to put on the request path

Three properties make Anyray low-risk to adopt:

  1. The optimizer fails open. The gateway calls the optimizer with a hard 800ms timeout and forwards the original request on any error or timeout — a broken or slow optimizer can never break or alter inference. The worst case is "you paid full price," never "you got a worse answer."
  2. Zero code change for developers. Workers point their SDK base URL at the gateway; no app edits. See Developers.
  3. Your data never leaves. Everything is self-hosted in your own environment with no vendor egress at all; content is encrypted at rest by default. See the data boundary.

Worried it'll slow you down or hurt answer quality? You don't have to take our word — Anyray proves it on your own traffic in observe-only Shadow Mode, keeps a control holdback as live ground truth, and is one kill switch away from off. See Proof, not promises.