Vertex + Claude Code (Quickstart)

One quick way to get Anyray value if your team uses Claude Code against Claude on Google Vertex. You stand up one shared stack inside your own GCP project — LiteLLM (which already speaks Vertex and the Anthropic API) plus the Anyray optimizer — and your org's workers point ANTHROPIC_BASE_URL at it.

The gateway is a configurable choice. Anyray's own gateway now speaks Vertex/Anthropic natively, so you can point Claude Code straight at it (:8787) — see which gateway. This page walks the LiteLLM alternative front door (Anthropic in, Vertex out), which remains a valid option.

Every Claude Code request is then cost-attributed, runaway max_tokens is capped, and there's a fail-safe so Claude Code keeps working even if Anyray is down.

Vertex credentials and prompts never leave your project — the Anyray optimizer only decides and never takes custody of a Google key. See How the integration works for the moving parts and the life of a request.

:::info What this MVP optimizes today

Spend attribution — every request priced and attributed by model, in the content-free spend store. (Per-team attribution via x-anyray-metadata is the gateway's job; richer per-app attribution is roadmap — see Where you see spend & savings.)
param_tuning — caps max_tokens at a configurable ceiling (a guardrail, not an aggressive trim).
Fail-safe — if the optimizer is unreachable, slow, or errors, the request passes through unchanged at full price. Claude Code never breaks.

Model routing (e.g. routing cheap turns to a smaller Claude) is the gateway's job, and automated classify-then-route is on the roadmap. The day-one win is visibility + a safety guardrail. :::

:::note How workers are redirected Integration is config-based: each worker sets ANTHROPIC_BASE_URL itself (an env var, a shell profile, or a CI secret) — exactly like adopting LiteLLM. There is no agent, no admission webhook, no org CA, and no TLS-MITM. :::

How the integration works

You deploy one stack — two containers from a single docker compose up — on a host in your GCP project. Each developer's Claude Code is pointed at it, and it calls Vertex on their behalf using the host's GCP credentials.

  ┌─────────────────────┐         ┌──────────────────────────────────────┐        ┌──────────────┐
  │  DEV MACHINE         │         │  ONE HOST IN THE ORG'S GCP PROJECT     │        │  VERTEX AI   │
  │  (each developer)    │         │  (a GCE VM, `docker compose up`)       │        │  (Claude)    │
  │                      │         │                                        │        │              │
  │  Claude Code         │         │  ┌────────────┐   /v1/optimize         │        │              │
  │  ANTHROPIC_BASE_URL ─┼────────▶│  │  LiteLLM   │──▶┌───────────────┐    │        │              │
  │   = http://gw:4000   │ Anthropic│  │ (the "AI   │   │ Anyray optimizer│   │        │              │
  │                      │/v1/messages│  gateway")  │◀──│  (decisions)  │    │        │              │
  │                      │         │  │   :4000    │   │    :8088      │   │        │              │
  │                      │◀────────┼──│            │────────────────────────────────▶│  inference   │
  │   normal output      │ response │  └────────────┘   uses the HOST's GCP       │  happens here │
  └─────────────────────┘         │   creds (service account / ADC)        │        └──────────────┘
                                   └──────────────────────────────────────┘

The three pieces

Claude Code runs on each developer's machine (the client). It does not run on Vertex — it just sends Anthropic /v1/messages requests to your gateway instead of to Anthropic's API.
The gateway = LiteLLM + the Anyray optimizer — two containers from one docker compose. LiteLLM speaks the Anthropic API in (so Claude Code can talk to it) and calls Vertex out. The Anyray optimizer is a sidecar that makes the optimize decision; it's credential-free and internal (:8088, not exposed). Spend is recorded in a content-free spend store.
Vertex AI is where Claude inference actually runs, in your GCP project. The gateway calls it with the host's service account, so no Google key ever touches a dev machine.

Life of one request

A dev runs claude → Claude Code sends an Anthropic /v1/messages request to the gateway (:4000), not to Vertex or Anthropic directly.
LiteLLM receives it → the Anyray pre-call hook calls the optimizer POST /v1/optimize (e.g. caps a runaway max_tokens).
LiteLLM forwards the request to Vertex, signed with the host's GCP credentials; Claude runs the inference.
The response flows back through LiteLLM to the dev's Claude Code — they see normal output.
The Anyray post-call hook calls POST /v1/optimize-response; the cost is recorded in the content-free spend store (there is no separate meter() call).
If the optimizer is ever down or slow, LiteLLM still serves the request unchanged (fail-open) — devs are never blocked.

So Anyray doesn't sit beside the traffic — traffic flows through the gateway, which is the only way it can both trim requests and attribute spend.

Two config touchpoints, nothing else

Admin, once — the Vertex connection in .env (VERTEX_PROJECT, VERTEX_LOCATION), then docker compose up (Steps 2–3).
Each developer, once — point Claude Code at the gateway with ANTHROPIC_BASE_URL (Step 4).

Prerequisites

A GCP project with the Claude models enabled in Vertex AI Model Garden, and the region that serves them (e.g. us-east5). Availability is region-specific.
A GCE VM in that project with Docker, and a service account attached to the VM with the Vertex AI User role. On GCE, LiteLLM picks this up automatically via Application Default Credentials — no key file to distribute.
Your org's machines can reach the VM on :4000.

Step 1 — Put the stack on the host

SSH to the VM, then:

git clone https://github.com/anyrayHQ/monorepo.git
cd monorepo/optimizer/adapters/litellm/example-vertex

Step 2 — Configure

cp .env.example .env

Edit .env:

VERTEX_PROJECT=your-gcp-project-id
VERTEX_LOCATION=us-east5          # the region serving Claude in your project

No credentials file is needed on GCE — the VM's attached service account is used. (Off-GCE? See Run it on a laptop first.)

:::info No Anthropic key — proxy is keyless by default There is no Anthropic API key and no Anthropic account in this setup — Vertex Claude is billed entirely through GCP. The proxy also ships keyless, so there's no shared secret to manage; you secure :4000 at the network layer (next step). If you prefer a shared token guarding the endpoint, set LITELLM_MASTER_KEY in .env and uncomment master_key in litellm_config.yaml — see Securing the endpoint. :::

Step 3 — Start it

docker compose up -d --build
docker compose ps          # litellm Up (:4000), anyray-optimizer Up (internal)

Confirm the optimizer is alive:

docker compose logs anyray-optimizer | tail -1
# {"level":30,...,"module":"optimizer/server","port":8088,"msg":"optimizer listening on :8088"}

:::warning Lock down the ports Expose only :4000 to your org's network (firewall it to trusted CIDRs). The optimizer's :8088 is internal — the compose file keeps it off the host; don't open it publicly. :::

Step 4 — Point your workers at it

A developer who already uses Claude Code sets one variable (shell profile, or your CI/secret manager):

export ANTHROPIC_BASE_URL=http://<anyray-host>:4000

That's it — no API key, no token. Claude Code reuses its existing login and routes to your endpoint instead of Anthropic's. They then use it exactly as before — same replies, same speed; the optimization and metering happen invisibly.

:::note No Anthropic key needed — and usually no token either Verified with Claude Code 2.1: a logged-in client needs only ANTHROPIC_BASE_URL. You add a credential env in just two cases, and it is never an Anthropic key:

Not logged in / CI / fresh container — to skip Claude Code's interactive login, set ANTHROPIC_AUTH_TOKEN=lever (any placeholder; the keyless proxy ignores it).
You enabled a proxy token (Securing the endpoint) — set ANTHROPIC_AUTH_TOKEN to that token.

Use ANTHROPIC_AUTH_TOKEN (the gateway Authorization: Bearer slot), not ANTHROPIC_API_KEY. :::

Step 5 — Verify Anyray is attributing spend

On the host, after a few Claude Code requests have gone through, query the content-free spend store (admin-gated by ANYRAY_ADMIN_TOKEN):

curl -s http://localhost:8787/admin/spend \
  -H "Authorization: Bearer $ANYRAY_ADMIN_TOKEN" | jq .
# rows priced and attributed by model; status ok for all.

Where you see spend & savings

The gateway records one content-free row per request in the spend store (who/team, model, provider, tokens, cost, latency, status — never content). The admin-gated summary at GET /admin/spend powers the Spend page in the Anyray console, rolled up per model and per user/team:

curl -s http://localhost:8787/admin/spend \
  -H "Authorization: Bearer $ANYRAY_ADMIN_TOKEN" | jq '.byModel'
# claude-sonnet-4-20250514 → reqs 42, spend $0.5810
# claude-3-5-haiku-20241022 → reqs 18, spend $0.0040

cost — what each request actually cost on Vertex (priced from token usage).
attribution is per model and per user/team (from x-anyray-metadata). Richer per-app/developer attribution is roadmap.

Tuning the `max_tokens` guardrail

The param_tuning strategy caps max_tokens at a configurable ceiling. It's deliberately set above normal Claude Code output sizes, so it only clips pathological or runaway requests and won't truncate real coding work.

Strategies are configured in optimizer.config.json — an ordered strategies[] of {kind, enabled, params} (array order = execution order). You edit it from the console Optimizer settings page, or via the admin API (PUT /admin/optimizer/settings, gated by ANYRAY_ADMIN_TOKEN):

Lower the cap (e.g. 8192) in the param_tuning strategy's params to trade output headroom for more savings — only after confirming your team's prompts don't legitimately need longer outputs.
Attribution-only: set every strategy's enabled: false to leave requests completely untouched and just measure spend.

Changes are runtime-mutable and audit-logged; no restart needed.

Data boundary

Vertex credentials stay on the one host (the VM's attached service account); they're never distributed to workers and the optimizer never sees them.
Prompts and responses stay inside your GCP project — they flow worker → LiteLLM → Vertex, all in-project. The optimizer only receives the request shape it needs to decide, and the spend store records content-free metadata only. Anyray is fully self-hosted — nothing leaves your environment, and stored content is encrypted at rest by default.

See Concepts → Data boundary.

Securing the endpoint

The proxy is keyless by default, so secure it at the network layer: run it inside your VPC and firewall :4000 to trusted CIDRs (your office/VPN, GKE node ranges, CI). The optimizer's :8088 is internal-only.

If the endpoint isn't tightly isolated, add a shared proxy token:

In .env, set LITELLM_MASTER_KEY=<strong-random-string>.
In litellm_config.yaml, uncomment the general_settings.master_key block.
docker compose up -d, then have workers set ANTHROPIC_AUTH_TOKEN to that token.

Either way, this token guards your proxy — it is never an Anthropic credential and never leaves your environment.

Optional: try it on a laptop first

To smoke-test off-GCE before deploying, in docker-compose.yml uncomment the credentials mount + GOOGLE_APPLICATION_CREDENTIALS, and provide creds via either a downloaded SA key or gcloud auth application-default login. Then ANTHROPIC_BASE_URL=http://localhost:4000.

If a model isn't priced

The model id Claude Code sends must (1) be matched by a model_name in litellm_config.yaml and (2) be priced in the optimizer's pricing table. The config ships with Opus 4, Sonnet 4, 3.7 Sonnet, and 3.5 Haiku — all already priced. If your team uses another Claude version, add it in both places (and confirm the Vertex prices against your contract). An unpriced model still works — it just passes through with no cost attributed.

Which gateway? (why LiteLLM here)

The gateway is a configurable choice, and Anyray's own gateway is the implemented default. It now speaks Vertex/Anthropic natively, so you can point Claude Code straight at the Anyray gateway (:8787) and skip LiteLLM entirely.

This page documents the LiteLLM alternative because LiteLLM is a mature, widely-deployed front door that speaks the Anthropic /v1/messages API in and Vertex out — a fine choice if you already run it or prefer it. Either way the optimizer and spend store are the same; only the front door differs.

How the integration works​

The three pieces​

Life of one request​

Two config touchpoints, nothing else​

Prerequisites​

Step 1 — Put the stack on the host​

Step 2 — Configure​

Step 3 — Start it​

Step 4 — Point your workers at it​

Step 5 — Verify Anyray is attributing spend​

Where you see spend & savings​

Tuning the max_tokens guardrail​

Data boundary​

Securing the endpoint​

Optional: try it on a laptop first​

If a model isn't priced​

Which gateway? (why LiteLLM here)​

See also​