Vertex + Claude Code (Quickstart)
One quick way to get Anyray value if your team uses Claude Code against Claude on
Google Vertex. You stand up one shared stack inside your own GCP project —
LiteLLM (which already speaks Vertex and the Anthropic API) plus the Anyray
optimizer — and your org's workers point ANTHROPIC_BASE_URL at it.
The gateway is a configurable choice. Anyray's own gateway now
speaks Vertex/Anthropic natively, so you can point Claude Code straight at it
(:8787) — see which gateway. This page walks the
LiteLLM alternative front door (Anthropic in, Vertex out), which remains a valid option.
Every Claude Code request is then cost-attributed, runaway max_tokens is capped,
and there's a fail-safe so Claude Code keeps working even if Anyray is down.
Vertex credentials and prompts never leave your project — the Anyray optimizer only decides and never takes custody of a Google key. See How the integration works for the moving parts and the life of a request.
:::info What this MVP optimizes today
- Spend attribution — every request priced and attributed by model, in the content-free spend store. (Per-team attribution via
x-anyray-metadatais the gateway's job; richer per-app attribution is roadmap — see Where you see spend & savings.) param_tuning— capsmax_tokensat a configurable ceiling (a guardrail, not an aggressive trim).- Fail-safe — if the optimizer is unreachable, slow, or errors, the request passes through unchanged at full price. Claude Code never breaks.
Model routing (e.g. routing cheap turns to a smaller Claude) is the gateway's job, and automated classify-then-route is on the roadmap. The day-one win is visibility + a safety guardrail. :::
:::note How workers are redirected
Integration is config-based: each worker sets ANTHROPIC_BASE_URL itself (an env var,
a shell profile, or a CI secret) — exactly like adopting LiteLLM. There is no agent, no
admission webhook, no org CA, and no TLS-MITM.
:::
How the integration works
You deploy one stack — two containers from a single docker compose up — on a host in
your GCP project. Each developer's Claude Code is pointed at it, and it calls Vertex on
their behalf using the host's GCP credentials.
┌─────────────────────┐ ┌──────────────────────────────────────┐ ┌──────────────┐
│ DEV MACHINE │ │ ONE HOST IN THE ORG'S GCP PROJECT │ │ VERTEX AI │
│ (each developer) │ │ (a GCE VM, `docker compose up`) │ │ (Claude) │
│ │ │ │ │ │
│ Claude Code │ │ ┌────────────┐ /v1/optimize │ │ │
│ ANTHROPIC_BASE_URL ─┼────────▶│ │ LiteLLM │──▶┌───────────────┐ │ │ │
│ = http://gw:4000 │ Anthropic│ │ (the "AI │ │ Anyray optimizer│ │ │ │
│ │/v1/messages│ gateway") │◀──│ (decisions) │ │ │ │
│ │ │ │ :4000 │ │ :8088 │ │ │ │
│ │◀────────┼──│ │────────────────────────────────▶│ inference │
│ normal output │ response │ └────────────┘ uses the HOST's GCP │ happens here │
└─────────────────────┘ │ creds (service account / ADC) │ └──────────────┘
└──────────────────────────────────────┘
The three pieces
- Claude Code runs on each developer's machine (the client). It does not run on
Vertex — it just sends Anthropic
/v1/messagesrequests to your gateway instead of to Anthropic's API. - The gateway = LiteLLM + the Anyray optimizer — two containers from one
docker compose. LiteLLM speaks the Anthropic API in (so Claude Code can talk to it) and calls Vertex out. The Anyray optimizer is a sidecar that makes the optimize decision; it's credential-free and internal (:8088, not exposed). Spend is recorded in a content-free spend store. - Vertex AI is where Claude inference actually runs, in your GCP project. The gateway calls it with the host's service account, so no Google key ever touches a dev machine.
Life of one request
- A dev runs
claude→ Claude Code sends an Anthropic/v1/messagesrequest to the gateway (:4000), not to Vertex or Anthropic directly. - LiteLLM receives it → the Anyray pre-call hook calls the optimizer
POST /v1/optimize(e.g. caps a runawaymax_tokens). - LiteLLM forwards the request to Vertex, signed with the host's GCP credentials; Claude runs the inference.
- The response flows back through LiteLLM to the dev's Claude Code — they see normal output.
- The Anyray post-call hook calls
POST /v1/optimize-response; the cost is recorded in the content-free spend store (there is no separatemeter()call). - If the optimizer is ever down or slow, LiteLLM still serves the request unchanged (fail-open) — devs are never blocked.
So Anyray doesn't sit beside the traffic — traffic flows through the gateway, which is the only way it can both trim requests and attribute spend.
Two config touchpoints, nothing else
- Admin, once — the Vertex connection in
.env(VERTEX_PROJECT,VERTEX_LOCATION), thendocker compose up(Steps 2–3). - Each developer, once — point Claude Code at the gateway with
ANTHROPIC_BASE_URL(Step 4).
Prerequisites
- A GCP project with the Claude models enabled in Vertex AI Model Garden, and the
region that serves them (e.g.
us-east5). Availability is region-specific. - A GCE VM in that project with Docker, and a service account attached to the VM with the Vertex AI User role. On GCE, LiteLLM picks this up automatically via Application Default Credentials — no key file to distribute.
- Your org's machines can reach the VM on
:4000.
Step 1 — Put the stack on the host
SSH to the VM, then:
git clone https://github.com/anyrayHQ/monorepo.git
cd monorepo/optimizer/adapters/litellm/example-vertex
Step 2 — Configure
cp .env.example .env
Edit .env:
VERTEX_PROJECT=your-gcp-project-id
VERTEX_LOCATION=us-east5 # the region serving Claude in your project
No credentials file is needed on GCE — the VM's attached service account is used. (Off-GCE? See Run it on a laptop first.)
:::info No Anthropic key — proxy is keyless by default
There is no Anthropic API key and no Anthropic account in this setup — Vertex Claude is
billed entirely through GCP. The proxy also ships keyless, so there's no shared secret
to manage; you secure :4000 at the network layer (next step). If you
prefer a shared token guarding the endpoint, set LITELLM_MASTER_KEY in .env and
uncomment master_key in litellm_config.yaml — see Securing the endpoint.
:::
Step 3 — Start it
docker compose up -d --build
docker compose ps # litellm Up (:4000), anyray-optimizer Up (internal)
Confirm the optimizer is alive:
docker compose logs anyray-optimizer | tail -1
# {"level":30,...,"module":"optimizer/server","port":8088,"msg":"optimizer listening on :8088"}
:::warning Lock down the ports
Expose only :4000 to your org's network (firewall it to trusted CIDRs). The
optimizer's :8088 is internal — the compose file keeps it off the host; don't open it
publicly.
:::
Step 4 — Point your workers at it
A developer who already uses Claude Code sets one variable (shell profile, or your CI/secret manager):
export ANTHROPIC_BASE_URL=http://<anyray-host>:4000
That's it — no API key, no token. Claude Code reuses its existing login and routes to your endpoint instead of Anthropic's. They then use it exactly as before — same replies, same speed; the optimization and metering happen invisibly.
:::note No Anthropic key needed — and usually no token either
Verified with Claude Code 2.1: a logged-in client needs only ANTHROPIC_BASE_URL. You
add a credential env in just two cases, and it is never an Anthropic key:
- Not logged in / CI / fresh container — to skip Claude Code's interactive login, set
ANTHROPIC_AUTH_TOKEN=lever(any placeholder; the keyless proxy ignores it). - You enabled a proxy token (Securing the endpoint) — set
ANTHROPIC_AUTH_TOKENto that token.
Use ANTHROPIC_AUTH_TOKEN (the gateway Authorization: Bearer slot), not ANTHROPIC_API_KEY.
:::
Step 5 — Verify Anyray is attributing spend
On the host, after a few Claude Code requests have gone through, query the content-free
spend store (admin-gated by ANYRAY_ADMIN_TOKEN):
curl -s http://localhost:8787/admin/spend \
-H "Authorization: Bearer $ANYRAY_ADMIN_TOKEN" | jq .
# rows priced and attributed by model; status ok for all.
Where you see spend & savings
The gateway records one content-free row per request in the spend store (who/team,
model, provider, tokens, cost, latency, status — never content). The admin-gated summary at
GET /admin/spend powers the Spend page in the Anyray console, rolled up per model and
per user/team:
curl -s http://localhost:8787/admin/spend \
-H "Authorization: Bearer $ANYRAY_ADMIN_TOKEN" | jq '.byModel'
# claude-sonnet-4-20250514 → reqs 42, spend $0.5810
# claude-3-5-haiku-20241022 → reqs 18, spend $0.0040
cost— what each request actually cost on Vertex (priced from token usage).- attribution is per model and per user/team (from
x-anyray-metadata). Richer per-app/developer attribution is roadmap.
Tuning the max_tokens guardrail
The param_tuning strategy caps max_tokens at a configurable ceiling. It's deliberately
set above normal Claude Code output sizes, so it only clips pathological or runaway
requests and won't truncate real coding work.
Strategies are configured in optimizer.config.json — an ordered strategies[] of
{kind, enabled, params} (array order = execution order). You edit it from the console
Optimizer settings page, or via the admin API
(PUT /admin/optimizer/settings, gated by ANYRAY_ADMIN_TOKEN):
- Lower the cap (e.g.
8192) in theparam_tuningstrategy'sparamsto trade output headroom for more savings — only after confirming your team's prompts don't legitimately need longer outputs. - Attribution-only: set every strategy's
enabled: falseto leave requests completely untouched and just measure spend.
Changes are runtime-mutable and audit-logged; no restart needed.
Data boundary
- Vertex credentials stay on the one host (the VM's attached service account); they're never distributed to workers and the optimizer never sees them.
- Prompts and responses stay inside your GCP project — they flow worker → LiteLLM → Vertex, all in-project. The optimizer only receives the request shape it needs to decide, and the spend store records content-free metadata only. Anyray is fully self-hosted — nothing leaves your environment, and stored content is encrypted at rest by default.
Securing the endpoint
The proxy is keyless by default, so secure it at the network layer: run it inside
your VPC and firewall :4000 to trusted CIDRs (your office/VPN, GKE node ranges, CI). The
optimizer's :8088 is internal-only.
If the endpoint isn't tightly isolated, add a shared proxy token:
- In
.env, setLITELLM_MASTER_KEY=<strong-random-string>. - In
litellm_config.yaml, uncomment thegeneral_settings.master_keyblock. docker compose up -d, then have workers setANTHROPIC_AUTH_TOKENto that token.
Either way, this token guards your proxy — it is never an Anthropic credential and never leaves your environment.
Optional: try it on a laptop first
To smoke-test off-GCE before deploying, in docker-compose.yml uncomment the credentials
mount + GOOGLE_APPLICATION_CREDENTIALS, and provide creds via either a downloaded SA key
or gcloud auth application-default login. Then ANTHROPIC_BASE_URL=http://localhost:4000.
If a model isn't priced
The model id Claude Code sends must (1) be matched by a model_name in
litellm_config.yaml and (2) be priced in the optimizer's pricing table. The config ships
with Opus 4, Sonnet 4, 3.7 Sonnet, and 3.5 Haiku — all already priced. If your team uses
another Claude version, add it in both places (and confirm the Vertex prices against your
contract). An unpriced model still works — it just passes through with no cost attributed.
Which gateway? (why LiteLLM here)
The gateway is a configurable choice, and Anyray's own gateway
is the implemented default. It now speaks Vertex/Anthropic natively, so you can point
Claude Code straight at the Anyray gateway (:8787) and skip LiteLLM entirely.
This page documents the LiteLLM alternative because LiteLLM is a mature, widely-deployed
front door that speaks the Anthropic /v1/messages API in and Vertex out — a
fine choice if you already run it or prefer it. Either way the optimizer and spend store are
the same; only the front door differs.
See also
- Gateways → LiteLLM — how the adapter plugs into LiteLLM.
- GCP / GKE — running the shared stack in a GKE cluster instead of a VM.
- Concepts → Performance & quality — why this is safe to roll out.