Conversation & memory management
Most strategies act on a single request. This family acts across the whole conversation or agent session — where the largest agentic savings hide, and where optimization is hardest to do safely. It's the top of Anyray's strategy difficulty ladder: behavior-sensitive, workload-specific, and far harder to copy than caching or routing.
:::warning Status: planned (fast-follow)
This whole family is roadmap — conversation- and session-level memory management is
not built yet. Today's shipped optimizer strategies are param_tuning, tool_pruning,
prompt_compression, and semantic_cache (default off); none of them act across a
conversation. The design below is documented so it can be reviewed; the guardrails it
describes (quality gating, fail-open) are the same ones that govern
every Anyray strategy.
:::
Why the conversation is the unit
In an agent or a long chat, cost isn't really per-request. Context accumulates: the system prompt, tool outputs, retrieved documents, and every prior turn get re-sent on each step. A 30-minute coding session can spend most of its tokens re-shipping context the model has already seen. Optimizing the conversation attacks that compounding input-token growth directly — the dominant agentic cost driver (see What saves the most).
Goal awareness
Infer what the conversation is actually trying to accomplish, then score context by relevance to that goal — keep what advances it, shed what doesn't. A debugging session doesn't need the marketing copy pasted ten turns ago; a refactor doesn't need the resolved stack trace. Goal awareness is what makes the rest of the family safe: you prune against a model of what matters now, not blindly by age.
Topic-change detection
Conversations pivot. When the topic shifts — a new feature, a different file, a fresh question — the prior topic's context usually stops being load-bearing. Topic-change detection segments the conversation and lets Anyray summarize or drop the previous segment instead of carrying it forward, turn after turn, forever. The hard part is distinguishing a genuine pivot from a digression that still depends on earlier context — so detection feeds the quality gate, it doesn't act unchecked.
Memory tiers
Short-term (working memory)
The live working set within a session. Anyray manages it with recency + relevance pruning and pre-compaction snapshots — small, priority-tiered summaries of the session's state (decisions, edits, open tasks) captured before the context window is compacted, so working state survives compaction instead of being lost and re-derived (which is what makes agents repeat work and re-read files).
Long-term (durable memory)
Knowledge that should persist across sessions. Instead of replaying history into every new prompt, Anyray keeps a durable store and retrieves only the relevant slice for the current goal. This is the difference between an agent that re-reads the whole repo each morning and one that recalls just what it needs.
Keeping it safe
Everything here drops or rewrites context, so it carries real quality risk — more than caching or output control. It's held to Anyray's behavior-preserving guarantee:
- Gated on a quality bound — goal-relevance and key-fact coverage must hold; if pruning would drop load-bearing context, it doesn't prune.
- When unsure, keep it. Uncertainty resolves toward more context, never less — the same fail-safe spirit as routing to frontier.
- Fail-open. A misfiring memory step is skipped, never fatal.
- Measured as cost per correct answer, so a prune that hurts answers shows up as negative savings and gets rolled back by adaptive optimization.
Where it runs: request-side vs. agent-side
- Short-term, in-flight pruning can act on the assembled request a gateway sees — so a gateway plugin can do it.
- Long-term memory and deep session continuity are partly agent-side (hooks / the agent's own loop), upstream of any gateway. Anyray's zero-touch interception already reaches that environment, so it's a natural place for Anyray to extend — but in pure plugin mode part of this family lives above the request Anyray sees. See request-side vs. agent-side.
Configuring it
Like every strategy, this family is enabled per use-case via the pipeline (see Configure). It earns its place in agentic and long-chat deployments; for short, stateless calls it does nothing and should be left off.
See also
- Optimization strategies — the full menu and the pipeline model
- What saves the most — why agentic cost is an input-token problem
- Adaptive optimization — how these strategies tune themselves safely