Conversation & memory management

Most strategies act on a single request. This family acts across the whole conversation or agent session — where the largest agentic savings hide, and where optimization is hardest to do safely. It's the top of Anyray's strategy difficulty ladder: behavior-sensitive, workload-specific, and far harder to copy than caching or routing.

:::warning Status: planned (fast-follow) This whole family is roadmap — conversation- and session-level memory management is not built yet. Today's shipped optimizer strategies are param_tuning, tool_pruning, prompt_compression, and semantic_cache (default off); none of them act across a conversation. The design below is documented so it can be reviewed; the guardrails it describes (quality gating, fail-open) are the same ones that govern every Anyray strategy. :::

Why the conversation is the unit

In an agent or a long chat, cost isn't really per-request. Context accumulates: the system prompt, tool outputs, retrieved documents, and every prior turn get re-sent on each step. A 30-minute coding session can spend most of its tokens re-shipping context the model has already seen. Optimizing the conversation attacks that compounding input-token growth directly — the dominant agentic cost driver (see What saves the most).

Goal awareness

Infer what the conversation is actually trying to accomplish, then score context by relevance to that goal — keep what advances it, shed what doesn't. A debugging session doesn't need the marketing copy pasted ten turns ago; a refactor doesn't need the resolved stack trace. Goal awareness is what makes the rest of the family safe: you prune against a model of what matters now, not blindly by age.

Topic-change detection

Conversations pivot. When the topic shifts — a new feature, a different file, a fresh question — the prior topic's context usually stops being load-bearing. Topic-change detection segments the conversation and lets Anyray summarize or drop the previous segment instead of carrying it forward, turn after turn, forever. The hard part is distinguishing a genuine pivot from a digression that still depends on earlier context — so detection feeds the quality gate, it doesn't act unchecked.

Memory tiers

Short-term (working memory)

The live working set within a session. Anyray manages it with recency + relevance pruning and pre-compaction snapshots — small, priority-tiered summaries of the session's state (decisions, edits, open tasks) captured before the context window is compacted, so working state survives compaction instead of being lost and re-derived (which is what makes agents repeat work and re-read files).

Long-term (durable memory)

Knowledge that should persist across sessions. Instead of replaying history into every new prompt, Anyray keeps a durable store and retrieves only the relevant slice for the current goal. This is the difference between an agent that re-reads the whole repo each morning and one that recalls just what it needs.

Keeping it safe

Everything here drops or rewrites context, so it carries real quality risk — more than caching or output control. It's held to Anyray's behavior-preserving guarantee:

Gated on a quality bound — goal-relevance and key-fact coverage must hold; if pruning would drop load-bearing context, it doesn't prune.
When unsure, keep it. Uncertainty resolves toward more context, never less — the same fail-safe spirit as routing to frontier.
Fail-open. A misfiring memory step is skipped, never fatal.
Measured as cost per correct answer, so a prune that hurts answers shows up as negative savings and gets rolled back by adaptive optimization.

Where it runs: request-side vs. agent-side

Short-term, in-flight pruning can act on the assembled request a gateway sees — so a gateway plugin can do it.
Long-term memory and deep session continuity are partly agent-side (hooks / the agent's own loop), upstream of any gateway. Anyray's zero-touch interception already reaches that environment, so it's a natural place for Anyray to extend — but in pure plugin mode part of this family lives above the request Anyray sees. See request-side vs. agent-side.

Configuring it

Like every strategy, this family is enabled per use-case via the pipeline (see Configure). It earns its place in agentic and long-chat deployments; for short, stateless calls it does nothing and should be left off.

Why the conversation is the unit​

Goal awareness​

Topic-change detection​

Memory tiers​

Short-term (working memory)​

Long-term (durable memory)​

Keeping it safe​

Where it runs: request-side vs. agent-side​

Configuring it​

See also​