Clarify sub-agent reply routing in MCP tool descriptions

A sub-agent's reply to send_message lands in the caller's own pane tagged [sub-agent:<name>], not in the sub-agent's output. The descriptions for wait_for_pattern, send_message, both timer_fire_when_idle_*, and the server-instructions preamble now spell this out, along with the canonical send_message → timer_fire_when_idle_any → read-own-pane pattern. help('readiness') and help('coordination') updated to match. Previously agents reached for wait_for_pattern on the sub-agent and deadlocked until timeout because the reply had already been delivered to their own pane.
Sync TODO.md perf-audit review pass
2026-05-15 16:08:07 +01:00 · 2026-05-15 16:07:58 +01:00
4 changed files with 98 additions and 148 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -20,6 +20,19 @@ loosely follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
  child agent, even though you were still within that thread.
 ### Changed
 - MCP tool descriptions and `help('coordination')` /
  `help('readiness')` now spell out that a sub-agent's reply to
  `send_message` lands in the caller's own pane (tagged
  `[sub-agent:<name>]`), not in the sub-agent's output. The canonical
  wait-for-reply pattern — `send_message` → `timer_fire_when_idle_any`
  on the sub-agent → read your own pane — is now called out on
  `send_message`, `wait_for_pattern`, both `timer_fire_when_idle_*`,
  the help topics, and the server-instructions preamble every agent
  reads at startup. Previously `wait_for_pattern` was the obvious
  blocking primitive in the catalog, and agents routinely called it
  against the sub-agent for a reply that had already arrived in their
  own pane, deadlocking until the wait timed out. No behaviour
  changes; descriptions only.
 - Agent-initiated `spawn_agent` and `spawn_process` MCP calls no
  longer steal viewport focus from the currently active tab. The
  new child still appears in the sidebar and tab bar; switch to it
--- a/TODO.md
+++ b/TODO.md
@@ -1,6 +1,7 @@
-# Perf Audit (auto-generated 2026-05-15)
+# Perf Audit (reviewed 2026-05-15)
-Findings from a codebase sweep — not user-reported, needs review before
+Findings that survived the 2026-05-15 review pass. Low and marginal
-action. Each item names the anchor and a sketched fix.
+items from the original sweep were removed; remaining items have enough
 measured or workflow evidence to justify action.
 Baseline benchmark numbers (`go test -bench=. ./internal/app/`, AMD
 Ryzen 7 7800X3D, libghostty-vt **ReleaseFast** after the Makefile
@@ -27,145 +28,75 @@ Emulator_Write_Stream_8Color_120fps    257 µs/frame  3890 fps_ceiling
 Emulator_Write_Stream_TrueColor_120fps 488 µs/frame  2051 fps_ceiling
 ```
-Result of the fix below: 27-32× pipeline speedup, 60× emulator
+The current pipeline still has large 120 fps headroom. The remaining
-speedup. Pipeline hits 930-2030 fps end-to-end — 7-16× headroom
+renderer concern is multi-MiB styled replay latency and allocation
-over the 120 fps target on the heaviest workload (truecolor
+churn, not normal steady-state frame budget.
 full-screen redraws).
- [ ] **viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks.** [MEDIUM]
+- [ ] **viewport renderer allocates heavily on SGR/CSI-heavy chunks.** [MEDIUM]
-  - `internal/app/viewport_renderer.go` — the styled-lines and
+  - Review evidence: five benchmark reps confirmed
-    ratatui benchmarks show 4-17k allocs per chunk. The hot
+    `ViewportRenderer_StyledLines` at about 4,325 allocs per 16 KiB
-    contributors are likely (a) `string(vr.buf)` / `string(params)`
+    chunk (~91.5 KB/op, roughly 1 alloc per 3.8 input bytes), and
-    conversions in `emitCSI` for every escape sequence, (b) the
+    `ViewportRenderer_RatatuiBurst` at about 17,306 allocs per chunk
-    `pending strings.Builder` resizing as fragments arrive, and (c)
+    (~365 KB/op). A 5 MiB styled resume benchmark allocated about
-    `vr.shifter.Shift(vr.buf)` returning a fresh slice per CSI.
+    31 MB across 1.38M objects.
-  - Fix direction: switch CSI param parsing to byte-slice
+  - Likely hot paths: generic CSI/SGR output in
-    comparison (no string conversion); reuse `vr.buf` and
+    `internal/app/viewport_renderer.go` sends many sequences through
-    `vr.pending` backing arrays across `Render` calls by
+    `vr.shifter.Shift(vr.buf)`, while `internal/app/cursorshift.go`
-    pre-growing in `newViewportRenderer`; have `cursorShifter.Shift`
+    returns a fresh `[]byte` via `pending.String()` on every
-    return into a caller-owned buffer instead of allocating.
+    `Shift` call and parses CSI params through `string(raw)` /
-    Profile-guided: run the styled-lines bench, point pprof at the
+    `strings.Split`. The mode-helper `string(params)` conversions
-    allocs profile, fix the top three call sites.
+    are real, but probably not the main SGR-heavy cost.
  - Fix direction: make `cursorShifter` write into caller-owned
    scratch output or directly into the viewport renderer's pending
    builder; parse CSI params from byte slices; pre-grow/reuse
    renderer and shifter buffers. Re-run styled-lines, ratatui, and
    5 MiB resume benchmarks; use pprof when available to confirm the
    top allocation sites.
- [ ] **viewport renderer throughput (~90 MB/s styled) limits codex steady-state.** [MEDIUM]
+- [ ] **large styled resume/replay dumps spend visible time in viewport rendering.** [MEDIUM]
-  - The styled-lines and ratatui benchmarks come in at 89 MB/s and
+  - Review evidence: `BenchmarkSessionResume_5MiBStyled` measured
-    40 MB/s respectively. A 100 KB/s codex burst is far under that
+    about 58 ms median and 63 ms p95 over five reps. The plain 5 MiB
-    limit, but a session-resume dump of a 5 MiB chat history takes
+    benchmark was about 23-24 ms with only 21 allocs. The live path
-    50-130 ms of pure renderer time at those rates — enough to be
+    renders focused PTY chunks through `renderer.Render`, then still
-    user-visible at the start of a long resume.
+    pays emulator writes, ring writes, event dispatch, stdout writes,
-  - Fix direction: same as the alloc fix above; once the per-call
+    and real terminal paint.
-    allocation cost drops, the throughput ceiling rises with it.
+  - Scope: this is not a Codex steady-state throughput limit. A
-    Worth re-running the benches after fixing the allocs and only
+    100 KB/s stream is far below the styled renderer's ~80-90 MB/s
-    investing further if the styled-lines bench is still under
+    ceiling. It matters for multi-MiB burst replay, resume/startup
-    ~300 MB/s.
+    dumps, and dense full-screen churn.
  - Fix direction: do the allocation fix first, since it should also
    improve throughput. After that, invest further only if styled
    resume traces remain user-visible or the styled-lines benchmark
    is still under roughly 300 MB/s.
- [ ] **Session.Children() allocates a fresh slice on every call.** [MEDIUM]
+- [ ] **wait_for_pattern re-scans the entire stream/grid while waiting.** [MEDIUM]
-  - `internal/app/session.go:530-541` walks `s.order` under `s.mu` and
+  - `internal/app/host.go:476-493` (the `check` closure). On
-    builds a new `[]*Child` slice every time. Callers on hot paths:
+    `scope="scrollback"` it calls `c.StreamRead(0)` followed by
-    `drawSidebar` calls it twice per frame
+    `stripANSIBytes(nil, b)`, so each check can copy, strip, and
-    (`internal/app/sidebar.go:139` and `:171`); `drawTabBar` calls it
+    search the full 1 MiB ring. On `scope="grid"` it calls
-    once per frame (`internal/app/tabbar.go:37`); the classifier
+    `PlainText()` and runs the regex against the full grid string.
-    iterates it every 250 ms (`internal/app/classifier.go:38`); and
+  - Caveat from review: the current chunk notifier coalesces bursts
-    palette/navigation hit it on every Ctrl-A/D/W/S keystroke.
+    with a buffered channel and has a 500 ms fallback, so this is not
-  - Fix direction: store the snapshot in an `atomic.Pointer[[]*Child]`
+    necessarily one full scan per PTY chunk. It is still meaningful
-    on `Session`, refresh it under `s.mu` only when `Spawn` / `delete`
+    for active waits on chatty panes.
-    mutates the map. Readers get O(1) `Load()` with zero allocation —
+  - Fix direction: for `scrollback`, track the last checked stream
-    same pattern already used for `listeners` (session.go:118-123).
+    offset and search only new output plus a bounded overlap/scratch
    buffer so matches spanning chunks are not missed. For `grid`,
    dedupe on `ScreenVersion()` and skip work when the version has
    not changed.
- [ ] **wait_for_pattern re-scans the entire stream/grid every iteration.** [MEDIUM]
+- [ ] **search_output rebuilds and searches whole scrollback on every call.** [MEDIUM]
-  - `internal/app/host.go:476-493` (the `check` closure). On `scope =
+  - `internal/app/host.go:428-437` compiles a fresh regex, reads the
-    "scrollback"` it calls `c.StreamRead(0)` followed by
+    stream from offset 0, strips ANSI for `kind="rendered"`, converts
-    `stripANSIBytes(nil, b)` over the entire ring on every wake — a
+    the full buffer to a string, and splits it into lines before
-    full O(ring size) walk per chunk arrival. On `grid` it goes
+    applying `limit`. This is meaningful when agents poll the same
-    through PlainText (one CGO call) plus a regex match against the
+    pattern; it is low impact for ad hoc searches.
-    full grid string. For an agent waiting on a marker in a chatty
+  - Fix direction: cache compiled regexes by pattern; cache stripped
-    pane, every PTY chunk fires `check()`.
+    rendered output by child id and stream end offset; avoid
-  - Fix direction: for `scrollback`, track the offset of the last
+    `strings.Split` over the whole ring when only the first `limit`
-    check and run the regex only over the new tail, reusing a
+    matches are needed. Prefer an incremental search shape if this
-    per-call scratch buffer for ANSI stripping. For `grid`, dedupe
+    becomes the standard "watch for marker" path.
    on `ScreenVersion()` — skip when version hasn't changed.
 - [ ] **search_output compiles regex + strips ANSI on every call.** [MEDIUM]
  - `internal/app/host.go:428` compiles a fresh `regexp.Regexp` per
    invocation; `:434` strips ANSI over the entire ring buffer when
    `kind="rendered"`. Agents that poll `search_output` with the same
    pattern (the typical "watch for marker" loop) repay both costs on
    every call.
  - Fix direction: small LRU of compiled regexes keyed by pattern
    string (cap maybe 32) on `toolHost`. Cache the stripped-ANSI
    buffer keyed by `c.ScreenVersion()` so consecutive searches over
    an unchanged ring reuse the strip.
 - [ ] **GetProcessOutput grid mode acquires the emulator twice.** [MEDIUM]
  - `internal/app/host.go:375-391` does `em := c.Emulator()` for
    ActiveScreen / Cursor / Size, then at line 387 re-fetches
    `em := c.Emulator()` for PlainText. Each `Emulator()` call goes
    through `ptyMu` and inspects the live PTY pointer. Under a
    chatty agent polling `get_process_output` every 100 ms this is
    a redundant lock and pointer chase per call.
  - Fix direction: hold the emulator reference from the first
    lookup; reuse it for PlainText. The check `if em == nil` still
    runs cleanly because the variable is captured.
 - [ ] **FindChildByIdentity is O(N) under the session lock.** [LOW]
  - `internal/app/session.go:553-565` scans the children map looking
    for a matching `Identity` token on every new mcp-stdio
    connection. Not a steady-state hot path — only fires once per
    child spawn — but with many short-lived sub-agents it adds up
    and contends with everyone else taking `s.mu`.
  - Fix direction: maintain an `identityIndex map[string]string`
    (identity → child id) updated alongside spawn / exit, give the
    lookup an O(1) read.
 - [ ] **Per-promoter regex matches in the idle classifier.** [LOW]
  - `internal/app/idle.go:175-182` (`matchAny`) walks each compiled
    pattern and runs the DFA over the same 4 KiB tail. A preset with
    five permission patterns + five error patterns is ten DFA
    invocations per child per 250 ms tick.
  - Fix direction: at preset load time, compile each `_patterns`
    list into a single alternation regex (`(?:p1)|(?:p2)|…`). The
    classifier then makes one Match call per category per tick.
 - [ ] **Port-detection dedup is O(N²) over c.ports.** [LOW]
  - `internal/app/child.go:461-467`: for each fresh URL match the
    code linearly scans the existing port list. The list rarely
    grows past a handful, but a dev server that lists "all open
    ports" in one log line interacts badly: M new matches × N
    existing entries.
  - Fix direction: keep a `seenPorts map[int]struct{}` next to
    `c.ports`, rebuilt on prune (none today). O(1) per match.
 - [ ] **Port-sighting string allocations happen before the dedup check.** [LOW]
  - `internal/app/child.go:455-456` allocates `urlForm` and `portStr`
    before line 461's `seen` walk. Both strings are wasted when the
    port is already in `c.ports`. Inside `c.portsMu` for the whole
    loop body too, blocking the `Ports()` reader path.
  - Fix direction: bind the port int first (cheap parse from
    `m[1]`), do the seen check, only then allocate the URL string
    for the surviving sighting.
 - [ ] **classifier `time.Now()` syscall per child per tick.** [LOW]
  - `internal/app/classifier.go:54` (and the `IdleMS` /
    `TitleIdleMS` helpers it transitively calls in
    `internal/app/child.go:343-374`) each call `time.Now()`.
    Reading time on Linux is fast (vDSO) but with N children × 4
    `time.Now()` per tick × 4 ticks/sec it's wasted work that can
    be batched.
  - Fix direction: capture `now := time.Now().UnixNano()` once at
    the top of `classifyAll` and thread it into `classifyOne` and
    the helpers as a parameter.
 - [ ] **wait_for_pattern subscribes a listener for every call.** [LOW]
  - `internal/app/host.go:472-474`: each invocation calls
    `Session.Subscribe(wake)` which clones the listener slice and
    swaps the atomic pointer; the `defer Unsubscribe` does the same
    on exit. Two allocations per `wait_for_pattern`. The agent
    pattern of looping on `wait_for_pattern` after every tool call
    pays this churn on the steady-state path.
  - Fix direction: a per-child `chunkBroadcaster` registered once
    at child spawn that hands out lightweight subscriber tokens,
    rather than going through the full session listener machinery.
 # On Hold
 - [ ] There's a unicode <?> being displayed in opencode [ON HOLD]
--- a/internal/app/host.go
+++ b/internal/app/host.go
@@ -1135,8 +1135,9 @@ func helpFor(topic string) mcp.HelpResponse {
 	case "coordination":
 		return mcp.HelpResponse{
 			Topic: "coordination",
-			Content:      "send_message tags the message with the caller's role (parent → [orchestrator], child → [sub-agent:<name>]). Siblings must route through their parent. request_human_attention raises a UI notification when you can't safely decide.",
+			Content: "send_message tags the message with the caller's role (parent → [orchestrator], child → [sub-agent:<name>]). Siblings must route through their parent. request_human_attention raises a UI notification when you can't safely decide.\n\n" +
-			RelatedTools: []string{"send_message", "request_human_attention"},
+				"Reply routing: a sub-agent's reply to your send_message lands in YOUR pane tagged `[sub-agent:<name>]`, not in the sub-agent's output. Anti-pattern: `wait_for_pattern(sub_agent, …)` to wait for a reply — the sub-agent is already idle, its output won't change, and the call spins to timeout. Pattern: send_message → timer_fire_when_idle_any([sub_agent_id], body=\"[system] sub-agent finished\") → when the timer fires, the reply is already queued as your next user turn (or visible via get_process_output on your own pane).",
 			RelatedTools: []string{"send_message", "request_human_attention", "timer_fire_when_idle_any", "timer_fire_when_idle_all"},
 		}
 	case "scratchpads":
 		return mcp.HelpResponse{
@@ -1162,8 +1163,13 @@ func helpFor(topic string) mcp.HelpResponse {
 	case "readiness":
 		return mcp.HelpResponse{
 			Topic: "readiness",
-			Content:      "A pane is 'idle' once nothing has been written to its PTY for ~1s (SPEC §11). Treat idle as a signal to read, not a guarantee of completion. wait_for_pattern lets you wait on a known terminal marker for stronger evidence.",
+			Content: "A pane is 'idle' once nothing has been written to its PTY for ~1s (SPEC §11). Treat idle as a signal to read, not a guarantee of completion.\n\n" +
-			RelatedTools: []string{"wait_for_pattern", "get_process_status"},
+				"Waiting for a sub-agent's reply (canonical pattern):\n" +
 				"  1. send_message(sub_agent_id, request)\n" +
 				"  2. timer_fire_when_idle_any(watched=[sub_agent_id], body=\"[system] sub-agent done\")\n" +
 				"  3. When the timer fires you re-enter as a fresh user turn; the sub-agent's reply is already in your own pane tagged `[sub-agent:<name>]` (read via get_process_output on yourself if you need it explicitly).\n\n" +
 				"wait_for_pattern is for waiting on text a process emits in its OWN output (a shell prompt, a build's \"tests passed\" line). It does NOT see send_message replies, because those land in the caller's pane, not the target's — calling wait_for_pattern on a sub-agent to wait for its reply deadlocks until timeout.",
 			RelatedTools: []string{"wait_for_pattern", "get_process_status", "timer_fire_when_idle_any", "send_message"},
 		}
 	case "permissions":
 		return mcp.HelpResponse{
--- a/internal/mcp/protocol.go
+++ b/internal/mcp/protocol.go
@@ -43,7 +43,7 @@ var serverInfo = map[string]any{
 //     up as sub-agents and won't be tied into the patterm lifecycle.
 //
 // Keep this short — clients vary in how much they surface to the LLM.
-const serverInstructions = "You are already running INSIDE patterm; the `patterm` MCP server is connected over the same stdio MCP transport you use for any other MCP server. Use the MCP tools you see in tools/list — do NOT (a) try to launch `patterm` or `patterm mcp-stdio` yourself, (b) poke the Unix socket through perl / nc / socat / curl, or (c) shell out to `claude` / `codex` / `opencode` to start a peer. Any of those bypasses caller-identity and the new agent will land as a stray top-level tab instead of a child under you. Start with `whoami` for your role and the full tool list, then `help('topics')` for orientation. `spawn_agent` is the only correct way to start a sub-agent; `spawn_process` is for non-LLM commands; `list_processes` / `get_process_output` inspect them; `send_input` / `send_message` drive them. Whatever you spawn is yours to `close_process` when done."
+const serverInstructions = "You are already running INSIDE patterm; the `patterm` MCP server is connected over the same stdio MCP transport you use for any other MCP server. Use the MCP tools you see in tools/list — do NOT (a) try to launch `patterm` or `patterm mcp-stdio` yourself, (b) poke the Unix socket through perl / nc / socat / curl, or (c) shell out to `claude` / `codex` / `opencode` to start a peer. Any of those bypasses caller-identity and the new agent will land as a stray top-level tab instead of a child under you. Start with `whoami` for your role and the full tool list, then `help('topics')` for orientation. `spawn_agent` is the only correct way to start a sub-agent; `spawn_process` is for non-LLM commands; `list_processes` / `get_process_output` inspect them; `send_input` / `send_message` drive them. Whatever you spawn is yours to `close_process` when done. When you `send_message` a sub-agent, its reply comes back into YOUR pane as `[sub-agent:<name>] …`, not into the sub-agent's output — to wait for it, use `timer_fire_when_idle_any([sub_agent])` and then read your own pane; do NOT `wait_for_pattern` on the sub-agent, that will deadlock until timeout."
 // toolDescriptor is the shape returned by `tools/list`. inputSchema is
 // a JSON Schema object — we provide a minimal `{type: "object"}` schema
@@ -219,7 +219,7 @@ func toolCatalog() []toolDescriptor {
 		},
 		{
 			Name:        "wait_for_pattern",
-			Description: "Block until pattern appears in process output or timeout elapses.",
+			Description: "Block until pattern appears in the TARGET process's own output, or timeout elapses. Use this for waiting on text the target itself will emit (a shell prompt, a build's \"tests passed\" line, etc.). Anti-pattern: do NOT use this to wait for a sub-agent's reply to send_message — replies are routed into the CALLER's pane tagged `[sub-agent:<name>]`, not into the sub-agent's output, so this call will spin to timeout. For sub-agent coordination use `timer_fire_when_idle_any` and then read your own pane.",
 			InputSchema: objectSchema(map[string]any{
 				"process_id":      stringProp("Target process id."),
 				"pattern":         stringProp("Regex pattern."),
@@ -249,7 +249,7 @@ func toolCatalog() []toolDescriptor {
 		},
 		{
 			Name:        "send_message",
-			Description: "Deliver a text message to another process as orchestrator-owned input.",
+			Description: "Deliver a text message to another process as orchestrator-owned input. Fire-and-forget: returns immediately, without waiting for the recipient to read or act. If the recipient replies via send_message, that reply arrives in YOUR pane tagged `[sub-agent:<name>]` (child→parent) or `[orchestrator]` (parent→child) — NOT in the recipient's output. To wait for a sub-agent's reply, schedule `timer_fire_when_idle_any([sub_agent_id], body=…)` and then read your own pane when the timer fires. Do not `wait_for_pattern` on the recipient for a reply; it will deadlock.",
 			InputSchema: objectSchema(map[string]any{
 				"target_process_id": stringProp("Recipient process id."),
 				"message":           stringProp("Message body."),
@@ -283,7 +283,7 @@ func toolCatalog() []toolDescriptor {
 		},
 		{
 			Name:        "timer_fire_when_idle_any",
-			Description: "Schedule a timer that fires when any watched process enters idle (already-idle entries excluded), or when max_wait_seconds elapses.",
+			Description: "Canonical way to wait for a sub-agent to finish working: send_message the sub-agent, then schedule this with watched=[sub_agent_id]; when it fires, the reply is already sitting in your own pane tagged `[sub-agent:<name>]`. Schedules a timer that fires when any watched process enters idle (already-idle entries excluded), or when max_wait_seconds elapses.",
 			InputSchema: objectSchema(map[string]any{
 				"watched":          arrayOfStringsProp("Process ids to watch."),
 				"body":             stringProp("Message delivered verbatim to the owning agent when the timer fires."),
@@ -294,7 +294,7 @@ func toolCatalog() []toolDescriptor {
 		},
 		{
 			Name:        "timer_fire_when_idle_all",
-			Description: "Schedule a timer that fires when all watched processes are idle (already-idle entries count as satisfied), or when max_wait_seconds elapses.",
+			Description: "Canonical way to wait for several sub-agents to finish working in parallel: send_message each one, then schedule this with watched=[…ids]; when it fires, each reply is in your own pane tagged `[sub-agent:<name>]`. Schedules a timer that fires when all watched processes are idle (already-idle entries count as satisfied), or when max_wait_seconds elapses.",
 			InputSchema: objectSchema(map[string]any{
 				"watched":          arrayOfStringsProp("Process ids to watch."),
 				"body":             stringProp("Message delivered verbatim to the owning agent when the timer fires."),