Release v0.0.2

Bundles the in-flight work into the second tagged release. See CHANGELOG.md `[0.0.2] - 2026-05-15` for the full per-change list. Highlights: - libghostty-vt was building in zig's silent Debug default, capping the full pipeline at 34-63 fps. Makefile now defaults to ReleaseFast (.mise.toml pins zig 0.15.2 so the build is reproducible). End-to-end pipeline now runs at 930-2030 fps — 27-32× faster, with 7-16× headroom over a 120 fps target. - --debug[=DIR] and --profile[=DIR] flags capture full PTY logs, pprof data, and per-hot-path metrics (chunks/sec, mean/max latencies, cache hit rates) for offline analysis. Nothing pollutes stdout/stderr. - ASCII-video benchmark suite (8-colour / truecolor / Bad-Apple patterns at 30/60/120 fps) plus a renderer microbenchmark set for stable A/B comparisons across changes. - Click-and-drag text selection from alt-screen TUIs (codex) now works — host mouse mode follows the focused child's screen side instead of being permanently armed. - Long claude session resume + codex steady-state rendering pay less per chunk: drawSidebar deferred to the chrome ticker, emulator.Title CGO poll gated on a containsOSC scan. - Vendor-TUI orientation: MCP initialize.instructions, the spawn_agent tool description, and help('spawning') all spell out the anti-patterns (shell-out, perl-into-socket) that produced codex's stray top-level tabs.
mise-pin zig 0.15.2; rebuild libghostty-vt ReleaseFast — 27-32x pipeline speedup
2026-05-15 14:22:59 +01:00 · 2026-05-15 13:54:48 +01:00 · 2026-05-15 13:43:31 +01:00 · 2026-05-15 13:31:37 +01:00 · 2026-05-15 12:46:42 +01:00 · 2026-05-15 12:41:47 +01:00
18 changed files with 2024 additions and 37 deletions
--- a/.mise.toml
+++ b/.mise.toml
@@ -0,0 +1,8 @@
+# mise config — `mise install` provisions the tools `make deps` needs.
+#
+# libghostty-vt is built from a pinned upstream Ghostty commit; that
+# commit's build.zig.zon pins minimum_zig_version = 0.15.2. We match
+# it here so contributors don't have to puzzle out the version from
+# a deep upstream file.
+[tools]
+zig = "0.15.2"
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,7 +6,68 @@ loosely follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

 ## [Unreleased]

+## [0.0.2] - 2026-05-15
+
 ### Added
+- `.mise.toml` pinning `zig = "0.15.2"` (the minimum version the
+  vendored Ghostty commit requires). Contributors run
+  `mise install` once; the Makefile picks up the resulting `zig`
+  binary automatically via `mise which zig` and falls back to
+  PATH when mise isn't available, so the existing build flow
+  still works.
+- ASCII-video stress benchmarks (`internal/app/bench_test.go`):
+  per-frame and per-stream variants at 30 / 60 / 120 fps targets,
+  three workload fixtures (8-colour cells, 24-bit truecolor cells,
+  and a Bad-Apple-style 1-bit pattern). Each stream benchmark
+  reports `µs/frame`, an achievable `fps_ceiling`, and `budget_pct`
+  so you can read off "do we hit N fps?" directly. A matching
+  Pipeline_ASCIIVideo_* set includes libghostty-vt's em.Write CGO
+  and an io.Discard stdout write so the FPS claim reflects the
+  whole pipeline, not just the renderer.
+- MCP `initialize.instructions`, the `spawn_agent` tool description
+  (visible to LLMs via `tools/list`), and the `help('spawning')`
+  topic now spell out — in the three places vendor TUIs actually
+  consult — that the connected `patterm` MCP server is the only
+  correct way to drive the host. Anti-patterns called out by name:
+  (a) trying to launch `patterm` / `patterm mcp-stdio` themselves,
+  (b) piping JSON-RPC into the per-PID Unix socket via `perl` /
+  `nc` / `socat` / `curl`, and (c) shelling out to `claude` /
+  `codex` / `opencode` to start a peer. Each of those bypasses
+  caller identity, so a sub-agent spawned that way reads back as
+  a stray top-level tab instead of a child under the spawning
+  agent. Codex was hitting (b) and (c) in practice — this is the
+  fix.
+- `--debug[=DIR]` flag captures detailed run artefacts for offline
+  analysis: a verbose `patterm.log` (the existing `PATTERM_DEBUG_LOG`
+  stream), an `events.jsonl` lifecycle log (spawn / exit / idle-state
+  changes with timestamps), and per-child `<id>.raw` files containing
+  the raw PTY byte stream. With no argument, the dated subdir
+  `$XDG_STATE_HOME/patterm/debug/YYYYMMDD-HHMMSS` is used; pass an
+  explicit path to override. All output goes to files — stdout/stderr
+  are untouched.
+- `--profile[=DIR]` flag captures pprof data plus concrete
+  performance counters for performance work: `cpu.pprof` (running
+  for the lifetime of the session), plus `heap.pprof` and
+  `goroutine.pprof` snapshots written on shutdown; alongside them,
+  a per-hot-path metrics tracker writes `metrics.jsonl` (one row
+  per second with chunk/byte rates, per-stage mean and max
+  latencies, and cache hit rates) plus a final `metrics.json`
+  aggregate and a human-readable `summary.txt` on exit.
+  Instrumented hot paths: `OnPTYOut`, viewport `renderer.Render`,
+  host stdout writes, libghostty-vt `emulator.Write` / `Title`,
+  sidebar / tab bar / status line draws (with cache-hit
+  accounting), snapshot replays, and the chrome ticker (so you can
+  see how often it fires with nothing to do). Defaults to
+  `$XDG_STATE_HOME/patterm/profile/YYYYMMDD-HHMMSS`. All
+  diagnostics (startup, errors) are written to `profile.log`
+  inside the dir, never to stdout/stderr.
+- Renderer benchmark suite (`internal/app/bench_test.go`). Three
+  workload fixtures — plain ASCII, SGR-styled lines, and a
+  ratatui-style cursor-shuffling burst — plus an OSC-gate
+  micro-benchmark. Run via `go test -bench=. -benchmem
+  ./internal/app/`. Gives a stable reference for the per-chunk
+  cost of the viewport renderer so future changes can be compared
+  apples-to-apples.
 - "New Terminal" entry in the command palette spawns a bare interactive
  `$SHELL` pane (kind `terminal`). Unlike "Run process: …" presets,
  which are session-persistent and reachable via `restart_process`,
@@ -120,6 +181,41 @@ loosely follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
  renders the canonical `--flag` form.

 ### Fixed
+- `make deps` now builds libghostty-vt with `-Doptimize=ReleaseFast`
+  instead of zig's silent `Debug` default, and resolves `zig`
+  through `mise` when a project `.mise.toml` pins it. The
+  default-Debug build shipped an unoptimised CSI/SGR parser that
+  ate 16-29 ms per 30-70 KiB full-screen frame in benchmarks,
+  capping the entire PTY-to-host pipeline at 34-63 fps. After the
+  rebuild the same pipeline runs at **930-2030 fps**: 27-32× the
+  prior throughput, and 7-16× margin over 120 fps for full-screen
+  truecolor ASCII video. Static library size drops from 33 MiB to
+  13 MiB. Override with `make deps GHOSTTY_VT_OPTIMIZE=Debug` only
+  when debugging the upstream library itself. Apply on existing
+  checkouts with `mise install && make clean-deps && make deps`.
+- Long claude session resume (and codex steady-state rendering) is
+  noticeably faster. Two costs that scaled per-PTY-chunk are now
+  deferred or short-circuited: (1) `drawSidebar()` used to run
+  synchronously for every chunk that scrolled — on a session
+  resume where every chunk scrolls, this rebuilt the full sidebar
+  string hundreds of times for a frame that was almost always
+  cache-equal. The sidebar now signals dirty and the chrome ticker
+  (60 Hz) handles the repaint. (2) `pumpChild` polled the
+  emulator's OSC title after every PTY chunk via CGO, even for
+  chunks (the common case under codex/ratatui) that carry no OSC
+  bytes at all. The poll is now gated on a containsOSC scan over
+  the chunk.
+- Click-and-drag text selection from alt-screen TUIs (codex in
+  particular) now works. Patterm used to keep host SGR mouse
+  reporting armed continuously, which forced the host terminal to
+  forward every click as an escape sequence and prevented native
+  selection. The host's mouse mode now follows the focused child's
+  screen side: primary-screen children keep mouse armed (so wheel
+  scrollback works), alt-screen children get host mouse disabled by
+  default. Alt-screen TUIs that need mouse events (vim, less, etc.)
+  re-enable mouse-mode themselves; the viewport renderer forwards
+  those toggles to the host while the child is on alt. Leaving alt
+  re-arms host mouse reporting so wheel scrollback resumes.
 - Exited terminal panes (kind `terminal`, including those launched via
  the new "New Terminal" palette entry or MCP `spawn_process` with
  `kind=terminal`) are now removed from the session and the Processes
--- a/26
+++ b/26
@@ -20,10 +20,30 @@ $(SOURCE)/.git/HEAD:

 deps-fetch: $(SOURCE)/.git/HEAD

+# Zig's `standardOptimizeOption` defaults to .Debug when no
+# -Doptimize is passed, which makes libghostty-vt's CSI/SGR parser
+# an order of magnitude slower — truecolor full-screen frames spend
+# ~16-29 ms each in em.Write under Debug (see
+# internal/app/bench_test.go BenchmarkEmulator_Write_*), which caps
+# the full PTY-to-host pipeline at ~60 fps. ReleaseFast is the
+# right default for the shipped artefact. Override with
+# `make deps GHOSTTY_VT_OPTIMIZE=Debug` when you actually want a
+# debug build of the upstream lib.
+GHOSTTY_VT_OPTIMIZE ?= ReleaseFast
+
+# Resolve zig via the project's mise pin (.mise.toml) when available,
+# falling back to whatever's on PATH. mise keeps the zig version in
+# lockstep with what the pinned ghostty commit requires; without it,
+# contributors have to chase the version requirement themselves.
+ZIG := $(shell command -v mise >/dev/null && mise which zig 2>/dev/null || command -v zig 2>/dev/null)
+
 $(INSTALL)/lib/libghostty-vt.a: $(SOURCE)/.git/HEAD
-	@command -v zig >/dev/null || { echo "ERROR: zig not on PATH (need >=0.15.2 to build libghostty-vt)"; exit 1; }
-	@echo ">> building libghostty-vt with zig"
-	@cd $(SOURCE) && zig build -Demit-lib-vt --prefix $(INSTALL)
+	@if [ -z "$(ZIG)" ]; then \
+		echo "ERROR: zig not available. Run \`mise install\` (see .mise.toml — needs zig 0.15.2) or install zig manually."; \
+		exit 1; \
+	fi
+	@echo ">> building libghostty-vt with $(ZIG) (optimize=$(GHOSTTY_VT_OPTIMIZE))"
+	@cd $(SOURCE) && $(ZIG) build -Demit-lib-vt -Doptimize=$(GHOSTTY_VT_OPTIMIZE) --prefix $(INSTALL)
 	@test -f $(INSTALL)/lib/libghostty-vt.a || { echo "ERROR: expected static lib at $(INSTALL)/lib/libghostty-vt.a"; exit 1; }
 	@echo ">> libghostty-vt installed under $(INSTALL)"

--- a/TODO.md
+++ b/TODO.md
@@ -1,14 +1,171 @@
- [ ] Codex seemed to think that it needed to launch patterm itself to get the mcp working
- [ ] I cant click and drag to select text from codex
- [ ] codex uses perl to interact with the socket rather than calling mcp tools
-  - when it _did_ open a sub claude it opened it as a separate tab rather than a sub-agent.
- [ ] codex rendering is VERY slow
-  - maybe we need to use diffing rather than rendering the entire viewport for performance
- We should add a --debug and --profile flag, so we can get detailed performance data and full logs of the agent output to be debugged later on.
-  - I don't mind what format this is in, ideally easy for LLMs to understand
- [ ] Resuming a long claude session takes a couple of seconds for the entire buffer to load in, it looks like it's scrolling down for a couple seconds.
-  - In raw alacritty this is instant, so there's some sort of performance issue with patterm's terminal emulation.
+# Perf Audit (auto-generated 2026-05-15)
+Findings from a codebase sweep — not user-reported, needs review before
+action. Each item names the anchor and a sketched fix.

+Baseline benchmark numbers (`go test -bench=. ./internal/app/`, AMD
+Ryzen 7 7800X3D, libghostty-vt **ReleaseFast** after the Makefile
+fix landed):
+
+```
+# Renderer alone
+ViewportRenderer_PlainASCII       229 MB/s     1.3 KB/op    6 allocs/op
+ViewportRenderer_StyledLines       89 MB/s    91   KB/op  4325 allocs/op
+ViewportRenderer_RatatuiBurst      40 MB/s   365   KB/op 17306 allocs/op
+RendererThroughput_ReuseInstance   90 MB/s   316   KB/op 17380 allocs/op
+ContainsOSC_NoOSC                3050 MB/s     0   B/op     0 allocs/op
+
+# ASCII-video stream (renderer only — 3 sec at the target fps)
+ASCIIVideo_Stream_8Color_120fps     260 µs/frame  3845 fps_ceiling   3.1% budget
+ASCIIVideo_Stream_TrueColor_120fps  576 µs/frame  1735 fps_ceiling   6.9% budget
+
+# Full pipeline (em.Write + renderer + io.Discard write)
+Pipeline_ASCIIVideo_8Color_120fps     493 µs/frame  2030 fps_ceiling   5.9% budget
+Pipeline_ASCIIVideo_TrueColor_120fps 1075 µs/frame   931 fps_ceiling  12.9% budget
+
+# Emulator alone (libghostty-vt CSI/SGR parser)
+Emulator_Write_Stream_8Color_120fps    257 µs/frame  3890 fps_ceiling
+Emulator_Write_Stream_TrueColor_120fps 488 µs/frame  2051 fps_ceiling
+```
+
+Result of the fix below: 27-32× pipeline speedup, 60× emulator
+speedup. Pipeline hits 930-2030 fps end-to-end — 7-16× headroom
+over the 120 fps target on the heaviest workload (truecolor
+full-screen redraws).
+
+
+- [ ] **viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks.** [MEDIUM]
+  - `internal/app/viewport_renderer.go` — the styled-lines and
+    ratatui benchmarks show 4-17k allocs per chunk. The hot
+    contributors are likely (a) `string(vr.buf)` / `string(params)`
+    conversions in `emitCSI` for every escape sequence, (b) the
+    `pending strings.Builder` resizing as fragments arrive, and (c)
+    `vr.shifter.Shift(vr.buf)` returning a fresh slice per CSI.
+  - Fix direction: switch CSI param parsing to byte-slice
+    comparison (no string conversion); reuse `vr.buf` and
+    `vr.pending` backing arrays across `Render` calls by
+    pre-growing in `newViewportRenderer`; have `cursorShifter.Shift`
+    return into a caller-owned buffer instead of allocating.
+    Profile-guided: run the styled-lines bench, point pprof at the
+    allocs profile, fix the top three call sites.
+
+- [ ] **viewport renderer throughput (~90 MB/s styled) limits codex steady-state.** [MEDIUM]
+  - The styled-lines and ratatui benchmarks come in at 89 MB/s and
+    40 MB/s respectively. A 100 KB/s codex burst is far under that
+    limit, but a session-resume dump of a 5 MiB chat history takes
+    50-130 ms of pure renderer time at those rates — enough to be
+    user-visible at the start of a long resume.
+  - Fix direction: same as the alloc fix above; once the per-call
+    allocation cost drops, the throughput ceiling rises with it.
+    Worth re-running the benches after fixing the allocs and only
+    investing further if the styled-lines bench is still under
+    ~300 MB/s.
+
+- [ ] **Session.Children() allocates a fresh slice on every call.** [MEDIUM]
+  - `internal/app/session.go:530-541` walks `s.order` under `s.mu` and
+    builds a new `[]*Child` slice every time. Callers on hot paths:
+    `drawSidebar` calls it twice per frame
+    (`internal/app/sidebar.go:139` and `:171`); `drawTabBar` calls it
+    once per frame (`internal/app/tabbar.go:37`); the classifier
+    iterates it every 250 ms (`internal/app/classifier.go:38`); and
+    palette/navigation hit it on every Ctrl-A/D/W/S keystroke.
+  - Fix direction: store the snapshot in an `atomic.Pointer[[]*Child]`
+    on `Session`, refresh it under `s.mu` only when `Spawn` / `delete`
+    mutates the map. Readers get O(1) `Load()` with zero allocation —
+    same pattern already used for `listeners` (session.go:118-123).
+
+- [ ] **wait_for_pattern re-scans the entire stream/grid every iteration.** [MEDIUM]
+  - `internal/app/host.go:476-493` (the `check` closure). On `scope =
+    "scrollback"` it calls `c.StreamRead(0)` followed by
+    `stripANSIBytes(nil, b)` over the entire ring on every wake — a
+    full O(ring size) walk per chunk arrival. On `grid` it goes
+    through PlainText (one CGO call) plus a regex match against the
+    full grid string. For an agent waiting on a marker in a chatty
+    pane, every PTY chunk fires `check()`.
+  - Fix direction: for `scrollback`, track the offset of the last
+    check and run the regex only over the new tail, reusing a
+    per-call scratch buffer for ANSI stripping. For `grid`, dedupe
+    on `ScreenVersion()` — skip when version hasn't changed.
+
+- [ ] **search_output compiles regex + strips ANSI on every call.** [MEDIUM]
+  - `internal/app/host.go:428` compiles a fresh `regexp.Regexp` per
+    invocation; `:434` strips ANSI over the entire ring buffer when
+    `kind="rendered"`. Agents that poll `search_output` with the same
+    pattern (the typical "watch for marker" loop) repay both costs on
+    every call.
+  - Fix direction: small LRU of compiled regexes keyed by pattern
+    string (cap maybe 32) on `toolHost`. Cache the stripped-ANSI
+    buffer keyed by `c.ScreenVersion()` so consecutive searches over
+    an unchanged ring reuse the strip.
+
+- [ ] **GetProcessOutput grid mode acquires the emulator twice.** [MEDIUM]
+  - `internal/app/host.go:375-391` does `em := c.Emulator()` for
+    ActiveScreen / Cursor / Size, then at line 387 re-fetches
+    `em := c.Emulator()` for PlainText. Each `Emulator()` call goes
+    through `ptyMu` and inspects the live PTY pointer. Under a
+    chatty agent polling `get_process_output` every 100 ms this is
+    a redundant lock and pointer chase per call.
+  - Fix direction: hold the emulator reference from the first
+    lookup; reuse it for PlainText. The check `if em == nil` still
+    runs cleanly because the variable is captured.
+
+- [ ] **FindChildByIdentity is O(N) under the session lock.** [LOW]
+  - `internal/app/session.go:553-565` scans the children map looking
+    for a matching `Identity` token on every new mcp-stdio
+    connection. Not a steady-state hot path — only fires once per
+    child spawn — but with many short-lived sub-agents it adds up
+    and contends with everyone else taking `s.mu`.
+  - Fix direction: maintain an `identityIndex map[string]string`
+    (identity → child id) updated alongside spawn / exit, give the
+    lookup an O(1) read.
+
+- [ ] **Per-promoter regex matches in the idle classifier.** [LOW]
+  - `internal/app/idle.go:175-182` (`matchAny`) walks each compiled
+    pattern and runs the DFA over the same 4 KiB tail. A preset with
+    five permission patterns + five error patterns is ten DFA
+    invocations per child per 250 ms tick.
+  - Fix direction: at preset load time, compile each `_patterns`
+    list into a single alternation regex (`(?:p1)|(?:p2)|…`). The
+    classifier then makes one Match call per category per tick.
+
+- [ ] **Port-detection dedup is O(N²) over c.ports.** [LOW]
+  - `internal/app/child.go:461-467`: for each fresh URL match the
+    code linearly scans the existing port list. The list rarely
+    grows past a handful, but a dev server that lists "all open
+    ports" in one log line interacts badly: M new matches × N
+    existing entries.
+  - Fix direction: keep a `seenPorts map[int]struct{}` next to
+    `c.ports`, rebuilt on prune (none today). O(1) per match.
+
+- [ ] **Port-sighting string allocations happen before the dedup check.** [LOW]
+  - `internal/app/child.go:455-456` allocates `urlForm` and `portStr`
+    before line 461's `seen` walk. Both strings are wasted when the
+    port is already in `c.ports`. Inside `c.portsMu` for the whole
+    loop body too, blocking the `Ports()` reader path.
+  - Fix direction: bind the port int first (cheap parse from
+    `m[1]`), do the seen check, only then allocate the URL string
+    for the surviving sighting.
+
+- [ ] **classifier `time.Now()` syscall per child per tick.** [LOW]
+  - `internal/app/classifier.go:54` (and the `IdleMS` /
+    `TitleIdleMS` helpers it transitively calls in
+    `internal/app/child.go:343-374`) each call `time.Now()`.
+    Reading time on Linux is fast (vDSO) but with N children × 4
+    `time.Now()` per tick × 4 ticks/sec it's wasted work that can
+    be batched.
+  - Fix direction: capture `now := time.Now().UnixNano()` once at
+    the top of `classifyAll` and thread it into `classifyOne` and
+    the helpers as a parameter.
+
+- [ ] **wait_for_pattern subscribes a listener for every call.** [LOW]
+  - `internal/app/host.go:472-474`: each invocation calls
+    `Session.Subscribe(wake)` which clones the listener slice and
+    swaps the atomic pointer; the `defer Unsubscribe` does the same
+    on exit. Two allocations per `wait_for_pattern`. The agent
+    pattern of looping on `wait_for_pattern` after every tool call
+    pays this churn on the steady-state path.
+  - Fix direction: a per-child `chunkBroadcaster` registered once
+    at child spawn that hands out lightweight subscriber tokens,
+    rather than going through the full session listener machinery.

 # On Hold
 - [ ] There's a unicode <?> being displayed in opencode [ON HOLD]
--- a/cmd/patterm/main.go
+++ b/cmd/patterm/main.go
@@ -16,7 +16,10 @@ import (
 	"context"
 	"fmt"
 	"os"
+	"path/filepath"
+	"runtime"
 	"runtime/debug"
+	"runtime/pprof"
 	"time"

 	flag "github.com/spf13/pflag"
@@ -49,7 +52,13 @@ func main() {
 	var (
 		projectDir  = flag.String("project", "", "project directory (default $PWD)")
 		showVersion = flag.Bool("version", false, "print version and exit")
+		debugDir    = flag.String("debug", "", "write debug logs + per-child raw PTY output to DIR (auto-picks a dated subdir under $XDG_STATE_HOME/patterm/debug when DIR is omitted)")
+		profileDir  = flag.String("profile", "", "write pprof files (cpu/heap/goroutine) and live perf counters (metrics.jsonl per-second, metrics.json + summary.txt on exit) to DIR (auto-picks a dated subdir under $XDG_STATE_HOME/patterm/profile when DIR is omitted)")
 	)
+	// Allow bare `--debug` / `--profile` with no value — pflag treats
+	// them as boolean-shaped strings, picking a sensible default dir.
+	flag.Lookup("debug").NoOptDefVal = "auto"
+	flag.Lookup("profile").NoOptDefVal = "auto"
 	flag.Parse()

 	if *showVersion {
@@ -73,15 +82,104 @@ func main() {
 		die("chdir %s: %v", cwd, err)
 	}

+	resolvedDebug, err := resolveDiagDir(*debugDir, "debug")
+	if err != nil {
+		die("debug: %v", err)
+	}
+	resolvedProfile, err := resolveDiagDir(*profileDir, "profile")
+	if err != nil {
+		die("profile: %v", err)
+	}
+
+	stopProfile := startProfile(resolvedProfile)
+	defer stopProfile()
+
 	ctx := context.Background()
 	if err := app.Run(ctx, app.Options{
 		ProjectDir: cwd,
 		ProjectKey: key,
+		DebugDir:   resolvedDebug,
+		ProfileDir: resolvedProfile,
 	}); err != nil {
 		die("%v", err)
 	}
 }

+// resolveDiagDir turns the raw flag value into an absolute directory
+// path. Empty string disables the feature. The sentinel "auto" (set by
+// NoOptDefVal on bare flags) picks $XDG_STATE_HOME/patterm/<kind>/<ts>.
+// Any other value is treated as a literal path.
+func resolveDiagDir(raw, kind string) (string, error) {
+	if raw == "" {
+		return "", nil
+	}
+	if raw == "auto" {
+		base := os.Getenv("XDG_STATE_HOME")
+		if base == "" {
+			home, err := os.UserHomeDir()
+			if err != nil {
+				return "", err
+			}
+			base = filepath.Join(home, ".local", "state")
+		}
+		ts := time.Now().Format("20060102-150405")
+		return filepath.Join(base, "patterm", kind, ts), nil
+	}
+	return raw, nil
+}
+
+// startProfile begins a CPU profile under dir and returns a stop func
+// that writes heap + goroutine snapshots before flushing the CPU file.
+// Returns a no-op stop func when dir is empty. All diagnostics are
+// written to <dir>/profile.log — never to stdout/stderr — so the TUI
+// stays uncluttered.
+func startProfile(dir string) func() {
+	if dir == "" {
+		return func() {}
+	}
+	if err := os.MkdirAll(dir, 0o700); err != nil {
+		return func() {}
+	}
+	logPath := filepath.Join(dir, "profile.log")
+	plog := func(format string, args ...any) {
+		f, err := os.OpenFile(logPath, os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0o600)
+		if err != nil {
+			return
+		}
+		defer f.Close()
+		fmt.Fprintf(f, format+"\n", args...)
+	}
+	cpuPath := filepath.Join(dir, "cpu.pprof")
+	f, err := os.Create(cpuPath)
+	if err != nil {
+		plog("cpu open: %v", err)
+		return func() {}
+	}
+	if err := pprof.StartCPUProfile(f); err != nil {
+		plog("cpu start: %v", err)
+		_ = f.Close()
+		return func() {}
+	}
+	plog("profiling started at %s", time.Now().Format(time.RFC3339Nano))
+	return func() {
+		pprof.StopCPUProfile()
+		_ = f.Close()
+		// Heap and goroutine snapshots at exit. Heap captures
+		// steady-state allocation; goroutine catches stragglers
+		// that didn't get cleaned up.
+		runtime.GC()
+		if hf, err := os.Create(filepath.Join(dir, "heap.pprof")); err == nil {
+			_ = pprof.Lookup("heap").WriteTo(hf, 0)
+			_ = hf.Close()
+		}
+		if gf, err := os.Create(filepath.Join(dir, "goroutine.pprof")); err == nil {
+			_ = pprof.Lookup("goroutine").WriteTo(gf, 0)
+			_ = gf.Close()
+		}
+		plog("profiling stopped at %s", time.Now().Format(time.RFC3339Nano))
+	}
+}
+
 func runMCPProxy() {
 	var (
 		socket   = flag.String("socket", "", "path to the running patterm process's MCP socket")
--- a/internal/app/app.go
+++ b/internal/app/app.go
@@ -29,6 +29,17 @@ import (
 type Options struct {
 	ProjectDir string
 	ProjectKey string
+	// DebugDir, when non-empty, enables verbose debug logging to
+	// <DebugDir>/patterm.log and per-child raw PTY output capture to
+	// <DebugDir>/<child-id>.raw. The dir is created if missing. Events
+	// (spawn / exit / state change) land in <DebugDir>/events.jsonl.
+	DebugDir string
+	// ProfileDir, when non-empty, enables in-process performance
+	// counters. patterm writes a per-second JSONL snapshot stream to
+	// <ProfileDir>/metrics.jsonl, a final aggregate to metrics.json,
+	// and a human-readable summary.txt on shutdown. The pprof files
+	// written by --profile sit alongside these in the same dir.
+	ProfileDir string
 }

 const keyCtrlK byte = 0x0b
@@ -77,6 +88,22 @@ func Run(ctx context.Context, opts Options) error {

 	sess := NewSession(opts.ProjectDir, opts.ProjectKey)
 	defer sess.Shutdown()
+
+	// Debug capture: when --debug=<dir> is set, write a verbose log
+	// (patterm.log), per-child raw PTY output (<id>.raw), and a
+	// JSONL event stream (events.jsonl). Installed before the TUI
+	// listener so the very first OnChildSpawned / OnPTYOut event
+	// is captured.
+	if opts.DebugDir != "" {
+		dc, err := openDebugCapture(opts.DebugDir)
+		if err != nil {
+			return fmt.Errorf("app: debug capture: %w", err)
+		}
+		os.Setenv("PATTERM_DEBUG_LOG", dc.LogPath())
+		sess.Subscribe(dc)
+		defer dc.Close()
+		logf("debug capture enabled at %s", opts.DebugDir)
+	}
 	// Snapshot persisted processes BEFORE attaching the store: Spawn
 	// mints fresh ids, so the old records would otherwise linger
 	// alongside the new ones. Drop them up front; the restore loop
@@ -113,6 +140,18 @@ func Run(ctx context.Context, opts Options) error {
 	ctx, cancel := context.WithCancel(ctx)
 	defer cancel()

+	// Performance tracker — instrumented hot-path timings written to
+	// <ProfileDir>. nil when --profile is off, in which case every
+	// record*() call is a fast nil check.
+	metrics, err := newMetricsTracker(opts.ProfileDir)
+	if err != nil {
+		return fmt.Errorf("app: metrics tracker: %w", err)
+	}
+	if metrics != nil {
+		go metrics.run(ctx)
+		defer metrics.close()
+	}
+
 	// Per-session idle-detection classifier. One goroutine ticks every
 	// 250ms over every live child and updates IdleState. It stops when
 	// ctx is cancelled.
@@ -129,7 +168,9 @@ func Run(ctx context.Context, opts Options) error {
 		hostCols:   cols,
 		hostRows:   rows,
 		stdinTTY:   term.IsTerminal(int(os.Stdin.Fd())),
+		metrics:    metrics,
 	}
+	sess.SetMetrics(metrics)
 	host.attention = st
 	host.focus = st
 	host.prompter = st
@@ -248,11 +289,20 @@ func Run(ctx context.Context, opts Options) error {
 			case <-st.chromeWake:
 			case <-ticker.C:
 			}
-			if !st.chromeDirty.Swap(false) {
+			chromeChanged := st.chromeDirty.Swap(false)
+			sidebarChanged := st.sidebarDirty.Swap(false)
+			didWork := chromeChanged || sidebarChanged
+			st.metrics.recordTickerFire(didWork)
+			if !didWork {
 				continue
 			}
-			st.drawTabBar()
-			st.drawStatusLine()
+			if chromeChanged {
+				st.drawTabBar()
+				st.drawStatusLine()
+			}
+			if sidebarChanged {
+				st.drawSidebar()
+			}
 		}
 	}()

@@ -355,6 +405,11 @@ type uiState struct {
 	hostCols, hostRows uint16
 	stdinTTY           bool

+	// metrics is the optional performance tracker. nil when --profile
+	// is off. Hot paths call metrics.recordX which is a fast nil
+	// check on the disabled path.
+	metrics *metricsTracker
+
 	// chromeCacheMu guards the last-rendered byte cache for each chrome
 	// element. The tab bar, sidebar, and status line all repaint on
 	// many state changes and on every PTY chunk, but their content
@@ -372,7 +427,14 @@ type uiState struct {
 	// sensitive paths (owner flip, attention, trust, focus change)
 	// continue to call drawStatusLine / drawTabBar synchronously.
 	chromeDirty atomic.Bool
-	chromeWake  chan struct{}
+	// sidebarDirty defers sidebar repaints off the per-chunk hot path
+	// in the same way. A long claude session resume — where every PTY
+	// chunk scrolls the viewport — used to call drawSidebar()
+	// synchronously per chunk, which dominated the resume's wall time
+	// (hundreds of full-sidebar rebuilds for a frame that was almost
+	// always cache-equal).
+	sidebarDirty atomic.Bool
+	chromeWake   chan struct{}

 	// padsCacheMu guards the cached scratchpad listing. The sidebar
 	// and palette/sidebar nav helpers read it on every chunk-driven
@@ -415,14 +477,18 @@ func (st *uiState) focusProcess(processID string) {
 		return
 	}
 	layout := st.layoutSnapshot()
+	onAlt := childIsOnAlt(c)
 	st.mu.Lock()
 	leavingPad := st.focusedPad != ""
 	st.focusedPad = ""
 	st.focusedID = c.ID
 	st.focusedName = c.DisplayName()
 	st.updateActiveAgentLocked(c)
-	st.renderer = newViewportRenderer(layout)
+	r := newViewportRenderer(layout)
+	r.SetChildOnAlt(onAlt)
+	st.renderer = r
 	st.mu.Unlock()
+	st.syncHostMouseForChild(onAlt)
 	// Wipe whatever the previous focus (PTY child or pad view) left in
 	// the viewport before painting the new child's snapshot.
 	if leavingPad {
@@ -434,6 +500,41 @@ func (st *uiState) focusProcess(processID string) {
 	st.drawStatusLine()
 }

+// childIsOnAlt reports whether the child's emulator is currently on
+// its alternate screen. Returns false if the emulator is gone or the
+// query fails.
+func childIsOnAlt(c *Child) bool {
+	if c == nil {
+		return false
+	}
+	em := c.Emulator()
+	if em == nil {
+		return false
+	}
+	sc, err := em.ActiveScreen()
+	if err != nil {
+		return false
+	}
+	return sc == vt.ScreenAlternate
+}
+
+// syncHostMouseForChild emits the host mouse-reporting toggle that
+// matches a newly-focused child's screen side. Primary-screen children
+// want host mouse armed so the wheel drives inline scrollback; alt-
+// screen children get host mouse disabled by default so click-and-drag
+// selection works. Alt-screen TUIs that need mouse (vim, ranger, etc.)
+// re-enable it themselves, and the viewport renderer forwards those
+// toggles back to the host.
+func (st *uiState) syncHostMouseForChild(onAlt bool) {
+	st.outMu.Lock()
+	defer st.outMu.Unlock()
+	if onAlt {
+		_, _ = os.Stdout.WriteString("\x1b[?1000l\x1b[?1006l")
+	} else {
+		_, _ = os.Stdout.WriteString("\x1b[?1000h\x1b[?1006h")
+	}
+}
+
 // focusScratchpad shifts focus to a scratchpad. The main viewport
 // renders the pad's text instead of any child PTY; PTY output for the
 // previously focused child is dropped until focus moves back to a
@@ -572,12 +673,14 @@ func (st *uiState) scratchpadsChanged() {
 // OnChildSpawned auto-focuses the new child.
 func (st *uiState) OnChildSpawned(c *Child) {
 	layout := st.layoutSnapshot()
+	onAlt := childIsOnAlt(c)
 	st.mu.Lock()
 	st.focusedPad = ""
 	st.focusedID = c.ID
 	st.focusedName = c.DisplayName()
 	st.updateActiveAgentLocked(c)
 	renderer := newViewportRenderer(layout)
+	renderer.SetChildOnAlt(onAlt)
 	st.renderer = renderer
 	palOpen := st.palette != nil
 	if palOpen {
@@ -611,6 +714,7 @@ func (st *uiState) OnChildSpawned(c *Child) {
 		st.outMu.Unlock()
 	}

+	st.syncHostMouseForChild(onAlt)
 	st.moveToViewportOrigin()
 	st.drawTabBar()
 	st.drawSidebar()
@@ -710,6 +814,10 @@ func (st *uiState) scheduleAutoRestart(c *Child) {
 // disabled only around the replay so long styled runs cannot wrap into
 // the right rail.
 func (st *uiState) OnPTYOut(childID string, chunk []byte) {
+	var entry time.Time
+	if st.metrics != nil {
+		entry = time.Now()
+	}
 	layout := st.layoutSnapshot()
 	st.mu.Lock()
 	focus := st.focusedID
@@ -726,16 +834,31 @@ func (st *uiState) OnPTYOut(childID string, chunk []byte) {
 	}
 	st.mu.Unlock()
 	if palOpen || focus != childID || renderer == nil {
+		st.metrics.recordPTYOutDrop()
 		return
 	}
 	var out []byte
 	if forceRepaint {
+		var snapStart time.Time
+		if st.metrics != nil {
+			snapStart = time.Now()
+		}
 		out = st.renderFocusedSnapshot(childID, renderer, layout)
+		if st.metrics != nil {
+			st.metrics.recordSnapshot(time.Since(snapStart))
+		}
 		if len(out) == 0 {
 			return
 		}
 	} else {
+		var rstart time.Time
+		if st.metrics != nil {
+			rstart = time.Now()
+		}
 		out = renderer.Render(chunk)
+		if st.metrics != nil {
+			st.metrics.recordRender(time.Since(rstart))
+		}
 	}
 	// One write covers the autowrap-disable prelude, the chunk, and the
 	// autowrap-restore postlude — three syscalls collapsed into one
@@ -745,9 +868,16 @@ func (st *uiState) OnPTYOut(childID string, chunk []byte) {
 	wrapped = append(wrapped, "\x1b[?7l"...)
 	wrapped = append(wrapped, out...)
 	wrapped = append(wrapped, "\x1b[?7h"...)
+	var wstart time.Time
+	if st.metrics != nil {
+		wstart = time.Now()
+	}
 	st.outMu.Lock()
 	_, _ = os.Stdout.Write(wrapped)
 	st.outMu.Unlock()
+	if st.metrics != nil {
+		st.metrics.recordStdout(time.Since(wstart), len(wrapped))
+	}
 	// RI / IND / NEL / SU / SD / IL / DL and bottom-margin LF / VT / FF
 	// scroll content within the host's scroll region, which spans every
 	// column — so any of them drags the right-hand sidebar's session-tree
@@ -760,15 +890,23 @@ func (st *uiState) OnPTYOut(childID string, chunk []byte) {
 		st.chromeCacheMu.Lock()
 		st.sidebarCache = ""
 		st.chromeCacheMu.Unlock()
-		// Scrolled chunks can clobber the sidebar columns; repaint
-		// synchronously so the gap fills before the next chunk lands.
-		st.drawSidebar()
+		// Defer the sidebar repaint to the chrome ticker. On a long
+		// session resume every PTY chunk scrolls, and a synchronous
+		// drawSidebar() per chunk dominates wall time even when the
+		// frame ends up cache-equal — the rebuild work is unconditional.
+		// The chrome ticker drains the dirty flag at ~60 Hz, so the
+		// visible gap a scrolled chunk can leave in the sidebar columns
+		// is bounded by one frame.
+		st.markSidebarDirty()
 	}
 	// Defer the tab bar + status line repaint to the chrome ticker.
 	// The cached frame already short-circuits the wire write, but
 	// avoiding the string build, FindChild, and locking on every
 	// chunk pulls steady-state CPU off the hot path.
 	st.markChromeDirty()
+	if st.metrics != nil {
+		st.metrics.recordPTYOut(time.Since(entry), len(chunk))
+	}
 }

 func (st *uiState) enterScreen() {
@@ -866,6 +1004,18 @@ func (st *uiState) markChromeDirty() {
 	}
 }

+// markSidebarDirty schedules a sidebar repaint on the next ticker
+// frame. Hot path — every scrolled PTY chunk lands here. Synchronous
+// repaints from latency-sensitive sites (spawn, exit, focus, state
+// change, trust) keep calling drawSidebar directly.
+func (st *uiState) markSidebarDirty() {
+	st.sidebarDirty.Store(true)
+	select {
+	case st.chromeWake <- struct{}{}:
+	default:
+	}
+}
+
 func (st *uiState) invalidateChromeCache() {
 	st.chromeCacheMu.Lock()
 	st.tabBarCache = ""
@@ -896,6 +1046,10 @@ func (st *uiState) renderPaletteLocked() {
 // attention ask. Right side: palette hint. The PTY child occupies
 // host_rows-1 rows so this row is exclusively ours.
 func (st *uiState) drawStatusLine() {
+	var entry time.Time
+	if st.metrics != nil {
+		entry = time.Now()
+	}
 	st.mu.Lock()
 	palOpen := st.palette != nil
 	focusID := st.focusedID
@@ -982,10 +1136,16 @@ func (st *uiState) drawStatusLine() {
 	st.chromeCacheMu.Lock()
 	if line == st.statusLineCache {
 		st.chromeCacheMu.Unlock()
+		if st.metrics != nil {
+			st.metrics.recordStatus(time.Since(entry), true)
+		}
 		return
 	}
 	st.statusLineCache = line
 	st.chromeCacheMu.Unlock()
+	if st.metrics != nil {
+		defer func() { st.metrics.recordStatus(time.Since(entry), false) }()
+	}

 	st.outMu.Lock()
 	defer st.outMu.Unlock()
--- a/internal/app/bench_test.go
+++ b/internal/app/bench_test.go
@@ -0,0 +1,546 @@
+package app
+
+import (
+	"fmt"
+	"io"
+	"strings"
+	"testing"
+
+	"github.com/hjbdev/patterm/internal/vt"
+)
+
+// Benchmarks for patterm's hot paths. Run with:
+//
+//	go test -bench=. -benchmem ./internal/app/
+//
+// or target one:
+//
+//	go test -bench=BenchmarkViewportRenderer_PlainASCII -benchmem ./internal/app/
+//
+// The fixtures below model the three workloads we care about most:
+//
+//   - PlainASCII: long-running text output (claude streaming a code
+//     diff, codex outputting a tool result body). Fast-path territory.
+//   - StyledLines: SGR-heavy output (claude/codex chat history with
+//     coloured tokens). State-machine path.
+//   - RatatuiBurst: many short cursor-positioning / SGR transitions in
+//     a tight chunk, matching codex/ratatui's incremental diff
+//     updates.
+//   - SnapshotReplay: full styled-grid replay (focus switch).
+
+// buildPlainASCIIChunk returns a roughly N-byte chunk of pure
+// printable ASCII text with the occasional newline — the cheapest
+// workload, exercises the fast path in viewport_renderer.Render.
+func buildPlainASCIIChunk(n int) []byte {
+	var b strings.Builder
+	b.Grow(n)
+	line := "The quick brown fox jumps over the lazy dog 0123456789 "
+	for b.Len() < n {
+		b.WriteString(line)
+		if b.Len()%80 < len(line) {
+			b.WriteByte('\n')
+		}
+	}
+	return []byte(b.String()[:n])
+}
+
+// buildStyledLinesChunk simulates SGR-heavy output: every word wears
+// a colour, so the renderer breaks out of its fast path on every
+// escape sequence.
+func buildStyledLinesChunk(n int) []byte {
+	var b strings.Builder
+	b.Grow(n)
+	colours := []string{"31", "32", "33", "34", "35", "36"}
+	words := []string{"package", "func", "return", "import", "struct", "type", "const", "var"}
+	i := 0
+	for b.Len() < n {
+		fmt.Fprintf(&b, "\x1b[%sm%s\x1b[0m ", colours[i%len(colours)], words[i%len(words)])
+		if i%10 == 9 {
+			b.WriteByte('\n')
+		}
+		i++
+	}
+	return []byte(b.String()[:n])
+}
+
+// buildRatatuiBurst simulates a single ratatui-style diff frame:
+// CUP, SGR, a few chars, CUP, SGR, a few chars… for a viewport's
+// worth of cells.
+func buildRatatuiBurst(cells int) []byte {
+	var b strings.Builder
+	for i := 0; i < cells; i++ {
+		row := (i / 80) + 1
+		col := (i % 80) + 1
+		fmt.Fprintf(&b, "\x1b[%d;%dH\x1b[3%dm%c", row, col, i%8, byte('A'+(i%26)))
+	}
+	b.WriteString("\x1b[0m")
+	return []byte(b.String())
+}
+
+// BenchmarkViewportRenderer_PlainASCII drives a 16 KiB plain-text
+// chunk through Render once per iteration. Reports ns/op,
+// allocations, and B/op.
+func BenchmarkViewportRenderer_PlainASCII(b *testing.B) {
+	chunk := buildPlainASCIIChunk(16 * 1024)
+	b.SetBytes(int64(len(chunk)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		_ = vr.Render(chunk)
+	}
+}
+
+// BenchmarkViewportRenderer_StyledLines exercises the per-byte CSI
+// path on SGR-heavy output. Most claude/codex chat resume traffic
+// looks like this — coloured prose with frequent style toggles.
+func BenchmarkViewportRenderer_StyledLines(b *testing.B) {
+	chunk := buildStyledLinesChunk(16 * 1024)
+	b.SetBytes(int64(len(chunk)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		_ = vr.Render(chunk)
+	}
+}
+
+// BenchmarkViewportRenderer_RatatuiBurst measures the worst-case
+// cursor-shuffling workload: full-frame diff updates dominated by
+// CUP + SGR + single-char writes.
+func BenchmarkViewportRenderer_RatatuiBurst(b *testing.B) {
+	chunk := buildRatatuiBurst(80 * 24) // one screenful of cells
+	b.SetBytes(int64(len(chunk)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		_ = vr.Render(chunk)
+	}
+}
+
+// BenchmarkContainsOSC measures the OSC-gate fast path used by
+// pumpChild before deciding whether to fire the per-chunk Title()
+// CGO call. Inputs:
+//   - "hot": SGR-styled output without OSC — the common case for
+//     codex/ratatui. We want this near zero.
+//   - "cold": chunk with an OSC sequence in the middle.
+func BenchmarkContainsOSC_NoOSC(b *testing.B) {
+	chunk := buildStyledLinesChunk(8 * 1024)
+	b.SetBytes(int64(len(chunk)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		_ = containsOSC(chunk)
+	}
+}
+
+func BenchmarkContainsOSC_WithOSC(b *testing.B) {
+	chunk := append(buildStyledLinesChunk(8*1024), []byte("\x1b]0;new title\x07")...)
+	b.SetBytes(int64(len(chunk)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		_ = containsOSC(chunk)
+	}
+}
+
+// BenchmarkRendererThroughput_ReuseInstance approximates real
+// session behaviour: a single viewport renderer fed many chunks in
+// sequence, no per-iteration allocation. Reports a throughput
+// closer to the steady-state OnPTYOut path. Chunks are 4 KiB to
+// match typical PTY read sizes; the renderer is reset every
+// benchmark run.
+func BenchmarkRendererThroughput_ReuseInstance(b *testing.B) {
+	chunks := make([][]byte, 16)
+	for i := range chunks {
+		chunks[i] = buildStyledLinesChunk(4 * 1024)
+	}
+	totalBytes := 0
+	for _, c := range chunks {
+		totalBytes += len(c)
+	}
+	b.SetBytes(int64(totalBytes))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		for _, c := range chunks {
+			_ = vr.Render(c)
+		}
+	}
+}
+
+// Stress workloads — these model the worst things a real session
+// can throw at us. The headline target is "ASCII video": every cell
+// of an 80x40 viewport carries an SGR colour change and a printable
+// character, rendered as one chunk per frame. Real ASCII-video CLIs
+// (ascii-image-converter, asciinema-render, towel.blinkenlights, the
+// Bad Apple meme) hit patterm with exactly this pattern at 24-30 fps
+// for minutes at a time.
+//
+// We synthesise the workload rather than ship a captured corpus so
+// the benchmarks stay deterministic and the repo doesn't carry tens
+// of MiB of fixture data. The encoding is faithful to what those
+// tools actually emit.
+
+// buildASCIIVideoFrame builds a single full-viewport frame with
+// 8-colour SGR per cell (`\x1b[3Nm`). One frame ≈ 30 KiB for an
+// 80x40 viewport, which lines up with what ascii-video tools emit.
+func buildASCIIVideoFrame(cols, rows int) []byte {
+	var b strings.Builder
+	b.WriteString("\x1b[H") // home cursor before the frame starts
+	for r := 0; r < rows; r++ {
+		for c := 0; c < cols; c++ {
+			fmt.Fprintf(&b, "\x1b[3%dm%c", (r+c)%8, byte(' '+(r*c)%(0x7e-' ')))
+		}
+		b.WriteString("\x1b[0m\r\n")
+	}
+	return []byte(b.String())
+}
+
+// buildASCIIVideoFrameTrueColor builds the same frame but with
+// 24-bit RGB SGR (`\x1b[38;2;R;G;Bm`). Every cell is ~20 bytes of
+// escape + 1 byte glyph, so a frame is ≈ 70 KiB. This is what
+// chafa --colors=full and modern terminal video players emit, and
+// it's the heaviest SGR variant the renderer's CSI path sees.
+func buildASCIIVideoFrameTrueColor(cols, rows int) []byte {
+	var b strings.Builder
+	b.WriteString("\x1b[H")
+	for r := 0; r < rows; r++ {
+		for c := 0; c < cols; c++ {
+			rd := (r * 7) % 256
+			gd := (c * 11) % 256
+			bd := ((r + c) * 13) % 256
+			fmt.Fprintf(&b, "\x1b[38;2;%d;%d;%dm%c", rd, gd, bd, byte(' '+(r*c)%(0x7e-' ')))
+		}
+		b.WriteString("\x1b[0m\r\n")
+	}
+	return []byte(b.String())
+}
+
+// buildBadApplePattern builds the simplest possible ASCII video
+// frame: alternating black/white cells (the Bad Apple meme is
+// essentially a 1-bit silhouette video). This is the pattern that
+// stresses the SGR state-machine without exercising truecolor parse
+// — useful for isolating "is the cost in the colour parsing or in
+// the cell-by-cell switching?"
+func buildBadApplePattern(cols, rows int) []byte {
+	var b strings.Builder
+	b.WriteString("\x1b[H")
+	for r := 0; r < rows; r++ {
+		for c := 0; c < cols; c++ {
+			if (r+c)%2 == 0 {
+				b.WriteString("\x1b[37m█")
+			} else {
+				b.WriteString("\x1b[30m█")
+			}
+		}
+		b.WriteString("\x1b[0m\r\n")
+	}
+	return []byte(b.String())
+}
+
+// BenchmarkASCIIVideo_Frame_8Color renders a single full-screen
+// frame as one chunk. The headline number is MB/s — at 30 fps a
+// frame is one PTY chunk every ~33 ms, so this should comfortably
+// stay well under 1 ms.
+func BenchmarkASCIIVideo_Frame_8Color(b *testing.B) {
+	frame := buildASCIIVideoFrame(80, 40)
+	b.SetBytes(int64(len(frame)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		_ = vr.Render(frame)
+	}
+}
+
+// BenchmarkASCIIVideo_Frame_TrueColor renders a single truecolor
+// frame. ~70 KiB per frame. Compare this to the 8-colour number to
+// see how much extra cost the truecolor SGR parse imposes — the
+// `\x1b[38;2;R;G;Bm` form is the longest and most parameter-rich
+// CSI patterm sees in practice.
+func BenchmarkASCIIVideo_Frame_TrueColor(b *testing.B) {
+	frame := buildASCIIVideoFrameTrueColor(80, 40)
+	b.SetBytes(int64(len(frame)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		_ = vr.Render(frame)
+	}
+}
+
+// BenchmarkASCIIVideo_Frame_BadApple is the 1-bit pattern: simplest
+// SGR (two colours, alternating). Isolates the renderer's cell-by-
+// cell SGR cycling cost from the truecolor parse cost.
+func BenchmarkASCIIVideo_Frame_BadApple(b *testing.B) {
+	frame := buildBadApplePattern(80, 40)
+	b.SetBytes(int64(len(frame)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		_ = vr.Render(frame)
+	}
+}
+
+// runStreamBench is the shared body for the per-fps stream
+// benchmarks. It feeds a fixed frame N times through a single
+// renderer instance and reports µs/frame + an achievable-fps
+// ceiling alongside the standard ns/op + MB/s. The fps value in
+// the benchmark name is the *target* — the workload itself doesn't
+// rate-limit; we just decide how many frames make a benchmark op
+// (3 seconds' worth) so steady-state cost dominates warm-up.
+func runStreamBench(b *testing.B, frame []byte, fps int) {
+	frames := fps * 3 // 3 seconds at the target rate
+	totalBytes := int64(len(frame) * frames)
+	b.SetBytes(totalBytes)
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		for f := 0; f < frames; f++ {
+			_ = vr.Render(frame)
+		}
+	}
+	nsPerFrame := float64(b.Elapsed().Nanoseconds()) / float64(b.N*frames)
+	b.ReportMetric(nsPerFrame/1000.0, "µs/frame")
+	b.ReportMetric(1e9/nsPerFrame, "fps_ceiling")
+	// budget_pct = how much of the per-frame budget at the target
+	// rate we burn. Under 100 means we can hit the target; over
+	// means we can't.
+	budgetNs := 1e9 / float64(fps)
+	b.ReportMetric(nsPerFrame/budgetNs*100, "budget_pct")
+}
+
+// BenchmarkASCIIVideo_Stream_8Color_30fps / _60fps / _120fps reuse
+// one renderer across (3 × fps) frames. The headline numbers are
+// µs/frame, fps_ceiling (= 1e9 / ns/frame), and budget_pct (=
+// percent of the per-frame budget at the target rate we consume).
+//
+// 30 fps is the typical ASCII-video baseline (towel, chafa, Bad
+// Apple ports). 60 is the "smooth playback" target. 120 is a
+// future-proofing stress level matching modern high-refresh
+// terminals.
+func BenchmarkASCIIVideo_Stream_8Color_30fps(b *testing.B) {
+	runStreamBench(b, buildASCIIVideoFrame(80, 40), 30)
+}
+func BenchmarkASCIIVideo_Stream_8Color_60fps(b *testing.B) {
+	runStreamBench(b, buildASCIIVideoFrame(80, 40), 60)
+}
+func BenchmarkASCIIVideo_Stream_8Color_120fps(b *testing.B) {
+	runStreamBench(b, buildASCIIVideoFrame(80, 40), 120)
+}
+
+// BenchmarkASCIIVideo_Stream_TrueColor_* same set but with the
+// truecolor frames. Compare against the 8-colour numbers to see
+// what the longer `\x1b[38;2;R;G;Bm` parse costs us.
+func BenchmarkASCIIVideo_Stream_TrueColor_30fps(b *testing.B) {
+	runStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 30)
+}
+func BenchmarkASCIIVideo_Stream_TrueColor_60fps(b *testing.B) {
+	runStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 60)
+}
+func BenchmarkASCIIVideo_Stream_TrueColor_120fps(b *testing.B) {
+	runStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 120)
+}
+
+// BenchmarkASCIIVideo_Stream_BadApple_* tracks the 1-bit alternating
+// pattern. Isolates per-cell SGR cycling cost from the truecolor
+// parse cost above — useful when reading the diff between the two
+// stream variants.
+func BenchmarkASCIIVideo_Stream_BadApple_30fps(b *testing.B) {
+	runStreamBench(b, buildBadApplePattern(80, 40), 30)
+}
+func BenchmarkASCIIVideo_Stream_BadApple_60fps(b *testing.B) {
+	runStreamBench(b, buildBadApplePattern(80, 40), 60)
+}
+func BenchmarkASCIIVideo_Stream_BadApple_120fps(b *testing.B) {
+	runStreamBench(b, buildBadApplePattern(80, 40), 120)
+}
+
+// BenchmarkEmulator_Write_8Color / _TrueColor isolate the
+// libghostty-vt CGO cost — same frames the Pipeline benchmarks use,
+// but feeding only the emulator. The delta between this and
+// BenchmarkASCIIVideo_Stream_… is the renderer's share; the rest
+// is libghostty-vt.
+func BenchmarkEmulator_Write_8Color_Frame(b *testing.B) {
+	frame := buildASCIIVideoFrame(80, 40)
+	b.SetBytes(int64(len(frame)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		em, err := vt.NewGhosttyEmulator(80, 40)
+		if err != nil {
+			b.Fatalf("emulator: %v", err)
+		}
+		if _, werr := em.Write(frame); werr != nil {
+			b.Fatalf("emulator.Write: %v", werr)
+		}
+		_ = em.Close()
+	}
+}
+
+func BenchmarkEmulator_Write_TrueColor_Frame(b *testing.B) {
+	frame := buildASCIIVideoFrameTrueColor(80, 40)
+	b.SetBytes(int64(len(frame)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		em, err := vt.NewGhosttyEmulator(80, 40)
+		if err != nil {
+			b.Fatalf("emulator: %v", err)
+		}
+		if _, werr := em.Write(frame); werr != nil {
+			b.Fatalf("emulator.Write: %v", werr)
+		}
+		_ = em.Close()
+	}
+}
+
+// BenchmarkEmulator_Write_Stream_120fps reuses one emulator across
+// 360 frames (3 sec × 120 fps). This is the cleanest measurement
+// of em.Write steady-state cost.
+func BenchmarkEmulator_Write_Stream_8Color_120fps(b *testing.B) {
+	frame := buildASCIIVideoFrame(80, 40)
+	const frames = 360
+	b.SetBytes(int64(len(frame) * frames))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		em, err := vt.NewGhosttyEmulator(80, 40)
+		if err != nil {
+			b.Fatalf("emulator: %v", err)
+		}
+		for f := 0; f < frames; f++ {
+			if _, werr := em.Write(frame); werr != nil {
+				b.Fatalf("emulator.Write: %v", werr)
+			}
+		}
+		_ = em.Close()
+	}
+	nsPerFrame := float64(b.Elapsed().Nanoseconds()) / float64(b.N*frames)
+	b.ReportMetric(nsPerFrame/1000.0, "µs/frame")
+	b.ReportMetric(1e9/nsPerFrame, "fps_ceiling")
+}
+
+func BenchmarkEmulator_Write_Stream_TrueColor_120fps(b *testing.B) {
+	frame := buildASCIIVideoFrameTrueColor(80, 40)
+	const frames = 360
+	b.SetBytes(int64(len(frame) * frames))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		em, err := vt.NewGhosttyEmulator(80, 40)
+		if err != nil {
+			b.Fatalf("emulator: %v", err)
+		}
+		for f := 0; f < frames; f++ {
+			if _, werr := em.Write(frame); werr != nil {
+				b.Fatalf("emulator.Write: %v", werr)
+			}
+		}
+		_ = em.Close()
+	}
+	nsPerFrame := float64(b.Elapsed().Nanoseconds()) / float64(b.N*frames)
+	b.ReportMetric(nsPerFrame/1000.0, "µs/frame")
+	b.ReportMetric(1e9/nsPerFrame, "fps_ceiling")
+}
+
+// runPipelineStreamBench includes the libghostty-vt emulator.Write
+// CGO call and a stdout write to io.Discard alongside the renderer
+// — i.e. everything OnPTYOut does in production except the host
+// terminal's own paint time (which patterm doesn't control). This
+// is the honest "can we hit N fps end-to-end?" measurement.
+func runPipelineStreamBench(b *testing.B, frame []byte, fps int) {
+	frames := fps * 3
+	totalBytes := int64(len(frame) * frames)
+	b.SetBytes(totalBytes)
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		em, err := vt.NewGhosttyEmulator(80, 40)
+		if err != nil {
+			b.Fatalf("emulator: %v", err)
+		}
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		for f := 0; f < frames; f++ {
+			if _, werr := em.Write(frame); werr != nil {
+				b.Fatalf("emulator.Write: %v", werr)
+			}
+			out := vr.Render(frame)
+			// Match OnPTYOut's autowrap prelude/postlude wrapping so
+			// the byte count is faithful.
+			_, _ = io.Discard.Write([]byte("\x1b[?7l"))
+			_, _ = io.Discard.Write(out)
+			_, _ = io.Discard.Write([]byte("\x1b[?7h"))
+		}
+		_ = em.Close()
+	}
+	nsPerFrame := float64(b.Elapsed().Nanoseconds()) / float64(b.N*frames)
+	b.ReportMetric(nsPerFrame/1000.0, "µs/frame")
+	b.ReportMetric(1e9/nsPerFrame, "fps_ceiling")
+	budgetNs := 1e9 / float64(fps)
+	b.ReportMetric(nsPerFrame/budgetNs*100, "budget_pct")
+}
+
+// BenchmarkPipeline_ASCIIVideo_* — the FULL OnPTYOut path
+// (emulator.Write CGO + viewport renderer + a stdout write to
+// io.Discard) running at 30/60/120 fps targets. These are the
+// numbers to trust when asking "can we sustain N fps?" The
+// renderer-only Stream benchmarks above isolate one stage and
+// understate the real cost.
+//
+// 120 fps is the explicit baseline: anything under 100% of the
+// per-frame budget here means we hit 120 fps with margin to spare.
+func BenchmarkPipeline_ASCIIVideo_8Color_30fps(b *testing.B) {
+	runPipelineStreamBench(b, buildASCIIVideoFrame(80, 40), 30)
+}
+func BenchmarkPipeline_ASCIIVideo_8Color_60fps(b *testing.B) {
+	runPipelineStreamBench(b, buildASCIIVideoFrame(80, 40), 60)
+}
+func BenchmarkPipeline_ASCIIVideo_8Color_120fps(b *testing.B) {
+	runPipelineStreamBench(b, buildASCIIVideoFrame(80, 40), 120)
+}
+
+func BenchmarkPipeline_ASCIIVideo_TrueColor_30fps(b *testing.B) {
+	runPipelineStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 30)
+}
+func BenchmarkPipeline_ASCIIVideo_TrueColor_60fps(b *testing.B) {
+	runPipelineStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 60)
+}
+func BenchmarkPipeline_ASCIIVideo_TrueColor_120fps(b *testing.B) {
+	runPipelineStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 120)
+}
+
+// BenchmarkSessionResume_5MiBStyled simulates the user's
+// motivating case: claude resuming a long chat session and dumping
+// the whole history. 5 MiB of styled output as a single Render
+// call. Numbers here tell us how long the visible "scrolling
+// while resume loads" window will be.
+func BenchmarkSessionResume_5MiBStyled(b *testing.B) {
+	chunk := buildStyledLinesChunk(5 * 1024 * 1024)
+	b.SetBytes(int64(len(chunk)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		_ = vr.Render(chunk)
+	}
+}
+
+// BenchmarkSessionResume_5MiBPlain same as above but pure text.
+// Lower bound — what we'd hit if the resume content were styling-
+// free.
+func BenchmarkSessionResume_5MiBPlain(b *testing.B) {
+	chunk := buildPlainASCIIChunk(5 * 1024 * 1024)
+	b.SetBytes(int64(len(chunk)))
+	b.ReportAllocs()
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		vr := newViewportRenderer(newTerminalLayout(120, 40))
+		_ = vr.Render(chunk)
+	}
+}
--- a/internal/app/debug.go
+++ b/internal/app/debug.go
@@ -0,0 +1,155 @@
+package app
+
+import (
+	"encoding/json"
+	"fmt"
+	"os"
+	"path/filepath"
+	"sync"
+	"time"
+)
+
+// debugCapture implements ChildEventListener and writes structured
+// debug artefacts under a single directory:
+//
+//   - patterm.log    — the existing logf() stream
+//   - events.jsonl   — one JSON object per lifecycle event
+//   - <id>.raw       — raw PTY bytes for each child, by id+name
+//
+// The capture is installed only when --debug=<dir> is set, so default
+// runs pay nothing.
+type debugCapture struct {
+	dir     string
+	logPath string
+
+	mu      sync.Mutex
+	events  *os.File
+	rawByID map[string]*os.File
+}
+
+func openDebugCapture(dir string) (*debugCapture, error) {
+	if err := os.MkdirAll(dir, 0o700); err != nil {
+		return nil, err
+	}
+	logPath := filepath.Join(dir, "patterm.log")
+	// Truncate-style fresh log per run is friendlier for grep'ing one
+	// session. The existing logf opens O_APPEND though, so concurrent
+	// runs against the same dir would interleave — that's on the user.
+	if f, err := os.Create(logPath); err != nil {
+		return nil, err
+	} else {
+		_ = f.Close()
+	}
+	ev, err := os.Create(filepath.Join(dir, "events.jsonl"))
+	if err != nil {
+		return nil, err
+	}
+	dc := &debugCapture{
+		dir:     dir,
+		logPath: logPath,
+		events:  ev,
+		rawByID: make(map[string]*os.File),
+	}
+	dc.writeEvent("session_start", map[string]any{
+		"time": time.Now().Format(time.RFC3339Nano),
+		"pid":  os.Getpid(),
+	})
+	return dc, nil
+}
+
+func (d *debugCapture) LogPath() string { return d.logPath }
+
+func (d *debugCapture) Close() error {
+	d.mu.Lock()
+	defer d.mu.Unlock()
+	d.writeEventLocked("session_end", map[string]any{
+		"time": time.Now().Format(time.RFC3339Nano),
+	})
+	for _, f := range d.rawByID {
+		_ = f.Close()
+	}
+	d.rawByID = nil
+	if d.events != nil {
+		_ = d.events.Close()
+		d.events = nil
+	}
+	return nil
+}
+
+func (d *debugCapture) OnChildSpawned(c *Child) {
+	d.writeEvent("child_spawned", map[string]any{
+		"time":      time.Now().Format(time.RFC3339Nano),
+		"id":        c.ID,
+		"name":      c.Name,
+		"kind":      string(c.Kind),
+		"parent_id": c.ParentID,
+		"preset":    c.PresetRef,
+		"argv":      c.Argv,
+	})
+}
+
+func (d *debugCapture) OnChildExited(c *Child) {
+	d.writeEvent("child_exited", map[string]any{
+		"time":      time.Now().Format(time.RFC3339Nano),
+		"id":        c.ID,
+		"name":      c.Name,
+		"exit_code": c.ExitCode(),
+	})
+	d.mu.Lock()
+	defer d.mu.Unlock()
+	if f, ok := d.rawByID[c.ID]; ok {
+		_ = f.Close()
+		delete(d.rawByID, c.ID)
+	}
+}
+
+func (d *debugCapture) OnChildStateChanged(id string, state IdleState) {
+	d.writeEvent("child_state", map[string]any{
+		"time":  time.Now().Format(time.RFC3339Nano),
+		"id":    id,
+		"state": string(state),
+	})
+}
+
+func (d *debugCapture) OnPTYOut(childID string, chunk []byte) {
+	if len(chunk) == 0 {
+		return
+	}
+	d.mu.Lock()
+	defer d.mu.Unlock()
+	f, ok := d.rawByID[childID]
+	if !ok {
+		path := filepath.Join(d.dir, childID+".raw")
+		nf, err := os.Create(path)
+		if err != nil {
+			return
+		}
+		f = nf
+		d.rawByID[childID] = nf
+	}
+	// Listener contract: don't retain chunk past return. Writing now
+	// is fine; the slice's backing buffer is reused for the next read
+	// only after this listener chain completes.
+	_, _ = f.Write(chunk)
+}
+
+func (d *debugCapture) writeEvent(kind string, fields map[string]any) {
+	d.mu.Lock()
+	defer d.mu.Unlock()
+	d.writeEventLocked(kind, fields)
+}
+
+func (d *debugCapture) writeEventLocked(kind string, fields map[string]any) {
+	if d.events == nil {
+		return
+	}
+	if fields == nil {
+		fields = map[string]any{}
+	}
+	fields["event"] = kind
+	enc, err := json.Marshal(fields)
+	if err != nil {
+		return
+	}
+	_, _ = fmt.Fprintln(d.events, string(enc))
+}
--- a/internal/app/host.go
+++ b/internal/app/host.go
@@ -1111,7 +1111,7 @@ func helpFor(topic string) mcp.HelpResponse {
 	case "spawning":
 		return mcp.HelpResponse{
 			Topic:        "spawning",
-			Content:      "spawn_agent launches another vendor LLM CLI as a sub-agent (orchestrator only). spawn_process(kind: command, preset: …) starts a stored command; spawn_process(kind: terminal) opens a shell. Command presets need trust the first time — you'll get needs_trust until the human accepts. Whatever you spawn is yours to clean up — see help('lifecycle').",
+			Content:      "spawn_agent launches another vendor LLM CLI as a sub-agent (orchestrator only). spawn_process(kind: command, preset: …) starts a stored command; spawn_process(kind: terminal) opens a shell. Command presets need trust the first time — you'll get needs_trust until the human accepts. ANTI-PATTERNS: do not shell out to `claude` / `codex` / `opencode` (or any other agent CLI) yourself, and do not pipe JSON-RPC into patterm's Unix socket via perl / nc / socat / curl. Either path bypasses caller-identity and the new agent reads back as a stray top-level tab instead of your child — call spawn_agent through the MCP transport you were initialised on. Whatever you spawn is yours to clean up — see help('lifecycle').",
 			RelatedTools: []string{"spawn_agent", "spawn_process", "start_process", "restart_process", "close_process"},
 		}
 	case "lifecycle":
--- a/internal/app/metrics.go
+++ b/internal/app/metrics.go
@@ -0,0 +1,462 @@
+package app
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"os"
+	"path/filepath"
+	"sync/atomic"
+	"time"
+)
+
+// metricsTracker collects per-hot-path counters and timings. All
+// fields are atomic so callers can record from the per-PTY-chunk path
+// without taking a lock. Enabled only when --profile is set.
+//
+// Sampled rates ("X per second", "p99 latency") are not tracked here
+// directly — the snapshotter goroutine writes a row to metrics.jsonl
+// every second, and analysis tools compute rates from the deltas.
+// Aggregate totals are written to metrics.json on shutdown.
+type metricsTracker struct {
+	startedAt time.Time
+
+	// PTY chunk arrival → stdout write pipeline (per OnPTYOut call).
+	ptyChunks      atomic.Int64
+	ptyBytes       atomic.Int64
+	onPTYOutNs     atomic.Int64
+	onPTYOutMaxNs  atomic.Int64
+	onPTYOutDrops  atomic.Int64 // chunks for non-focused children — fast-path returns
+	stdoutWrites   atomic.Int64
+	stdoutBytes    atomic.Int64
+	stdoutNs       atomic.Int64
+	stdoutMaxNs    atomic.Int64
+
+	// Viewport renderer (state-machine over child PTY bytes).
+	renderCalls atomic.Int64
+	renderNs    atomic.Int64
+	renderMaxNs atomic.Int64
+
+	// CGO into libghostty-vt (counted from pumpChild).
+	emuWriteCalls atomic.Int64
+	emuWriteNs    atomic.Int64
+	emuWriteMaxNs atomic.Int64
+	emuTitleCalls atomic.Int64
+	emuTitleNs    atomic.Int64
+	emuTitleSkips atomic.Int64 // OSC-gate fast path — title poll skipped
+
+	// Chrome paint pipeline.
+	sidebarDraws     atomic.Int64
+	sidebarCacheHits atomic.Int64
+	sidebarNs        atomic.Int64
+	sidebarMaxNs     atomic.Int64
+
+	tabbarDraws     atomic.Int64
+	tabbarCacheHits atomic.Int64
+	tabbarNs        atomic.Int64
+
+	statusDraws     atomic.Int64
+	statusCacheHits atomic.Int64
+	statusNs        atomic.Int64
+
+	// Snapshot replay (focus / spawn / nudge).
+	snapshotReplays atomic.Int64
+	snapshotNs      atomic.Int64
+	snapshotMaxNs   atomic.Int64
+
+	// Chrome ticker — distinguishes useful work from idle wakeups.
+	tickerFires      atomic.Int64
+	tickerIdleFires  atomic.Int64 // nothing dirty when the ticker fired
+
+	// Output destination (set when enabled).
+	rowFile *os.File // metrics.jsonl
+	dir     string
+}
+
+// newMetricsTracker creates an enabled tracker writing to <dir>/.
+// Returns nil + nil err if dir is empty (feature off). Caller must
+// call tracker.run(ctx) in a goroutine and tracker.close() at exit.
+func newMetricsTracker(dir string) (*metricsTracker, error) {
+	if dir == "" {
+		return nil, nil
+	}
+	if err := os.MkdirAll(dir, 0o700); err != nil {
+		return nil, err
+	}
+	row, err := os.Create(filepath.Join(dir, "metrics.jsonl"))
+	if err != nil {
+		return nil, err
+	}
+	return &metricsTracker{
+		startedAt: time.Now(),
+		rowFile:   row,
+		dir:       dir,
+	}, nil
+}
+
+// observeMax updates dst to max(dst, v) using a CAS loop. Atomic max
+// isn't a hardware primitive on most CPUs; this is the standard idiom.
+// Spurious wakeups can race but the result settles at the true max.
+func observeMax(dst *atomic.Int64, v int64) {
+	for {
+		old := dst.Load()
+		if v <= old {
+			return
+		}
+		if dst.CompareAndSwap(old, v) {
+			return
+		}
+	}
+}
+
+// recordPTYOut is called once at the end of each OnPTYOut invocation.
+// `dur` is the full per-chunk wall time (renderer + stdout + chrome
+// signals); `bytes` is the chunk's byte count.
+func (m *metricsTracker) recordPTYOut(dur time.Duration, bytes int) {
+	if m == nil {
+		return
+	}
+	m.ptyChunks.Add(1)
+	m.ptyBytes.Add(int64(bytes))
+	ns := dur.Nanoseconds()
+	m.onPTYOutNs.Add(ns)
+	observeMax(&m.onPTYOutMaxNs, ns)
+}
+
+func (m *metricsTracker) recordPTYOutDrop() {
+	if m == nil {
+		return
+	}
+	m.onPTYOutDrops.Add(1)
+}
+
+func (m *metricsTracker) recordRender(dur time.Duration) {
+	if m == nil {
+		return
+	}
+	m.renderCalls.Add(1)
+	ns := dur.Nanoseconds()
+	m.renderNs.Add(ns)
+	observeMax(&m.renderMaxNs, ns)
+}
+
+func (m *metricsTracker) recordStdout(dur time.Duration, bytes int) {
+	if m == nil {
+		return
+	}
+	m.stdoutWrites.Add(1)
+	m.stdoutBytes.Add(int64(bytes))
+	ns := dur.Nanoseconds()
+	m.stdoutNs.Add(ns)
+	observeMax(&m.stdoutMaxNs, ns)
+}
+
+func (m *metricsTracker) recordEmuWrite(dur time.Duration) {
+	if m == nil {
+		return
+	}
+	m.emuWriteCalls.Add(1)
+	ns := dur.Nanoseconds()
+	m.emuWriteNs.Add(ns)
+	observeMax(&m.emuWriteMaxNs, ns)
+}
+
+func (m *metricsTracker) recordEmuTitle(dur time.Duration, skipped bool) {
+	if m == nil {
+		return
+	}
+	if skipped {
+		m.emuTitleSkips.Add(1)
+		return
+	}
+	m.emuTitleCalls.Add(1)
+	m.emuTitleNs.Add(dur.Nanoseconds())
+}
+
+func (m *metricsTracker) recordSidebar(dur time.Duration, cacheHit bool) {
+	if m == nil {
+		return
+	}
+	m.sidebarDraws.Add(1)
+	if cacheHit {
+		m.sidebarCacheHits.Add(1)
+	}
+	ns := dur.Nanoseconds()
+	m.sidebarNs.Add(ns)
+	observeMax(&m.sidebarMaxNs, ns)
+}
+
+func (m *metricsTracker) recordTabbar(dur time.Duration, cacheHit bool) {
+	if m == nil {
+		return
+	}
+	m.tabbarDraws.Add(1)
+	if cacheHit {
+		m.tabbarCacheHits.Add(1)
+	}
+	m.tabbarNs.Add(dur.Nanoseconds())
+}
+
+func (m *metricsTracker) recordStatus(dur time.Duration, cacheHit bool) {
+	if m == nil {
+		return
+	}
+	m.statusDraws.Add(1)
+	if cacheHit {
+		m.statusCacheHits.Add(1)
+	}
+	m.statusNs.Add(dur.Nanoseconds())
+}
+
+func (m *metricsTracker) recordSnapshot(dur time.Duration) {
+	if m == nil {
+		return
+	}
+	m.snapshotReplays.Add(1)
+	ns := dur.Nanoseconds()
+	m.snapshotNs.Add(ns)
+	observeMax(&m.snapshotMaxNs, ns)
+}
+
+func (m *metricsTracker) recordTickerFire(didWork bool) {
+	if m == nil {
+		return
+	}
+	m.tickerFires.Add(1)
+	if !didWork {
+		m.tickerIdleFires.Add(1)
+	}
+}
+
+// snapshot captures the tracker's current state as a JSON-serialisable
+// map. Suitable for both the per-second JSONL row and the final
+// metrics.json aggregate.
+type metricsSnapshot struct {
+	WallSeconds   float64 `json:"wall_seconds"`
+	PTYChunks     int64   `json:"pty_chunks"`
+	PTYBytes      int64   `json:"pty_bytes"`
+	OnPTYOutNs    int64   `json:"on_pty_out_ns_total"`
+	OnPTYOutMaxNs int64   `json:"on_pty_out_ns_max"`
+	OnPTYOutDrops int64   `json:"on_pty_out_drops"`
+	StdoutWrites  int64   `json:"stdout_writes"`
+	StdoutBytes   int64   `json:"stdout_bytes"`
+	StdoutNs      int64   `json:"stdout_ns_total"`
+	StdoutMaxNs   int64   `json:"stdout_ns_max"`
+
+	RenderCalls int64 `json:"render_calls"`
+	RenderNs    int64 `json:"render_ns_total"`
+	RenderMaxNs int64 `json:"render_ns_max"`
+
+	EmuWriteCalls int64 `json:"emu_write_calls"`
+	EmuWriteNs    int64 `json:"emu_write_ns_total"`
+	EmuWriteMaxNs int64 `json:"emu_write_ns_max"`
+	EmuTitleCalls int64 `json:"emu_title_calls"`
+	EmuTitleNs    int64 `json:"emu_title_ns_total"`
+	EmuTitleSkips int64 `json:"emu_title_skips"`
+
+	SidebarDraws     int64 `json:"sidebar_draws"`
+	SidebarCacheHits int64 `json:"sidebar_cache_hits"`
+	SidebarNs        int64 `json:"sidebar_ns_total"`
+	SidebarMaxNs     int64 `json:"sidebar_ns_max"`
+
+	TabbarDraws     int64 `json:"tabbar_draws"`
+	TabbarCacheHits int64 `json:"tabbar_cache_hits"`
+	TabbarNs        int64 `json:"tabbar_ns_total"`
+
+	StatusDraws     int64 `json:"status_draws"`
+	StatusCacheHits int64 `json:"status_cache_hits"`
+	StatusNs        int64 `json:"status_ns_total"`
+
+	SnapshotReplays int64 `json:"snapshot_replays"`
+	SnapshotNs      int64 `json:"snapshot_ns_total"`
+	SnapshotMaxNs   int64 `json:"snapshot_ns_max"`
+
+	TickerFires     int64 `json:"ticker_fires"`
+	TickerIdleFires int64 `json:"ticker_idle_fires"`
+
+	// Derived rates (computed at snapshot time so consumers don't have
+	// to). All "per_second" values are averaged over wall_seconds.
+	PTYChunksPerSec      float64 `json:"pty_chunks_per_sec"`
+	PTYBytesPerSec       float64 `json:"pty_bytes_per_sec"`
+	OnPTYOutMeanUs       float64 `json:"on_pty_out_mean_us"`
+	StdoutMeanUs         float64 `json:"stdout_mean_us"`
+	EmuWriteMeanUs       float64 `json:"emu_write_mean_us"`
+	SidebarMeanUs        float64 `json:"sidebar_mean_us"`
+	SidebarCacheHitRate  float64 `json:"sidebar_cache_hit_rate"`
+	TabbarCacheHitRate   float64 `json:"tabbar_cache_hit_rate"`
+	StatusCacheHitRate   float64 `json:"status_cache_hit_rate"`
+	EmuTitleSkipRate     float64 `json:"emu_title_skip_rate"`
+	TickerIdleRate       float64 `json:"ticker_idle_rate"`
+	Timestamp            string  `json:"timestamp"`
+}
+
+func (m *metricsTracker) snapshotNow() metricsSnapshot {
+	wall := time.Since(m.startedAt).Seconds()
+	if wall <= 0 {
+		wall = 1
+	}
+	chunks := m.ptyChunks.Load()
+	bytes := m.ptyBytes.Load()
+	onptyTotal := m.onPTYOutNs.Load()
+	stdW := m.stdoutWrites.Load()
+	stdNs := m.stdoutNs.Load()
+	emuW := m.emuWriteCalls.Load()
+	emuWNs := m.emuWriteNs.Load()
+	sbDraws := m.sidebarDraws.Load()
+	sbHits := m.sidebarCacheHits.Load()
+	sbNs := m.sidebarNs.Load()
+	tbDraws := m.tabbarDraws.Load()
+	tbHits := m.tabbarCacheHits.Load()
+	stDraws := m.statusDraws.Load()
+	stHits := m.statusCacheHits.Load()
+	emuTC := m.emuTitleCalls.Load()
+	emuTS := m.emuTitleSkips.Load()
+	tickerF := m.tickerFires.Load()
+	tickerI := m.tickerIdleFires.Load()
+
+	div := func(num, denom int64) float64 {
+		if denom == 0 {
+			return 0
+		}
+		return float64(num) / float64(denom)
+	}
+
+	return metricsSnapshot{
+		WallSeconds:   wall,
+		PTYChunks:     chunks,
+		PTYBytes:      bytes,
+		OnPTYOutNs:    onptyTotal,
+		OnPTYOutMaxNs: m.onPTYOutMaxNs.Load(),
+		OnPTYOutDrops: m.onPTYOutDrops.Load(),
+		StdoutWrites:  stdW,
+		StdoutBytes:   m.stdoutBytes.Load(),
+		StdoutNs:      stdNs,
+		StdoutMaxNs:   m.stdoutMaxNs.Load(),
+
+		RenderCalls: m.renderCalls.Load(),
+		RenderNs:    m.renderNs.Load(),
+		RenderMaxNs: m.renderMaxNs.Load(),
+
+		EmuWriteCalls: emuW,
+		EmuWriteNs:    emuWNs,
+		EmuWriteMaxNs: m.emuWriteMaxNs.Load(),
+		EmuTitleCalls: emuTC,
+		EmuTitleNs:    m.emuTitleNs.Load(),
+		EmuTitleSkips: emuTS,
+
+		SidebarDraws:     sbDraws,
+		SidebarCacheHits: sbHits,
+		SidebarNs:        sbNs,
+		SidebarMaxNs:     m.sidebarMaxNs.Load(),
+
+		TabbarDraws:     tbDraws,
+		TabbarCacheHits: tbHits,
+		TabbarNs:        m.tabbarNs.Load(),
+
+		StatusDraws:     stDraws,
+		StatusCacheHits: stHits,
+		StatusNs:        m.statusNs.Load(),
+
+		SnapshotReplays: m.snapshotReplays.Load(),
+		SnapshotNs:      m.snapshotNs.Load(),
+		SnapshotMaxNs:   m.snapshotMaxNs.Load(),
+
+		TickerFires:     tickerF,
+		TickerIdleFires: tickerI,
+
+		PTYChunksPerSec:     float64(chunks) / wall,
+		PTYBytesPerSec:      float64(bytes) / wall,
+		OnPTYOutMeanUs:      div(onptyTotal/1000, chunks),
+		StdoutMeanUs:        div(stdNs/1000, stdW),
+		EmuWriteMeanUs:      div(emuWNs/1000, emuW),
+		SidebarMeanUs:       div(sbNs/1000, sbDraws),
+		SidebarCacheHitRate: div(sbHits, sbDraws),
+		TabbarCacheHitRate:  div(tbHits, tbDraws),
+		StatusCacheHitRate:  div(stHits, stDraws),
+		EmuTitleSkipRate:    div(emuTS, emuTC+emuTS),
+		TickerIdleRate:      div(tickerI, tickerF),
+		Timestamp:           time.Now().Format(time.RFC3339Nano),
+	}
+}
+
+// run is the snapshotter goroutine: write a JSONL row every second
+// until ctx is cancelled. Stops cleanly without flushing partial
+// rows.
+func (m *metricsTracker) run(ctx context.Context) {
+	if m == nil {
+		return
+	}
+	enc := json.NewEncoder(m.rowFile)
+	ticker := time.NewTicker(time.Second)
+	defer ticker.Stop()
+	for {
+		select {
+		case <-ctx.Done():
+			return
+		case <-ticker.C:
+			snap := m.snapshotNow()
+			_ = enc.Encode(snap)
+		}
+	}
+}
+
+// close writes the final aggregate snapshot to metrics.json + a
+// short human-readable summary.txt, then closes the row file. Safe
+// to call on a nil receiver.
+func (m *metricsTracker) close() {
+	if m == nil {
+		return
+	}
+	snap := m.snapshotNow()
+	if f, err := os.Create(filepath.Join(m.dir, "metrics.json")); err == nil {
+		enc := json.NewEncoder(f)
+		enc.SetIndent("", "  ")
+		_ = enc.Encode(snap)
+		_ = f.Close()
+	}
+	if f, err := os.Create(filepath.Join(m.dir, "summary.txt")); err == nil {
+		writeSummary(f, snap)
+		_ = f.Close()
+	}
+	if m.rowFile != nil {
+		_ = m.rowFile.Close()
+		m.rowFile = nil
+	}
+}
+
+// writeSummary renders a brief human-readable digest of a snapshot.
+// Designed for `cat summary.txt` after a session — quick orientation
+// before diving into metrics.json / pprof.
+func writeSummary(w *os.File, s metricsSnapshot) {
+	fmt.Fprintf(w, "patterm performance summary\n")
+	fmt.Fprintf(w, "===========================\n\n")
+	fmt.Fprintf(w, "session length:        %.1fs\n", s.WallSeconds)
+	fmt.Fprintf(w, "pty chunks:            %d  (%.1f /s)\n", s.PTYChunks, s.PTYChunksPerSec)
+	fmt.Fprintf(w, "pty bytes:             %d  (%.0f /s, %.1f KiB/s)\n",
+		s.PTYBytes, s.PTYBytesPerSec, s.PTYBytesPerSec/1024)
+	fmt.Fprintf(w, "pty chunks dropped:    %d  (focus not on caller — fast-path return)\n", s.OnPTYOutDrops)
+	fmt.Fprintf(w, "\n")
+	fmt.Fprintf(w, "OnPTYOut mean:         %.1fµs   max: %.1fms\n",
+		s.OnPTYOutMeanUs, float64(s.OnPTYOutMaxNs)/1e6)
+	fmt.Fprintf(w, "viewport.Render calls: %d  total %.1fms  max %.1fms\n",
+		s.RenderCalls, float64(s.RenderNs)/1e6, float64(s.RenderMaxNs)/1e6)
+	fmt.Fprintf(w, "stdout writes:         %d  mean %.1fµs  max %.1fms  bytes %d\n",
+		s.StdoutWrites, s.StdoutMeanUs, float64(s.StdoutMaxNs)/1e6, s.StdoutBytes)
+	fmt.Fprintf(w, "\n")
+	fmt.Fprintf(w, "emulator.Write (cgo):  %d  mean %.1fµs  max %.1fms\n",
+		s.EmuWriteCalls, s.EmuWriteMeanUs, float64(s.EmuWriteMaxNs)/1e6)
+	fmt.Fprintf(w, "emulator.Title polls:  %d real, %d gated   skip rate %.1f%%\n",
+		s.EmuTitleCalls, s.EmuTitleSkips, s.EmuTitleSkipRate*100)
+	fmt.Fprintf(w, "\n")
+	fmt.Fprintf(w, "sidebar draws:         %d  mean %.1fµs  max %.1fms  cache-hit %.1f%%\n",
+		s.SidebarDraws, s.SidebarMeanUs, float64(s.SidebarMaxNs)/1e6, s.SidebarCacheHitRate*100)
+	fmt.Fprintf(w, "tabbar draws:          %d  cache-hit %.1f%%\n",
+		s.TabbarDraws, s.TabbarCacheHitRate*100)
+	fmt.Fprintf(w, "status draws:          %d  cache-hit %.1f%%\n",
+		s.StatusDraws, s.StatusCacheHitRate*100)
+	fmt.Fprintf(w, "snapshot replays:      %d  total %.1fms  max %.1fms\n",
+		s.SnapshotReplays, float64(s.SnapshotNs)/1e6, float64(s.SnapshotMaxNs)/1e6)
+	fmt.Fprintf(w, "\n")
+	fmt.Fprintf(w, "chrome ticker:         %d fires, %d idle   idle rate %.1f%%\n",
+		s.TickerFires, s.TickerIdleFires, s.TickerIdleRate*100)
+}
--- a/internal/app/metrics_test.go
+++ b/internal/app/metrics_test.go
@@ -0,0 +1,116 @@
+package app
+
+import (
+	"encoding/json"
+	"os"
+	"path/filepath"
+	"testing"
+	"time"
+)
+
+func TestMetricsTrackerDisabledByEmptyDir(t *testing.T) {
+	m, err := newMetricsTracker("")
+	if err != nil {
+		t.Fatalf("newMetricsTracker(\"\") err: %v", err)
+	}
+	if m != nil {
+		t.Fatalf("expected nil tracker for empty dir, got %v", m)
+	}
+}
+
+func TestMetricsTrackerRecordsAndWrites(t *testing.T) {
+	dir := t.TempDir()
+	m, err := newMetricsTracker(dir)
+	if err != nil {
+		t.Fatalf("newMetricsTracker: %v", err)
+	}
+	if m == nil {
+		t.Fatal("expected non-nil tracker")
+	}
+
+	m.recordPTYOut(2*time.Millisecond, 1024)
+	m.recordPTYOut(5*time.Millisecond, 4096)
+	m.recordRender(800 * time.Microsecond)
+	m.recordStdout(300*time.Microsecond, 1100)
+	m.recordEmuWrite(150 * time.Microsecond)
+	m.recordEmuTitle(0, true)
+	m.recordEmuTitle(20*time.Microsecond, false)
+	m.recordSidebar(100*time.Microsecond, true)
+	m.recordSidebar(900*time.Microsecond, false)
+	m.recordTabbar(50*time.Microsecond, true)
+	m.recordStatus(40*time.Microsecond, true)
+	m.recordSnapshot(2 * time.Millisecond)
+	m.recordTickerFire(false)
+	m.recordTickerFire(true)
+	m.recordPTYOutDrop()
+
+	m.close()
+
+	// metrics.json should exist and parse, and reflect what we recorded.
+	raw, err := os.ReadFile(filepath.Join(dir, "metrics.json"))
+	if err != nil {
+		t.Fatalf("read metrics.json: %v", err)
+	}
+	var snap metricsSnapshot
+	if err := json.Unmarshal(raw, &snap); err != nil {
+		t.Fatalf("parse metrics.json: %v", err)
+	}
+	if snap.PTYChunks != 2 {
+		t.Errorf("PTYChunks = %d, want 2", snap.PTYChunks)
+	}
+	if snap.PTYBytes != 5120 {
+		t.Errorf("PTYBytes = %d, want 5120", snap.PTYBytes)
+	}
+	if snap.OnPTYOutMaxNs != (5 * time.Millisecond).Nanoseconds() {
+		t.Errorf("OnPTYOutMaxNs = %d, want %d",
+			snap.OnPTYOutMaxNs, (5 * time.Millisecond).Nanoseconds())
+	}
+	if snap.SidebarDraws != 2 {
+		t.Errorf("SidebarDraws = %d, want 2", snap.SidebarDraws)
+	}
+	if snap.SidebarCacheHits != 1 {
+		t.Errorf("SidebarCacheHits = %d, want 1", snap.SidebarCacheHits)
+	}
+	if snap.SidebarCacheHitRate != 0.5 {
+		t.Errorf("SidebarCacheHitRate = %v, want 0.5", snap.SidebarCacheHitRate)
+	}
+	if snap.EmuTitleCalls != 1 || snap.EmuTitleSkips != 1 {
+		t.Errorf("emu title accounting: calls=%d skips=%d, want 1/1",
+			snap.EmuTitleCalls, snap.EmuTitleSkips)
+	}
+	if snap.TickerFires != 2 || snap.TickerIdleFires != 1 {
+		t.Errorf("ticker accounting: fires=%d idle=%d, want 2/1",
+			snap.TickerFires, snap.TickerIdleFires)
+	}
+	if snap.OnPTYOutDrops != 1 {
+		t.Errorf("OnPTYOutDrops = %d, want 1", snap.OnPTYOutDrops)
+	}
+
+	// summary.txt should also be present and non-empty.
+	info, err := os.Stat(filepath.Join(dir, "summary.txt"))
+	if err != nil {
+		t.Fatalf("stat summary.txt: %v", err)
+	}
+	if info.Size() == 0 {
+		t.Fatal("summary.txt is empty")
+	}
+}
+
+func TestMetricsTrackerNilSafe(t *testing.T) {
+	// Every record* method must be safe to call on a nil receiver
+	// because the hot paths use that to avoid an enabled-check.
+	var m *metricsTracker
+	m.recordPTYOut(time.Millisecond, 100)
+	m.recordPTYOutDrop()
+	m.recordRender(time.Microsecond)
+	m.recordStdout(time.Microsecond, 50)
+	m.recordEmuWrite(time.Microsecond)
+	m.recordEmuTitle(time.Microsecond, false)
+	m.recordEmuTitle(0, true)
+	m.recordSidebar(time.Microsecond, true)
+	m.recordTabbar(time.Microsecond, false)
+	m.recordStatus(time.Microsecond, true)
+	m.recordSnapshot(time.Microsecond)
+	m.recordTickerFire(true)
+	m.close()
+}
--- a/internal/app/session.go
+++ b/internal/app/session.go
@@ -50,6 +50,11 @@ type Session struct {
 	// JSON file so they can be re-spawned after patterm restarts.
 	// Optional; nil means "no persistence" (used by unit tests).
 	persistStore *persist.Store
+
+	// metrics is the optional performance tracker. nil when --profile
+	// is off. The pump goroutine reads it via atomic Load so installing
+	// metrics post-construction doesn't race with running children.
+	metrics atomic.Pointer[metricsTracker]
 }

 // SetPersistStore attaches a process-persistence store. Future Spawn /
@@ -61,6 +66,18 @@ func (s *Session) SetPersistStore(p *persist.Store) {
 	s.mu.Unlock()
 }

+// SetMetrics installs the per-session performance tracker. Safe to
+// call with nil to disable (the default). Reads on the hot path go
+// through atomic.Pointer.Load() with no lock; SetMetrics swaps the
+// pointer once at startup.
+func (s *Session) SetMetrics(m *metricsTracker) {
+	s.metrics.Store(m)
+}
+
+func (s *Session) loadMetrics() *metricsTracker {
+	return s.metrics.Load()
+}
+
 // ChildEventListener is implemented by the TUI to react to lifecycle
 // events without polling.
 type ChildEventListener interface {
@@ -392,17 +409,37 @@ func (s *Session) pumpChild(c *Child, runID uint64) {
 			}
 			chunk := buf[:n]
 			if em := c.Emulator(); em != nil {
+				m := s.loadMetrics()
+				wstart := time.Time{}
+				if m != nil {
+					wstart = time.Now()
+				}
 				if _, werr := em.Write(chunk); werr != nil {
 					logf("emulator.Write(child %s): %v", c.ID, werr)
 				}
+				if m != nil {
+					m.recordEmuWrite(time.Since(wstart))
+				}
 				// OSC 0/2 title updates ride on the same byte stream as
 				// the rest of the output. Polling the emulator after each
-				// Write is cheap (one cgo call returning a borrowed
-				// string) and lets the classifier treat title changes as
-				// an activity signal — even when the title isn't visible
-				// in the rendered grid.
-				if t, terr := em.Title(); terr == nil {
-					c.recordTitle(t)
+				// chunk is cheap on its own (one CGO call) but codex/
+				// ratatui sends so many small chunks that the per-chunk
+				// CGO cost becomes measurable. Skip the Title poll when
+				// the chunk doesn't carry an OSC start byte at all; the
+				// title can only change on chunks that include one.
+				if containsOSC(chunk) {
+					tstart := time.Time{}
+					if m != nil {
+						tstart = time.Now()
+					}
+					if t, terr := em.Title(); terr == nil {
+						c.recordTitle(t)
+					}
+					if m != nil {
+						m.recordEmuTitle(time.Since(tstart), false)
+					}
+				} else if m != nil {
+					m.recordEmuTitle(0, true)
 				}
 			}
 			c.recordWrite(chunk)
@@ -679,6 +716,24 @@ func (s *Session) Shutdown() {
 	}
 }

+// containsOSC reports whether chunk holds a sequence that could begin
+// an OSC. OSC starts as ESC ] (0x1b 0x5d) or the bare C1 ] (0x9d),
+// so a chunk without either cannot have changed the emulator's OSC
+// title state. Used to short-circuit the per-chunk Title() poll from
+// pumpChild, which otherwise pays a CGO call for every chunk even
+// when codex/ratatui is just emitting SGR-styled output.
+func containsOSC(chunk []byte) bool {
+	for i, b := range chunk {
+		if b == 0x9d {
+			return true
+		}
+		if b == 0x1b && i+1 < len(chunk) && chunk[i+1] == ']' {
+			return true
+		}
+	}
+	return false
+}
+
 func logf(format string, args ...any) {
 	if os.Getenv("PATTERM_DEBUG_LOG") == "" {
 		return
--- a/internal/app/sidebar.go
+++ b/internal/app/sidebar.go
@@ -38,6 +38,10 @@ func formatShortDuration(d time.Duration) string {
 // computed main viewport, so the sidebar region is outside the child's
 // cursor range. We can redraw freely without fighting the child for cells.
 func (st *uiState) drawSidebar() {
+	var entry time.Time
+	if st.metrics != nil {
+		entry = time.Now()
+	}
 	st.mu.Lock()
 	palOpen := st.palette != nil
 	focus := st.focusedID
@@ -231,10 +235,16 @@ func (st *uiState) drawSidebar() {
 	st.chromeCacheMu.Lock()
 	if frame == st.sidebarCache {
 		st.chromeCacheMu.Unlock()
+		if st.metrics != nil {
+			st.metrics.recordSidebar(time.Since(entry), true)
+		}
 		return
 	}
 	st.sidebarCache = frame
 	st.chromeCacheMu.Unlock()
+	if st.metrics != nil {
+		defer func() { st.metrics.recordSidebar(time.Since(entry), false) }()
+	}

 	st.outMu.Lock()
 	// Save cursor; emit the sidebar; restore.
--- a/internal/app/tabbar.go
+++ b/internal/app/tabbar.go
@@ -4,6 +4,7 @@ import (
 	"fmt"
 	"os"
 	"strings"
+	"time"
 	"unicode/utf8"
 )

@@ -17,6 +18,10 @@ const tabBarRows = 2
 // to the leftmost tabs so the strip fills the screen edge-to-edge.
 // A trailing "+ new" hint sits in the rightmost reserved slot.
 func (st *uiState) drawTabBar() {
+	var entry time.Time
+	if st.metrics != nil {
+		entry = time.Now()
+	}
 	st.mu.Lock()
 	palOpen := st.palette != nil
 	focus := st.focusedID
@@ -188,10 +193,16 @@ func (st *uiState) drawTabBar() {
 	st.chromeCacheMu.Lock()
 	if frame == st.tabBarCache {
 		st.chromeCacheMu.Unlock()
+		if st.metrics != nil {
+			st.metrics.recordTabbar(time.Since(entry), true)
+		}
 		return
 	}
 	st.tabBarCache = frame
 	st.chromeCacheMu.Unlock()
+	if st.metrics != nil {
+		defer func() { st.metrics.recordTabbar(time.Since(entry), false) }()
+	}

 	st.outMu.Lock()
 	defer st.outMu.Unlock()
--- a/internal/app/viewport_renderer.go
+++ b/internal/app/viewport_renderer.go
@@ -33,6 +33,14 @@ type viewportRenderer struct {
 	// cache so the next drawSidebar repaints over the clobber.
 	scrolled bool

+	// childOnAlt tracks whether the focused child has entered its
+	// alternate screen (via ?47 / ?1047 / ?1049). Used to gate mouse-
+	// tracking-mode forwarding to the host: filter on primary so
+	// patterm's wheel-scrollback stays armed, forward on alt so codex
+	// (which disables mouse) lets the user select text and vim (which
+	// enables it) still gets mouse events.
+	childOnAlt bool
+
 	// skipUTF8 is set when the current multi-byte UTF-8 character started
 	// past the viewport's right edge. The starter byte was dropped, so
 	// the remaining continuation bytes must be dropped too instead of
@@ -65,6 +73,16 @@ func newViewportRenderer(l terminalLayout) *viewportRenderer {
 	return vr
 }

+// SetChildOnAlt seeds the renderer's view of the focused child's screen
+// side. Used when a new renderer is constructed for an already-running
+// child whose alt-screen transition we missed, so subsequent mouse-mode
+// toggles are filtered/forwarded according to the right side.
+func (vr *viewportRenderer) SetChildOnAlt(onAlt bool) {
+	vr.mu.Lock()
+	defer vr.mu.Unlock()
+	vr.childOnAlt = onAlt
+}
+
 func (vr *viewportRenderer) SetLayout(l terminalLayout) {
 	vr.mu.Lock()
 	defer vr.mu.Unlock()
@@ -236,15 +254,36 @@ func (vr *viewportRenderer) emitCSI() {
 			return
 		}
 		if isAltScreenMode(params) {
+			// Track the child's screen side so we know whether to filter
+			// or forward subsequent mouse-mode toggles. Entering alt
+			// disables host mouse reporting by default so codex (and
+			// any other alt-screen TUI that doesn't request mouse)
+			// allows the user to click-drag to select text. Alt-screen
+			// TUIs that want mouse (vim, less with -X) re-enable it
+			// via ?1000h after switching to alt — the forwarder below
+			// passes that through. Leaving alt re-arms host mouse for
+			// primary-screen wheel-scrollback.
+			wasAlt := vr.childOnAlt
+			vr.childOnAlt = final == 'h'
+			if !wasAlt && vr.childOnAlt {
+				vr.pending.WriteString("\x1b[?1000l\x1b[?1006l")
+			}
+			if wasAlt && !vr.childOnAlt {
+				vr.pending.WriteString("\x1b[?1000h\x1b[?1006h")
+			}
 			return
 		}
 		if isMouseTrackingMode(params) {
-			// Patterm owns mouse reporting on the host so wheel events keep
-			// flowing for scroll-viewport. The child's own emulator still
-			// observes the mode set/reset (it processes the same bytes we
-			// hand to ghostty_terminal_vt_write), so we know whether the
-			// child wants mouse input — we just don't let it disarm our
-			// host listener.
+			// On the child's primary screen patterm owns mouse reporting so
+			// wheel events keep flowing for in-pane scrollback — drop the
+			// child's toggle. On the alt screen the child should be free
+			// to enable mouse (vim, less) or disable it (codex); we forward
+			// the toggle to the host so click-and-drag selection works for
+			// alt-screen TUIs that don't want mouse, and mouse-aware ones
+			// still see the events they need.
+			if vr.childOnAlt {
+				vr.pending.Write(vr.buf)
+			}
 			return
 		}
 	}
--- a/internal/app/viewport_renderer_test.go
+++ b/internal/app/viewport_renderer_test.go
@@ -24,8 +24,36 @@ func TestViewportRendererShiftsCursor(t *testing.T) {
 func TestViewportRendererSwallowsAltScreenToggles(t *testing.T) {
 	vr := newViewportRenderer(newTerminalLayout(120, 40))
 	got := string(vr.Render([]byte("a\x1b[?1049hb\x1b[?1049lc")))
+	// The ?1049h/l toggles themselves must not reach the host (patterm
+	// owns its own alt screen). On the transition we re-sync host mouse
+	// reporting so codex (which doesn't request mouse) lets the user
+	// drag-select; leaving alt re-arms it for primary-screen wheel
+	// scrollback.
+	want := "a\x1b[?1000l\x1b[?1006lb\x1b[?1000h\x1b[?1006hc"
+	if got != want {
+		t.Fatalf("alt-screen toggles: got %q want %q", got, want)
+	}
+}
+
+func TestViewportRendererMouseTrackingFilteredOnPrimary(t *testing.T) {
+	vr := newViewportRenderer(newTerminalLayout(120, 40))
+	got := string(vr.Render([]byte("a\x1b[?1000lb\x1b[?1000hc")))
 	if got != "abc" {
-		t.Fatalf("alt-screen toggles: got %q", got)
+		t.Fatalf("mouse mode on primary should be filtered: got %q", got)
+	}
+}
+
+func TestViewportRendererMouseTrackingForwardedOnAlt(t *testing.T) {
+	vr := newViewportRenderer(newTerminalLayout(120, 40))
+	// Enter alt; subsequent mouse-mode toggles should reach the host so
+	// alt-screen TUIs (vim, less) can run with mouse on, and selection-
+	// using ones (codex) stay with mouse off.
+	got := string(vr.Render([]byte("\x1b[?1049h\x1b[?1000lx\x1b[?1000hy")))
+	if !strings.Contains(got, "\x1b[?1000l") {
+		t.Fatalf("alt-screen mouse disable should reach host: %q", got)
+	}
+	if !strings.Contains(got, "\x1b[?1000h") {
+		t.Fatalf("alt-screen mouse enable should reach host: %q", got)
 	}
 }

--- a/internal/mcp/protocol.go
+++ b/internal/mcp/protocol.go
@@ -27,6 +27,24 @@ var serverInfo = map[string]any{
 	"version": "0.1.0",
 }

+// serverInstructions is returned in the MCP `initialize` response. MCP
+// clients show this to the underlying LLM as context for how to use
+// the server. Failure modes we've seen and want to head off:
+//   - The agent assumes patterm is something it has to launch (running
+//     `patterm` or `patterm mcp-stdio` from its own shell). It's
+//     already attached — it just calls the tools.
+//   - The agent reaches for shell tools (perl / nc / socat / curl) to
+//     poke patterm's Unix socket directly. That socket connection
+//     carries no caller identity, so any sub-agent the agent spawns
+//     that way ends up as a stray top-level tab instead of a child
+//     under the spawning agent. Always go through the MCP tools.
+//   - The agent shells out to `claude` / `codex` / `opencode` to start
+//     a peer instead of calling `spawn_agent`. Those peers won't show
+//     up as sub-agents and won't be tied into the patterm lifecycle.
+//
+// Keep this short — clients vary in how much they surface to the LLM.
+const serverInstructions = "You are already running INSIDE patterm; the `patterm` MCP server is connected over the same stdio MCP transport you use for any other MCP server. Use the MCP tools you see in tools/list — do NOT (a) try to launch `patterm` or `patterm mcp-stdio` yourself, (b) poke the Unix socket through perl / nc / socat / curl, or (c) shell out to `claude` / `codex` / `opencode` to start a peer. Any of those bypasses caller-identity and the new agent will land as a stray top-level tab instead of a child under you. Start with `whoami` for your role and the full tool list, then `help('topics')` for orientation. `spawn_agent` is the only correct way to start a sub-agent; `spawn_process` is for non-LLM commands; `list_processes` / `get_process_output` inspect them; `send_input` / `send_message` drive them. Whatever you spawn is yours to `close_process` when done."
+
 // toolDescriptor is the shape returned by `tools/list`. inputSchema is
 // a JSON Schema object — we provide a minimal `{type: "object"}` schema
 // for each tool, which lets MCP clients accept arbitrary arguments and
@@ -88,7 +106,7 @@ func toolCatalog() []toolDescriptor {
 	return []toolDescriptor{
 		{
 			Name:        "spawn_agent",
-			Description: "Spawn a sub-agent from an agent preset and optionally seed it with initial instructions. Caller owns lifecycle: when the sub-agent's work is done (it reports back via send_message, or you no longer need it), call close_process on its process_id to free the pane and tear down the PTY. See help('lifecycle').",
+			Description: "Spawn a sub-agent from an agent preset and optionally seed it with initial instructions. This is the ONLY correct way to start a sub-agent under you — do not shell out to `claude` / `codex` / `opencode` and do not poke patterm's Unix socket via perl / nc / socat. Either bypasses caller identity and the new agent lands as a stray top-level tab instead of your child. Caller owns lifecycle: when the sub-agent's work is done (it reports back via send_message, or you no longer need it), call close_process on its process_id to free the pane and tear down the PTY. See help('spawning') and help('lifecycle').",
 			InputSchema: objectSchema(map[string]any{
 				"agent":              stringProp("Preset name (e.g. \"claude\", \"codex\")."),
 				"agent_instructions": stringProp("Initial prompt typed into the agent after it's ready."),
@@ -377,7 +395,8 @@ func (s *Server) handleProtocolMethod(callerID, method string, params json.RawMe
 			"capabilities": map[string]any{
 				"tools": map[string]any{"listChanged": false},
 			},
-			"serverInfo": serverInfo,
+			"serverInfo":   serverInfo,
+			"instructions": serverInstructions,
 		}
 		return result, true, 0, "", nil

--- a/internal/mcp/protocol_test.go
+++ b/internal/mcp/protocol_test.go
@@ -36,6 +36,13 @@ func TestInitializeReturnsCapabilities(t *testing.T) {
 	if caps["tools"] == nil {
 		t.Fatalf("tools capability missing: %+v", caps)
 	}
+	// patterm-specific orientation: clients show this to the underlying
+	// LLM, so it's our primary hook for steering vendor TUIs (codex in
+	// particular) toward the MCP tool surface instead of shell-ing out.
+	instructions, ok := parsed.Result["instructions"].(string)
+	if !ok || instructions == "" {
+		t.Fatalf("instructions missing or wrong type: %+v", parsed.Result)
+	}
 }

 func TestInitializedNotificationSuppressesResponse(t *testing.T) {
Author	SHA1	Message	Date
Harry Bayliss	4b4e7543e8	Release v0.0.2 Some checks failed release / build-linux-amd64 (push) Failing after 10m12s Details Bundles the in-flight work into the second tagged release. See CHANGELOG.md `[0.0.2] - 2026-05-15` for the full per-change list. Highlights: - libghostty-vt was building in zig's silent Debug default, capping the full pipeline at 34-63 fps. Makefile now defaults to ReleaseFast (.mise.toml pins zig 0.15.2 so the build is reproducible). End-to-end pipeline now runs at 930-2030 fps — 27-32× faster, with 7-16× headroom over a 120 fps target. - --debug[=DIR] and --profile[=DIR] flags capture full PTY logs, pprof data, and per-hot-path metrics (chunks/sec, mean/max latencies, cache hit rates) for offline analysis. Nothing pollutes stdout/stderr. - ASCII-video benchmark suite (8-colour / truecolor / Bad-Apple patterns at 30/60/120 fps) plus a renderer microbenchmark set for stable A/B comparisons across changes. - Click-and-drag text selection from alt-screen TUIs (codex) now works — host mouse mode follows the focused child's screen side instead of being permanently armed. - Long claude session resume + codex steady-state rendering pay less per chunk: drawSidebar deferred to the chrome ticker, emulator.Title CGO poll gated on a containsOSC scan. - Vendor-TUI orientation: MCP initialize.instructions, the spawn_agent tool description, and help('spawning') all spell out the anti-patterns (shell-out, perl-into-socket) that produced codex's stray top-level tabs.	2026-05-15 14:22:59 +01:00
Harry Bayliss	bda799a3c6	mise-pin zig 0.15.2; rebuild libghostty-vt ReleaseFast — 27-32x pipeline speedup Added .mise.toml pinning zig = "0.15.2" (the minimum the vendored Ghostty commit requires) and taught the Makefile to resolve zig through mise when available, falling back to PATH. Contributors run `mise install` once and `make deps` just works. Re-ran the pipeline benchmarks after rebuilding libghostty-vt with ReleaseFast (same hardware, AMD Ryzen 7 7800X3D): Debug ReleaseFast speedup Pipeline 8-colour @120fps 63 fps 2030 fps 32x Pipeline truecolor @120fps 34 fps 931 fps 27x Emulator-only truecolor 34 fps 2051 fps 60x 7-16x headroom over 120 fps for the heaviest workload (truecolor full-screen redraws). Static library size 33 MiB -> 13 MiB. TODO.md baseline numbers updated to reflect post-fix throughput; the "Debug-mode lib" finding is folded into the result it produced rather than left as an open item.	2026-05-15 13:54:48 +01:00
Harry Bayliss	2f109a84fa	Stress-test ASCII video at 30/60/120 fps; fix libghostty-vt Debug build Added a full ASCII-video benchmark suite that hammers the renderer with 30 KiB / 70 KiB full-screen frames at 30, 60, and 120 fps targets — both renderer-only and full-pipeline (em.Write + renderer + stdout). Each stream benchmark reports µs/frame, fps_ceiling, and percent of the per-frame budget consumed. The pipeline benchmarks revealed we were missing 120 fps by a wide margin (190%-350% of budget at 120fps, 60-90 fps ceiling). Isolating em.Write confirmed libghostty-vt is the bottleneck — 16-29 ms per truecolor frame, library file at 33 MiB. Root cause: the Makefile invoked `zig build` with no -Doptimize, and Zig's standardOptimizeOption defaults to Debug. So the shipped libghostty-vt was unoptimised. Fixed by pinning ReleaseFast in the Makefile (override via GHOSTTY_VT_OPTIMIZE for debug builds of the upstream lib). Existing checkouts need `make clean-deps && make deps` to pick up the rebuild.	2026-05-15 13:43:31 +01:00
Harry Bayliss	1c590f8e32	Concrete perf metrics: live counters in --profile + benchmark suite Live metrics (--profile): - New metricsTracker instruments OnPTYOut, viewport renderer, stdout writes, libghostty-vt Write/Title CGO calls, sidebar / tabbar / status draws (with cache-hit accounting), snapshot replays, and the chrome ticker (so we can see ticker fires that did nothing). - Writes metrics.jsonl (one snapshot per second) and metrics.json + summary.txt on exit, alongside the existing pprof files. - All record* methods are nil-safe so disabled paths pay only a cheap nil check; counters are atomic so the per-PTY-chunk hot path stays lock-free. Benchmark suite (go test -bench=.): - Three workload fixtures — plain ASCII, SGR-styled lines, and a ratatui-style cursor-shuffling burst — plus a containsOSC microbenchmark. Reports ns/op, MB/s, allocs/op, B/op. - Initial baseline numbers added to TODO under the perf-audit section, alongside two new findings (renderer allocs ~1 per 4 bytes on styled chunks; styled throughput tops out near 90 MB/s) those benchmarks surfaced.	2026-05-15 13:31:37 +01:00
Harry Bayliss	442eed605c	Add auto-generated perf audit findings to TODO Codebase sweep for perf issues outside the per-PTY-chunk path that recent CHANGELOG work already covered. Ten findings under a new "Perf Audit (auto-generated)" section in TODO.md — anchored to file:line, classified MEDIUM/LOW, with a sketched fix per entry. None landed as code changes; review pending.	2026-05-15 12:46:42 +01:00
Harry Bayliss	c120342709	Clear TODO backlog: --debug/--profile, codex selection, MCP orientation, perf - Add --debug[=DIR] / --profile[=DIR] flags that write run artefacts (patterm.log, events.jsonl, per-child raw PTY captures, CPU + heap + goroutine pprof) to a dir without polluting stdout/stderr. - Strengthen vendor-TUI orientation in three places (MCP initialize.instructions, the spawn_agent tool description, and help('spawning')) to head off codex's habits of poking the Unix socket via perl and shelling out to launch peers — both bypass caller identity and produce orphaned top-level tabs. - Fix click-and-drag text selection from alt-screen TUIs. Host SGR mouse reporting now follows the focused child's screen side instead of being permanently armed; alt-screen TUIs that need mouse re-enable it themselves and the toggle is forwarded. - Move drawSidebar() off the per-PTY-chunk hot path. Long claude session resume was paying a full sidebar rebuild for every scrolled chunk; the chrome ticker now drains a dirty flag at 60 Hz. - Gate the per-chunk Title() CGO poll on a containsOSC scan so codex/ratatui's many SGR-only chunks no longer pay a CGO call each.	2026-05-15 12:41:47 +01:00