Files
patterm/TODO.md
Harry Bayliss 2f109a84fa Stress-test ASCII video at 30/60/120 fps; fix libghostty-vt Debug build
Added a full ASCII-video benchmark suite that hammers the renderer
with 30 KiB / 70 KiB full-screen frames at 30, 60, and 120 fps
targets — both renderer-only and full-pipeline (em.Write + renderer
+ stdout). Each stream benchmark reports µs/frame, fps_ceiling, and
percent of the per-frame budget consumed.

The pipeline benchmarks revealed we were missing 120 fps by a wide
margin (190%-350% of budget at 120fps, 60-90 fps ceiling). Isolating
em.Write confirmed libghostty-vt is the bottleneck — 16-29 ms per
truecolor frame, library file at 33 MiB.

Root cause: the Makefile invoked `zig build` with no
-Doptimize, and Zig's standardOptimizeOption defaults to Debug. So
the shipped libghostty-vt was unoptimised. Fixed by pinning
ReleaseFast in the Makefile (override via GHOSTTY_VT_OPTIMIZE for
debug builds of the upstream lib).

Existing checkouts need `make clean-deps && make deps` to pick up
the rebuild.
2026-05-15 13:43:31 +01:00

202 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Perf Audit (auto-generated 2026-05-15)
Findings from a codebase sweep — not user-reported, needs review before
action. Each item names the anchor and a sketched fix.
Baseline benchmark numbers (`go test -bench=. ./internal/app/`, AMD
Ryzen 7 7800X3D, libghostty-vt **Debug-mode** — see the first item
below):
```
# Renderer alone
ViewportRenderer_PlainASCII 229 MB/s 1.3 KB/op 6 allocs/op
ViewportRenderer_StyledLines 89 MB/s 91 KB/op 4325 allocs/op
ViewportRenderer_RatatuiBurst 40 MB/s 365 KB/op 17306 allocs/op
RendererThroughput_ReuseInstance 90 MB/s 316 KB/op 17380 allocs/op
ContainsOSC_NoOSC 3050 MB/s 0 B/op 0 allocs/op
# ASCII-video stream (renderer only — 3 sec at the target fps)
ASCIIVideo_Stream_8Color_120fps 260 µs/frame 3845 fps_ceiling 3.1% budget
ASCIIVideo_Stream_TrueColor_120fps 576 µs/frame 1735 fps_ceiling 6.9% budget
# Full pipeline (em.Write + renderer + io.Discard write)
Pipeline_ASCIIVideo_8Color_120fps 15838 µs/frame 63 fps_ceiling 190% budget
Pipeline_ASCIIVideo_TrueColor_120fps 29224 µs/frame 34 fps_ceiling 350% budget
# Emulator alone (libghostty-vt CSI/SGR parser)
Emulator_Write_Stream_8Color_120fps 15930 µs/frame 63 fps_ceiling
Emulator_Write_Stream_TrueColor_120fps 29241 µs/frame 34 fps_ceiling
```
The renderer alone hits 1700-3800 fps with margin. The full pipeline
caps at 34-63 fps. **The whole gap is libghostty-vt's em.Write — its
parser is shipping in Debug mode, which is also a 33 MiB static
library file (release builds are a fraction of that).**
- [ ] **libghostty-vt was being built in Debug mode.** [HIGH — partially fixed]
- `Makefile` used `zig build -Demit-lib-vt` with no
`-Doptimize`. Zig's `standardOptimizeOption` defaults to
`.Debug`, so the shipped static lib was unoptimised. Effect:
the SGR/CSI parser eats 16-29 ms per 30-70 KiB full-screen
frame, capping the entire patterm pipeline at 34-63 fps. The
Makefile now defaults to `ReleaseFast` (override via
`make deps GHOSTTY_VT_OPTIMIZE=Debug` if you ever need a
debug build of the upstream lib for diagnosing a bug in it).
- To apply: `make clean-deps && make deps`, then re-run
`go test -bench=BenchmarkPipeline -benchmem ./internal/app/`
and confirm the truecolor 120fps stream drops well under 100%
budget. Update the numbers in this section after rebuilding.
- Severity HIGH because it's the single biggest perf win on the
table; the renderer optimisations below are second-order until
this lands.
- [ ] **viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks.** [MEDIUM]
- `internal/app/viewport_renderer.go` — the styled-lines and
ratatui benchmarks show 4-17k allocs per chunk. The hot
contributors are likely (a) `string(vr.buf)` / `string(params)`
conversions in `emitCSI` for every escape sequence, (b) the
`pending strings.Builder` resizing as fragments arrive, and (c)
`vr.shifter.Shift(vr.buf)` returning a fresh slice per CSI.
- Fix direction: switch CSI param parsing to byte-slice
comparison (no string conversion); reuse `vr.buf` and
`vr.pending` backing arrays across `Render` calls by
pre-growing in `newViewportRenderer`; have `cursorShifter.Shift`
return into a caller-owned buffer instead of allocating.
Profile-guided: run the styled-lines bench, point pprof at the
allocs profile, fix the top three call sites.
- [ ] **viewport renderer throughput (~90 MB/s styled) limits codex steady-state.** [MEDIUM]
- The styled-lines and ratatui benchmarks come in at 89 MB/s and
40 MB/s respectively. A 100 KB/s codex burst is far under that
limit, but a session-resume dump of a 5 MiB chat history takes
50-130 ms of pure renderer time at those rates — enough to be
user-visible at the start of a long resume.
- Fix direction: same as the alloc fix above; once the per-call
allocation cost drops, the throughput ceiling rises with it.
Worth re-running the benches after fixing the allocs and only
investing further if the styled-lines bench is still under
~300 MB/s.
- [ ] **Session.Children() allocates a fresh slice on every call.** [MEDIUM]
- `internal/app/session.go:530-541` walks `s.order` under `s.mu` and
builds a new `[]*Child` slice every time. Callers on hot paths:
`drawSidebar` calls it twice per frame
(`internal/app/sidebar.go:139` and `:171`); `drawTabBar` calls it
once per frame (`internal/app/tabbar.go:37`); the classifier
iterates it every 250 ms (`internal/app/classifier.go:38`); and
palette/navigation hit it on every Ctrl-A/D/W/S keystroke.
- Fix direction: store the snapshot in an `atomic.Pointer[[]*Child]`
on `Session`, refresh it under `s.mu` only when `Spawn` / `delete`
mutates the map. Readers get O(1) `Load()` with zero allocation —
same pattern already used for `listeners` (session.go:118-123).
- [ ] **wait_for_pattern re-scans the entire stream/grid every iteration.** [MEDIUM]
- `internal/app/host.go:476-493` (the `check` closure). On `scope =
"scrollback"` it calls `c.StreamRead(0)` followed by
`stripANSIBytes(nil, b)` over the entire ring on every wake — a
full O(ring size) walk per chunk arrival. On `grid` it goes
through PlainText (one CGO call) plus a regex match against the
full grid string. For an agent waiting on a marker in a chatty
pane, every PTY chunk fires `check()`.
- Fix direction: for `scrollback`, track the offset of the last
check and run the regex only over the new tail, reusing a
per-call scratch buffer for ANSI stripping. For `grid`, dedupe
on `ScreenVersion()` — skip when version hasn't changed.
- [ ] **search_output compiles regex + strips ANSI on every call.** [MEDIUM]
- `internal/app/host.go:428` compiles a fresh `regexp.Regexp` per
invocation; `:434` strips ANSI over the entire ring buffer when
`kind="rendered"`. Agents that poll `search_output` with the same
pattern (the typical "watch for marker" loop) repay both costs on
every call.
- Fix direction: small LRU of compiled regexes keyed by pattern
string (cap maybe 32) on `toolHost`. Cache the stripped-ANSI
buffer keyed by `c.ScreenVersion()` so consecutive searches over
an unchanged ring reuse the strip.
- [ ] **GetProcessOutput grid mode acquires the emulator twice.** [MEDIUM]
- `internal/app/host.go:375-391` does `em := c.Emulator()` for
ActiveScreen / Cursor / Size, then at line 387 re-fetches
`em := c.Emulator()` for PlainText. Each `Emulator()` call goes
through `ptyMu` and inspects the live PTY pointer. Under a
chatty agent polling `get_process_output` every 100 ms this is
a redundant lock and pointer chase per call.
- Fix direction: hold the emulator reference from the first
lookup; reuse it for PlainText. The check `if em == nil` still
runs cleanly because the variable is captured.
- [ ] **FindChildByIdentity is O(N) under the session lock.** [LOW]
- `internal/app/session.go:553-565` scans the children map looking
for a matching `Identity` token on every new mcp-stdio
connection. Not a steady-state hot path — only fires once per
child spawn — but with many short-lived sub-agents it adds up
and contends with everyone else taking `s.mu`.
- Fix direction: maintain an `identityIndex map[string]string`
(identity → child id) updated alongside spawn / exit, give the
lookup an O(1) read.
- [ ] **Per-promoter regex matches in the idle classifier.** [LOW]
- `internal/app/idle.go:175-182` (`matchAny`) walks each compiled
pattern and runs the DFA over the same 4 KiB tail. A preset with
five permission patterns + five error patterns is ten DFA
invocations per child per 250 ms tick.
- Fix direction: at preset load time, compile each `_patterns`
list into a single alternation regex (`(?:p1)|(?:p2)|…`). The
classifier then makes one Match call per category per tick.
- [ ] **Port-detection dedup is O(N²) over c.ports.** [LOW]
- `internal/app/child.go:461-467`: for each fresh URL match the
code linearly scans the existing port list. The list rarely
grows past a handful, but a dev server that lists "all open
ports" in one log line interacts badly: M new matches × N
existing entries.
- Fix direction: keep a `seenPorts map[int]struct{}` next to
`c.ports`, rebuilt on prune (none today). O(1) per match.
- [ ] **Port-sighting string allocations happen before the dedup check.** [LOW]
- `internal/app/child.go:455-456` allocates `urlForm` and `portStr`
before line 461's `seen` walk. Both strings are wasted when the
port is already in `c.ports`. Inside `c.portsMu` for the whole
loop body too, blocking the `Ports()` reader path.
- Fix direction: bind the port int first (cheap parse from
`m[1]`), do the seen check, only then allocate the URL string
for the surviving sighting.
- [ ] **classifier `time.Now()` syscall per child per tick.** [LOW]
- `internal/app/classifier.go:54` (and the `IdleMS` /
`TitleIdleMS` helpers it transitively calls in
`internal/app/child.go:343-374`) each call `time.Now()`.
Reading time on Linux is fast (vDSO) but with N children × 4
`time.Now()` per tick × 4 ticks/sec it's wasted work that can
be batched.
- Fix direction: capture `now := time.Now().UnixNano()` once at
the top of `classifyAll` and thread it into `classifyOne` and
the helpers as a parameter.
- [ ] **wait_for_pattern subscribes a listener for every call.** [LOW]
- `internal/app/host.go:472-474`: each invocation calls
`Session.Subscribe(wake)` which clones the listener slice and
swaps the atomic pointer; the `defer Unsubscribe` does the same
on exit. Two allocations per `wait_for_pattern`. The agent
pattern of looping on `wait_for_pattern` after every tool call
pays this churn on the steady-state path.
- Fix direction: a per-child `chunkBroadcaster` registered once
at child spawn that hands out lightweight subscriber tokens,
rather than going through the full session listener machinery.
# On Hold
- [ ] There's a unicode <?> being displayed in opencode [ON HOLD]
- Investigated 2026-05-14: patterm passes ghostty grapheme codepoints
through unchanged (vt/ghostty.go:452-462), so the `<?>` glyph is
most likely the *host* terminal's font fallback for opencode's
Nerd Font private-use codepoints, not a patterm substitution.
Need a concrete reproduction (which codepoint, which host
terminal/font) before changing rendering.
- [ ] After codex rips for like 15 minutes, the terminal becomes quite slow. [ON HOLD / VERIFYING]
- 2026-05-14: Perf plan P1-P11 landed (see CHANGELOG). Needs a real
long-running codex session to confirm whether the steady-state
slowdown is gone or some hotspot remains. Capture a pprof if it
still feels slow after ≥15 minutes — the structural drivers the
audit named are all addressed, so a remaining symptom is a new
one and probably wants fresh profiling.