Concrete perf metrics: live counters in --profile + benchmark suite

Live metrics (--profile): - New metricsTracker instruments OnPTYOut, viewport renderer, stdout writes, libghostty-vt Write/Title CGO calls, sidebar / tabbar / status draws (with cache-hit accounting), snapshot replays, and the chrome ticker (so we can see ticker fires that did nothing). - Writes metrics.jsonl (one snapshot per second) and metrics.json + summary.txt on exit, alongside the existing pprof files. - All record* methods are nil-safe so disabled paths pay only a cheap nil check; counters are atomic so the per-PTY-chunk hot path stays lock-free. Benchmark suite (go test -bench=.): - Three workload fixtures — plain ASCII, SGR-styled lines, and a ratatui-style cursor-shuffling burst — plus a containsOSC microbenchmark. Reports ns/op, MB/s, allocs/op, B/op. - Initial baseline numbers added to TODO under the perf-audit section, alongside two new findings (renderer allocs ~1 per 4 bytes on styled chunks; styled throughput tops out near 90 MB/s) those benchmarks surfaced.
2026-05-15 13:31:37 +01:00
parent 442eed605c
commit 1c590f8e32
10 changed files with 931 additions and 7 deletions
--- a/TODO.md
+++ b/TODO.md
@@ -2,6 +2,44 @@
 Findings from a codebase sweep — not user-reported, needs review before
 action. Each item names the anchor and a sketched fix.

+Baseline benchmark numbers (`go test -bench=. ./internal/app/`, AMD
+Ryzen 7 7800X3D):
+
+```
+ViewportRenderer_PlainASCII       229 MB/s     1.3 KB/op    6 allocs/op
+ViewportRenderer_StyledLines       89 MB/s    91   KB/op  4325 allocs/op
+ViewportRenderer_RatatuiBurst      40 MB/s   365   KB/op 17306 allocs/op
+RendererThroughput_ReuseInstance   90 MB/s   316   KB/op 17380 allocs/op
+ContainsOSC_NoOSC                3050 MB/s     0   B/op     0 allocs/op
+```
+
+- [ ] **viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks.** [MEDIUM]
+  - `internal/app/viewport_renderer.go` — the styled-lines and
+    ratatui benchmarks show 4-17k allocs per chunk. The hot
+    contributors are likely (a) `string(vr.buf)` / `string(params)`
+    conversions in `emitCSI` for every escape sequence, (b) the
+    `pending strings.Builder` resizing as fragments arrive, and (c)
+    `vr.shifter.Shift(vr.buf)` returning a fresh slice per CSI.
+  - Fix direction: switch CSI param parsing to byte-slice
+    comparison (no string conversion); reuse `vr.buf` and
+    `vr.pending` backing arrays across `Render` calls by
+    pre-growing in `newViewportRenderer`; have `cursorShifter.Shift`
+    return into a caller-owned buffer instead of allocating.
+    Profile-guided: run the styled-lines bench, point pprof at the
+    allocs profile, fix the top three call sites.
+
+- [ ] **viewport renderer throughput (~90 MB/s styled) limits codex steady-state.** [MEDIUM]
+  - The styled-lines and ratatui benchmarks come in at 89 MB/s and
+    40 MB/s respectively. A 100 KB/s codex burst is far under that
+    limit, but a session-resume dump of a 5 MiB chat history takes
+    50-130 ms of pure renderer time at those rates — enough to be
+    user-visible at the start of a long resume.
+  - Fix direction: same as the alloc fix above; once the per-call
+    allocation cost drops, the throughput ceiling rises with it.
+    Worth re-running the benches after fixing the allocs and only
+    investing further if the styled-lines bench is still under
+    ~300 MB/s.
+
 - [ ] **Session.Children() allocates a fresh slice on every call.** [MEDIUM]
  - `internal/app/session.go:530-541` walks `s.order` under `s.mu` and
    builds a new `[]*Child` slice every time. Callers on hot paths: