Concrete perf metrics: live counters in --profile + benchmark suite

Live metrics (--profile):
- New metricsTracker instruments OnPTYOut, viewport renderer,
  stdout writes, libghostty-vt Write/Title CGO calls, sidebar /
  tabbar / status draws (with cache-hit accounting), snapshot
  replays, and the chrome ticker (so we can see ticker fires that
  did nothing).
- Writes metrics.jsonl (one snapshot per second) and metrics.json
  + summary.txt on exit, alongside the existing pprof files.
- All record* methods are nil-safe so disabled paths pay only a
  cheap nil check; counters are atomic so the per-PTY-chunk hot
  path stays lock-free.

Benchmark suite (go test -bench=.):
- Three workload fixtures — plain ASCII, SGR-styled lines, and a
  ratatui-style cursor-shuffling burst — plus a containsOSC
  microbenchmark. Reports ns/op, MB/s, allocs/op, B/op.
- Initial baseline numbers added to TODO under the perf-audit
  section, alongside two new findings (renderer allocs ~1 per 4
  bytes on styled chunks; styled throughput tops out near
  90 MB/s) those benchmarks surfaced.
This commit is contained in:
2026-05-15 13:31:37 +01:00
parent 442eed605c
commit 1c590f8e32
10 changed files with 931 additions and 7 deletions

38
TODO.md
View File

@@ -2,6 +2,44 @@
Findings from a codebase sweep — not user-reported, needs review before
action. Each item names the anchor and a sketched fix.
Baseline benchmark numbers (`go test -bench=. ./internal/app/`, AMD
Ryzen 7 7800X3D):
```
ViewportRenderer_PlainASCII 229 MB/s 1.3 KB/op 6 allocs/op
ViewportRenderer_StyledLines 89 MB/s 91 KB/op 4325 allocs/op
ViewportRenderer_RatatuiBurst 40 MB/s 365 KB/op 17306 allocs/op
RendererThroughput_ReuseInstance 90 MB/s 316 KB/op 17380 allocs/op
ContainsOSC_NoOSC 3050 MB/s 0 B/op 0 allocs/op
```
- [ ] **viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks.** [MEDIUM]
- `internal/app/viewport_renderer.go` — the styled-lines and
ratatui benchmarks show 4-17k allocs per chunk. The hot
contributors are likely (a) `string(vr.buf)` / `string(params)`
conversions in `emitCSI` for every escape sequence, (b) the
`pending strings.Builder` resizing as fragments arrive, and (c)
`vr.shifter.Shift(vr.buf)` returning a fresh slice per CSI.
- Fix direction: switch CSI param parsing to byte-slice
comparison (no string conversion); reuse `vr.buf` and
`vr.pending` backing arrays across `Render` calls by
pre-growing in `newViewportRenderer`; have `cursorShifter.Shift`
return into a caller-owned buffer instead of allocating.
Profile-guided: run the styled-lines bench, point pprof at the
allocs profile, fix the top three call sites.
- [ ] **viewport renderer throughput (~90 MB/s styled) limits codex steady-state.** [MEDIUM]
- The styled-lines and ratatui benchmarks come in at 89 MB/s and
40 MB/s respectively. A 100 KB/s codex burst is far under that
limit, but a session-resume dump of a 5 MiB chat history takes
50-130 ms of pure renderer time at those rates — enough to be
user-visible at the start of a long resume.
- Fix direction: same as the alloc fix above; once the per-call
allocation cost drops, the throughput ceiling rises with it.
Worth re-running the benches after fixing the allocs and only
investing further if the styled-lines bench is still under
~300 MB/s.
- [ ] **Session.Children() allocates a fresh slice on every call.** [MEDIUM]
- `internal/app/session.go:530-541` walks `s.order` under `s.mu` and
builds a new `[]*Child` slice every time. Callers on hot paths: