Removed low/marginal items from the original sweep; remaining items have measured or workflow evidence to justify action.
6.2 KiB
Perf Audit (reviewed 2026-05-15)
Findings that survived the 2026-05-15 review pass. Low and marginal items from the original sweep were removed; remaining items have enough measured or workflow evidence to justify action.
Baseline benchmark numbers (go test -bench=. ./internal/app/, AMD
Ryzen 7 7800X3D, libghostty-vt ReleaseFast after the Makefile
fix landed):
# Renderer alone
ViewportRenderer_PlainASCII 229 MB/s 1.3 KB/op 6 allocs/op
ViewportRenderer_StyledLines 89 MB/s 91 KB/op 4325 allocs/op
ViewportRenderer_RatatuiBurst 40 MB/s 365 KB/op 17306 allocs/op
RendererThroughput_ReuseInstance 90 MB/s 316 KB/op 17380 allocs/op
ContainsOSC_NoOSC 3050 MB/s 0 B/op 0 allocs/op
# ASCII-video stream (renderer only — 3 sec at the target fps)
ASCIIVideo_Stream_8Color_120fps 260 µs/frame 3845 fps_ceiling 3.1% budget
ASCIIVideo_Stream_TrueColor_120fps 576 µs/frame 1735 fps_ceiling 6.9% budget
# Full pipeline (em.Write + renderer + io.Discard write)
Pipeline_ASCIIVideo_8Color_120fps 493 µs/frame 2030 fps_ceiling 5.9% budget
Pipeline_ASCIIVideo_TrueColor_120fps 1075 µs/frame 931 fps_ceiling 12.9% budget
# Emulator alone (libghostty-vt CSI/SGR parser)
Emulator_Write_Stream_8Color_120fps 257 µs/frame 3890 fps_ceiling
Emulator_Write_Stream_TrueColor_120fps 488 µs/frame 2051 fps_ceiling
The current pipeline still has large 120 fps headroom. The remaining renderer concern is multi-MiB styled replay latency and allocation churn, not normal steady-state frame budget.
-
viewport renderer allocates heavily on SGR/CSI-heavy chunks. [MEDIUM]
- Review evidence: five benchmark reps confirmed
ViewportRenderer_StyledLinesat about 4,325 allocs per 16 KiB chunk (~91.5 KB/op, roughly 1 alloc per 3.8 input bytes), andViewportRenderer_RatatuiBurstat about 17,306 allocs per chunk (~365 KB/op). A 5 MiB styled resume benchmark allocated about 31 MB across 1.38M objects. - Likely hot paths: generic CSI/SGR output in
internal/app/viewport_renderer.gosends many sequences throughvr.shifter.Shift(vr.buf), whileinternal/app/cursorshift.goreturns a fresh[]byteviapending.String()on everyShiftcall and parses CSI params throughstring(raw)/strings.Split. The mode-helperstring(params)conversions are real, but probably not the main SGR-heavy cost. - Fix direction: make
cursorShifterwrite into caller-owned scratch output or directly into the viewport renderer's pending builder; parse CSI params from byte slices; pre-grow/reuse renderer and shifter buffers. Re-run styled-lines, ratatui, and 5 MiB resume benchmarks; use pprof when available to confirm the top allocation sites.
- Review evidence: five benchmark reps confirmed
-
large styled resume/replay dumps spend visible time in viewport rendering. [MEDIUM]
- Review evidence:
BenchmarkSessionResume_5MiBStyledmeasured about 58 ms median and 63 ms p95 over five reps. The plain 5 MiB benchmark was about 23-24 ms with only 21 allocs. The live path renders focused PTY chunks throughrenderer.Render, then still pays emulator writes, ring writes, event dispatch, stdout writes, and real terminal paint. - Scope: this is not a Codex steady-state throughput limit. A 100 KB/s stream is far below the styled renderer's ~80-90 MB/s ceiling. It matters for multi-MiB burst replay, resume/startup dumps, and dense full-screen churn.
- Fix direction: do the allocation fix first, since it should also improve throughput. After that, invest further only if styled resume traces remain user-visible or the styled-lines benchmark is still under roughly 300 MB/s.
- Review evidence:
-
wait_for_pattern re-scans the entire stream/grid while waiting. [MEDIUM]
internal/app/host.go:476-493(thecheckclosure). Onscope="scrollback"it callsc.StreamRead(0)followed bystripANSIBytes(nil, b), so each check can copy, strip, and search the full 1 MiB ring. Onscope="grid"it callsPlainText()and runs the regex against the full grid string.- Caveat from review: the current chunk notifier coalesces bursts with a buffered channel and has a 500 ms fallback, so this is not necessarily one full scan per PTY chunk. It is still meaningful for active waits on chatty panes.
- Fix direction: for
scrollback, track the last checked stream offset and search only new output plus a bounded overlap/scratch buffer so matches spanning chunks are not missed. Forgrid, dedupe onScreenVersion()and skip work when the version has not changed.
-
search_output rebuilds and searches whole scrollback on every call. [MEDIUM]
internal/app/host.go:428-437compiles a fresh regex, reads the stream from offset 0, strips ANSI forkind="rendered", converts the full buffer to a string, and splits it into lines before applyinglimit. This is meaningful when agents poll the same pattern; it is low impact for ad hoc searches.- Fix direction: cache compiled regexes by pattern; cache stripped
rendered output by child id and stream end offset; avoid
strings.Splitover the whole ring when only the firstlimitmatches are needed. Prefer an incremental search shape if this becomes the standard "watch for marker" path.
On Hold
- There's a unicode being displayed in opencode [ON HOLD]
- Investigated 2026-05-14: patterm passes ghostty grapheme codepoints
through unchanged (vt/ghostty.go:452-462), so the
<?>glyph is most likely the host terminal's font fallback for opencode's Nerd Font private-use codepoints, not a patterm substitution. Need a concrete reproduction (which codepoint, which host terminal/font) before changing rendering.
- Investigated 2026-05-14: patterm passes ghostty grapheme codepoints
through unchanged (vt/ghostty.go:452-462), so the
- After codex rips for like 15 minutes, the terminal becomes quite slow. [ON HOLD / VERIFYING]
- 2026-05-14: Perf plan P1-P11 landed (see CHANGELOG). Needs a real long-running codex session to confirm whether the steady-state slowdown is gone or some hotspot remains. Capture a pprof if it still feels slow after ≥15 minutes — the structural drivers the audit named are all addressed, so a remaining symptom is a new one and probably wants fresh profiling.