Added a full ASCII-video benchmark suite that hammers the renderer with 30 KiB / 70 KiB full-screen frames at 30, 60, and 120 fps targets — both renderer-only and full-pipeline (em.Write + renderer + stdout). Each stream benchmark reports µs/frame, fps_ceiling, and percent of the per-frame budget consumed. The pipeline benchmarks revealed we were missing 120 fps by a wide margin (190%-350% of budget at 120fps, 60-90 fps ceiling). Isolating em.Write confirmed libghostty-vt is the bottleneck — 16-29 ms per truecolor frame, library file at 33 MiB. Root cause: the Makefile invoked `zig build` with no -Doptimize, and Zig's standardOptimizeOption defaults to Debug. So the shipped libghostty-vt was unoptimised. Fixed by pinning ReleaseFast in the Makefile (override via GHOSTTY_VT_OPTIMIZE for debug builds of the upstream lib). Existing checkouts need `make clean-deps && make deps` to pick up the rebuild.
11 KiB
Perf Audit (auto-generated 2026-05-15)
Findings from a codebase sweep — not user-reported, needs review before action. Each item names the anchor and a sketched fix.
Baseline benchmark numbers (go test -bench=. ./internal/app/, AMD
Ryzen 7 7800X3D, libghostty-vt Debug-mode — see the first item
below):
# Renderer alone
ViewportRenderer_PlainASCII 229 MB/s 1.3 KB/op 6 allocs/op
ViewportRenderer_StyledLines 89 MB/s 91 KB/op 4325 allocs/op
ViewportRenderer_RatatuiBurst 40 MB/s 365 KB/op 17306 allocs/op
RendererThroughput_ReuseInstance 90 MB/s 316 KB/op 17380 allocs/op
ContainsOSC_NoOSC 3050 MB/s 0 B/op 0 allocs/op
# ASCII-video stream (renderer only — 3 sec at the target fps)
ASCIIVideo_Stream_8Color_120fps 260 µs/frame 3845 fps_ceiling 3.1% budget
ASCIIVideo_Stream_TrueColor_120fps 576 µs/frame 1735 fps_ceiling 6.9% budget
# Full pipeline (em.Write + renderer + io.Discard write)
Pipeline_ASCIIVideo_8Color_120fps 15838 µs/frame 63 fps_ceiling 190% budget
Pipeline_ASCIIVideo_TrueColor_120fps 29224 µs/frame 34 fps_ceiling 350% budget
# Emulator alone (libghostty-vt CSI/SGR parser)
Emulator_Write_Stream_8Color_120fps 15930 µs/frame 63 fps_ceiling
Emulator_Write_Stream_TrueColor_120fps 29241 µs/frame 34 fps_ceiling
The renderer alone hits 1700-3800 fps with margin. The full pipeline caps at 34-63 fps. The whole gap is libghostty-vt's em.Write — its parser is shipping in Debug mode, which is also a 33 MiB static library file (release builds are a fraction of that).
-
libghostty-vt was being built in Debug mode. [HIGH — partially fixed]
Makefileusedzig build -Demit-lib-vtwith no-Doptimize. Zig'sstandardOptimizeOptiondefaults to.Debug, so the shipped static lib was unoptimised. Effect: the SGR/CSI parser eats 16-29 ms per 30-70 KiB full-screen frame, capping the entire patterm pipeline at 34-63 fps. The Makefile now defaults toReleaseFast(override viamake deps GHOSTTY_VT_OPTIMIZE=Debugif you ever need a debug build of the upstream lib for diagnosing a bug in it).- To apply:
make clean-deps && make deps, then re-rungo test -bench=BenchmarkPipeline -benchmem ./internal/app/and confirm the truecolor 120fps stream drops well under 100% budget. Update the numbers in this section after rebuilding. - Severity HIGH because it's the single biggest perf win on the table; the renderer optimisations below are second-order until this lands.
-
viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks. [MEDIUM]
internal/app/viewport_renderer.go— the styled-lines and ratatui benchmarks show 4-17k allocs per chunk. The hot contributors are likely (a)string(vr.buf)/string(params)conversions inemitCSIfor every escape sequence, (b) thepending strings.Builderresizing as fragments arrive, and (c)vr.shifter.Shift(vr.buf)returning a fresh slice per CSI.- Fix direction: switch CSI param parsing to byte-slice
comparison (no string conversion); reuse
vr.bufandvr.pendingbacking arrays acrossRendercalls by pre-growing innewViewportRenderer; havecursorShifter.Shiftreturn into a caller-owned buffer instead of allocating. Profile-guided: run the styled-lines bench, point pprof at the allocs profile, fix the top three call sites.
-
viewport renderer throughput (~90 MB/s styled) limits codex steady-state. [MEDIUM]
- The styled-lines and ratatui benchmarks come in at 89 MB/s and 40 MB/s respectively. A 100 KB/s codex burst is far under that limit, but a session-resume dump of a 5 MiB chat history takes 50-130 ms of pure renderer time at those rates — enough to be user-visible at the start of a long resume.
- Fix direction: same as the alloc fix above; once the per-call allocation cost drops, the throughput ceiling rises with it. Worth re-running the benches after fixing the allocs and only investing further if the styled-lines bench is still under ~300 MB/s.
-
Session.Children() allocates a fresh slice on every call. [MEDIUM]
internal/app/session.go:530-541walkss.orderunders.muand builds a new[]*Childslice every time. Callers on hot paths:drawSidebarcalls it twice per frame (internal/app/sidebar.go:139and:171);drawTabBarcalls it once per frame (internal/app/tabbar.go:37); the classifier iterates it every 250 ms (internal/app/classifier.go:38); and palette/navigation hit it on every Ctrl-A/D/W/S keystroke.- Fix direction: store the snapshot in an
atomic.Pointer[[]*Child]onSession, refresh it unders.muonly whenSpawn/deletemutates the map. Readers get O(1)Load()with zero allocation — same pattern already used forlisteners(session.go:118-123).
-
wait_for_pattern re-scans the entire stream/grid every iteration. [MEDIUM]
internal/app/host.go:476-493(thecheckclosure). Onscope = "scrollback"it callsc.StreamRead(0)followed bystripANSIBytes(nil, b)over the entire ring on every wake — a full O(ring size) walk per chunk arrival. Ongridit goes through PlainText (one CGO call) plus a regex match against the full grid string. For an agent waiting on a marker in a chatty pane, every PTY chunk firescheck().- Fix direction: for
scrollback, track the offset of the last check and run the regex only over the new tail, reusing a per-call scratch buffer for ANSI stripping. Forgrid, dedupe onScreenVersion()— skip when version hasn't changed.
-
search_output compiles regex + strips ANSI on every call. [MEDIUM]
internal/app/host.go:428compiles a freshregexp.Regexpper invocation;:434strips ANSI over the entire ring buffer whenkind="rendered". Agents that pollsearch_outputwith the same pattern (the typical "watch for marker" loop) repay both costs on every call.- Fix direction: small LRU of compiled regexes keyed by pattern
string (cap maybe 32) on
toolHost. Cache the stripped-ANSI buffer keyed byc.ScreenVersion()so consecutive searches over an unchanged ring reuse the strip.
-
GetProcessOutput grid mode acquires the emulator twice. [MEDIUM]
internal/app/host.go:375-391doesem := c.Emulator()for ActiveScreen / Cursor / Size, then at line 387 re-fetchesem := c.Emulator()for PlainText. EachEmulator()call goes throughptyMuand inspects the live PTY pointer. Under a chatty agent pollingget_process_outputevery 100 ms this is a redundant lock and pointer chase per call.- Fix direction: hold the emulator reference from the first
lookup; reuse it for PlainText. The check
if em == nilstill runs cleanly because the variable is captured.
-
FindChildByIdentity is O(N) under the session lock. [LOW]
internal/app/session.go:553-565scans the children map looking for a matchingIdentitytoken on every new mcp-stdio connection. Not a steady-state hot path — only fires once per child spawn — but with many short-lived sub-agents it adds up and contends with everyone else takings.mu.- Fix direction: maintain an
identityIndex map[string]string(identity → child id) updated alongside spawn / exit, give the lookup an O(1) read.
-
Per-promoter regex matches in the idle classifier. [LOW]
internal/app/idle.go:175-182(matchAny) walks each compiled pattern and runs the DFA over the same 4 KiB tail. A preset with five permission patterns + five error patterns is ten DFA invocations per child per 250 ms tick.- Fix direction: at preset load time, compile each
_patternslist into a single alternation regex ((?:p1)|(?:p2)|…). The classifier then makes one Match call per category per tick.
-
Port-detection dedup is O(N²) over c.ports. [LOW]
internal/app/child.go:461-467: for each fresh URL match the code linearly scans the existing port list. The list rarely grows past a handful, but a dev server that lists "all open ports" in one log line interacts badly: M new matches × N existing entries.- Fix direction: keep a
seenPorts map[int]struct{}next toc.ports, rebuilt on prune (none today). O(1) per match.
-
Port-sighting string allocations happen before the dedup check. [LOW]
internal/app/child.go:455-456allocatesurlFormandportStrbefore line 461'sseenwalk. Both strings are wasted when the port is already inc.ports. Insidec.portsMufor the whole loop body too, blocking thePorts()reader path.- Fix direction: bind the port int first (cheap parse from
m[1]), do the seen check, only then allocate the URL string for the surviving sighting.
-
classifier
time.Now()syscall per child per tick. [LOW]internal/app/classifier.go:54(and theIdleMS/TitleIdleMShelpers it transitively calls ininternal/app/child.go:343-374) each calltime.Now(). Reading time on Linux is fast (vDSO) but with N children × 4time.Now()per tick × 4 ticks/sec it's wasted work that can be batched.- Fix direction: capture
now := time.Now().UnixNano()once at the top ofclassifyAlland thread it intoclassifyOneand the helpers as a parameter.
-
wait_for_pattern subscribes a listener for every call. [LOW]
internal/app/host.go:472-474: each invocation callsSession.Subscribe(wake)which clones the listener slice and swaps the atomic pointer; thedefer Unsubscribedoes the same on exit. Two allocations perwait_for_pattern. The agent pattern of looping onwait_for_patternafter every tool call pays this churn on the steady-state path.- Fix direction: a per-child
chunkBroadcasterregistered once at child spawn that hands out lightweight subscriber tokens, rather than going through the full session listener machinery.
On Hold
- There's a unicode being displayed in opencode [ON HOLD]
- Investigated 2026-05-14: patterm passes ghostty grapheme codepoints
through unchanged (vt/ghostty.go:452-462), so the
<?>glyph is most likely the host terminal's font fallback for opencode's Nerd Font private-use codepoints, not a patterm substitution. Need a concrete reproduction (which codepoint, which host terminal/font) before changing rendering.
- Investigated 2026-05-14: patterm passes ghostty grapheme codepoints
through unchanged (vt/ghostty.go:452-462), so the
- After codex rips for like 15 minutes, the terminal becomes quite slow. [ON HOLD / VERIFYING]
- 2026-05-14: Perf plan P1-P11 landed (see CHANGELOG). Needs a real long-running codex session to confirm whether the steady-state slowdown is gone or some hotspot remains. Capture a pprof if it still feels slow after ≥15 minutes — the structural drivers the audit named are all addressed, so a remaining symptom is a new one and probably wants fresh profiling.