Added .mise.toml pinning zig = "0.15.2" (the minimum the vendored
Ghostty commit requires) and taught the Makefile to resolve zig
through mise when available, falling back to PATH. Contributors run
`mise install` once and `make deps` just works.
Re-ran the pipeline benchmarks after rebuilding libghostty-vt with
ReleaseFast (same hardware, AMD Ryzen 7 7800X3D):
Debug ReleaseFast speedup
Pipeline 8-colour @120fps 63 fps 2030 fps 32x
Pipeline truecolor @120fps 34 fps 931 fps 27x
Emulator-only truecolor 34 fps 2051 fps 60x
7-16x headroom over 120 fps for the heaviest workload (truecolor
full-screen redraws). Static library size 33 MiB -> 13 MiB.
TODO.md baseline numbers updated to reflect post-fix throughput;
the "Debug-mode lib" finding is folded into the result it produced
rather than left as an open item.
9.9 KiB
Perf Audit (auto-generated 2026-05-15)
Findings from a codebase sweep — not user-reported, needs review before action. Each item names the anchor and a sketched fix.
Baseline benchmark numbers (go test -bench=. ./internal/app/, AMD
Ryzen 7 7800X3D, libghostty-vt ReleaseFast after the Makefile
fix landed):
# Renderer alone
ViewportRenderer_PlainASCII 229 MB/s 1.3 KB/op 6 allocs/op
ViewportRenderer_StyledLines 89 MB/s 91 KB/op 4325 allocs/op
ViewportRenderer_RatatuiBurst 40 MB/s 365 KB/op 17306 allocs/op
RendererThroughput_ReuseInstance 90 MB/s 316 KB/op 17380 allocs/op
ContainsOSC_NoOSC 3050 MB/s 0 B/op 0 allocs/op
# ASCII-video stream (renderer only — 3 sec at the target fps)
ASCIIVideo_Stream_8Color_120fps 260 µs/frame 3845 fps_ceiling 3.1% budget
ASCIIVideo_Stream_TrueColor_120fps 576 µs/frame 1735 fps_ceiling 6.9% budget
# Full pipeline (em.Write + renderer + io.Discard write)
Pipeline_ASCIIVideo_8Color_120fps 493 µs/frame 2030 fps_ceiling 5.9% budget
Pipeline_ASCIIVideo_TrueColor_120fps 1075 µs/frame 931 fps_ceiling 12.9% budget
# Emulator alone (libghostty-vt CSI/SGR parser)
Emulator_Write_Stream_8Color_120fps 257 µs/frame 3890 fps_ceiling
Emulator_Write_Stream_TrueColor_120fps 488 µs/frame 2051 fps_ceiling
Result of the fix below: 27-32× pipeline speedup, 60× emulator speedup. Pipeline hits 930-2030 fps end-to-end — 7-16× headroom over the 120 fps target on the heaviest workload (truecolor full-screen redraws).
-
viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks. [MEDIUM]
internal/app/viewport_renderer.go— the styled-lines and ratatui benchmarks show 4-17k allocs per chunk. The hot contributors are likely (a)string(vr.buf)/string(params)conversions inemitCSIfor every escape sequence, (b) thepending strings.Builderresizing as fragments arrive, and (c)vr.shifter.Shift(vr.buf)returning a fresh slice per CSI.- Fix direction: switch CSI param parsing to byte-slice
comparison (no string conversion); reuse
vr.bufandvr.pendingbacking arrays acrossRendercalls by pre-growing innewViewportRenderer; havecursorShifter.Shiftreturn into a caller-owned buffer instead of allocating. Profile-guided: run the styled-lines bench, point pprof at the allocs profile, fix the top three call sites.
-
viewport renderer throughput (~90 MB/s styled) limits codex steady-state. [MEDIUM]
- The styled-lines and ratatui benchmarks come in at 89 MB/s and 40 MB/s respectively. A 100 KB/s codex burst is far under that limit, but a session-resume dump of a 5 MiB chat history takes 50-130 ms of pure renderer time at those rates — enough to be user-visible at the start of a long resume.
- Fix direction: same as the alloc fix above; once the per-call allocation cost drops, the throughput ceiling rises with it. Worth re-running the benches after fixing the allocs and only investing further if the styled-lines bench is still under ~300 MB/s.
-
Session.Children() allocates a fresh slice on every call. [MEDIUM]
internal/app/session.go:530-541walkss.orderunders.muand builds a new[]*Childslice every time. Callers on hot paths:drawSidebarcalls it twice per frame (internal/app/sidebar.go:139and:171);drawTabBarcalls it once per frame (internal/app/tabbar.go:37); the classifier iterates it every 250 ms (internal/app/classifier.go:38); and palette/navigation hit it on every Ctrl-A/D/W/S keystroke.- Fix direction: store the snapshot in an
atomic.Pointer[[]*Child]onSession, refresh it unders.muonly whenSpawn/deletemutates the map. Readers get O(1)Load()with zero allocation — same pattern already used forlisteners(session.go:118-123).
-
wait_for_pattern re-scans the entire stream/grid every iteration. [MEDIUM]
internal/app/host.go:476-493(thecheckclosure). Onscope = "scrollback"it callsc.StreamRead(0)followed bystripANSIBytes(nil, b)over the entire ring on every wake — a full O(ring size) walk per chunk arrival. Ongridit goes through PlainText (one CGO call) plus a regex match against the full grid string. For an agent waiting on a marker in a chatty pane, every PTY chunk firescheck().- Fix direction: for
scrollback, track the offset of the last check and run the regex only over the new tail, reusing a per-call scratch buffer for ANSI stripping. Forgrid, dedupe onScreenVersion()— skip when version hasn't changed.
-
search_output compiles regex + strips ANSI on every call. [MEDIUM]
internal/app/host.go:428compiles a freshregexp.Regexpper invocation;:434strips ANSI over the entire ring buffer whenkind="rendered". Agents that pollsearch_outputwith the same pattern (the typical "watch for marker" loop) repay both costs on every call.- Fix direction: small LRU of compiled regexes keyed by pattern
string (cap maybe 32) on
toolHost. Cache the stripped-ANSI buffer keyed byc.ScreenVersion()so consecutive searches over an unchanged ring reuse the strip.
-
GetProcessOutput grid mode acquires the emulator twice. [MEDIUM]
internal/app/host.go:375-391doesem := c.Emulator()for ActiveScreen / Cursor / Size, then at line 387 re-fetchesem := c.Emulator()for PlainText. EachEmulator()call goes throughptyMuand inspects the live PTY pointer. Under a chatty agent pollingget_process_outputevery 100 ms this is a redundant lock and pointer chase per call.- Fix direction: hold the emulator reference from the first
lookup; reuse it for PlainText. The check
if em == nilstill runs cleanly because the variable is captured.
-
FindChildByIdentity is O(N) under the session lock. [LOW]
internal/app/session.go:553-565scans the children map looking for a matchingIdentitytoken on every new mcp-stdio connection. Not a steady-state hot path — only fires once per child spawn — but with many short-lived sub-agents it adds up and contends with everyone else takings.mu.- Fix direction: maintain an
identityIndex map[string]string(identity → child id) updated alongside spawn / exit, give the lookup an O(1) read.
-
Per-promoter regex matches in the idle classifier. [LOW]
internal/app/idle.go:175-182(matchAny) walks each compiled pattern and runs the DFA over the same 4 KiB tail. A preset with five permission patterns + five error patterns is ten DFA invocations per child per 250 ms tick.- Fix direction: at preset load time, compile each
_patternslist into a single alternation regex ((?:p1)|(?:p2)|…). The classifier then makes one Match call per category per tick.
-
Port-detection dedup is O(N²) over c.ports. [LOW]
internal/app/child.go:461-467: for each fresh URL match the code linearly scans the existing port list. The list rarely grows past a handful, but a dev server that lists "all open ports" in one log line interacts badly: M new matches × N existing entries.- Fix direction: keep a
seenPorts map[int]struct{}next toc.ports, rebuilt on prune (none today). O(1) per match.
-
Port-sighting string allocations happen before the dedup check. [LOW]
internal/app/child.go:455-456allocatesurlFormandportStrbefore line 461'sseenwalk. Both strings are wasted when the port is already inc.ports. Insidec.portsMufor the whole loop body too, blocking thePorts()reader path.- Fix direction: bind the port int first (cheap parse from
m[1]), do the seen check, only then allocate the URL string for the surviving sighting.
-
classifier
time.Now()syscall per child per tick. [LOW]internal/app/classifier.go:54(and theIdleMS/TitleIdleMShelpers it transitively calls ininternal/app/child.go:343-374) each calltime.Now(). Reading time on Linux is fast (vDSO) but with N children × 4time.Now()per tick × 4 ticks/sec it's wasted work that can be batched.- Fix direction: capture
now := time.Now().UnixNano()once at the top ofclassifyAlland thread it intoclassifyOneand the helpers as a parameter.
-
wait_for_pattern subscribes a listener for every call. [LOW]
internal/app/host.go:472-474: each invocation callsSession.Subscribe(wake)which clones the listener slice and swaps the atomic pointer; thedefer Unsubscribedoes the same on exit. Two allocations perwait_for_pattern. The agent pattern of looping onwait_for_patternafter every tool call pays this churn on the steady-state path.- Fix direction: a per-child
chunkBroadcasterregistered once at child spawn that hands out lightweight subscriber tokens, rather than going through the full session listener machinery.
On Hold
- There's a unicode being displayed in opencode [ON HOLD]
- Investigated 2026-05-14: patterm passes ghostty grapheme codepoints
through unchanged (vt/ghostty.go:452-462), so the
<?>glyph is most likely the host terminal's font fallback for opencode's Nerd Font private-use codepoints, not a patterm substitution. Need a concrete reproduction (which codepoint, which host terminal/font) before changing rendering.
- Investigated 2026-05-14: patterm passes ghostty grapheme codepoints
through unchanged (vt/ghostty.go:452-462), so the
- After codex rips for like 15 minutes, the terminal becomes quite slow. [ON HOLD / VERIFYING]
- 2026-05-14: Perf plan P1-P11 landed (see CHANGELOG). Needs a real long-running codex session to confirm whether the steady-state slowdown is gone or some hotspot remains. Capture a pprof if it still feels slow after ≥15 minutes — the structural drivers the audit named are all addressed, so a remaining symptom is a new one and probably wants fresh profiling.