Files

Harry Bayliss bda799a3c6 mise-pin zig 0.15.2; rebuild libghostty-vt ReleaseFast — 27-32x pipeline speedup

Added .mise.toml pinning zig = "0.15.2" (the minimum the vendored
Ghostty commit requires) and taught the Makefile to resolve zig
through mise when available, falling back to PATH. Contributors run
`mise install` once and `make deps` just works.

Re-ran the pipeline benchmarks after rebuilding libghostty-vt with
ReleaseFast (same hardware, AMD Ryzen 7 7800X3D):

                                Debug         ReleaseFast    speedup
  Pipeline 8-colour @120fps     63 fps         2030 fps       32x
  Pipeline truecolor @120fps    34 fps          931 fps       27x
  Emulator-only truecolor       34 fps         2051 fps       60x

7-16x headroom over 120 fps for the heaviest workload (truecolor
full-screen redraws). Static library size 33 MiB -> 13 MiB.

TODO.md baseline numbers updated to reflect post-fix throughput;
the "Debug-mode lib" finding is folded into the result it produced
rather than left as an open item.

2026-05-15 13:54:48 +01:00

9.9 KiB

Raw Blame History

Perf Audit (auto-generated 2026-05-15)

Findings from a codebase sweep — not user-reported, needs review before action. Each item names the anchor and a sketched fix.

Baseline benchmark numbers (go test -bench=. ./internal/app/, AMD Ryzen 7 7800X3D, libghostty-vt ReleaseFast after the Makefile fix landed):

# Renderer alone
ViewportRenderer_PlainASCII       229 MB/s     1.3 KB/op    6 allocs/op
ViewportRenderer_StyledLines       89 MB/s    91   KB/op  4325 allocs/op
ViewportRenderer_RatatuiBurst      40 MB/s   365   KB/op 17306 allocs/op
RendererThroughput_ReuseInstance   90 MB/s   316   KB/op 17380 allocs/op
ContainsOSC_NoOSC                3050 MB/s     0   B/op     0 allocs/op

# ASCII-video stream (renderer only — 3 sec at the target fps)
ASCIIVideo_Stream_8Color_120fps     260 µs/frame  3845 fps_ceiling   3.1% budget
ASCIIVideo_Stream_TrueColor_120fps  576 µs/frame  1735 fps_ceiling   6.9% budget

# Full pipeline (em.Write + renderer + io.Discard write)
Pipeline_ASCIIVideo_8Color_120fps     493 µs/frame  2030 fps_ceiling   5.9% budget
Pipeline_ASCIIVideo_TrueColor_120fps 1075 µs/frame   931 fps_ceiling  12.9% budget

# Emulator alone (libghostty-vt CSI/SGR parser)
Emulator_Write_Stream_8Color_120fps    257 µs/frame  3890 fps_ceiling
Emulator_Write_Stream_TrueColor_120fps 488 µs/frame  2051 fps_ceiling

Result of the fix below: 27-32× pipeline speedup, 60× emulator speedup. Pipeline hits 930-2030 fps end-to-end — 7-16× headroom over the 120 fps target on the heaviest workload (truecolor full-screen redraws).

viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks. [MEDIUM]
- internal/app/viewport_renderer.go — the styled-lines and ratatui benchmarks show 4-17k allocs per chunk. The hot contributors are likely (a) string(vr.buf) / string(params) conversions in emitCSI for every escape sequence, (b) the pending strings.Builder resizing as fragments arrive, and (c) vr.shifter.Shift(vr.buf) returning a fresh slice per CSI.
- Fix direction: switch CSI param parsing to byte-slice comparison (no string conversion); reuse vr.buf and vr.pending backing arrays across Render calls by pre-growing in newViewportRenderer; have cursorShifter.Shift return into a caller-owned buffer instead of allocating. Profile-guided: run the styled-lines bench, point pprof at the allocs profile, fix the top three call sites.
viewport renderer throughput (~90 MB/s styled) limits codex steady-state. [MEDIUM]
- The styled-lines and ratatui benchmarks come in at 89 MB/s and 40 MB/s respectively. A 100 KB/s codex burst is far under that limit, but a session-resume dump of a 5 MiB chat history takes 50-130 ms of pure renderer time at those rates — enough to be user-visible at the start of a long resume.
- Fix direction: same as the alloc fix above; once the per-call allocation cost drops, the throughput ceiling rises with it. Worth re-running the benches after fixing the allocs and only investing further if the styled-lines bench is still under ~300 MB/s.
Session.Children() allocates a fresh slice on every call. [MEDIUM]
- internal/app/session.go:530-541 walks s.order under s.mu and builds a new []*Child slice every time. Callers on hot paths: drawSidebar calls it twice per frame (internal/app/sidebar.go:139 and :171); drawTabBar calls it once per frame (internal/app/tabbar.go:37); the classifier iterates it every 250 ms (internal/app/classifier.go:38); and palette/navigation hit it on every Ctrl-A/D/W/S keystroke.
- Fix direction: store the snapshot in an atomic.Pointer[[]*Child] on Session, refresh it under s.mu only when Spawn / delete mutates the map. Readers get O(1) Load() with zero allocation — same pattern already used for listeners (session.go:118-123).
wait_for_pattern re-scans the entire stream/grid every iteration. [MEDIUM]
- internal/app/host.go:476-493 (the check closure). On scope = "scrollback" it calls c.StreamRead(0) followed by stripANSIBytes(nil, b) over the entire ring on every wake — a full O(ring size) walk per chunk arrival. On grid it goes through PlainText (one CGO call) plus a regex match against the full grid string. For an agent waiting on a marker in a chatty pane, every PTY chunk fires check().
- Fix direction: for scrollback, track the offset of the last check and run the regex only over the new tail, reusing a per-call scratch buffer for ANSI stripping. For grid, dedupe on ScreenVersion() — skip when version hasn't changed.
search_output compiles regex + strips ANSI on every call. [MEDIUM]
- internal/app/host.go:428 compiles a fresh regexp.Regexp per invocation; :434 strips ANSI over the entire ring buffer when kind="rendered". Agents that poll search_output with the same pattern (the typical "watch for marker" loop) repay both costs on every call.
- Fix direction: small LRU of compiled regexes keyed by pattern string (cap maybe 32) on toolHost. Cache the stripped-ANSI buffer keyed by c.ScreenVersion() so consecutive searches over an unchanged ring reuse the strip.
GetProcessOutput grid mode acquires the emulator twice. [MEDIUM]
- internal/app/host.go:375-391 does em := c.Emulator() for ActiveScreen / Cursor / Size, then at line 387 re-fetches em := c.Emulator() for PlainText. Each Emulator() call goes through ptyMu and inspects the live PTY pointer. Under a chatty agent polling get_process_output every 100 ms this is a redundant lock and pointer chase per call.
- Fix direction: hold the emulator reference from the first lookup; reuse it for PlainText. The check if em == nil still runs cleanly because the variable is captured.
FindChildByIdentity is O(N) under the session lock. [LOW]
- internal/app/session.go:553-565 scans the children map looking for a matching Identity token on every new mcp-stdio connection. Not a steady-state hot path — only fires once per child spawn — but with many short-lived sub-agents it adds up and contends with everyone else taking s.mu.
- Fix direction: maintain an identityIndex map[string]string (identity → child id) updated alongside spawn / exit, give the lookup an O(1) read.
Per-promoter regex matches in the idle classifier. [LOW]
- internal/app/idle.go:175-182 (matchAny) walks each compiled pattern and runs the DFA over the same 4 KiB tail. A preset with five permission patterns + five error patterns is ten DFA invocations per child per 250 ms tick.
- Fix direction: at preset load time, compile each _patterns list into a single alternation regex ((?:p1)|(?:p2)|…). The classifier then makes one Match call per category per tick.
Port-detection dedup is O(N²) over c.ports. [LOW]
- internal/app/child.go:461-467: for each fresh URL match the code linearly scans the existing port list. The list rarely grows past a handful, but a dev server that lists "all open ports" in one log line interacts badly: M new matches × N existing entries.
- Fix direction: keep a seenPorts map[int]struct{} next to c.ports, rebuilt on prune (none today). O(1) per match.
Port-sighting string allocations happen before the dedup check. [LOW]
- internal/app/child.go:455-456 allocates urlForm and portStr before line 461's seen walk. Both strings are wasted when the port is already in c.ports. Inside c.portsMu for the whole loop body too, blocking the Ports() reader path.
- Fix direction: bind the port int first (cheap parse from m[1]), do the seen check, only then allocate the URL string for the surviving sighting.
classifier time.Now() syscall per child per tick. [LOW]
- internal/app/classifier.go:54 (and the IdleMS / TitleIdleMS helpers it transitively calls in internal/app/child.go:343-374) each call time.Now(). Reading time on Linux is fast (vDSO) but with N children × 4 time.Now() per tick × 4 ticks/sec it's wasted work that can be batched.
- Fix direction: capture now := time.Now().UnixNano() once at the top of classifyAll and thread it into classifyOne and the helpers as a parameter.
wait_for_pattern subscribes a listener for every call. [LOW]
- internal/app/host.go:472-474: each invocation calls Session.Subscribe(wake) which clones the listener slice and swaps the atomic pointer; the defer Unsubscribe does the same on exit. Two allocations per wait_for_pattern. The agent pattern of looping on wait_for_pattern after every tool call pays this churn on the steady-state path.
- Fix direction: a per-child chunkBroadcaster registered once at child spawn that hands out lightweight subscriber tokens, rather than going through the full session listener machinery.

On Hold

There's a unicode being displayed in opencode [ON HOLD]
- Investigated 2026-05-14: patterm passes ghostty grapheme codepoints through unchanged (vt/ghostty.go:452-462), so the <?> glyph is most likely the host terminal's font fallback for opencode's Nerd Font private-use codepoints, not a patterm substitution. Need a concrete reproduction (which codepoint, which host terminal/font) before changing rendering.
After codex rips for like 15 minutes, the terminal becomes quite slow. [ON HOLD / VERIFYING]
- 2026-05-14: Perf plan P1-P11 landed (see CHANGELOG). Needs a real long-running codex session to confirm whether the steady-state slowdown is gone or some hotspot remains. Capture a pprof if it still feels slow after ≥15 minutes — the structural drivers the audit named are all addressed, so a remaining symptom is a new one and probably wants fresh profiling.

9.9 KiB Raw Blame History Unescape Escape

Perf Audit (auto-generated 2026-05-15)

On Hold

9.9 KiB

Raw Blame History