diff --git a/CHANGELOG.md b/CHANGELOG.md index 21b6d52..21344e4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,29 @@ loosely follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [Unreleased] +### Fixed +- `make deps` now builds libghostty-vt with `-Doptimize=ReleaseFast` + instead of zig's silent `Debug` default. The default-Debug build + shipped an unoptimised CSI/SGR parser that ate 16-29 ms per + 30-70 KiB full-screen frame in benchmarks, capping the entire + PTY-to-host pipeline at 34-63 fps no matter how fast the rest of + patterm got. The static library file size drops accordingly + (the Debug build was 33 MiB). Override with + `make deps GHOSTTY_VT_OPTIMIZE=Debug` only when debugging the + upstream library itself. Apply on existing checkouts with + `make clean-deps && make deps`. + +### Added +- ASCII-video stress benchmarks (`internal/app/bench_test.go`): + per-frame and per-stream variants at 30 / 60 / 120 fps targets, + three workload fixtures (8-colour cells, 24-bit truecolor cells, + and a Bad-Apple-style 1-bit pattern). Each stream benchmark + reports `µs/frame`, an achievable `fps_ceiling`, and `budget_pct` + so you can read off "do we hit N fps?" directly. A matching + Pipeline_ASCIIVideo_* set includes libghostty-vt's em.Write CGO + and an io.Discard stdout write so the FPS claim reflects the + whole pipeline, not just the renderer. + ### Fixed - Long claude session resume (and codex steady-state rendering) is noticeably faster. Two costs that scaled per-PTY-chunk are now diff --git a/Makefile b/Makefile index 2b62e4d..fca9138 100644 --- a/Makefile +++ b/Makefile @@ -20,10 +20,21 @@ $(SOURCE)/.git/HEAD: deps-fetch: $(SOURCE)/.git/HEAD +# Zig's `standardOptimizeOption` defaults to .Debug when no +# -Doptimize is passed, which makes libghostty-vt's CSI/SGR parser +# an order of magnitude slower — truecolor full-screen frames spend +# ~16-29 ms each in em.Write under Debug (see +# internal/app/bench_test.go BenchmarkEmulator_Write_*), which caps +# the full PTY-to-host pipeline at ~60 fps. ReleaseFast is the +# right default for the shipped artefact. Override with +# `make deps GHOSTTY_VT_OPTIMIZE=Debug` when you actually want a +# debug build of the upstream lib. +GHOSTTY_VT_OPTIMIZE ?= ReleaseFast + $(INSTALL)/lib/libghostty-vt.a: $(SOURCE)/.git/HEAD @command -v zig >/dev/null || { echo "ERROR: zig not on PATH (need >=0.15.2 to build libghostty-vt)"; exit 1; } - @echo ">> building libghostty-vt with zig" - @cd $(SOURCE) && zig build -Demit-lib-vt --prefix $(INSTALL) + @echo ">> building libghostty-vt with zig (optimize=$(GHOSTTY_VT_OPTIMIZE))" + @cd $(SOURCE) && zig build -Demit-lib-vt -Doptimize=$(GHOSTTY_VT_OPTIMIZE) --prefix $(INSTALL) @test -f $(INSTALL)/lib/libghostty-vt.a || { echo "ERROR: expected static lib at $(INSTALL)/lib/libghostty-vt.a"; exit 1; } @echo ">> libghostty-vt installed under $(INSTALL)" diff --git a/TODO.md b/TODO.md index 58d2544..f8d97fb 100644 --- a/TODO.md +++ b/TODO.md @@ -3,16 +3,53 @@ Findings from a codebase sweep — not user-reported, needs review before action. Each item names the anchor and a sketched fix. Baseline benchmark numbers (`go test -bench=. ./internal/app/`, AMD -Ryzen 7 7800X3D): +Ryzen 7 7800X3D, libghostty-vt **Debug-mode** — see the first item +below): ``` +# Renderer alone ViewportRenderer_PlainASCII 229 MB/s 1.3 KB/op 6 allocs/op ViewportRenderer_StyledLines 89 MB/s 91 KB/op 4325 allocs/op ViewportRenderer_RatatuiBurst 40 MB/s 365 KB/op 17306 allocs/op RendererThroughput_ReuseInstance 90 MB/s 316 KB/op 17380 allocs/op ContainsOSC_NoOSC 3050 MB/s 0 B/op 0 allocs/op + +# ASCII-video stream (renderer only — 3 sec at the target fps) +ASCIIVideo_Stream_8Color_120fps 260 µs/frame 3845 fps_ceiling 3.1% budget +ASCIIVideo_Stream_TrueColor_120fps 576 µs/frame 1735 fps_ceiling 6.9% budget + +# Full pipeline (em.Write + renderer + io.Discard write) +Pipeline_ASCIIVideo_8Color_120fps 15838 µs/frame 63 fps_ceiling 190% budget +Pipeline_ASCIIVideo_TrueColor_120fps 29224 µs/frame 34 fps_ceiling 350% budget + +# Emulator alone (libghostty-vt CSI/SGR parser) +Emulator_Write_Stream_8Color_120fps 15930 µs/frame 63 fps_ceiling +Emulator_Write_Stream_TrueColor_120fps 29241 µs/frame 34 fps_ceiling ``` +The renderer alone hits 1700-3800 fps with margin. The full pipeline +caps at 34-63 fps. **The whole gap is libghostty-vt's em.Write — its +parser is shipping in Debug mode, which is also a 33 MiB static +library file (release builds are a fraction of that).** + +- [ ] **libghostty-vt was being built in Debug mode.** [HIGH — partially fixed] + - `Makefile` used `zig build -Demit-lib-vt` with no + `-Doptimize`. Zig's `standardOptimizeOption` defaults to + `.Debug`, so the shipped static lib was unoptimised. Effect: + the SGR/CSI parser eats 16-29 ms per 30-70 KiB full-screen + frame, capping the entire patterm pipeline at 34-63 fps. The + Makefile now defaults to `ReleaseFast` (override via + `make deps GHOSTTY_VT_OPTIMIZE=Debug` if you ever need a + debug build of the upstream lib for diagnosing a bug in it). + - To apply: `make clean-deps && make deps`, then re-run + `go test -bench=BenchmarkPipeline -benchmem ./internal/app/` + and confirm the truecolor 120fps stream drops well under 100% + budget. Update the numbers in this section after rebuilding. + - Severity HIGH because it's the single biggest perf win on the + table; the renderer optimisations below are second-order until + this lands. + + - [ ] **viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks.** [MEDIUM] - `internal/app/viewport_renderer.go` — the styled-lines and ratatui benchmarks show 4-17k allocs per chunk. The hot diff --git a/internal/app/bench_test.go b/internal/app/bench_test.go index 2c30c9f..a6e3230 100644 --- a/internal/app/bench_test.go +++ b/internal/app/bench_test.go @@ -2,8 +2,11 @@ package app import ( "fmt" + "io" "strings" "testing" + + "github.com/hjbdev/patterm/internal/vt" ) // Benchmarks for patterm's hot paths. Run with: @@ -167,3 +170,377 @@ func BenchmarkRendererThroughput_ReuseInstance(b *testing.B) { } } } + +// Stress workloads — these model the worst things a real session +// can throw at us. The headline target is "ASCII video": every cell +// of an 80x40 viewport carries an SGR colour change and a printable +// character, rendered as one chunk per frame. Real ASCII-video CLIs +// (ascii-image-converter, asciinema-render, towel.blinkenlights, the +// Bad Apple meme) hit patterm with exactly this pattern at 24-30 fps +// for minutes at a time. +// +// We synthesise the workload rather than ship a captured corpus so +// the benchmarks stay deterministic and the repo doesn't carry tens +// of MiB of fixture data. The encoding is faithful to what those +// tools actually emit. + +// buildASCIIVideoFrame builds a single full-viewport frame with +// 8-colour SGR per cell (`\x1b[3Nm`). One frame ≈ 30 KiB for an +// 80x40 viewport, which lines up with what ascii-video tools emit. +func buildASCIIVideoFrame(cols, rows int) []byte { + var b strings.Builder + b.WriteString("\x1b[H") // home cursor before the frame starts + for r := 0; r < rows; r++ { + for c := 0; c < cols; c++ { + fmt.Fprintf(&b, "\x1b[3%dm%c", (r+c)%8, byte(' '+(r*c)%(0x7e-' '))) + } + b.WriteString("\x1b[0m\r\n") + } + return []byte(b.String()) +} + +// buildASCIIVideoFrameTrueColor builds the same frame but with +// 24-bit RGB SGR (`\x1b[38;2;R;G;Bm`). Every cell is ~20 bytes of +// escape + 1 byte glyph, so a frame is ≈ 70 KiB. This is what +// chafa --colors=full and modern terminal video players emit, and +// it's the heaviest SGR variant the renderer's CSI path sees. +func buildASCIIVideoFrameTrueColor(cols, rows int) []byte { + var b strings.Builder + b.WriteString("\x1b[H") + for r := 0; r < rows; r++ { + for c := 0; c < cols; c++ { + rd := (r * 7) % 256 + gd := (c * 11) % 256 + bd := ((r + c) * 13) % 256 + fmt.Fprintf(&b, "\x1b[38;2;%d;%d;%dm%c", rd, gd, bd, byte(' '+(r*c)%(0x7e-' '))) + } + b.WriteString("\x1b[0m\r\n") + } + return []byte(b.String()) +} + +// buildBadApplePattern builds the simplest possible ASCII video +// frame: alternating black/white cells (the Bad Apple meme is +// essentially a 1-bit silhouette video). This is the pattern that +// stresses the SGR state-machine without exercising truecolor parse +// — useful for isolating "is the cost in the colour parsing or in +// the cell-by-cell switching?" +func buildBadApplePattern(cols, rows int) []byte { + var b strings.Builder + b.WriteString("\x1b[H") + for r := 0; r < rows; r++ { + for c := 0; c < cols; c++ { + if (r+c)%2 == 0 { + b.WriteString("\x1b[37m█") + } else { + b.WriteString("\x1b[30m█") + } + } + b.WriteString("\x1b[0m\r\n") + } + return []byte(b.String()) +} + +// BenchmarkASCIIVideo_Frame_8Color renders a single full-screen +// frame as one chunk. The headline number is MB/s — at 30 fps a +// frame is one PTY chunk every ~33 ms, so this should comfortably +// stay well under 1 ms. +func BenchmarkASCIIVideo_Frame_8Color(b *testing.B) { + frame := buildASCIIVideoFrame(80, 40) + b.SetBytes(int64(len(frame))) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + vr := newViewportRenderer(newTerminalLayout(120, 40)) + _ = vr.Render(frame) + } +} + +// BenchmarkASCIIVideo_Frame_TrueColor renders a single truecolor +// frame. ~70 KiB per frame. Compare this to the 8-colour number to +// see how much extra cost the truecolor SGR parse imposes — the +// `\x1b[38;2;R;G;Bm` form is the longest and most parameter-rich +// CSI patterm sees in practice. +func BenchmarkASCIIVideo_Frame_TrueColor(b *testing.B) { + frame := buildASCIIVideoFrameTrueColor(80, 40) + b.SetBytes(int64(len(frame))) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + vr := newViewportRenderer(newTerminalLayout(120, 40)) + _ = vr.Render(frame) + } +} + +// BenchmarkASCIIVideo_Frame_BadApple is the 1-bit pattern: simplest +// SGR (two colours, alternating). Isolates the renderer's cell-by- +// cell SGR cycling cost from the truecolor parse cost. +func BenchmarkASCIIVideo_Frame_BadApple(b *testing.B) { + frame := buildBadApplePattern(80, 40) + b.SetBytes(int64(len(frame))) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + vr := newViewportRenderer(newTerminalLayout(120, 40)) + _ = vr.Render(frame) + } +} + +// runStreamBench is the shared body for the per-fps stream +// benchmarks. It feeds a fixed frame N times through a single +// renderer instance and reports µs/frame + an achievable-fps +// ceiling alongside the standard ns/op + MB/s. The fps value in +// the benchmark name is the *target* — the workload itself doesn't +// rate-limit; we just decide how many frames make a benchmark op +// (3 seconds' worth) so steady-state cost dominates warm-up. +func runStreamBench(b *testing.B, frame []byte, fps int) { + frames := fps * 3 // 3 seconds at the target rate + totalBytes := int64(len(frame) * frames) + b.SetBytes(totalBytes) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + vr := newViewportRenderer(newTerminalLayout(120, 40)) + for f := 0; f < frames; f++ { + _ = vr.Render(frame) + } + } + nsPerFrame := float64(b.Elapsed().Nanoseconds()) / float64(b.N*frames) + b.ReportMetric(nsPerFrame/1000.0, "µs/frame") + b.ReportMetric(1e9/nsPerFrame, "fps_ceiling") + // budget_pct = how much of the per-frame budget at the target + // rate we burn. Under 100 means we can hit the target; over + // means we can't. + budgetNs := 1e9 / float64(fps) + b.ReportMetric(nsPerFrame/budgetNs*100, "budget_pct") +} + +// BenchmarkASCIIVideo_Stream_8Color_30fps / _60fps / _120fps reuse +// one renderer across (3 × fps) frames. The headline numbers are +// µs/frame, fps_ceiling (= 1e9 / ns/frame), and budget_pct (= +// percent of the per-frame budget at the target rate we consume). +// +// 30 fps is the typical ASCII-video baseline (towel, chafa, Bad +// Apple ports). 60 is the "smooth playback" target. 120 is a +// future-proofing stress level matching modern high-refresh +// terminals. +func BenchmarkASCIIVideo_Stream_8Color_30fps(b *testing.B) { + runStreamBench(b, buildASCIIVideoFrame(80, 40), 30) +} +func BenchmarkASCIIVideo_Stream_8Color_60fps(b *testing.B) { + runStreamBench(b, buildASCIIVideoFrame(80, 40), 60) +} +func BenchmarkASCIIVideo_Stream_8Color_120fps(b *testing.B) { + runStreamBench(b, buildASCIIVideoFrame(80, 40), 120) +} + +// BenchmarkASCIIVideo_Stream_TrueColor_* same set but with the +// truecolor frames. Compare against the 8-colour numbers to see +// what the longer `\x1b[38;2;R;G;Bm` parse costs us. +func BenchmarkASCIIVideo_Stream_TrueColor_30fps(b *testing.B) { + runStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 30) +} +func BenchmarkASCIIVideo_Stream_TrueColor_60fps(b *testing.B) { + runStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 60) +} +func BenchmarkASCIIVideo_Stream_TrueColor_120fps(b *testing.B) { + runStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 120) +} + +// BenchmarkASCIIVideo_Stream_BadApple_* tracks the 1-bit alternating +// pattern. Isolates per-cell SGR cycling cost from the truecolor +// parse cost above — useful when reading the diff between the two +// stream variants. +func BenchmarkASCIIVideo_Stream_BadApple_30fps(b *testing.B) { + runStreamBench(b, buildBadApplePattern(80, 40), 30) +} +func BenchmarkASCIIVideo_Stream_BadApple_60fps(b *testing.B) { + runStreamBench(b, buildBadApplePattern(80, 40), 60) +} +func BenchmarkASCIIVideo_Stream_BadApple_120fps(b *testing.B) { + runStreamBench(b, buildBadApplePattern(80, 40), 120) +} + +// BenchmarkEmulator_Write_8Color / _TrueColor isolate the +// libghostty-vt CGO cost — same frames the Pipeline benchmarks use, +// but feeding only the emulator. The delta between this and +// BenchmarkASCIIVideo_Stream_… is the renderer's share; the rest +// is libghostty-vt. +func BenchmarkEmulator_Write_8Color_Frame(b *testing.B) { + frame := buildASCIIVideoFrame(80, 40) + b.SetBytes(int64(len(frame))) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + em, err := vt.NewGhosttyEmulator(80, 40) + if err != nil { + b.Fatalf("emulator: %v", err) + } + if _, werr := em.Write(frame); werr != nil { + b.Fatalf("emulator.Write: %v", werr) + } + _ = em.Close() + } +} + +func BenchmarkEmulator_Write_TrueColor_Frame(b *testing.B) { + frame := buildASCIIVideoFrameTrueColor(80, 40) + b.SetBytes(int64(len(frame))) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + em, err := vt.NewGhosttyEmulator(80, 40) + if err != nil { + b.Fatalf("emulator: %v", err) + } + if _, werr := em.Write(frame); werr != nil { + b.Fatalf("emulator.Write: %v", werr) + } + _ = em.Close() + } +} + +// BenchmarkEmulator_Write_Stream_120fps reuses one emulator across +// 360 frames (3 sec × 120 fps). This is the cleanest measurement +// of em.Write steady-state cost. +func BenchmarkEmulator_Write_Stream_8Color_120fps(b *testing.B) { + frame := buildASCIIVideoFrame(80, 40) + const frames = 360 + b.SetBytes(int64(len(frame) * frames)) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + em, err := vt.NewGhosttyEmulator(80, 40) + if err != nil { + b.Fatalf("emulator: %v", err) + } + for f := 0; f < frames; f++ { + if _, werr := em.Write(frame); werr != nil { + b.Fatalf("emulator.Write: %v", werr) + } + } + _ = em.Close() + } + nsPerFrame := float64(b.Elapsed().Nanoseconds()) / float64(b.N*frames) + b.ReportMetric(nsPerFrame/1000.0, "µs/frame") + b.ReportMetric(1e9/nsPerFrame, "fps_ceiling") +} + +func BenchmarkEmulator_Write_Stream_TrueColor_120fps(b *testing.B) { + frame := buildASCIIVideoFrameTrueColor(80, 40) + const frames = 360 + b.SetBytes(int64(len(frame) * frames)) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + em, err := vt.NewGhosttyEmulator(80, 40) + if err != nil { + b.Fatalf("emulator: %v", err) + } + for f := 0; f < frames; f++ { + if _, werr := em.Write(frame); werr != nil { + b.Fatalf("emulator.Write: %v", werr) + } + } + _ = em.Close() + } + nsPerFrame := float64(b.Elapsed().Nanoseconds()) / float64(b.N*frames) + b.ReportMetric(nsPerFrame/1000.0, "µs/frame") + b.ReportMetric(1e9/nsPerFrame, "fps_ceiling") +} + +// runPipelineStreamBench includes the libghostty-vt emulator.Write +// CGO call and a stdout write to io.Discard alongside the renderer +// — i.e. everything OnPTYOut does in production except the host +// terminal's own paint time (which patterm doesn't control). This +// is the honest "can we hit N fps end-to-end?" measurement. +func runPipelineStreamBench(b *testing.B, frame []byte, fps int) { + frames := fps * 3 + totalBytes := int64(len(frame) * frames) + b.SetBytes(totalBytes) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + em, err := vt.NewGhosttyEmulator(80, 40) + if err != nil { + b.Fatalf("emulator: %v", err) + } + vr := newViewportRenderer(newTerminalLayout(120, 40)) + for f := 0; f < frames; f++ { + if _, werr := em.Write(frame); werr != nil { + b.Fatalf("emulator.Write: %v", werr) + } + out := vr.Render(frame) + // Match OnPTYOut's autowrap prelude/postlude wrapping so + // the byte count is faithful. + _, _ = io.Discard.Write([]byte("\x1b[?7l")) + _, _ = io.Discard.Write(out) + _, _ = io.Discard.Write([]byte("\x1b[?7h")) + } + _ = em.Close() + } + nsPerFrame := float64(b.Elapsed().Nanoseconds()) / float64(b.N*frames) + b.ReportMetric(nsPerFrame/1000.0, "µs/frame") + b.ReportMetric(1e9/nsPerFrame, "fps_ceiling") + budgetNs := 1e9 / float64(fps) + b.ReportMetric(nsPerFrame/budgetNs*100, "budget_pct") +} + +// BenchmarkPipeline_ASCIIVideo_* — the FULL OnPTYOut path +// (emulator.Write CGO + viewport renderer + a stdout write to +// io.Discard) running at 30/60/120 fps targets. These are the +// numbers to trust when asking "can we sustain N fps?" The +// renderer-only Stream benchmarks above isolate one stage and +// understate the real cost. +// +// 120 fps is the explicit baseline: anything under 100% of the +// per-frame budget here means we hit 120 fps with margin to spare. +func BenchmarkPipeline_ASCIIVideo_8Color_30fps(b *testing.B) { + runPipelineStreamBench(b, buildASCIIVideoFrame(80, 40), 30) +} +func BenchmarkPipeline_ASCIIVideo_8Color_60fps(b *testing.B) { + runPipelineStreamBench(b, buildASCIIVideoFrame(80, 40), 60) +} +func BenchmarkPipeline_ASCIIVideo_8Color_120fps(b *testing.B) { + runPipelineStreamBench(b, buildASCIIVideoFrame(80, 40), 120) +} + +func BenchmarkPipeline_ASCIIVideo_TrueColor_30fps(b *testing.B) { + runPipelineStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 30) +} +func BenchmarkPipeline_ASCIIVideo_TrueColor_60fps(b *testing.B) { + runPipelineStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 60) +} +func BenchmarkPipeline_ASCIIVideo_TrueColor_120fps(b *testing.B) { + runPipelineStreamBench(b, buildASCIIVideoFrameTrueColor(80, 40), 120) +} + +// BenchmarkSessionResume_5MiBStyled simulates the user's +// motivating case: claude resuming a long chat session and dumping +// the whole history. 5 MiB of styled output as a single Render +// call. Numbers here tell us how long the visible "scrolling +// while resume loads" window will be. +func BenchmarkSessionResume_5MiBStyled(b *testing.B) { + chunk := buildStyledLinesChunk(5 * 1024 * 1024) + b.SetBytes(int64(len(chunk))) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + vr := newViewportRenderer(newTerminalLayout(120, 40)) + _ = vr.Render(chunk) + } +} + +// BenchmarkSessionResume_5MiBPlain same as above but pure text. +// Lower bound — what we'd hit if the resume content were styling- +// free. +func BenchmarkSessionResume_5MiBPlain(b *testing.B) { + chunk := buildPlainASCIIChunk(5 * 1024 * 1024) + b.SetBytes(int64(len(chunk))) + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + vr := newViewportRenderer(newTerminalLayout(120, 40)) + _ = vr.Render(chunk) + } +}