Sync TODO.md perf-audit review pass
Removed low/marginal items from the original sweep; remaining items have measured or workflow evidence to justify action.
This commit is contained in:
205
TODO.md
205
TODO.md
@@ -1,6 +1,7 @@
|
|||||||
# Perf Audit (auto-generated 2026-05-15)
|
# Perf Audit (reviewed 2026-05-15)
|
||||||
Findings from a codebase sweep — not user-reported, needs review before
|
Findings that survived the 2026-05-15 review pass. Low and marginal
|
||||||
action. Each item names the anchor and a sketched fix.
|
items from the original sweep were removed; remaining items have enough
|
||||||
|
measured or workflow evidence to justify action.
|
||||||
|
|
||||||
Baseline benchmark numbers (`go test -bench=. ./internal/app/`, AMD
|
Baseline benchmark numbers (`go test -bench=. ./internal/app/`, AMD
|
||||||
Ryzen 7 7800X3D, libghostty-vt **ReleaseFast** after the Makefile
|
Ryzen 7 7800X3D, libghostty-vt **ReleaseFast** after the Makefile
|
||||||
@@ -27,145 +28,75 @@ Emulator_Write_Stream_8Color_120fps 257 µs/frame 3890 fps_ceiling
|
|||||||
Emulator_Write_Stream_TrueColor_120fps 488 µs/frame 2051 fps_ceiling
|
Emulator_Write_Stream_TrueColor_120fps 488 µs/frame 2051 fps_ceiling
|
||||||
```
|
```
|
||||||
|
|
||||||
Result of the fix below: 27-32× pipeline speedup, 60× emulator
|
The current pipeline still has large 120 fps headroom. The remaining
|
||||||
speedup. Pipeline hits 930-2030 fps end-to-end — 7-16× headroom
|
renderer concern is multi-MiB styled replay latency and allocation
|
||||||
over the 120 fps target on the heaviest workload (truecolor
|
churn, not normal steady-state frame budget.
|
||||||
full-screen redraws).
|
|
||||||
|
|
||||||
|
|
||||||
- [ ] **viewport renderer allocates ~1 alloc per 4 input bytes on SGR/CSI-heavy chunks.** [MEDIUM]
|
- [ ] **viewport renderer allocates heavily on SGR/CSI-heavy chunks.** [MEDIUM]
|
||||||
- `internal/app/viewport_renderer.go` — the styled-lines and
|
- Review evidence: five benchmark reps confirmed
|
||||||
ratatui benchmarks show 4-17k allocs per chunk. The hot
|
`ViewportRenderer_StyledLines` at about 4,325 allocs per 16 KiB
|
||||||
contributors are likely (a) `string(vr.buf)` / `string(params)`
|
chunk (~91.5 KB/op, roughly 1 alloc per 3.8 input bytes), and
|
||||||
conversions in `emitCSI` for every escape sequence, (b) the
|
`ViewportRenderer_RatatuiBurst` at about 17,306 allocs per chunk
|
||||||
`pending strings.Builder` resizing as fragments arrive, and (c)
|
(~365 KB/op). A 5 MiB styled resume benchmark allocated about
|
||||||
`vr.shifter.Shift(vr.buf)` returning a fresh slice per CSI.
|
31 MB across 1.38M objects.
|
||||||
- Fix direction: switch CSI param parsing to byte-slice
|
- Likely hot paths: generic CSI/SGR output in
|
||||||
comparison (no string conversion); reuse `vr.buf` and
|
`internal/app/viewport_renderer.go` sends many sequences through
|
||||||
`vr.pending` backing arrays across `Render` calls by
|
`vr.shifter.Shift(vr.buf)`, while `internal/app/cursorshift.go`
|
||||||
pre-growing in `newViewportRenderer`; have `cursorShifter.Shift`
|
returns a fresh `[]byte` via `pending.String()` on every
|
||||||
return into a caller-owned buffer instead of allocating.
|
`Shift` call and parses CSI params through `string(raw)` /
|
||||||
Profile-guided: run the styled-lines bench, point pprof at the
|
`strings.Split`. The mode-helper `string(params)` conversions
|
||||||
allocs profile, fix the top three call sites.
|
are real, but probably not the main SGR-heavy cost.
|
||||||
|
- Fix direction: make `cursorShifter` write into caller-owned
|
||||||
|
scratch output or directly into the viewport renderer's pending
|
||||||
|
builder; parse CSI params from byte slices; pre-grow/reuse
|
||||||
|
renderer and shifter buffers. Re-run styled-lines, ratatui, and
|
||||||
|
5 MiB resume benchmarks; use pprof when available to confirm the
|
||||||
|
top allocation sites.
|
||||||
|
|
||||||
- [ ] **viewport renderer throughput (~90 MB/s styled) limits codex steady-state.** [MEDIUM]
|
- [ ] **large styled resume/replay dumps spend visible time in viewport rendering.** [MEDIUM]
|
||||||
- The styled-lines and ratatui benchmarks come in at 89 MB/s and
|
- Review evidence: `BenchmarkSessionResume_5MiBStyled` measured
|
||||||
40 MB/s respectively. A 100 KB/s codex burst is far under that
|
about 58 ms median and 63 ms p95 over five reps. The plain 5 MiB
|
||||||
limit, but a session-resume dump of a 5 MiB chat history takes
|
benchmark was about 23-24 ms with only 21 allocs. The live path
|
||||||
50-130 ms of pure renderer time at those rates — enough to be
|
renders focused PTY chunks through `renderer.Render`, then still
|
||||||
user-visible at the start of a long resume.
|
pays emulator writes, ring writes, event dispatch, stdout writes,
|
||||||
- Fix direction: same as the alloc fix above; once the per-call
|
and real terminal paint.
|
||||||
allocation cost drops, the throughput ceiling rises with it.
|
- Scope: this is not a Codex steady-state throughput limit. A
|
||||||
Worth re-running the benches after fixing the allocs and only
|
100 KB/s stream is far below the styled renderer's ~80-90 MB/s
|
||||||
investing further if the styled-lines bench is still under
|
ceiling. It matters for multi-MiB burst replay, resume/startup
|
||||||
~300 MB/s.
|
dumps, and dense full-screen churn.
|
||||||
|
- Fix direction: do the allocation fix first, since it should also
|
||||||
|
improve throughput. After that, invest further only if styled
|
||||||
|
resume traces remain user-visible or the styled-lines benchmark
|
||||||
|
is still under roughly 300 MB/s.
|
||||||
|
|
||||||
- [ ] **Session.Children() allocates a fresh slice on every call.** [MEDIUM]
|
- [ ] **wait_for_pattern re-scans the entire stream/grid while waiting.** [MEDIUM]
|
||||||
- `internal/app/session.go:530-541` walks `s.order` under `s.mu` and
|
- `internal/app/host.go:476-493` (the `check` closure). On
|
||||||
builds a new `[]*Child` slice every time. Callers on hot paths:
|
`scope="scrollback"` it calls `c.StreamRead(0)` followed by
|
||||||
`drawSidebar` calls it twice per frame
|
`stripANSIBytes(nil, b)`, so each check can copy, strip, and
|
||||||
(`internal/app/sidebar.go:139` and `:171`); `drawTabBar` calls it
|
search the full 1 MiB ring. On `scope="grid"` it calls
|
||||||
once per frame (`internal/app/tabbar.go:37`); the classifier
|
`PlainText()` and runs the regex against the full grid string.
|
||||||
iterates it every 250 ms (`internal/app/classifier.go:38`); and
|
- Caveat from review: the current chunk notifier coalesces bursts
|
||||||
palette/navigation hit it on every Ctrl-A/D/W/S keystroke.
|
with a buffered channel and has a 500 ms fallback, so this is not
|
||||||
- Fix direction: store the snapshot in an `atomic.Pointer[[]*Child]`
|
necessarily one full scan per PTY chunk. It is still meaningful
|
||||||
on `Session`, refresh it under `s.mu` only when `Spawn` / `delete`
|
for active waits on chatty panes.
|
||||||
mutates the map. Readers get O(1) `Load()` with zero allocation —
|
- Fix direction: for `scrollback`, track the last checked stream
|
||||||
same pattern already used for `listeners` (session.go:118-123).
|
offset and search only new output plus a bounded overlap/scratch
|
||||||
|
buffer so matches spanning chunks are not missed. For `grid`,
|
||||||
|
dedupe on `ScreenVersion()` and skip work when the version has
|
||||||
|
not changed.
|
||||||
|
|
||||||
- [ ] **wait_for_pattern re-scans the entire stream/grid every iteration.** [MEDIUM]
|
- [ ] **search_output rebuilds and searches whole scrollback on every call.** [MEDIUM]
|
||||||
- `internal/app/host.go:476-493` (the `check` closure). On `scope =
|
- `internal/app/host.go:428-437` compiles a fresh regex, reads the
|
||||||
"scrollback"` it calls `c.StreamRead(0)` followed by
|
stream from offset 0, strips ANSI for `kind="rendered"`, converts
|
||||||
`stripANSIBytes(nil, b)` over the entire ring on every wake — a
|
the full buffer to a string, and splits it into lines before
|
||||||
full O(ring size) walk per chunk arrival. On `grid` it goes
|
applying `limit`. This is meaningful when agents poll the same
|
||||||
through PlainText (one CGO call) plus a regex match against the
|
pattern; it is low impact for ad hoc searches.
|
||||||
full grid string. For an agent waiting on a marker in a chatty
|
- Fix direction: cache compiled regexes by pattern; cache stripped
|
||||||
pane, every PTY chunk fires `check()`.
|
rendered output by child id and stream end offset; avoid
|
||||||
- Fix direction: for `scrollback`, track the offset of the last
|
`strings.Split` over the whole ring when only the first `limit`
|
||||||
check and run the regex only over the new tail, reusing a
|
matches are needed. Prefer an incremental search shape if this
|
||||||
per-call scratch buffer for ANSI stripping. For `grid`, dedupe
|
becomes the standard "watch for marker" path.
|
||||||
on `ScreenVersion()` — skip when version hasn't changed.
|
|
||||||
|
|
||||||
- [ ] **search_output compiles regex + strips ANSI on every call.** [MEDIUM]
|
|
||||||
- `internal/app/host.go:428` compiles a fresh `regexp.Regexp` per
|
|
||||||
invocation; `:434` strips ANSI over the entire ring buffer when
|
|
||||||
`kind="rendered"`. Agents that poll `search_output` with the same
|
|
||||||
pattern (the typical "watch for marker" loop) repay both costs on
|
|
||||||
every call.
|
|
||||||
- Fix direction: small LRU of compiled regexes keyed by pattern
|
|
||||||
string (cap maybe 32) on `toolHost`. Cache the stripped-ANSI
|
|
||||||
buffer keyed by `c.ScreenVersion()` so consecutive searches over
|
|
||||||
an unchanged ring reuse the strip.
|
|
||||||
|
|
||||||
- [ ] **GetProcessOutput grid mode acquires the emulator twice.** [MEDIUM]
|
|
||||||
- `internal/app/host.go:375-391` does `em := c.Emulator()` for
|
|
||||||
ActiveScreen / Cursor / Size, then at line 387 re-fetches
|
|
||||||
`em := c.Emulator()` for PlainText. Each `Emulator()` call goes
|
|
||||||
through `ptyMu` and inspects the live PTY pointer. Under a
|
|
||||||
chatty agent polling `get_process_output` every 100 ms this is
|
|
||||||
a redundant lock and pointer chase per call.
|
|
||||||
- Fix direction: hold the emulator reference from the first
|
|
||||||
lookup; reuse it for PlainText. The check `if em == nil` still
|
|
||||||
runs cleanly because the variable is captured.
|
|
||||||
|
|
||||||
- [ ] **FindChildByIdentity is O(N) under the session lock.** [LOW]
|
|
||||||
- `internal/app/session.go:553-565` scans the children map looking
|
|
||||||
for a matching `Identity` token on every new mcp-stdio
|
|
||||||
connection. Not a steady-state hot path — only fires once per
|
|
||||||
child spawn — but with many short-lived sub-agents it adds up
|
|
||||||
and contends with everyone else taking `s.mu`.
|
|
||||||
- Fix direction: maintain an `identityIndex map[string]string`
|
|
||||||
(identity → child id) updated alongside spawn / exit, give the
|
|
||||||
lookup an O(1) read.
|
|
||||||
|
|
||||||
- [ ] **Per-promoter regex matches in the idle classifier.** [LOW]
|
|
||||||
- `internal/app/idle.go:175-182` (`matchAny`) walks each compiled
|
|
||||||
pattern and runs the DFA over the same 4 KiB tail. A preset with
|
|
||||||
five permission patterns + five error patterns is ten DFA
|
|
||||||
invocations per child per 250 ms tick.
|
|
||||||
- Fix direction: at preset load time, compile each `_patterns`
|
|
||||||
list into a single alternation regex (`(?:p1)|(?:p2)|…`). The
|
|
||||||
classifier then makes one Match call per category per tick.
|
|
||||||
|
|
||||||
- [ ] **Port-detection dedup is O(N²) over c.ports.** [LOW]
|
|
||||||
- `internal/app/child.go:461-467`: for each fresh URL match the
|
|
||||||
code linearly scans the existing port list. The list rarely
|
|
||||||
grows past a handful, but a dev server that lists "all open
|
|
||||||
ports" in one log line interacts badly: M new matches × N
|
|
||||||
existing entries.
|
|
||||||
- Fix direction: keep a `seenPorts map[int]struct{}` next to
|
|
||||||
`c.ports`, rebuilt on prune (none today). O(1) per match.
|
|
||||||
|
|
||||||
- [ ] **Port-sighting string allocations happen before the dedup check.** [LOW]
|
|
||||||
- `internal/app/child.go:455-456` allocates `urlForm` and `portStr`
|
|
||||||
before line 461's `seen` walk. Both strings are wasted when the
|
|
||||||
port is already in `c.ports`. Inside `c.portsMu` for the whole
|
|
||||||
loop body too, blocking the `Ports()` reader path.
|
|
||||||
- Fix direction: bind the port int first (cheap parse from
|
|
||||||
`m[1]`), do the seen check, only then allocate the URL string
|
|
||||||
for the surviving sighting.
|
|
||||||
|
|
||||||
- [ ] **classifier `time.Now()` syscall per child per tick.** [LOW]
|
|
||||||
- `internal/app/classifier.go:54` (and the `IdleMS` /
|
|
||||||
`TitleIdleMS` helpers it transitively calls in
|
|
||||||
`internal/app/child.go:343-374`) each call `time.Now()`.
|
|
||||||
Reading time on Linux is fast (vDSO) but with N children × 4
|
|
||||||
`time.Now()` per tick × 4 ticks/sec it's wasted work that can
|
|
||||||
be batched.
|
|
||||||
- Fix direction: capture `now := time.Now().UnixNano()` once at
|
|
||||||
the top of `classifyAll` and thread it into `classifyOne` and
|
|
||||||
the helpers as a parameter.
|
|
||||||
|
|
||||||
- [ ] **wait_for_pattern subscribes a listener for every call.** [LOW]
|
|
||||||
- `internal/app/host.go:472-474`: each invocation calls
|
|
||||||
`Session.Subscribe(wake)` which clones the listener slice and
|
|
||||||
swaps the atomic pointer; the `defer Unsubscribe` does the same
|
|
||||||
on exit. Two allocations per `wait_for_pattern`. The agent
|
|
||||||
pattern of looping on `wait_for_pattern` after every tool call
|
|
||||||
pays this churn on the steady-state path.
|
|
||||||
- Fix direction: a per-child `chunkBroadcaster` registered once
|
|
||||||
at child spawn that hands out lightweight subscriber tokens,
|
|
||||||
rather than going through the full session listener machinery.
|
|
||||||
|
|
||||||
# On Hold
|
# On Hold
|
||||||
- [ ] There's a unicode <?> being displayed in opencode [ON HOLD]
|
- [ ] There's a unicode <?> being displayed in opencode [ON HOLD]
|
||||||
|
|||||||
Reference in New Issue
Block a user