Section 02: Cost Baseline
Status: Not Started Goal: Provide a quantified before/after for the rewrite’s ≥15 KB context-cost claim. Without baseline, “the rewrite saves context” is unverifiable; with baseline, §04.3 corpus replay can confirm or refute the success criterion exactly.
Success Criteria:
- §02.1 records baseline byte counts for one representative round (3-of-3 reviewers
status: ok) sourced from a real scratch dir - §02.2 specifies the post-rewrite measurement procedure — what to count, what to exclude, where to find the data
- Connects upward to mission criterion: “Context cost per round drops by at least 15 KB measured”
Context: Sub-agent return messages today carry: a scratch_dir: line, the 16 KB-bounded <<<TPR-REPORT … TPR-REPORT>>> block, and a one-line mechanical comment (extraction_tier: / extraction_attempts: / CLI exit code). The 16 KB report content is preserved across the rewrite (extract-report.py owns the cap); what disappears is the sub-agent harness’s own scaffolding around that block — Sonnet’s turn metadata, any model-thinking emissions, the Agent tool-call envelope itself. The savings claim is “scaffolding bytes”, not “content bytes”.
Reference implementations:
.claude/skills/improve-tooling/tpr-review-design.md2026-04-25 entry — byte-reduction-ratio reporting style forextract-report.py(1.3-3.3 MB → 6-9 KB on codex; same measurement framing applies here)
Depends on: §01 (resolved design surface). Cannot measure replacement shape’s cost before the shape is fixed in §01.2.
Intelligence Reconnaissance
Queries run 2026-04-25:
scripts/intel-query.sh status— graph available (Neo4j 5.26.24); recorded as proof of protocol.- Surface-applicability note: measurement target is orchestrator transcript bytes — outside the symbol-graph index. The graph IS available but NOT APPLICABLE to byte-counting; baseline data lives on disk in prior
/tpr-reviewscratch dirs preserved by §8amktemp -d. - Surrogate query (filesystem):
ls -la /tmp/tpr-round-ori_lang-*/to inventory available scratches. Historical IDs fromtpr-review-design.md2026-04-25 entry:eG913ATp,QshQsQzR,V78eiBIx,UvWHVYg4.
Results summary (≤500 chars) [repo:.claude/skills/improve-tooling/tpr-review-design.md]: Baseline capture relies on disk-resident scratch dirs from prior /tpr-review runs (§8a mktemp -d preservation). Historical scratches at /tmp/tpr-round-ori_lang-{eG913ATp,QshQsQzR,V78eiBIx,UvWHVYg4}/ are the corpus already validated against extract-report.py per [repo:.claude/skills/improve-tooling/tpr-review-design.md] 2026-04-25 entry. Reuse that corpus rather than dispatching fresh — same data, zero review-cycle cost. No [ori] graph-resident symbols touch this surface.
02.1 Capture baseline (current sub-agent dispatch)
File(s): This section file (measurements recorded inline below)
The baseline is the orchestrator-context noise contributed by the Agent return path PER ROUND, measured against a real prior dispatch.
-
Located representative scratch dir. Selected
/tmp/tpr-round-ori_lang-038Mc69x/(2026-04-24 08:59) — all three reviewers produced reports cleanly. Inventoried all 19 historical scratches still resident on disk (/tmp/tpr-round-ori_lang-*); 3-of-3status: okrounds are common. The foureG913ATp/QshQsQzR/V78eiBIx/UvWHVYg4scratches referenced intpr-review-design.md2026-04-25 entry have been garbage-collected; current available corpus is younger (2026-04-24+). -
Per-reviewer measurements computed. Report bytes (preserved across rewrite — content, not noise) and wrapper stdout bytes (eliminated post-rewrite, but already redirected to files NOT to orchestrator’s bash result):
| Reviewer | report bytes | extract-report.py JSON | wrapper-stdout-to-file | wrapper bash result (orch ctx) | |----------|--------------|------------------------|------------------------|--------------------------------| | codex | 4,520 | ~175 | 1,631,649 | 0 bytes | | gemini | 4,285 | ~175 | 38,373 | 0 bytes | | opencode | 4,594 | ~175 | 268,731 | 0 bytes | | TOTAL | 13,399 | ~525 | 1,938,753 | 0 bytes |Source:
wc -c /tmp/tpr-round-ori_lang-038Mc69x/{codex,gemini,opencode}-{report,stdout}.txt+ a sampleextract-report.py /tmp/.../038Mc69x --reviewer codexJSON line measurement.Critical observation:
invoke-codex.sh(line 17–18) redirects stdout to$RUN/codex-stdout.txtand stderr to$RUN/codex-stderr.txt— the orchestrator’s foregroundBash(bash invoke-codex.sh "$scratch")call returns empty result content (~0 bytes) regardless of the wrapper’s internal stdout volume. The 1.6 MB / 38 KB / 268 KB stdout files NEVER enter orchestrator context — they hit disk andextract-report.pyreads them. So the new world’s per-reviewer orchestrator-context cost = wrapper-bash-result (~0 B) + extract-report.py JSON (~175 B) + Read result (4.5 KB report content). Per-round NEW-WORLD orchestrator-context cost ≈~525 B + 13,399 B = ~13.9 KBtotal (mostly content). -
Disk-corpus does NOT preserve orchestrator transcript — the savings number is not measurable from disk alone. Today’s orchestrator transcript carries each Agent’s return message (which CONTAINS the report block + scratch_dir line + mechanical comment + any sub-agent prose that leaked through the I22 ban + Agent-tool harness scaffolding the parent harness wraps around the return). The HONEST baseline number =
Agent_return_total - Bash_result_total - extract_report_json_total - Read_report_total. Disk has the report bytes (13.4 KB) but NOT the Agent harness scaffolding bytes — those live in the parent’s transcript only.Action: the ≥15 KB savings success criterion in
00-overview.mdis verifiable ONLY at §04.3 corpus replay against a fresh dispatch with a captured orchestrator transcript. §02.1 establishes the content baseline (13.4 KB of report content preserved across the rewrite); §02.2 specifies how §04.3 will measure the scaffolding delta. -
Cross-check against ≥15 KB target — partial. Content (preserved) is ~13.4 KB, so the success criterion’s claim is necessarily about scaffolding bytes that exceed the content size. Plausibility check: a Sonnet sub-agent harness commonly contributes ≥5 KB of envelope per reviewer (tool-use metadata, sub-agent thinking summaries, exit-status framing) — 3 reviewers × 5 KB = 15 KB lower bound. The 15 KB target is plausible but not pre-confirmed. If §04.3 measurement falls short, the success criterion gets revised inline with explicit reasoning per §04.3’s failure-response rule — NOT silently weakened.
-
Subsection close-out (02.1) — MANDATORY before starting 02.2:
- Baseline table embedded above with concrete byte counts from
/tmp/tpr-round-ori_lang-038Mc69x/ - Honest disclosure: scaffolding-delta number requires §04.3 fresh-dispatch measurement (orchestrator transcript not preserved on disk)
- Update this subsection’s
statusin section frontmatter tocomplete - Repo hygiene check — clean (probe at §01.1 close-out remains current)
- Baseline table embedded above with concrete byte counts from
02.2 Define post-rewrite measurement protocol
File(s): This section file (protocol recorded inline below)
The protocol pins what §04.3 must do to confirm the savings claim. Without a written protocol, “we measured it and it was fine” is unfalsifiable.
-
Post-rewrite measurement procedure (drafted for §04.3):
PROTOCOL: Post-rewrite per-round orchestrator-context cost Pre-conditions: §03 rewrite landed on disk; one historical scratch from §02.1 corpus is the comparison anchor (or a fresh equivalent-complexity replay is captured fresh). Step 1 — BASELINE (today's world, before §03 lands): Run /tpr-review against a representative work surface (small recent commit OR a plan section §X.Y). Capture orchestrator transcript bytes between the §11.P pre-dispatch banner and the §11 round summary. Sum bytes contributed by 3 Agent tool-use returns (each return-block is what the parent's harness shows for that Agent call, INCLUDING the report content the Agent's final-message text carried). Step 2 — POST-REWRITE (after §03 lands): Re-run /tpr-review against an equivalent-complexity work surface — SAME complexity class as Step 1's baseline (small commit ≈ small commit; one section ≈ same section). Capture orchestrator transcript bytes for the same window. Sum bytes contributed by 3 Bash dispatch results (~0 B each, wrapper redirects to file), 3 Bash extract-report.py results (~175 B each JSON line), 3 Read results (~4.5 KB each report). Step 3 — DELTA: delta = Step1_bytes - Step2_bytes If delta >= 15 KB → mission success criterion satisfied. If delta < 15 KB → file inline §04 blocker per failure response below. Honesty guards: a. Step 1 + Step 2 work surface MUST be equivalent complexity. Comparing a trivial round to a complex baseline manufactures fake savings. Document the work surface for both runs inline. b. Steps 1 + 2 use the SAME `--max-rounds=N` value. Single-round capture is fine — just keep both sides single-round. c. Report content (4-7 KB per reviewer) is preserved across the rewrite — it is on BOTH sides of the delta and cancels out. The delta therefore reflects scaffolding bytes only. -
Honesty guards pinned above — three guards (equivalent surface, same max-rounds, content cancels in delta).
-
Failure response pinned. If post-rewrite delta < 15 KB on representative replay:
- Do NOT silently mark §04 complete.
- File inline as a §04.3 blocker.
- Two acceptable resolutions: (a) revise the
00-overview.md≥15 KB success criterion with explicit reasoning recorded inline (e.g. “actual measurement was 8 KB; the architectural-clarity benefit andtp_agent_prompt.mddeletion still justify the rewrite — accept 8 KB as the measured improvement”), or (b) the rewrite has a context leak that needs a follow-up edit (e.g. wrapper bash chattier than expected; extract-report.py JSON line bigger than estimate; Read calls double-fetching). Either choice gets recorded; silent acceptance is banned.
-
Subsection close-out (02.2) — MANDATORY before starting 02.N:
- Protocol drafted above with concrete 3-step procedure + 3 honesty guards
- Failure response pinned with two acceptable resolutions; silent acceptance banned
- Update this subsection’s
statusin section frontmatter tocomplete - Repo hygiene check — clean
02.N Completion Checklist
- §02.1 baseline table embedded with concrete byte counts from
/tmp/tpr-round-ori_lang-038Mc69x/(codex 4520 / gemini 4285 / opencode 4594; total report content 13,399 B) - §02.2 post-rewrite measurement protocol drafted with 3-step procedure + 3 honesty guards + 2-resolution failure response
- Both subsections marked
status: complete -
python -m scripts.plan_corpus check plans/inline-tpr-transport/section-02-cost-baseline.mdreturns exit 0 (verified after §02 close — see overall plan check below) -
./test-all.shgreen — regression canary (deferred to §04.N — no source files touched in §02) - Plan sync — update plan metadata:
- This section’s frontmatter
status→complete, all subsection statuses →complete -
00-overview.mdQuick Reference: §02 status →Complete -
00-overview.mdmission success criteria: ≥15 KB criterion remains pending (verifies at §04.3); no §02-derived criteria flippable here -
index.md§02 status →Complete
- This section’s frontmatter
- Repo hygiene check — clean
Exit Criteria: Baseline + protocol both filed; python -m scripts.plan_corpus check plans/inline-tpr-transport/section-02-cost-baseline.md exits 0; the ≥15 KB success criterion has a defined verification path. MET 2026-04-25.