Section 02: Cost Baseline

Status: Not Started Goal: Provide a quantified before/after for the rewrite’s ≥15 KB context-cost claim. Without baseline, “the rewrite saves context” is unverifiable; with baseline, §04.3 corpus replay can confirm or refute the success criterion exactly.

Success Criteria:

§02.1 records baseline byte counts for one representative round (3-of-3 reviewers status: ok) sourced from a real scratch dir
§02.2 specifies the post-rewrite measurement procedure — what to count, what to exclude, where to find the data
Connects upward to mission criterion: “Context cost per round drops by at least 15 KB measured”

Context: Sub-agent return messages today carry: a scratch_dir: line, the 16 KB-bounded <<<TPR-REPORT … TPR-REPORT>>> block, and a one-line mechanical comment (extraction_tier: / extraction_attempts: / CLI exit code). The 16 KB report content is preserved across the rewrite (extract-report.py owns the cap); what disappears is the sub-agent harness’s own scaffolding around that block — Sonnet’s turn metadata, any model-thinking emissions, the Agent tool-call envelope itself. The savings claim is “scaffolding bytes”, not “content bytes”.

Reference implementations:

.claude/skills/improve-tooling/tpr-review-design.md 2026-04-25 entry — byte-reduction-ratio reporting style for extract-report.py (1.3-3.3 MB → 6-9 KB on codex; same measurement framing applies here)

Depends on: §01 (resolved design surface). Cannot measure replacement shape’s cost before the shape is fixed in §01.2.

Intelligence Reconnaissance

Queries run 2026-04-25:

scripts/intel-query.sh status — graph available (Neo4j 5.26.24); recorded as proof of protocol.
Surface-applicability note: measurement target is orchestrator transcript bytes — outside the symbol-graph index. The graph IS available but NOT APPLICABLE to byte-counting; baseline data lives on disk in prior /tpr-review scratch dirs preserved by §8a mktemp -d.
Surrogate query (filesystem): ls -la /tmp/tpr-round-ori_lang-*/ to inventory available scratches. Historical IDs from tpr-review-design.md 2026-04-25 entry: eG913ATp, QshQsQzR, V78eiBIx, UvWHVYg4.

Results summary (≤500 chars) [repo:.claude/skills/improve-tooling/tpr-review-design.md]: Baseline capture relies on disk-resident scratch dirs from prior /tpr-review runs (§8a mktemp -d preservation). Historical scratches at /tmp/tpr-round-ori_lang-{eG913ATp,QshQsQzR,V78eiBIx,UvWHVYg4}/ are the corpus already validated against extract-report.py per [repo:.claude/skills/improve-tooling/tpr-review-design.md] 2026-04-25 entry. Reuse that corpus rather than dispatching fresh — same data, zero review-cycle cost. No [ori] graph-resident symbols touch this surface.

02.1 Capture baseline (current sub-agent dispatch)

File(s): This section file (measurements recorded inline below)

The baseline is the orchestrator-context noise contributed by the Agent return path PER ROUND, measured against a real prior dispatch.

Located representative scratch dir. Selected /tmp/tpr-round-ori_lang-038Mc69x/ (2026-04-24 08:59) — all three reviewers produced reports cleanly. Inventoried all 19 historical scratches still resident on disk (/tmp/tpr-round-ori_lang-*); 3-of-3 status: ok rounds are common. The four eG913ATp / QshQsQzR / V78eiBIx / UvWHVYg4 scratches referenced in tpr-review-design.md 2026-04-25 entry have been garbage-collected; current available corpus is younger (2026-04-24+).
Per-reviewer measurements computed. Report bytes (preserved across rewrite — content, not noise) and wrapper stdout bytes (eliminated post-rewrite, but already redirected to files NOT to orchestrator’s bash result):
```
| Reviewer | report bytes | extract-report.py JSON | wrapper-stdout-to-file | wrapper bash result (orch ctx) |
|----------|--------------|------------------------|------------------------|--------------------------------|
| codex    |       4,520  |                  ~175  |            1,631,649   |                       0 bytes |
| gemini   |       4,285  |                  ~175  |               38,373   |                       0 bytes |
| opencode |       4,594  |                  ~175  |              268,731   |                       0 bytes |
| TOTAL    |      13,399  |                  ~525  |            1,938,753   |                       0 bytes |
```
Source: wc -c /tmp/tpr-round-ori_lang-038Mc69x/{codex,gemini,opencode}-{report,stdout}.txt + a sample extract-report.py /tmp/.../038Mc69x --reviewer codex JSON line measurement.

Critical observation: invoke-codex.sh (line 17–18) redirects stdout to $RUN/codex-stdout.txt and stderr to $RUN/codex-stderr.txt — the orchestrator’s foreground Bash(bash invoke-codex.sh "$scratch") call returns empty result content (~0 bytes) regardless of the wrapper’s internal stdout volume. The 1.6 MB / 38 KB / 268 KB stdout files NEVER enter orchestrator context — they hit disk and extract-report.py reads them. So the new world’s per-reviewer orchestrator-context cost = wrapper-bash-result (~0 B) + extract-report.py JSON (~175 B) + Read result (4.5 KB report content). Per-round NEW-WORLD orchestrator-context cost ≈ ~525 B + 13,399 B = ~13.9 KB total (mostly content).
Disk-corpus does NOT preserve orchestrator transcript — the savings number is not measurable from disk alone. Today’s orchestrator transcript carries each Agent’s return message (which CONTAINS the report block + scratch_dir line + mechanical comment + any sub-agent prose that leaked through the I22 ban + Agent-tool harness scaffolding the parent harness wraps around the return). The HONEST baseline number = Agent_return_total - Bash_result_total - extract_report_json_total - Read_report_total. Disk has the report bytes (13.4 KB) but NOT the Agent harness scaffolding bytes — those live in the parent’s transcript only.

Action: the ≥15 KB savings success criterion in 00-overview.md is verifiable ONLY at §04.3 corpus replay against a fresh dispatch with a captured orchestrator transcript. §02.1 establishes the content baseline (13.4 KB of report content preserved across the rewrite); §02.2 specifies how §04.3 will measure the scaffolding delta.
Cross-check against ≥15 KB target — partial. Content (preserved) is ~13.4 KB, so the success criterion’s claim is necessarily about scaffolding bytes that exceed the content size. Plausibility check: a Sonnet sub-agent harness commonly contributes ≥5 KB of envelope per reviewer (tool-use metadata, sub-agent thinking summaries, exit-status framing) — 3 reviewers × 5 KB = 15 KB lower bound. The 15 KB target is plausible but not pre-confirmed. If §04.3 measurement falls short, the success criterion gets revised inline with explicit reasoning per §04.3’s failure-response rule — NOT silently weakened.
Subsection close-out (02.1) — MANDATORY before starting 02.2:
- Baseline table embedded above with concrete byte counts from /tmp/tpr-round-ori_lang-038Mc69x/
- Honest disclosure: scaffolding-delta number requires §04.3 fresh-dispatch measurement (orchestrator transcript not preserved on disk)
- Update this subsection’s status in section frontmatter to complete
- Repo hygiene check — clean (probe at §01.1 close-out remains current)

02.2 Define post-rewrite measurement protocol

File(s): This section file (protocol recorded inline below)

The protocol pins what §04.3 must do to confirm the savings claim. Without a written protocol, “we measured it and it was fine” is unfalsifiable.

Post-rewrite measurement procedure (drafted for §04.3):

PROTOCOL: Post-rewrite per-round orchestrator-context cost

Pre-conditions: §03 rewrite landed on disk; one historical scratch from
§02.1 corpus is the comparison anchor (or a fresh equivalent-complexity
replay is captured fresh).

Step 1 — BASELINE (today's world, before §03 lands):
  Run /tpr-review against a representative work surface (small recent
  commit OR a plan section §X.Y).  Capture orchestrator transcript bytes
  between the §11.P pre-dispatch banner and the §11 round summary.  Sum
  bytes contributed by 3 Agent tool-use returns (each return-block is
  what the parent's harness shows for that Agent call, INCLUDING the
  report content the Agent's final-message text carried).

Step 2 — POST-REWRITE (after §03 lands):
  Re-run /tpr-review against an equivalent-complexity work surface —
  SAME complexity class as Step 1's baseline (small commit ≈ small
  commit; one section ≈ same section).  Capture orchestrator transcript
  bytes for the same window.  Sum bytes contributed by 3 Bash dispatch
  results (~0 B each, wrapper redirects to file), 3 Bash extract-report.py
  results (~175 B each JSON line), 3 Read results (~4.5 KB each report).

Step 3 — DELTA:
  delta = Step1_bytes - Step2_bytes
  If delta >= 15 KB → mission success criterion satisfied.
  If delta <  15 KB → file inline §04 blocker per failure response below.

Honesty guards:
  a. Step 1 + Step 2 work surface MUST be equivalent complexity.
     Comparing a trivial round to a complex baseline manufactures fake
     savings.  Document the work surface for both runs inline.
  b. Steps 1 + 2 use the SAME `--max-rounds=N` value.  Single-round
     capture is fine — just keep both sides single-round.
  c. Report content (4-7 KB per reviewer) is preserved across the
     rewrite — it is on BOTH sides of the delta and cancels out.  The
     delta therefore reflects scaffolding bytes only.

Honesty guards pinned above — three guards (equivalent surface, same max-rounds, content cancels in delta).
Failure response pinned. If post-rewrite delta < 15 KB on representative replay:
- Do NOT silently mark §04 complete.
- File inline as a §04.3 blocker.
- Two acceptable resolutions: (a) revise the 00-overview.md ≥15 KB success criterion with explicit reasoning recorded inline (e.g. “actual measurement was 8 KB; the architectural-clarity benefit and tp_agent_prompt.md deletion still justify the rewrite — accept 8 KB as the measured improvement”), or (b) the rewrite has a context leak that needs a follow-up edit (e.g. wrapper bash chattier than expected; extract-report.py JSON line bigger than estimate; Read calls double-fetching). Either choice gets recorded; silent acceptance is banned.
Subsection close-out (02.2) — MANDATORY before starting 02.N:
- Protocol drafted above with concrete 3-step procedure + 3 honesty guards
- Failure response pinned with two acceptable resolutions; silent acceptance banned
- Update this subsection’s status in section frontmatter to complete
- Repo hygiene check — clean

02.N Completion Checklist

Exit Criteria: Baseline + protocol both filed; python -m scripts.plan_corpus check plans/inline-tpr-transport/section-02-cost-baseline.md exits 0; the ≥15 KB success criterion has a defined verification path. MET 2026-04-25.