87%

Section 02: Cost Baseline

Status: Not Started Goal: Provide a quantified before/after for the rewrite’s ≥15 KB context-cost claim. Without baseline, “the rewrite saves context” is unverifiable; with baseline, §04.3 corpus replay can confirm or refute the success criterion exactly.

Success Criteria:

  • §02.1 records baseline byte counts for one representative round (3-of-3 reviewers status: ok) sourced from a real scratch dir
  • §02.2 specifies the post-rewrite measurement procedure — what to count, what to exclude, where to find the data
  • Connects upward to mission criterion: “Context cost per round drops by at least 15 KB measured”

Context: Sub-agent return messages today carry: a scratch_dir: line, the 16 KB-bounded <<<TPR-REPORT … TPR-REPORT>>> block, and a one-line mechanical comment (extraction_tier: / extraction_attempts: / CLI exit code). The 16 KB report content is preserved across the rewrite (extract-report.py owns the cap); what disappears is the sub-agent harness’s own scaffolding around that block — Sonnet’s turn metadata, any model-thinking emissions, the Agent tool-call envelope itself. The savings claim is “scaffolding bytes”, not “content bytes”.

Reference implementations:

  • .claude/skills/improve-tooling/tpr-review-design.md 2026-04-25 entry — byte-reduction-ratio reporting style for extract-report.py (1.3-3.3 MB → 6-9 KB on codex; same measurement framing applies here)

Depends on: §01 (resolved design surface). Cannot measure replacement shape’s cost before the shape is fixed in §01.2.


Intelligence Reconnaissance

Queries run 2026-04-25:

  • scripts/intel-query.sh status — graph available (Neo4j 5.26.24); recorded as proof of protocol.
  • Surface-applicability note: measurement target is orchestrator transcript bytes — outside the symbol-graph index. The graph IS available but NOT APPLICABLE to byte-counting; baseline data lives on disk in prior /tpr-review scratch dirs preserved by §8a mktemp -d.
  • Surrogate query (filesystem): ls -la /tmp/tpr-round-ori_lang-*/ to inventory available scratches. Historical IDs from tpr-review-design.md 2026-04-25 entry: eG913ATp, QshQsQzR, V78eiBIx, UvWHVYg4.

Results summary (≤500 chars) [repo:.claude/skills/improve-tooling/tpr-review-design.md]: Baseline capture relies on disk-resident scratch dirs from prior /tpr-review runs (§8a mktemp -d preservation). Historical scratches at /tmp/tpr-round-ori_lang-{eG913ATp,QshQsQzR,V78eiBIx,UvWHVYg4}/ are the corpus already validated against extract-report.py per [repo:.claude/skills/improve-tooling/tpr-review-design.md] 2026-04-25 entry. Reuse that corpus rather than dispatching fresh — same data, zero review-cycle cost. No [ori] graph-resident symbols touch this surface.


02.1 Capture baseline (current sub-agent dispatch)

File(s): This section file (measurements recorded inline below)

The baseline is the orchestrator-context noise contributed by the Agent return path PER ROUND, measured against a real prior dispatch.

  • Located representative scratch dir. Selected /tmp/tpr-round-ori_lang-038Mc69x/ (2026-04-24 08:59) — all three reviewers produced reports cleanly. Inventoried all 19 historical scratches still resident on disk (/tmp/tpr-round-ori_lang-*); 3-of-3 status: ok rounds are common. The four eG913ATp / QshQsQzR / V78eiBIx / UvWHVYg4 scratches referenced in tpr-review-design.md 2026-04-25 entry have been garbage-collected; current available corpus is younger (2026-04-24+).

  • Per-reviewer measurements computed. Report bytes (preserved across rewrite — content, not noise) and wrapper stdout bytes (eliminated post-rewrite, but already redirected to files NOT to orchestrator’s bash result):

    | Reviewer | report bytes | extract-report.py JSON | wrapper-stdout-to-file | wrapper bash result (orch ctx) |
    |----------|--------------|------------------------|------------------------|--------------------------------|
    | codex    |       4,520  |                  ~175  |            1,631,649   |                       0 bytes |
    | gemini   |       4,285  |                  ~175  |               38,373   |                       0 bytes |
    | opencode |       4,594  |                  ~175  |              268,731   |                       0 bytes |
    | TOTAL    |      13,399  |                  ~525  |            1,938,753   |                       0 bytes |

    Source: wc -c /tmp/tpr-round-ori_lang-038Mc69x/{codex,gemini,opencode}-{report,stdout}.txt + a sample extract-report.py /tmp/.../038Mc69x --reviewer codex JSON line measurement.

    Critical observation: invoke-codex.sh (line 17–18) redirects stdout to $RUN/codex-stdout.txt and stderr to $RUN/codex-stderr.txt — the orchestrator’s foreground Bash(bash invoke-codex.sh "$scratch") call returns empty result content (~0 bytes) regardless of the wrapper’s internal stdout volume. The 1.6 MB / 38 KB / 268 KB stdout files NEVER enter orchestrator context — they hit disk and extract-report.py reads them. So the new world’s per-reviewer orchestrator-context cost = wrapper-bash-result (~0 B) + extract-report.py JSON (~175 B) + Read result (4.5 KB report content). Per-round NEW-WORLD orchestrator-context cost ≈ ~525 B + 13,399 B = ~13.9 KB total (mostly content).

  • Disk-corpus does NOT preserve orchestrator transcript — the savings number is not measurable from disk alone. Today’s orchestrator transcript carries each Agent’s return message (which CONTAINS the report block + scratch_dir line + mechanical comment + any sub-agent prose that leaked through the I22 ban + Agent-tool harness scaffolding the parent harness wraps around the return). The HONEST baseline number = Agent_return_total - Bash_result_total - extract_report_json_total - Read_report_total. Disk has the report bytes (13.4 KB) but NOT the Agent harness scaffolding bytes — those live in the parent’s transcript only.

    Action: the ≥15 KB savings success criterion in 00-overview.md is verifiable ONLY at §04.3 corpus replay against a fresh dispatch with a captured orchestrator transcript. §02.1 establishes the content baseline (13.4 KB of report content preserved across the rewrite); §02.2 specifies how §04.3 will measure the scaffolding delta.

  • Cross-check against ≥15 KB target — partial. Content (preserved) is ~13.4 KB, so the success criterion’s claim is necessarily about scaffolding bytes that exceed the content size. Plausibility check: a Sonnet sub-agent harness commonly contributes ≥5 KB of envelope per reviewer (tool-use metadata, sub-agent thinking summaries, exit-status framing) — 3 reviewers × 5 KB = 15 KB lower bound. The 15 KB target is plausible but not pre-confirmed. If §04.3 measurement falls short, the success criterion gets revised inline with explicit reasoning per §04.3’s failure-response rule — NOT silently weakened.

  • Subsection close-out (02.1) — MANDATORY before starting 02.2:

    • Baseline table embedded above with concrete byte counts from /tmp/tpr-round-ori_lang-038Mc69x/
    • Honest disclosure: scaffolding-delta number requires §04.3 fresh-dispatch measurement (orchestrator transcript not preserved on disk)
    • Update this subsection’s status in section frontmatter to complete
    • Repo hygiene check — clean (probe at §01.1 close-out remains current)

02.2 Define post-rewrite measurement protocol

File(s): This section file (protocol recorded inline below)

The protocol pins what §04.3 must do to confirm the savings claim. Without a written protocol, “we measured it and it was fine” is unfalsifiable.

  • Post-rewrite measurement procedure (drafted for §04.3):

    PROTOCOL: Post-rewrite per-round orchestrator-context cost
    
    Pre-conditions: §03 rewrite landed on disk; one historical scratch from
    §02.1 corpus is the comparison anchor (or a fresh equivalent-complexity
    replay is captured fresh).
    
    Step 1 — BASELINE (today's world, before §03 lands):
      Run /tpr-review against a representative work surface (small recent
      commit OR a plan section §X.Y).  Capture orchestrator transcript bytes
      between the §11.P pre-dispatch banner and the §11 round summary.  Sum
      bytes contributed by 3 Agent tool-use returns (each return-block is
      what the parent's harness shows for that Agent call, INCLUDING the
      report content the Agent's final-message text carried).
    
    Step 2 — POST-REWRITE (after §03 lands):
      Re-run /tpr-review against an equivalent-complexity work surface —
      SAME complexity class as Step 1's baseline (small commit ≈ small
      commit; one section ≈ same section).  Capture orchestrator transcript
      bytes for the same window.  Sum bytes contributed by 3 Bash dispatch
      results (~0 B each, wrapper redirects to file), 3 Bash extract-report.py
      results (~175 B each JSON line), 3 Read results (~4.5 KB each report).
    
    Step 3 — DELTA:
      delta = Step1_bytes - Step2_bytes
      If delta >= 15 KB → mission success criterion satisfied.
      If delta <  15 KB → file inline §04 blocker per failure response below.
    
    Honesty guards:
      a. Step 1 + Step 2 work surface MUST be equivalent complexity.
         Comparing a trivial round to a complex baseline manufactures fake
         savings.  Document the work surface for both runs inline.
      b. Steps 1 + 2 use the SAME `--max-rounds=N` value.  Single-round
         capture is fine — just keep both sides single-round.
      c. Report content (4-7 KB per reviewer) is preserved across the
         rewrite — it is on BOTH sides of the delta and cancels out.  The
         delta therefore reflects scaffolding bytes only.
  • Honesty guards pinned above — three guards (equivalent surface, same max-rounds, content cancels in delta).

  • Failure response pinned. If post-rewrite delta < 15 KB on representative replay:

    • Do NOT silently mark §04 complete.
    • File inline as a §04.3 blocker.
    • Two acceptable resolutions: (a) revise the 00-overview.md ≥15 KB success criterion with explicit reasoning recorded inline (e.g. “actual measurement was 8 KB; the architectural-clarity benefit and tp_agent_prompt.md deletion still justify the rewrite — accept 8 KB as the measured improvement”), or (b) the rewrite has a context leak that needs a follow-up edit (e.g. wrapper bash chattier than expected; extract-report.py JSON line bigger than estimate; Read calls double-fetching). Either choice gets recorded; silent acceptance is banned.
  • Subsection close-out (02.2) — MANDATORY before starting 02.N:

    • Protocol drafted above with concrete 3-step procedure + 3 honesty guards
    • Failure response pinned with two acceptable resolutions; silent acceptance banned
    • Update this subsection’s status in section frontmatter to complete
    • Repo hygiene check — clean

02.N Completion Checklist

  • §02.1 baseline table embedded with concrete byte counts from /tmp/tpr-round-ori_lang-038Mc69x/ (codex 4520 / gemini 4285 / opencode 4594; total report content 13,399 B)
  • §02.2 post-rewrite measurement protocol drafted with 3-step procedure + 3 honesty guards + 2-resolution failure response
  • Both subsections marked status: complete
  • python -m scripts.plan_corpus check plans/inline-tpr-transport/section-02-cost-baseline.md returns exit 0 (verified after §02 close — see overall plan check below)
  • ./test-all.sh green — regression canary (deferred to §04.N — no source files touched in §02)
  • Plan sync — update plan metadata:
    • This section’s frontmatter statuscomplete, all subsection statuses → complete
    • 00-overview.md Quick Reference: §02 status → Complete
    • 00-overview.md mission success criteria: ≥15 KB criterion remains pending (verifies at §04.3); no §02-derived criteria flippable here
    • index.md §02 status → Complete
  • Repo hygiene check — clean

Exit Criteria: Baseline + protocol both filed; python -m scripts.plan_corpus check plans/inline-tpr-transport/section-02-cost-baseline.md exits 0; the ≥15 KB success criterion has a defined verification path. MET 2026-04-25.