Section 07: Integration and Re-scoring
Status: Not Started Goal: Integrate all improvements from sections 01-06, re-score all 12 journeys, and build a regression test suite that prevents false positives from returning. All journeys should score ≥7.5 with ≤2 documented false positives total.
Context: This section is the validation step — it doesn’t add new analysis capability but proves the system works end-to-end. The existing rescore-report.md documents the V1 algorithm results; this section produces a V2 rescore report showing the false positive reduction.
Depends on: Sections 01 (effect summaries), 02 (CFG balance), 03 (cross-function ownership), 04 (IR parser), 06 (attribute compliance). Section 05 (ARC IR verification) is independent and does not affect Python tool scoring.
NOTE (incremental integration): Integration testing can begin after Phase 0 (Sections 01 + 04). Re-score J8 and J9 after Phase 0 to validate the effect summary and quoted-name fixes before proceeding to Phase 1. This provides an early gate and reduces risk of discovering Phase 0 issues only at Phase 3.
07.1 Pipeline Integration
File(s): .claude/skills/code-journey/extract-metrics.py
-
Update
extract-metrics.pyimports to include new modules:from effect_summaries import RUNTIME_EFFECTS # Section 01 from rc_state import analyze_function_cfg # Section 02 from arc_metrics import compute_arc_metrics # Updated with 01 + 03 -
Verify
compute_arc_metrics()callsget_effect()for each callee (Section 01) and computesmodule_balanced(Section 03) -
Add
--verboseflag to output per-function detail including:- Effect summary matches (which runtime functions were recognized)
- Conditional RC annotations (which operations are behind branches)
- Ownership transfer annotations (which imbalances are cross-function)
-
Ensure JSON output includes new fields:
{ "arc_module_balanced": true, "arc_ownership_transfers": 3, "arc_conditional_ops": 2, "arc_effect_matches": ["ori_str_from_raw: +1", "ori_rc_dec: -1"], "attr_not_applicable_count": 5 } -
Update attribute compliance calculation: When Section 06 marks attributes as not-applicable for closure functions,
total_applicabledecreases (excluded from denominator). This improves the compliance percentage. Verify thatextract-metrics.pypasses through the updatedattr_applicablecount. -
Per-function attribute detail: Include
not_applicable_reasonin per-function output for debugging:for fm in atm.per_function: detail = {"checks": len(fm.checks)} if fm.not_applicable_reason: detail["not_applicable_reason"] = fm.not_applicable_reason per_function.setdefault(fm.name, {})["attribute"] = detail
07.2 Re-score All Journeys
File(s): .claude/skills/code-journey/rescore-v2.sh (new, or manual process)
-
Extract IR from existing results: Use
extract_ir_from_results.pyto pull LLVM IR from each journey’s*-results.mdfile (journey IR is embedded inside results files, not standalone):for results_file in plans/code-journeys/j*/results.md; do journey=$(basename $(dirname "$results_file")) python3 extract_ir_from_results.py "$results_file" -o "/tmp/journey_ir/${journey}.ll" done -
Handle missing IR gracefully (some journeys may not have
#### Generated LLVM IR) -
Re-run
extract-metrics.pyon all 12 journey IR files -
Compare V1 vs V2 scores:
Journey V1 Score V2 Score (target) V1 Violations V2 Violations (target) J1 9.8 ≥ 9.8 0 0 J5 6.7 ≥ 8.5 14 ≤ 3 J9 7.4 ≥ 8.5 9 ≤ 2 J10 5.2 ≥ 7.5 15 ≤ 5 -
Document any remaining violations as genuine issues (not false positives)
-
Write
plans/journey-tooling-v2/rescore-v2-report.mdwith full results -
Update
plans/code-journeys/overview.mdwith V2 scores
07.3 Regression Test Suite
File(s): .claude/skills/code-journey/tests/
Build a test suite that prevents false positives from returning:
-
Golden file tests: Store expected
extract-metrics.pyoutput for each journey:tests/golden/ ├── j01_metrics.json ├── j05_metrics.json # closure ownership transfer ├── j09_metrics.json # string effect summaries └── j10_metrics.json # list effect summaries -
Synthetic IR tests: Test each false positive category in isolation (uses
ori_list_alloc_datawhich returnsptrdirectly, unlikeori_str_from_rawwhich uses sret):def test_list_alloc_data_counted_as_allocation(): """ori_list_alloc_data should count as +1 for RC balance.""" ir = """ define fastcc void @_ori_build_list() { entry: %data = call ptr @ori_list_alloc_data(i64 4, i64 8) call void @ori_buffer_rc_dec(ptr %data, i64 0, i64 4, i64 8, ptr null) ret void } declare ptr @ori_list_alloc_data(i64, i64) declare void @ori_buffer_rc_dec(ptr, i64, i64, i64, ptr) """ module = parse_module(ir) metrics = compute_arc_metrics(module) assert metrics.has_unbalanced is False # +1 from list_alloc_data, -1 from buffer_rc_dec -
Regression guard: test that J5, J9, J10 produce
arc_has_unbalanced: falsewith the new tooling -
Full pipeline test: A single pytest test that:
- Takes a known IR text (from a golden file or inline)
- Runs
extract_metrics()(the Python function, not CLI) - Asserts all output fields match expected values
- Asserts
parse_errorsisNoneor empty
-
score.pycompatibility check: Verify thatscore.pycan consume the new output fields without breaking. The new JSON fields (arc_module_balanced,arc_ownership_transfers,arc_conditional_ops,arc_effect_matches) are informational —score.pyreads JSON by key and ignores unknown fields. The semantic change is thatarc_violationsandarc_has_unbalancedvalues change (fewer violations). Verify thatscore.py’s scoring formulas produce correct results with the reduced violation counts — this is the intended improvement, not a regression. -
Rescore using existing
rescore-all.sh: The existingrescore-all.shalready handles:- Finding
*-results.mdfiles - Extracting IR via
extract_ir_from_results.py - Running
extract-metrics.py - Running
score.py - Generating a comparison report
The V2 rescore should either:
- Run the existing script as-is (it will automatically use the updated Python modules), OR
- Create
rescore-v2.shthat adds the--output-dirfor V2-specific reporting
The report output goes to
plans/journey-scoring-algorithms/rescore-report.md— the V2 report should go toplans/journey-tooling-v2/rescore-v2-report.md. - Finding
07.4 Completion Checklist
-
extract-metrics.pyintegrates all improvements from sections 01-06 - All 12 journeys re-scored with V2 algorithms
- All journeys score ≥ 7.5
- ≤ 2 documented false positives across all journeys total
- Golden file tests for J5, J9, J10 (the previously affected journeys)
- Synthetic IR tests for each false positive category
- Full pipeline end-to-end test (IR -> parse -> metrics -> score)
-
score.pycompatibility verified (new JSON fields ignored, reducedarc_violationsproduces higher scores) -
rescore-v2.shscript created and tested - IR extraction from
*-results.mdfiles works for all 12 journeys -
rescore-v2-report.mdwritten with full comparison -
python3 -m pytest tests/passes (all test files)
Exit Criteria: Running extract-metrics.py on all 12 journey IR files produces scores where every journey is ≥7.5, with J5 ≥8.5, J9 ≥8.5, and J10 ≥7.5. The rescore-v2-report.md shows the V1→V2 improvement for each journey and documents any remaining violations as genuine codegen issues.