99%

05 Code Graph: Parser Adapters

05.0 Goal

Set up the tree-sitter parsing infrastructure that all code graph work depends on. This section delivers three things: (1) reliable grammar loading for all 9 supported languages, (2) a formal adapter API contract that Section 06 consumes, and (3) query file families (not just tags.scm) that prepare for relationship extraction. The section does NOT extract symbols or import into Neo4j — it ensures every repo can be parsed and that the adapter layer exposes everything downstream sections need.

Success Criteria:

  • All 9 tree-sitter grammars load successfully with pinned, compatible versions (8 PyPI + 1 source-built Koka with scanner patch)
  • Parser adapter API contract exposes repo_id, language_id, relative_path, source_bytes, byte_count, tree, had_error, error_node_count, query_handles, coverage_status, content_hash
  • Query file families (decls.scm, calls.scm, imports.scm, impls.scm) exist for every supported language (32 files across 8 languages + lean symlink to cpp)
  • Matrix validation: Language x (Valid/Malformed/Empty) x query family all pass (108/108)
  • Full parse of all reference repos completes in <60 seconds (28.1s actual)
  • Unblocks mission criteria: “tree-sitter parses all 9 supported languages” (parsing half — extraction is Section 06’s deliverable)

Context: Section 06 (Symbol Extraction) needs more than just parse trees — it needs compiled query handles for declarations, calls, imports, and implementations. If Section 05 only delivers tags.scm parsing, Section 06 must reinvent query infrastructure. This section front-loads that work.

Reference implementations:

  • Sourcegraph SCIP: Multi-language indexing with per-language adapter pattern
  • nvim-treesitter: Query file organization (queries/{lang}/{tags,highlights,locals}.scm)

Depends on: None (independent pillar start).


05.1 Python Dependencies & Version Compatibility

File(s): ~/projects/lang_intelligence/.venv/, ~/projects/lang_intelligence/requirements.txt

Grammar packages pin different tree-sitter core versions. A blanket pip install tree-sitter>=0.25.0 will fail or produce ABI mismatches. The correct approach: pin exact versions after a compatibility smoke test.

Modern tree-sitter Python API (0.22+): The build_library() / Language() pattern from pre-0.22 is deprecated. In tree-sitter 0.22+, grammar packages expose a language() function directly:

# Modern API (tree-sitter >= 0.22)
import tree_sitter_rust
from tree_sitter import Language, Parser

RUST = Language(tree_sitter_rust.language())
parser = Parser(RUST)
tree = parser.parse(source_bytes)

There is NO shared library building step. Grammar packages are Python modules with compiled bindings.

  • Create requirements.txt with exact pinned versions. Start with latest compatible set and run smoke test:

    tree-sitter==0.25.2
    tree-sitter-rust==0.24.2
    tree-sitter-go==0.25.0
    tree-sitter-zig==1.1.2
    tree-sitter-typescript==0.23.2
    tree-sitter-haskell==0.23.1
    tree-sitter-swift==0.0.1
    tree-sitter-cpp==0.23.4

    Version selection rule: Used latest available from PyPI (2026-04-13). Core 0.25.2 compatible with all grammar packages.

  • Run compatibility smoke test: for each grammar package, Language(mod.language()) must succeed, and Parser(lang).parse(b"") must return a tree without segfault

  • Record the compatibility matrix in requirements.txt comments:

    # Compatibility matrix (verified 2026-04-13):
    # tree-sitter-rust       0.24.2 + core 0.25.2: OK
    # tree-sitter-go         0.25.0 + core 0.25.2: OK
    # tree-sitter-zig        1.1.2  + core 0.25.2: OK
    # tree-sitter-typescript 0.23.2 + core 0.25.2: OK
    # tree-sitter-haskell    0.23.1 + core 0.25.2: OK
    # tree-sitter-swift      0.0.1  + core 0.25.2: OK
    # tree-sitter-cpp        0.23.4 + core 0.25.2: OK
    # tree-sitter-koka       0.1.0 (source, scanner patch) + core 0.25.2: OK
  • Swift grammar: tree-sitter-swift==0.0.1 from PyPI loads successfully with core 0.25.2. No source build needed.

  • Koka grammar: NOT on PyPI. Cloned koka-community/tree-sitter-koka and installed from source. Required patching setup.py to include src/scanner.c (upstream bug: external scanner not listed in ext_modules sources). Requires python3-dev headers. Loads successfully after patch.

  • Verify all grammars load: created scripts/validate-parsers.py with --smoke mode. 8/8 grammars pass. Skips coverage_status: custom entries (Ori).

  • Document: Lean .lean files have 86% parse error rate. Lean4 repo is parsed via C++ grammar for runtime code only. Coverage status: partial (not “unsupported” — some of the repo IS parseable via C++ grammar). Documented in languages.yaml (created in 05.2).

  • Document: Ori uses its own Rust parser (no tree-sitter grammar). Ori adapter is implemented in Section 09.3. Ori appears in languages.yaml with grammar: native and coverage_status: custom. validate-parsers.py skips coverage_status: custom entries.

  • Create scripts/setup-parsers.sh that automates: venv creation, pip install -r requirements.txt, Koka source build (with scanner patch), validate-parsers.py --smoke run. Supports --verbose and --skip-koka flags.

  • Subsection close-out (05.1)

    • All tasks above are [x] and all grammars load via smoke test
    • Update this subsection’s status in section frontmatter to complete
    • Run /improve-tooling retrospectively on THIS subsection — Retrospective 05.1: Koka scanner patch was the main friction point (upstream bug in setup.py). setup-parsers.sh already has --verbose flag. No additional tooling gaps — validate-parsers.py --smoke gives clear per-grammar pass/fail.
    • Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files. Clean (2026-04-13).

05.2 Language Adapter Manifests

File(s): ~/projects/lang_intelligence/languages.yaml, ~/projects/lang_intelligence/repos.yaml

These manifests are the single source of truth for the entire code graph pipeline. Every downstream script (Section 06 extraction, Section 07 import, Section 09 sync) reads them. Getting the schema right here prevents cascading fixes later.

languages.yaml — per-language capabilities:

rust:
  grammar: tree-sitter-rust       # pip package name or "source" or "native"
  grammar_version: "0.23.3"       # pinned version (must match requirements.txt)
  extensions: [".rs"]
  query_families:                  # which .scm query files exist for this language
    - decls                        # declarations (functions, types, traits, etc.)
    - calls                        # call sites
    - imports                      # use/import statements
    - impls                        # impl/instance/conformance blocks
  coverage_status: full            # full | partial | custom
  maturity: stable
  expected_error_rate: 0.09
  notes: ""

# Ori — native parser, not tree-sitter
ori:
  grammar: native
  extensions: [".ori"]
  query_families: []               # N/A — uses Ori's own Rust parser via FFI
  coverage_status: custom
  maturity: stable
  expected_error_rate: 0.0
  notes: "Parsed by ori_parse (Rust). Adapter in Section 09."

lean:
  grammar: tree-sitter-cpp         # .lean files skipped; only C++ runtime parsed
  extensions: [".cpp", ".h"]       # NOT .lean
  query_families: [decls, calls, imports, impls]  # all C++ query families
  coverage_status: partial
  maturity: stable
  expected_error_rate: 0.02
  notes: ".lean files have 86% error rate — skipped. C++ runtime code only."

repos.yaml — per-repo source mapping:

The local corpus has both go/ (issue tracker only) and golang/ (source code). The manifest MUST canonicalize these:

go:
  repo_id: go                                    # canonical ID used in Neo4j
  source_root: ${REFERENCE_REPOS_ROOT}/golang    # resolved at runtime by adapter
  issue_root: ${REFERENCE_REPOS_ROOT}/go         # resolved at runtime by adapter
  languages: [go]
  include:
    - cmd/compile/
    - go/types/
    - internal/types/
  exclude:
    - test/
    - vendor/
  • Create languages.yaml with all 10 language configs (9 tree-sitter + Ori native), including all required fields. TypeScript has custom module_name/language_func fields for its language_typescript() API.

  • Create repos.yaml with curated include/exclude roots for all 11 repos. All paths use ${REFERENCE_REPOS_ROOT}, ${ORI_LANG_ROOT}, ${LANG_INTELLIGENCE_ROOT} env-var patterns.

  • Canonicalize the go/golang duality: repo_id: go, source_root: ${REFERENCE_REPOS_ROOT}/golang, issue_root: ${REFERENCE_REPOS_ROOT}/go

  • For mixed-language repos, all applicable languages listed:

    • Gleam: [rust], Roc: [rust], Elm: [haskell], Koka: [haskell, koka], Lean4: [cpp], Swift: [swift, cpp]
  • Validate: every languages: entry in repos.yaml references a valid key in languages.yaml (13/13 refs OK)

  • Validate: every source_root path exists on disk (11/11 OK); every issue_root path exists (11/11 OK)

  • Subsection close-out (05.2)

    • All tasks above are [x] and both manifests validate
    • Update this subsection’s status in section frontmatter to complete
    • Run /improve-tooling retrospectively on THIS subsection — Retrospective 05.2: Validation is currently inline Python. Built into parser_adapter.py in 05.3 (resolve_repo_path + manifest loading). No separate validate-manifests.py needed — validate-parsers.py --smoke already covers grammar loading, and the inline validation covered path resolution. No tooling gaps.
    • Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files. Clean (2026-04-13).

05.3 Parser Adapter API Contract

File(s): ~/projects/lang_intelligence/neo4j/parser_adapter.py

The parser adapter is the boundary between “raw tree-sitter” and “everything downstream.” Section 06 (extraction), Section 07 (import), and Section 09 (sync) all consume this API. The contract must be explicit, typed, and documented.

Adapter output contract (per file):

@dataclass
class ParseResult:
    repo_id: str               # canonical repo identifier from repos.yaml
    language_id: str           # language key from languages.yaml
    relative_path: str         # path relative to source_root
    source_bytes: bytes        # raw file content (needed by Section 06 for qualified names, signature_hash)
    byte_count: int            # len(source_bytes)
    tree: Tree | None          # tree-sitter Tree (None on load failure)
    had_error: bool            # True if tree contains ERROR nodes
    error_node_count: int      # count of ERROR nodes in tree
    query_handles: dict[str, Query]  # compiled queries by family name
    coverage_status: str       # "full" | "partial" | "custom"
    content_hash: str          # SHA-256 of file content (for incremental sync)

class CoverageStatus(Enum):
    FULL = "full"              # grammar parses this language well
    PARTIAL = "partial"        # grammar has known gaps (e.g., Lean C++ only)
    CUSTOM = "custom"          # not tree-sitter (e.g., Ori native parser)

Error handling policy:

  • Per-file parse failures (I/O error, encoding error): soft — skip file, log warning, continue. A single bad file must NOT abort the pipeline.

  • Grammar load failures (missing package, ABI mismatch): hard — abort immediately with clear error message. A broken grammar affects ALL files for that language.

  • Query compilation failures (malformed .scm file): hard — abort immediately. A broken query produces wrong extraction results silently.

  • Implement ParseResult dataclass with all fields listed above

  • Implement CoverageStatus enum

  • Implement parse_file(repo_config, lang_config, file_path) -> ParseResult that:

    • Loads grammar from languages.yaml config (with caching)
    • Reads file bytes (soft-fail on I/O/encoding errors — returns None)
    • Parses with tree-sitter
    • Counts ERROR nodes recursively
    • Compiles and attaches query handles for all query families listed in languages.yaml (with caching)
    • Computes SHA-256 content hash (for Section 09 incremental sync)
  • Implement parse_repo(repo_id) -> Iterator[ParseResult] that:

    • Reads repos.yaml for include/exclude patterns
    • Walks the file tree, filtering by extensions from languages.yaml
    • Calls parse_file for each matching file
    • Logs per-file soft failures without aborting
    • Skips coverage_status: custom languages (native parsers)
  • Implement hard error handling: grammar load failures raise RuntimeError with module/func context; query compilation failures raise RuntimeError with .scm path. Per-file I/O errors return None with warning log.

  • --parallel flag: removed unused parameter from parse_repo — sequential parsing meets the <60s target (28.1s actual). ProcessPoolExecutor can be added to validate-parsers.py --full if needed in the future.

  • Implement resolve_repo_path(template) that expands ${REFERENCE_REPOS_ROOT}, ${LANG_INTELLIGENCE_ROOT}, and ${ORI_LANG_ROOT} env vars via regex substitution. Checks env vars first, falls back to defaults. Also exposed load_manifests() for downstream consumers.

  • Verify adapter output: smoke test on Gleam repo (188 files, 0 errors). All ParseResult fields populated. Content hash deterministic (SHA-256, same file = same hash on re-parse).

  • TPR checkpoint/tpr-review covering 05.1–05.3 implementation work (superseded by full-section TPR in 05.N — iter-4 clean pass on 2026-04-12)

  • Subsection close-out (05.3)

    • All tasks above are [x] and adapter API is documented and tested
    • Update this subsection’s status in section frontmatter to complete
    • Run /improve-tooling retrospectively on THIS subsection — Retrospective 05.3: Query file warning spam was noisy during testing (expected — no query files yet). Added logging properly so downstream consumers control verbosity. Grammar cache and query cache prevent redundant loads. No tooling gaps.
    • Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files. Clean (2026-04-13).

05.4 Query File Families

Files: ~/projects/lang_intelligence/queries/{lang}/{family}.scm

Official tags.scm files vary by language in what they capture. Some (Rust, Go) already include @reference.call and @reference.implementation captures alongside declarations. Others (TypeScript, Swift) primarily capture declarations. This subsection standardizes query file families for ALL languages, adapting existing upstream queries where possible and writing custom ones where needed.

Query families:

  • decls.scm — declarations: functions, types, traits, methods, constants
  • calls.scm — call expressions: function calls, method calls
  • imports.scm — import/use/require statements
  • impls.scm — impl blocks, interface conformance, instance declarations

Per-language query file status:

Languagedecls.scmcalls.scmimports.scmimpls.scmSource
RustOfficial (has decls)Official (has @reference.call)CustomOfficial (has @reference.implementation)tree-sitter-rust
GoOfficial (has decls)Official (has @reference.call)Official (has package/import)N/A (implicit)tree-sitter-go
ZigCustom (no official tags)CustomCustomN/Atree-sitter-zig
TypeScriptOfficial tags.scm adaptedCustomCustomCustomtree-sitter-typescript
HaskellCustom (no official tags)CustomCustomCustomtree-sitter-haskell
SwiftOfficial tags.scm adaptedCustomCustomCustomtree-sitter-swift
C++Official tags.scm adaptedCustomCustomN/Atree-sitter-cpp
KokaCustom (if grammar works)CustomCustomCustomtree-sitter-koka

Implementation approach:

  1. For languages WITH official tags.scm: adapt/rename to decls.scm, then write calls.scm, imports.scm, impls.scm from scratch using each grammar’s node-types.json as reference.
  2. For languages WITHOUT official tags.scm (Zig, Haskell, Koka): write all four families from scratch.
  3. Some families may be empty stubs for some languages (e.g., Go has no explicit impl blocks — impls.scm is empty). Empty stubs are valid — they return zero captures. The adapter contract handles this gracefully.
  • Rust (queries/rust/): decls.scm (function_item, struct_item, enum_item, type_item, trait_item, const_item, static_item, mod_item, macro_definition), calls.scm (call_expression, macro_invocation), imports.scm (use_declaration), impls.scm (impl_item). 98 decls / 1356 calls / 12 imports / 2 impls on analyse.rs.

  • Go (queries/go/): decls.scm (function_declaration, method_declaration, type_declaration, const_declaration, var_declaration), calls.scm (call_expression), imports.scm (import_declaration, package_clause). impls.scm is empty stub. 60 decls / 28 calls / 2 imports.

  • Zig (queries/zig/): All four from scratch. decls.scm (function_declaration, variable_declaration, container_field), calls.scm (call_expression + field_expression), imports.scm (builtin_function @import). impls.scm empty stub. 1378 decls / 4664 calls / 93 imports.

  • TypeScript (queries/typescript/): decls.scm (function_declaration, class_declaration, interface_declaration, type_alias_declaration, enum_declaration, method_definition), calls.scm (call_expression, new_expression), imports.scm (import_statement, export with source), impls.scm (implements_clause). 822 decls / 2076 calls / 4 imports.

  • Haskell (queries/haskell/): All from scratch. decls.scm (function, signature, data_type, newtype, type_synomym), calls.scm (apply + variable), imports.scm (import + module), impls.scm (instance). 20 decls / 78 calls / 34 imports.

  • Swift (queries/swift/): decls.scm (function_declaration, class_declaration, protocol_declaration, typealias_declaration, property_declaration — note: tree-sitter-swift 0.0.1 lacks struct_declaration and enum_declaration), calls.scm (call_expression), imports.scm (import_declaration), impls.scm (inheritance_specifier). 38 decls / 32 calls / 8 imports / 6 impls.

  • C++ (queries/cpp/): decls.scm (function_definition, class_specifier, struct_specifier, enum_specifier, namespace_definition with namespace_identifier), calls.scm (call_expression), imports.scm (preproc_include). impls.scm empty stub. 66 decls / 214 calls / 10 imports.

  • Koka (queries/koka/): Grammar loaded (with scanner patch from 05.1). decls.scm (fundecl, puredecl, typedecl), calls.scm (opexpr/atom/name/qidentifier), imports.scm (import + modulepath), impls.scm empty stub. 24 decls / 212 calls. Koka grammar works for .kk files.

  • Test each query file against at least one real file from its repo. 29/32 produce captures; 3 WARN are test-file selection (files lacking instances/imports — not query bugs). All 32 queries compile. 4 declared stubs return zero captures.

  • Create golden file probes: tests/golden-probes.yaml with 8 probes (one per language), recording expected capture counts per query family with 10% tolerance.

  • Subsection close-out (05.4)

    • All tasks above are [x] and all query files compile (non-stubs produce captures, declared stubs return zero captures)
    • Update this subsection’s status in section frontmatter to complete
    • Run /improve-tooling retrospectively on THIS subsection — Retrospective 05.4: Node type discovery was the main friction (had to iterate multiple times fixing “Impossible pattern” errors). The inline test script used for validation should be formalized into validate-parsers.py --matrix in 05.5. Key lesson: always check named node types from the grammar BEFORE writing queries (the Language.node_kind_for_id() API). tree-sitter-swift 0.0.1 lacks struct_declaration/enum_declaration — noted in query file comments.
    • Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files. Clean (2026-04-13).

05.5 Parse Validation & Matrix Testing

File(s): ~/projects/lang_intelligence/scripts/validate-parsers.py

A comprehensive validation script that tests the full parser adapter stack: grammar loading, file parsing, query compilation, and capture accuracy.

Matrix dimensions:

  1. Language (9 tree-sitter languages)
  2. File condition (Valid source / Malformed source / Empty file)
  3. Query family (decls / calls / imports / impls)
  • Implement validate-parsers.py with all four test modes:

    • --smoke: 8/8 grammars, <0.02s (target <5s)
    • --matrix: 108/108 cells pass, 0.53s (target <30s). Tests 9 languages x 3 conditions x 4 families.
    • --full: 5581 files, 131.7 MB, 28.1s (target <60s). Reports per-repo: files, MB, errors, rate, throughput.
    • --golden: 8 probes pass, 0.26s. All capture counts within 10% tolerance.
  • Malformed file handling: Created tests/malformed/ with deliberately broken files. Matrix test verifies parser produces tree with ERROR nodes (no crash). --matrix includes valid/malformed/empty conditions.

  • Empty file handling: --matrix tests empty files for all languages. Parser produces empty tree, no crash.

  • Error rate validation: --full reports per-repo error rates. Key rates: Gleam 0%, Rust 5.9%, Go 4.5%, Swift 28.7% (0.0.1 grammar), Lean 73.4% (C++ grammar for runtime only).

  • Query compilation validation: --matrix compiles all .scm files per language x family. Reports stubs separately. All non-stubs produce captures on valid source.

  • Performance reporting: --full reports files/sec per repo. Aggregate: ~200 files/sec, 28.1s total. Zig slowest (23 files/s, large files), Gleam fastest (660 files/s, small files).

  • Golden file probes: 8 probes in tests/golden-probes.yaml (one per language). Baseline captured 2026-04-13. 10% tolerance for grammar version drift.

  • Incremental hashing verification: SHA-256 content hash is deterministic (verified in 05.3 smoke test — same file = same hash on re-parse).

  • Grammar update policy: Documented in validate-parsers.py --help (module docstring). Run --golden before/after grammar bump. CI gate: --matrix on every version bump.

  • Subsection close-out (05.5)

    • All tasks above are [x] and validate-parsers.py --matrix passes (108/108)
    • Update this subsection’s status in section frontmatter to complete
    • Run /improve-tooling retrospectively on THIS subsection — Retrospective 05.5: Script output is clear for human consumption. JSON output for CI could be useful but not blocking (no CI pipeline yet). BLESS=1 for golden probes is a nice-to-have for 05.N. No urgent tooling gaps.
    • Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files. Clean (2026-04-13).

05.R Third Party Review Findings

  • [TPR-05-001-codex][high] plans/lang-intelligence/section-06-symbol-extraction.md:53 — Close the LEAK between ParseResult and Section 06. Evidence: Section 06.2 reads repos.yaml/languages.yaml/tags.scm directly instead of consuming the ParseResult adapter. Resolved: Fixed on 2026-04-12. Added plan-sync item to 05.N requiring Section 06 update to consume adapter.
  • [TPR-05-002-codex][high] plans/lang-intelligence/section-05-parser-adapters.md:176 — Remove LEAK of machine-local roots from repos.yaml. Evidence: repos.yaml hardcoded ~/projects/… paths. Changed to ${REFERENCE_REPOS_ROOT} env-var pattern with runtime resolver. Resolved: Fixed on 2026-04-12. Changed repos.yaml contract to env-var pattern, added resolve_repo_path() item to 05.3.
  • [TPR-05-003-codex][high] plans/lang-intelligence/section-05-parser-adapters.md:237 — Fix GAP in ParseResult source payloads. Evidence: ParseResult had byte_count but not source_bytes — Section 06 needs source slices for qualified names. Resolved: Fixed on 2026-04-12. Added source_bytes field to ParseResult contract.
  • [TPR-05-004-codex][medium] plans/lang-intelligence/section-05-parser-adapters.md:292 — Eliminate WASTE from tags.scm baseline. Evidence: Rust/Go official tags.scm already include call/impl references. Blanket “declarations only” claim was inaccurate. Resolved: Fixed on 2026-04-12. Nuanced per-language, updated table and implementation notes.
  • [TPR-05-005-codex][medium] plans/lang-intelligence/00-overview.md:135 — Resolve DRIFT in overview language matrix. Evidence: Overview said Swift=source build, Lean=tree-sitter-lean. Section 05 says Swift=try PyPI first, Lean=C++ only. Resolved: Fixed on 2026-04-12. Synced overview matrix with Section 05 strategy.
  • [TPR-05-001-gemini][high] plans/lang-intelligence/section-05-parser-adapters.md:125 — Change subsection close-out headers to checklist items. Evidence: Used ### headers instead of - [ ] checklist items per plan-schema.md. Resolved: Fixed on 2026-04-12. Converted all 5 close-out blocks to checklist item format.
  • [TPR-05-002-gemini][medium] plans/lang-intelligence/section-05-parser-adapters.md:284 — Move TPR checkpoint above subsection close-out. Evidence: TPR checkpoint was placed after 05.3 close-out instead of before it. Resolved: Fixed on 2026-04-12. Moved TPR checkpoint to before close-out block.
  • [TPR-05-003-gemini][high] plans/lang-intelligence/section-05-parser-adapters.md:393 — Add task to update Section 06 for query file rename. Evidence: Section 05 renames tags.scm to decls.scm but no plan-sync item to update Section 06. Resolved: Fixed on 2026-04-12. Added Section 06 update item to plan-sync block. (Overlaps with TPR-05-001-codex.)
  • [TPR-05-001-codex][high] (iter 2) section-06-symbol-extraction.md:54 — Update Section 06 to consume adapter. Evidence: Section 06.2 still reads repos.yaml/tags.scm directly. Resolved: Fixed on 2026-04-12. Updated Section 06.2 contract to consume ParseResult/parse_repo().
  • [TPR-05-002-codex][medium] (iter 2) section-05-parser-adapters.md:324 — Stub query validation contradiction. Evidence: Plan says stubs are valid (zero captures) but also requires all queries to produce captures. Resolved: Fixed on 2026-04-12. Qualified validation: non-stubs must produce captures, stubs must compile cleanly.
  • [TPR-05-003-codex][medium] (iter 2) section-05-parser-adapters.md:62 — Success criteria overstates extraction. Evidence: Section 05 claims “extracts structural symbols” but extraction is Section 06’s deliverable. Resolved: Fixed on 2026-04-12. Changed to “Unblocks mission criteria” (parsing half only).
  • [TPR-05-001-codex][high] (iter 3) scripts/setup-parsers.sh:78 — Koka scanner patch detection uses grep on directory instead of file-existence check. Evidence: grep -q 'src/scanner.c' src/ searches file CONTENTS, not checks file existence. Patch never applied on fresh bootstrap. Resolved: Fixed on 2026-04-13. Changed to [ -f src/scanner.c ].
  • [TPR-05-002-codex][high] (iter 3) scripts/validate-parsers.py:159 — Matrix test assertions too weak: no behavioral checks for malformed/empty/valid conditions. Evidence: —matrix only checks query compilation, not ERROR node presence on malformed, clean on empty, or >0 captures on valid. Resolved: Fixed on 2026-04-13. Added behavioral assertions per condition.
  • [TPR-05-003-codex][high] (iter 3) neo4j/parser_adapter.py:196 — Missing declared query family only logs warning instead of hard error. Evidence: Missing .scm file returns {} silently, violating the hard-error contract. Resolved: Fixed on 2026-04-13. Promoted to RuntimeError.
  • [TPR-05-004-codex][medium] (iter 3) scripts/validate-parsers.py:155 — Malformed fixtures use .txt extension but lookup uses native extensions. Evidence: tests/malformed/rust.txt never matched by f"{lang_id}{ext}" lookup for .rs. Resolved: Fixed on 2026-04-13. Renamed fixtures to native extensions (.rs, .go, .zig, etc.).
  • [TPR-05-005-codex][medium] (iter 3) neo4j/parser_adapter.py:119 — load_manifests() missing path validation for source_root/issue_root. Evidence: Bad paths silently accepted. Path-existence validation only ran as inline ad-hoc check. Resolved: Fixed on 2026-04-13. Added path validation to load_manifests().
  • [TPR-05-001-gemini][high] (iter 3) neo4j/parser_adapter.py:250 — parallel parameter accepted but ignored in parse_repo. Evidence: parallel: bool = False plumbed through but never used. False API contract. Resolved: Fixed on 2026-04-13. Removed unused parameter. Sequential parsing meets <60s target.
  • [TPR-05-002-gemini][medium] (iter 3) scripts/validate-parsers.py:115 — Same as TPR-05-004-codex (malformed fixture extension mismatch). Resolved: Fixed on 2026-04-13. Same fix as TPR-05-004-codex.
  • [TPR-05-003-gemini][medium] (iter 3) neo4j/parser_adapter.py:44 — ParseResult missing source_root field for downstream absolute path construction. Evidence: Downstream consumers need absolute paths but only get relative_path. Must break encapsulation to reconstruct. Resolved: Fixed on 2026-04-13. Added source_root field to ParseResult.

05.N Completion Checklist

  • All 9 tree-sitter grammars load with pinned versions (requirements.txt verified compatibility matrix)
  • languages.yaml defines all 10 languages (9 tree-sitter + Ori native) with coverage_status, grammar_version, query_families; Lean is partial
  • repos.yaml defines all 11 repos with canonicalized repo_id / source_root / issue_root (resolves go/golang duality)
  • Parser adapter API (parser_adapter.py) exposes ParseResult with all contract fields; error handling: soft per-file, hard grammar/query
  • Query file families (decls.scm, calls.scm, imports.scm, impls.scm) exist for all 9 languages (stubs where appropriate; lean symlinks to cpp)
  • validate-parsers.py --matrix passes (108/108), --golden probes pass (8/8), --full completes in 28.1s (<60s)
  • setup-parsers.sh automates full environment setup; --verbose and --skip-koka flags
  • Content hashing deterministic (SHA-256, same file = same hash); grammar update policy documented in validate-parsers.py
  • Plan annotation cleanup: no stale plan references in code (infrastructure plan, no compiler code annotations)
  • All intermediate TPR checkpoint findings resolved (pre-implementation TPR covered in 05.R)
  • Plan sync — update plan metadata to reflect this section’s completion:
    • This section’s frontmatter status -> complete, subsection statuses updated
    • 00-overview.md Quick Reference table status updated (not-started → in-progress)
    • 00-overview.md mission success criteria: line 24 not checked — requires Section 06 (extraction) to complete “and extracts structural symbols” half
    • index.md section status updated (not-started → in-progress)
    • Next section’s (06) depends_on: ["05"] verified — correct, no stale assumptions
    • Update Section 06 plan — already updated during plan review (TPR iter-2): Section 06.2 contract now consumes ParseResult/parse_repo() and uses query family handles.
  • /tpr-review passed (final, full-section) — iter-3: 8 findings, all fixed. iter-4: 0 findings, clean pass. Both reviewers confirmed.
  • /impl-hygiene-review — N/A for Python infrastructure (skill targets Rust compiler code). Code quality verified by TPR reviewers.
  • /improve-tooling section-close sweep — Per-subsection retrospectives all documented (05.1 through 05.5). Cross-subsection pattern: node type discovery friction repeated across 05.4 query writing — addressed by the enriched matrix test (validates all queries compile and produce captures on real source). No additional cross-subsection gaps. Section-close sweep: per-subsection retrospectives covered everything; no cross-subsection patterns required new tooling.

Exit Criteria: validate-parsers.py --full --golden passes with all 9 languages within expected error rates, <60 seconds total parse time, all golden probes within tolerance, and parser_adapter.py API contract verified by Section 06’s extraction script importing and using it without modification.