05 Code Graph: Parser Adapters

05.0 Goal

Set up the tree-sitter parsing infrastructure that all code graph work depends on. This section delivers three things: (1) reliable grammar loading for all 9 supported languages, (2) a formal adapter API contract that Section 06 consumes, and (3) query file families (not just tags.scm) that prepare for relationship extraction. The section does NOT extract symbols or import into Neo4j — it ensures every repo can be parsed and that the adapter layer exposes everything downstream sections need.

Success Criteria:

All 9 tree-sitter grammars load successfully with pinned, compatible versions (8 PyPI + 1 source-built Koka with scanner patch)
Parser adapter API contract exposes repo_id, language_id, relative_path, source_bytes, byte_count, tree, had_error, error_node_count, query_handles, coverage_status, content_hash
Query file families (decls.scm, calls.scm, imports.scm, impls.scm) exist for every supported language (32 files across 8 languages + lean symlink to cpp)
Matrix validation: Language x (Valid/Malformed/Empty) x query family all pass (108/108)
Full parse of all reference repos completes in <60 seconds (28.1s actual)
Unblocks mission criteria: “tree-sitter parses all 9 supported languages” (parsing half — extraction is Section 06’s deliverable)

Context: Section 06 (Symbol Extraction) needs more than just parse trees — it needs compiled query handles for declarations, calls, imports, and implementations. If Section 05 only delivers tags.scm parsing, Section 06 must reinvent query infrastructure. This section front-loads that work.

Reference implementations:

Sourcegraph SCIP: Multi-language indexing with per-language adapter pattern
nvim-treesitter: Query file organization (queries/{lang}/{tags,highlights,locals}.scm)

Depends on: None (independent pillar start).

05.1 Python Dependencies & Version Compatibility

File(s): ~/projects/lang_intelligence/.venv/, ~/projects/lang_intelligence/requirements.txt

Grammar packages pin different tree-sitter core versions. A blanket pip install tree-sitter>=0.25.0 will fail or produce ABI mismatches. The correct approach: pin exact versions after a compatibility smoke test.

Modern tree-sitter Python API (0.22+): The build_library() / Language() pattern from pre-0.22 is deprecated. In tree-sitter 0.22+, grammar packages expose a language() function directly:

# Modern API (tree-sitter >= 0.22)
import tree_sitter_rust
from tree_sitter import Language, Parser

RUST = Language(tree_sitter_rust.language())
parser = Parser(RUST)
tree = parser.parse(source_bytes)

There is NO shared library building step. Grammar packages are Python modules with compiled bindings.

Create requirements.txt with exact pinned versions. Start with latest compatible set and run smoke test:
```
tree-sitter==0.25.2
tree-sitter-rust==0.24.2
tree-sitter-go==0.25.0
tree-sitter-zig==1.1.2
tree-sitter-typescript==0.23.2
tree-sitter-haskell==0.23.1
tree-sitter-swift==0.0.1
tree-sitter-cpp==0.23.4
```
Version selection rule: Used latest available from PyPI (2026-04-13). Core 0.25.2 compatible with all grammar packages.
Run compatibility smoke test: for each grammar package, Language(mod.language()) must succeed, and Parser(lang).parse(b"") must return a tree without segfault

Record the compatibility matrix in requirements.txt comments:

# Compatibility matrix (verified 2026-04-13):
# tree-sitter-rust       0.24.2 + core 0.25.2: OK
# tree-sitter-go         0.25.0 + core 0.25.2: OK
# tree-sitter-zig        1.1.2  + core 0.25.2: OK
# tree-sitter-typescript 0.23.2 + core 0.25.2: OK
# tree-sitter-haskell    0.23.1 + core 0.25.2: OK
# tree-sitter-swift      0.0.1  + core 0.25.2: OK
# tree-sitter-cpp        0.23.4 + core 0.25.2: OK
# tree-sitter-koka       0.1.0 (source, scanner patch) + core 0.25.2: OK

Swift grammar: tree-sitter-swift==0.0.1 from PyPI loads successfully with core 0.25.2. No source build needed.
Koka grammar: NOT on PyPI. Cloned koka-community/tree-sitter-koka and installed from source. Required patching setup.py to include src/scanner.c (upstream bug: external scanner not listed in ext_modules sources). Requires python3-dev headers. Loads successfully after patch.
Verify all grammars load: created scripts/validate-parsers.py with --smoke mode. 8/8 grammars pass. Skips coverage_status: custom entries (Ori).
Document: Lean .lean files have 86% parse error rate. Lean4 repo is parsed via C++ grammar for runtime code only. Coverage status: partial (not “unsupported” — some of the repo IS parseable via C++ grammar). Documented in languages.yaml (created in 05.2).
Document: Ori uses its own Rust parser (no tree-sitter grammar). Ori adapter is implemented in Section 09.3. Ori appears in languages.yaml with grammar: native and coverage_status: custom. validate-parsers.py skips coverage_status: custom entries.
Create scripts/setup-parsers.sh that automates: venv creation, pip install -r requirements.txt, Koka source build (with scanner patch), validate-parsers.py --smoke run. Supports --verbose and --skip-koka flags.
Subsection close-out (05.1)
- All tasks above are [x] and all grammars load via smoke test
- Update this subsection’s status in section frontmatter to complete
- Run /improve-tooling retrospectively on THIS subsection — Retrospective 05.1: Koka scanner patch was the main friction point (upstream bug in setup.py). setup-parsers.sh already has --verbose flag. No additional tooling gaps — validate-parsers.py --smoke gives clear per-grammar pass/fail.
- Repo hygiene check — run diagnostics/repo-hygiene.sh --check and clean any detected temp files. Clean (2026-04-13).

05.2 Language Adapter Manifests

File(s): ~/projects/lang_intelligence/languages.yaml, ~/projects/lang_intelligence/repos.yaml

These manifests are the single source of truth for the entire code graph pipeline. Every downstream script (Section 06 extraction, Section 07 import, Section 09 sync) reads them. Getting the schema right here prevents cascading fixes later.

languages.yaml — per-language capabilities:

rust:
  grammar: tree-sitter-rust       # pip package name or "source" or "native"
  grammar_version: "0.23.3"       # pinned version (must match requirements.txt)
  extensions: [".rs"]
  query_families:                  # which .scm query files exist for this language
    - decls                        # declarations (functions, types, traits, etc.)
    - calls                        # call sites
    - imports                      # use/import statements
    - impls                        # impl/instance/conformance blocks
  coverage_status: full            # full | partial | custom
  maturity: stable
  expected_error_rate: 0.09
  notes: ""

# Ori — native parser, not tree-sitter
ori:
  grammar: native
  extensions: [".ori"]
  query_families: []               # N/A — uses Ori's own Rust parser via FFI
  coverage_status: custom
  maturity: stable
  expected_error_rate: 0.0
  notes: "Parsed by ori_parse (Rust). Adapter in Section 09."

lean:
  grammar: tree-sitter-cpp         # .lean files skipped; only C++ runtime parsed
  extensions: [".cpp", ".h"]       # NOT .lean
  query_families: [decls, calls, imports, impls]  # all C++ query families
  coverage_status: partial
  maturity: stable
  expected_error_rate: 0.02
  notes: ".lean files have 86% error rate — skipped. C++ runtime code only."

repos.yaml — per-repo source mapping:

The local corpus has both go/ (issue tracker only) and golang/ (source code). The manifest MUST canonicalize these:

go:
  repo_id: go                                    # canonical ID used in Neo4j
  source_root: ${REFERENCE_REPOS_ROOT}/golang    # resolved at runtime by adapter
  issue_root: ${REFERENCE_REPOS_ROOT}/go         # resolved at runtime by adapter
  languages: [go]
  include:
    - cmd/compile/
    - go/types/
    - internal/types/
  exclude:
    - test/
    - vendor/

05.3 Parser Adapter API Contract

File(s): ~/projects/lang_intelligence/neo4j/parser_adapter.py

The parser adapter is the boundary between “raw tree-sitter” and “everything downstream.” Section 06 (extraction), Section 07 (import), and Section 09 (sync) all consume this API. The contract must be explicit, typed, and documented.

Adapter output contract (per file):

@dataclass
class ParseResult:
    repo_id: str               # canonical repo identifier from repos.yaml
    language_id: str           # language key from languages.yaml
    relative_path: str         # path relative to source_root
    source_bytes: bytes        # raw file content (needed by Section 06 for qualified names, signature_hash)
    byte_count: int            # len(source_bytes)
    tree: Tree | None          # tree-sitter Tree (None on load failure)
    had_error: bool            # True if tree contains ERROR nodes
    error_node_count: int      # count of ERROR nodes in tree
    query_handles: dict[str, Query]  # compiled queries by family name
    coverage_status: str       # "full" | "partial" | "custom"
    content_hash: str          # SHA-256 of file content (for incremental sync)

class CoverageStatus(Enum):
    FULL = "full"              # grammar parses this language well
    PARTIAL = "partial"        # grammar has known gaps (e.g., Lean C++ only)
    CUSTOM = "custom"          # not tree-sitter (e.g., Ori native parser)

Error handling policy:

05.4 Query File Families

Files: ~/projects/lang_intelligence/queries/{lang}/{family}.scm

Official tags.scm files vary by language in what they capture. Some (Rust, Go) already include @reference.call and @reference.implementation captures alongside declarations. Others (TypeScript, Swift) primarily capture declarations. This subsection standardizes query file families for ALL languages, adapting existing upstream queries where possible and writing custom ones where needed.

Query families:

decls.scm — declarations: functions, types, traits, methods, constants
calls.scm — call expressions: function calls, method calls
imports.scm — import/use/require statements
impls.scm — impl blocks, interface conformance, instance declarations

Per-language query file status:

Language	decls.scm	calls.scm	imports.scm	impls.scm	Source
Rust	Official (has decls)	Official (has @reference.call)	Custom	Official (has @reference.implementation)	tree-sitter-rust
Go	Official (has decls)	Official (has @reference.call)	Official (has package/import)	N/A (implicit)	tree-sitter-go
Zig	Custom (no official tags)	Custom	Custom	N/A	tree-sitter-zig
TypeScript	Official tags.scm adapted	Custom	Custom	Custom	tree-sitter-typescript
Haskell	Custom (no official tags)	Custom	Custom	Custom	tree-sitter-haskell
Swift	Official tags.scm adapted	Custom	Custom	Custom	tree-sitter-swift
C++	Official tags.scm adapted	Custom	Custom	N/A	tree-sitter-cpp
Koka	Custom (if grammar works)	Custom	Custom	Custom	tree-sitter-koka

Implementation approach:

For languages WITH official tags.scm: adapt/rename to decls.scm, then write calls.scm, imports.scm, impls.scm from scratch using each grammar’s node-types.json as reference.
For languages WITHOUT official tags.scm (Zig, Haskell, Koka): write all four families from scratch.
Some families may be empty stubs for some languages (e.g., Go has no explicit impl blocks — impls.scm is empty). Empty stubs are valid — they return zero captures. The adapter contract handles this gracefully.

05.5 Parse Validation & Matrix Testing

File(s): ~/projects/lang_intelligence/scripts/validate-parsers.py

A comprehensive validation script that tests the full parser adapter stack: grammar loading, file parsing, query compilation, and capture accuracy.

Matrix dimensions:

Language (9 tree-sitter languages)
File condition (Valid source / Malformed source / Empty file)
Query family (decls / calls / imports / impls)

05.R Third Party Review Findings

05.N Completion Checklist

Exit Criteria: validate-parsers.py --full --golden passes with all 9 languages within expected error rates, <60 seconds total parse time, all golden probes within tolerance, and parser_adapter.py API contract verified by Section 06’s extraction script importing and using it without modification.