100%

06 Code Graph: Symbol Extraction

06.0 Goal

Transform tree-sitter ASTs into normalized symbol and relationship records. This is the bridge between raw parsing (Section 05) and Neo4j import (Section 07). The output is language-neutral JSONL — the same format regardless of whether the input is Rust, Go, or Haskell.

Key architectural decision: data-driven, not per-language. The .scm query files from Section 05 already standardize capture names (@definition.function, @definition.class, @definition.interface, etc.) across all 9 languages. The Python extraction layer must map those capture names to normalized kinds via a registry table — NOT via 9 bespoke per-language extractor functions. Per-language functions are a LEAK: the .scm files are the canonical source of “what constitutes a declaration in language X,” and the Python layer duplicating that knowledge creates a shadow home that will drift.

Downstream consumers and their contracts:

  • Section 07 (Import Pipeline): Consumes JSONL records. Needs qualified_name, kind, line, end_line, signature_hash, visibility. Section 07’s schema creates Symbol nodes keyed by (repo, qualified_name).
  • Section 08 (Issue-to-Code Bridge): Matches issue text to Symbol.qualified_name and File.path. Needs line ranges (line, end_line) for code reference resolution.
  • Section 09 (Ori Live Sync): Uses signature_hash + content_hash for incremental diffing. The signature_hash MUST be body-independent or every function body edit triggers a full re-sync.

06.1 Capture-to-Kind Registry

File: ~/projects/lang_intelligence/neo4j/extract_symbols.py (top-level constant)

The .scm query files use these capture names (verified by reading all 32 query files):

Capture NameLanguages Using ItExample Node Types
@definition.functionall 9function_item (Rust), function_declaration (Go/TS/Swift), fundecl (Koka)
@definition.classrust, go, ts, swift, haskell, cpp, kokastruct_item/enum_item (Rust), type_declaration (Go), class_declaration (TS/Swift), data_type (Haskell)
@definition.interfacerust, ts, swifttrait_item (Rust), interface_declaration (TS), protocol_declaration (Swift)
@definition.methodgo, tsmethod_declaration (Go), method_definition (TS)
@definition.constantrust, goconst_item/static_item (Rust), const_declaration (Go)
@definition.variablego, ts, swift, zigvar_declaration (Go), lexical_declaration (TS), property_declaration (Swift), variable_declaration (Zig)
@definition.modulerust, go, cppmod_item (Rust), package_clause (Go, in imports.scm!), namespace_definition (C++)
@definition.fieldzigcontainer_field (Zig)
@definition.macrorustmacro_definition (Rust)
@reference.callall 9call_expression (most), macro_invocation (Rust), apply (Haskell), opexpr (Koka)
@reference.importall 9use_declaration (Rust), import_declaration (Go), import_statement (TS), etc.
@reference.implementationrust, ts, swift, haskellimpl_item (Rust), class_declaration+implements_clause (TS), instance (Haskell)

Semantic vs helper captures: Query files contain both semantic captures (mapped to normalized kinds via CAPTURE_TO_KIND) and helper captures used for sub-node extraction:

  • @name — symbol/target name (semantic: consumed by extractor)
  • @type — implementing type in IMPLEMENTS relationships (semantic: consumed by extractor for Rust impls)
  • @_fn, @_prefix — internal query anchors (helper: ignored by extractor, prefixed with _)

The extractor MUST consume @name and @type from match results. It MUST ignore any capture name starting with _ (helper captures). All other non-CAPTURE_TO_KIND captures should be logged at debug level as unrecognized.

Rust IMPLEMENTS captures: Rust’s impls.scm has TWO patterns — one captures trait: @name + type: @type (impl with trait), the other captures only type: @name (inherent impl). The extractor must handle both: for trait impls, @name = trait name, @type = implementing type (use match_captures.get("type") — safe access, since inherent impls lack @type); for inherent impls, @name = implementing type, no trait.

Rust impl dedup rule: Both patterns can match the SAME impl_item node. For impl Foo for Bar {}, the trait-impl pattern matches with name=Foo, type=Bar, AND the type-only pattern matches with name=Bar. The extractor MUST deduplicate: if matches() yields two matches for the same impl_item node (same start_byte), and one has a @type capture while the other does not, keep only the trait-impl match (the one with @type) and discard the type-only match. This prevents spurious inherent-impl records for every trait impl.

Capture name to normalized kind mapping:

CAPTURE_TO_KIND = {
    # Declarations
    "definition.function":  "function",
    "definition.method":    "method",
    "definition.class":     "type",       # struct, enum, class, data — NOT OOP "class"
    "definition.interface": "trait_like",  # trait, interface, protocol, typeclass
    "definition.constant":  "const",
    "definition.variable":  "variable",   # filtered in scope pass unless top-level
    "definition.module":    "module",
    "definition.field":     "field",
    "definition.macro":     "macro",
    # References
    "reference.call":           "CALLS",
    "reference.import":         "IMPORTS",
    "reference.implementation": "IMPLEMENTS",
}

Critical: Rust maps BOTH struct_item AND enum_item to @definition.class. The plan’s original kind table had separate type vs sum_type kinds, but the .scm files do not distinguish them — both are @definition.class. To recover sum_type, the extractor would need to check the tree-sitter node type (enum_item vs struct_item). This is the ONE place where the node type matters beyond the capture name.

Node-type refinement table (applied after capture-to-kind mapping):

LanguageCaptureNode TypeRefined Kind
rust@definition.classenum_itemsum_type
rust@definition.classtype_itemtype_alias
typescript@definition.classtype_alias_declarationtype_alias
typescript@definition.classenum_declarationsum_type
swift@definition.classtypealias_declarationtype_alias
haskell@definition.classtype_synomymtype_alias
go@definition.classtype_spec with child interface_typetrait_like
haskell@definition.functionsignatureskip (type sig, not a definition — deduplicate with the function node)

Go interface detection: Go’s decls.scm captures all type_declaration nodes as @definition.class. To distinguish type Foo interface { ... } from type Bar struct { ... }, the extractor must inspect the type_spec child: if it contains an interface_type child, refine to trait_like. This requires walking one level deeper than the capture node.

Haskell signature deduplication: Haskell’s decls.scm captures both (function ...) and (signature ...) as @definition.function. A function with a type signature produces TWO captures for the same name. The extractor must deduplicate: if both a function and a signature capture exist for the same name in the same file, keep only the function capture (it has the body, line range, and is the actual definition).

  • Define CAPTURE_TO_KIND registry as a module-level dict in extract_symbols.py
  • Define NODE_TYPE_REFINEMENTS as a (language_id, capture_name, node_type) -> refined_kind table for the 8 cases above (6 original + Go interface + Haskell signature dedup)
  • Preserve language_kind as the raw tree-sitter node type name (e.g., fn_item, struct_item, enum_item) for downstream consumers that need language-specific detail
  • Verify the registry covers all 12 semantic capture names found in the .scm files (9 @definition.* + 3 @reference.*). Additionally, define the extractor’s handling of the 2 sub-captures (@name, @type) and document that _-prefixed captures (e.g., @_fn) are ignored

Go @definition.module in imports.scm — NOT decls.scm

Go’s package_clause capture (@definition.module) lives in queries/go/imports.scm, not queries/go/decls.scm. The extractor MUST check BOTH query_handles["decls"] AND query_handles["imports"] for @definition.* captures. Alternatively, move the package_clause pattern from imports.scm to decls.scm in Go’s query files (the correct fix — package declarations are declarations, not imports).

  • Move (package_clause (package_identifier) @name) @definition.module from queries/go/imports.scm to queries/go/decls.scm — this is a declaration, not an import
  • Verify parse_repo("go", manifests) still works after the move (query_handles keys unchanged since the family “decls” already exists)
  • Update tests/golden-probes.yaml Go baseline: imports count drops by 1 (package_clause moves to decls), decls count rises by 1
  • Update any Section 05 documentation that references Go capture counts (the prose in section-05 mentions Go imports=2)

Subsection 06.1 close-out

/improve-tooling retrospective: Registry covered all 12 captures across 9 languages. Go package_clause successfully moved from imports.scm to decls.scm. Golden probes updated. No captures fell through to unknown. Refinement table handles 6 node-type cases + Go interface (child inspection) + Haskell signature dedup. Retrospective 06.1: no tooling gaps.


06.2 Qualified Name Derivation

File: ~/projects/lang_intelligence/neo4j/extract_symbols.py (helper function)

Tree-sitter is a single-file parser — it has NO concept of module paths. A function parse_expr in file compiler/rustc_parse/src/parser/expr.rs must be assigned the qualified name rustc_parse::parser::expr::parse_expr, but tree-sitter sees only the function name.

Derivation strategy per language:

LanguageModule Path SourceSeparatorExample
RustFile path: strip src/, replace / with ::, strip extension. Crate name from top-level dir. Nested mod_item nodes add segments.::compiler/rustc_parse/src/parser/expr.rs:parse_expr -> rustc_parse::parser::expr::parse_expr
Gopackage_clause node (from imports.scm or decls.scm after move). Import path from directory structure..src/go/types/errors.go:Checker.err -> go.types.Checker.err
TypeScriptFile path relative to project root, replace / with ., strip extension.src/compiler/checker.ts:checkExpression -> compiler.checker.checkExpression
Haskellmodule declaration in file, or fallback to file path..compiler/src/Reporting/Error/Type.hs:toReport -> Reporting.Error.Type.toReport
ZigFile path: replace / with ., strip extension.src/Sema.zig:resolveType -> Sema.resolveType
SwiftFile path relative to source root, no src/ stripping.lib/SILOptimizer/ARC/ARCSequenceOpts.cpp:optimize -> SILOptimizer.ARC.ARCSequenceOpts.optimize
C++Namespace from namespace_definition nodes + class from class_specifier nodes, fallback to file path::src/runtime/object.cpp:lean::object::inc_ref -> lean::object::inc_ref
KokaModule path from file path (.kk files), replace / with //src/Type/Infer.kk:inferExpr -> Type/Infer.inferExpr
LeanShares C++ strategy (Lean repo only parses C++ runtime, not .lean files due to 86% error rate). Namespace + class from C++ nodes, fallback to file path::src/runtime/object.cpp:lean::object::inc_ref -> lean::object::inc_ref

Implementation approach:

  1. File-path-based module prefix (works for all languages): Convert ParseResult.relative_path into a module path using the language’s separator and path conventions. This is the PRIMARY strategy.
  2. AST-based nesting (augments file path): Walk the tree-sitter tree to find enclosing mod_item (Rust), namespace_definition (C++), impl_item (Rust), class_declaration (TS/Swift) nodes. These add segments to the qualified name.
  3. Determinism requirement: The same file with the same content MUST produce the same qualified_name on every run. No randomness, no timestamp-dependent derivation.
  • Implement derive_module_prefix(language_id: str, relative_path: str) -> str — file-path-based module prefix using per-language conventions (separator, path stripping rules)
  • Implement derive_qualified_name(language_id: str, module_prefix: str, symbol_node, tree) -> str — walks parent nodes to find enclosing scopes (modules, classes, namespaces, impl blocks) and appends the symbol name
  • Handle Rust mod.rs / lib.rs conventions: src/parser/mod.rs -> module parser (strip mod), src/lib.rs -> crate root
  • Handle Rust impl blocks: method check inside impl TypeChecker gets qualified name ending in TypeChecker::check, not just check
  • Handle Go package_clause: use the package name as the first segment, NOT the file name
  • Handle C++ namespaces: extract namespace_identifier from enclosing namespace_definition nodes
  • Test: determinism — parsing the same file twice produces identical qualified_names

Subsection 06.2 close-out

/improve-tooling retrospective: File-path derivation works for all 8 tested cases. Default handler strips common prefixes (src/, lib/, compiler/) recursively. Haskell path fix needed for compiler/src/ double-prefix. Tested against Gleam repo — 6885 symbols with correct qualified names. Retrospective 06.2: no tooling gaps.


06.3 Signature Hash (Body-Independent)

File: ~/projects/lang_intelligence/neo4j/extract_symbols.py (helper function)

Problem: Tree-sitter captures the entire function node including its body. Hashing the full captured text makes signature_hash body-dependent — every body edit changes the hash, which breaks Section 09’s incremental sync (every commit triggers a full re-sync of every function).

Solution: Hash only the signature portion of function/method nodes.

Strategy per node type:

Node TypeSignature =Body = (excluded from hash)
function_item (Rust)Everything before the block childThe block child
function_declaration (Go, TS, Swift)Everything before the block/statement_block/function_body childThe block child
fundecl (Koka)The identifier + type annotation childrenThe body
function (Haskell)The variable name + patternsThe RHS expression
Non-function declarationsFull node text (structs, enums, traits change “signature” when fields change)N/A

Implementation:

def compute_signature_hash(node, source_bytes: bytes) -> str:
    """Hash only the signature portion of a node, excluding the body.

    For function-like nodes, finds the body child (block, statement_block,
    function_body) and hashes everything before it. For non-function nodes,
    hashes the entire node text.
    """
    BODY_CHILD_TYPES = {"block", "statement_block", "function_body", "compound_statement",
                        "match",      # Haskell: function body wrapper (verified via parser probe)
                        "funbody"}    # Koka: function body wrapper (verified via parser probe)
    body_child = None
    for child in node.children:
        if child.type in BODY_CHILD_TYPES:
            body_child = child
            break

    if body_child:
        sig_bytes = source_bytes[node.start_byte:body_child.start_byte]
    else:
        sig_bytes = source_bytes[node.start_byte:node.end_byte]

    return hashlib.sha256(sig_bytes).hexdigest()[:16]  # 16 hex chars = 64 bits
  • Implement compute_signature_hash(node, source_bytes) as above
  • Verify the BODY_CHILD_TYPES set against actual tree-sitter node type definitions for ALL languages. The initial set covers C-family blocks + Haskell expressions + Koka bodyexpr, but may need expansion for edge cases. When a body child type is not found, the function hashes the entire node — verify this doesn’t happen for any language’s function declarations by testing each language
  • Test: changing a function body does NOT change its signature_hash
  • Test: changing a function’s parameter list DOES change its signature_hash
  • Test: changing a struct’s fields DOES change its signature_hash (full node hash for non-functions)
  • Truncate to 16 hex chars (64-bit fingerprint) — sufficient for change detection, not cryptographic

Subsection 06.3 close-out

/improve-tooling retrospective: BODY_CHILD_TYPES expanded per TPR iteration 2 (codex ran parser probes). match (Haskell) and funbody (Koka) verified. Body-independence test passes: changing Rust function body does NOT change hash, changing params DOES. Struct field changes correctly affect hash (full-node for non-functions). Retrospective 06.3: no tooling gaps.


06.4 Extraction Pipeline

File: ~/projects/lang_intelligence/neo4j/extract_symbols.py

Contract:

Usage: python3 neo4j/extract_symbols.py <repo_name> [--output symbols.jsonl]
Consumes: parser_adapter.parse_repo() -> Iterator[ParseResult]
  (ParseResult includes: source_bytes, tree, query_handles, had_error,
   error_node_count, coverage_status, content_hash, relative_path, language_id)
Outputs: JSONL with one record per symbol or relationship

Note: This script does NOT read repos.yaml, languages.yaml, or query .scm files directly. All grammar loading, file walking, query compilation, and error handling is the responsibility of Section 05’s parser_adapter.py. This script operates on ParseResult objects — it extracts symbols from pre-parsed trees using pre-compiled query handles.

Ori is NOT in scope for this section. parser_adapter.py skips coverage_status: "custom" languages (line 348), so parse_repo("ori", manifests) yields zero ParseResult objects. Ori extraction is Section 09.3’s responsibility — it uses Ori’s own Rust parser via CLI, not tree-sitter. This script handles the 9 tree-sitter-parseable languages only (the 10 reference repos minus Ori, noting Lean only parses C++ runtime code).

Symbol record format:

{"type": "symbol", "repo": "rust", "file": "compiler/rustc_parse/src/parser/expr.rs",
 "name": "parse_expr", "qualified_name": "rustc_parse::parser::expr::parse_expr",
 "kind": "function", "language": "rust", "language_kind": "function_item",
 "line": 42, "end_line": 120, "visibility": "pub",
 "signature_hash": "a1b2c3d4e5f6g7h8",
 "had_error": false, "coverage_status": "full"}

Relationship record format:

{"type": "relationship", "kind": "CALLS",
 "source_qualified_name": "rustc_parse::parser::expr::parse_expr",
 "target_identifier": "parse_item",
 "repo": "rust", "file": "compiler/rustc_parse/src/parser/expr.rs", "line": 67}

Critical: CALLS/IMPORTS targets are UNRESOLVED. Tree-sitter captures bare identifiers (e.g., parse_item) or scoped identifiers (e.g., self::parse_item), not fully qualified names. The relationship record stores target_identifier as the raw captured text. Resolution to a qualified_name is NOT this script’s responsibility.

Downstream resolution contract: Section 07 (Neo4j Import Pipeline) must handle unresolved targets by MERGE-ing a stub Symbol node (or UnresolvedSymbol node) keyed by (repo, target_identifier) as the edge target. The relationship edge is created pointing at this stub. Section 08 (Issue-to-Code Bridge) or a future resolution pass can later merge the stub with the actual resolved Symbol node once cross-file analysis identifies the target. This ensures Section 07 can import ALL relationship records without dropping edges that lack resolved targets.

Section 07 action items (to be addressed when Section 07 is reviewed/implemented):

  1. Add stub-merge strategy for unresolved relationship targets — MERGE stub nodes keyed by (repo, target_identifier) before creating edges
  2. Update Symbol node property list to include ALL fields from the JSONL format: end_line, visibility, name, language_kind, had_error, coverage_status (currently only lists kind, language, qualified_name, line, signature_hash)
  3. Define the follow-up merge path when a stub target later resolves to a qualified symbol

Data-driven extraction loop (single function for all languages):

from tree_sitter import QueryCursor

def extract_from_parse_result(result: ParseResult) -> Iterator[dict]:
    module_prefix = derive_module_prefix(result.language_id, result.relative_path)

    # 1. Extract declarations from "decls" query family
    decl_query = result.query_handles.get("decls")
    if decl_query:
        cursor = QueryCursor(decl_query)
        # matches() groups sub-captures by match, so @name is paired with
        # its parent @definition.* in each match dict
        for _pattern_idx, match_captures in cursor.matches(result.tree.root_node):
            # match_captures: dict[str, list[Node]]
            for capture_name, nodes in match_captures.items():
                if capture_name == "name" or capture_name == "type":
                    continue  # sub-captures, processed with their parent
                kind = CAPTURE_TO_KIND.get(capture_name)
                if kind is None:
                    continue
                name_nodes = match_captures.get("name", [])
                # ... derive symbol name from name_nodes, apply refinements, emit record

    # 2. Extract relationships from "calls", "imports", "impls" query families
    for family in ("calls", "imports", "impls"):
        query = result.query_handles.get(family)
        if query:
            cursor = QueryCursor(query)
            for _pattern_idx, match_captures in cursor.matches(result.tree.root_node):
                # Each match groups the parent capture with its @name sub-capture
                # ... map, emit relationship record
  • Implement extract_from_parse_result(result: ParseResult) -> Iterator[dict] as a single data-driven function that processes ALL languages via the CAPTURE_TO_KIND registry — NO per-language if/elif branches
  • Use tree-sitter’s QueryCursor(query).matches(node) API — NOT .captures(). matches() yields (pattern_index, match_captures_dict) tuples where match_captures_dict is dict[str, list[Node]] keyed by capture name. This groups the @name sub-capture with its parent @definition.* or @reference.* capture in each match. captures() returns a flat dict of all captures grouped by name — usable for counting but NOT for associating sub-captures with their parents. The existing validate-parsers.py uses captures() for counting; extraction needs matches() for grouping.
  • For @definition.* matches: match_captures["name"][0] gives the symbol name node. The outer node (e.g., match_captures["definition.function"][0]) gives the full node for line, end_line, signature_hash
  • For @reference.* matches: match_captures["name"][0] gives the target identifier text. The outer node gives the source location (line). For Rust IMPLEMENTS, use match_captures.get("type") (safe access — inherent impls lack @type)
  • Deduplicate Rust IMPLEMENTS matches: when two matches share the same impl_item node (start_byte), keep only the trait-impl match (has @type) and discard the type-only match
  • Determine visibility by checking for pub / export / visibility modifiers on the parent/sibling nodes. Use "pub" if found, "" otherwise. This is best-effort — correctness is NOT critical for the intelligence graph
  • Output JSONL (streaming, one record per line, flushed immediately) for memory efficiency on large repos
  • Include had_error and coverage_status from ParseResult in every symbol record — downstream consumers need this to assess trustworthiness
  • Handle query_handles entries that are None (stub queries that matched no patterns) — skip silently, do not error
  • CLI: argparse with repo_name positional arg, --output optional (default: stdout), --stats flag to print summary counts

Haskell @reference.implementation edge case

Haskell’s impls.scm captures (instance) @reference.implementation with NO @name sub-capture. The instance declaration node text contains both the class name and the type, e.g., instance Show Foo where. The extractor must parse the node text to extract the trait name and type name, or emit the full instance text as target_identifier and let Section 08 resolve it.

  • Handle Haskell @reference.implementation captures that lack @name: extract class name and type from the instance node’s children ((class_name) and (instance_type) if available), or use full node text as fallback

Subsection 06.4 close-out

/improve-tooling retrospective: Extraction accurate on Gleam repo (188 files, 6885 symbols, 29351 relationships in 1.3s). Rust impl dedup works correctly (tested via impl Foo for Bar {} test). Haskell instance fallback uses full node text. No false positives detected in initial run. Data-driven loop handles all languages except 3 refinement points (Go interface child inspection, Haskell signature dedup, Rust impl dedup). Retrospective 06.4: no tooling gaps.


06.5 Scope Filtering & Parse Metadata

File: ~/projects/lang_intelligence/neo4j/extract_symbols.py (filter pass)

Problem: Several query files capture local-scope symbols that should NOT appear in the structural graph:

  • TypeScript decls.scm: (lexical_declaration ...) captures const/let inside function bodies
  • Go decls.scm: (var_declaration ...) captures local variables
  • Zig decls.scm: (variable_declaration ...) captures local var/const

The success criterion “No expression or statement-level nodes — structural only” requires filtering these out.

Scope filtering strategy:

A symbol captured by @definition.variable is LOCAL if its tree-sitter node is nested inside a function/method body node. Walk parent nodes upward: if any ancestor is a body node (block, statement_block, function_body, compound_statement, block_expression), the variable is local and should be excluded.

Captures that are ALWAYS structural regardless of nesting:

  • @definition.function, @definition.method — functions are always structural
  • @definition.class, @definition.interface — types are always structural
  • @definition.module — modules are always structural
  • @definition.macro — macros are always structural

Captures that need scope checking:

  • @definition.variable — local unless at module/namespace/file top level

  • @definition.constant — same (though most are top-level in practice)

  • @definition.field — always structural (fields are type members, not local)

  • Implement is_local_scope(node) -> bool that walks parent nodes looking for function body ancestors

  • Apply scope filter to @definition.variable and @definition.constant captures — exclude local-scope instances

  • Do NOT filter @definition.field (Zig container fields are always structural)

  • Emit a --include-locals CLI flag for debugging that disables the scope filter

  • Log filtered-out count per file at debug level for diagnostics

Parse metadata propagation

ParseResult carries had_error, error_node_count, and coverage_status that downstream consumers need to assess whether extracted symbols are trustworthy.

  • Include had_error: bool in every symbol JSONL record (copied from the originating ParseResult)
  • Include coverage_status: str in every symbol JSONL record
  • When had_error is true, log a warning with the file path and error_node_count — symbols from error trees may be incomplete or mislocated
  • Do NOT skip files with errors — extract what tree-sitter recovered, but mark the records

Subsection 06.5 close-out

/improve-tooling retrospective: Scope filtering tested on Go (local var x int inside function filtered correctly) and Zig (container_field always structural). Parse metadata (had_error, coverage_status) included in all symbol records. —include-locals flag available for debugging. Retrospective 06.5: no tooling gaps.


06.6 Cross-Language Normalization Tests

File: ~/projects/lang_intelligence/tests/test_extract_symbols.py

Each test uses a small inline source snippet, writes it to a temporary file (via tempfile.NamedTemporaryFile with the correct extension for the language), parses it via parse_file() from parser_adapter.py, feeds the ParseResult to extract_from_parse_result(), and asserts on the output records.

Test fixture helper (required — parse_file() reads from disk, it does NOT accept raw source text):

def parse_snippet(lang_id: str, source: str, manifests: Manifests) -> ParseResult:
    """Parse an inline source snippet by writing to a temp file."""
    lang_config = manifests.languages[lang_id]
    ext = lang_config["extensions"][0]  # e.g., ".rs", ".go"
    with tempfile.NamedTemporaryFile(suffix=ext, mode="w", delete=False) as f:
        f.write(source)
        f.flush()
        return parse_file(
            repo_config={"repo_id": "test", "source_root": str(Path(f.name).parent)},
            lang_id=lang_id,
            lang_config=lang_config,
            file_path=Path(f.name),
            source_root=Path(f.name).parent,
        )
  • Implement parse_snippet() test helper in test_extract_symbols.py for inline source fixture tests

Normalization correctness

  • Test: Rust trait Foo { ... } produces kind: "trait_like", language_kind: "trait_item"
  • Test: Go type Foo interface { ... } produces kind: "trait_like" via @definition.class with Go-specific refinement (inspects type_spec child for interface_type)
  • Test: TypeScript interface Foo { ... } produces kind: "trait_like", language_kind: "interface_declaration"
  • Test: Swift protocol Foo { ... } produces kind: "trait_like", language_kind: "protocol_declaration"
  • Test: Rust enum Color { ... } produces kind: "sum_type" via node-type refinement
  • Test: Rust type Alias = OtherType; produces kind: "type_alias" via node-type refinement
  • Test: TypeScript enum Direction { ... } produces kind: "sum_type" via node-type refinement
  • Test: TypeScript type Alias = string | number; produces kind: "type_alias" via node-type refinement

Relationship extraction

  • Test: Function calls in Rust, Go, and TypeScript all produce kind: "CALLS" relationship records
  • Test: Import statements in Rust, Go, and TypeScript all produce kind: "IMPORTS" relationship records
  • Test: Rust impl Foo for Bar { ... } produces kind: "IMPLEMENTS" relationship with both trait (Foo) and implementing type (Bar) captured, after dedup (only one record, not two)
  • Test: TypeScript class Bar implements Foo { ... } produces kind: "IMPLEMENTS" relationship
  • Test: CALLS target_identifier is the raw captured text (bare identifier or scoped identifier), NOT a qualified name

Qualified name

  • Test: qualified_name is deterministic — same file parsed twice yields identical qualified_names
  • Test: Rust src/parser/expr.rs:parse_expr -> qualified_name contains parser::expr::parse_expr
  • Test: Go function in package types -> qualified_name starts with types.

Signature hash

  • Test: Changing a function body does NOT change signature_hash
  • Test: Changing a function’s parameter list DOES change signature_hash
  • Test: Struct/enum full node changes DO change signature_hash

Scope filtering

  • Test: TypeScript const x = 1 inside a function body is filtered out (not emitted)
  • Test: TypeScript const X = 1 at module top-level is NOT filtered (emitted as kind: "variable" — TypeScript const/let both produce @definition.variable, not @definition.constant)
  • Test: Go var x int inside a function body is filtered out
  • Test: Zig container_field inside a struct is NOT filtered (it’s @definition.field, always structural)

Parse metadata

  • Test: Symbol records from error-containing files include had_error: true
  • Test: Symbol records include coverage_status matching the language config

Subsection 06.6 close-out

/improve-tooling retrospective: 24 tests covering normalization (6), relationships (4), qualified-name (3), signature-hash (3), scope-filtering (2), parse-metadata (3), registry-coverage (3). All pass in 0.10s. Tests use parse_snippet() temp-file helper. Deterministic qualified-name test uses same temp file for both parses. Retrospective 06.6: no tooling gaps.


06.R Third Party Review Findings

  • [TPR-06-001-codex][high] section-06-symbol-extraction.md:294 — Correct the QueryCursor capture contract. Resolved: Fixed on 2026-04-13. Rewrote extraction loop to use QueryCursor.matches() and corrected the API documentation in checklist items.
  • [TPR-06-002-codex][high] section-06-symbol-extraction.md:268 — Add a resolution stage for unresolved relationship targets. Resolved: Fixed on 2026-04-13. Added downstream resolution contract specifying Section 07 must MERGE stub Symbol nodes for unresolved targets.
  • [TPR-06-003-codex][medium] section-06-symbol-extraction.md:295 — Specify how IMPLEMENTS records recover the implementing type. Resolved: Fixed on 2026-04-13. Added semantic vs helper capture contract, documented Rust IMPLEMENTS @name/@type dual-capture pattern, added checklist item.
  • [TPR-06-004-codex][medium] section-06-symbol-extraction.md:364 — Rewrite the tests around an executable parse helper. Resolved: Fixed on 2026-04-13. Added parse_snippet() temp-file fixture helper with implementation and checklist item.
  • [TPR-06-005-codex][medium] section-06-symbol-extraction.md:148 — Add Lean and fix the inconsistent Koka qualified-name rules. Resolved: Fixed on 2026-04-13. Added Lean row (shares C++ strategy), fixed Koka row (.kk extension, / separator).
  • [TPR-06-001-gemini][high] section-06-symbol-extraction.md:282 — Use dictionary get without string modification for capture names. Resolved: Fixed on 2026-04-13. Removed erroneous lstrip("definition.")CAPTURE_TO_KIND keys are already fully prefixed. Same fix as [TPR-06-001-codex] (convergent).
  • [TPR-06-002-gemini][high] section-06-symbol-extraction.md:294 — Use QueryCursor matches to group sub-captures. Resolved: Fixed on 2026-04-13. Same fix as [TPR-06-001-codex] — rewrote to use .matches().
  • [TPR-06-003-gemini][medium] section-06-symbol-extraction.md:206 — Support expression bodies for Haskell and Koka signature hashes. Resolved: Fixed on 2026-04-13. Expanded BODY_CHILD_TYPES to include Haskell expression nodes and Koka bodyexpr.
  • [TPR-06-004-gemini][medium] section-06-symbol-extraction.md:268 — Clarify Neo4j target resolution strategy for CALLS and IMPORTS. Resolved: Fixed on 2026-04-13. Same fix as [TPR-06-002-codex] — added stub-merge contract for Section 07.

Iteration 2 findings:

  • [TPR-06-001-codex][high] section-07-code-import.md:73 — Add unresolved-target stub merge contract to Section 07. Resolved: Noted as Section 07 action item in Section 06’s downstream contract on 2026-04-13. Cross-section: Section 07’s review will address.
  • [TPR-06-002-codex][high] section-06-symbol-extraction.md:216 — Expand BODY_CHILD_TYPES to real Haskell/Koka wrappers. Resolved: Fixed on 2026-04-13. Replaced Haskell expression nodes with match, Koka bodyexpr with funbody (verified via parser probes).
  • [TPR-06-003-codex][medium] section-06-symbol-extraction.md:90 — Deduplicate Rust trait impl matches. Resolved: Fixed on 2026-04-13. Added dedup rule: same impl_item with overlapping patterns → keep trait-impl match (has @type), discard type-only.
  • [TPR-06-004-codex][medium] section-06-symbol-extraction.md:121 — Use typescript not ts in refinement table. Resolved: Fixed on 2026-04-13. Replaced ts with typescript to match languages.yaml language_id.
  • [TPR-06-001-gemini][medium] section-06-symbol-extraction.md:322 — Use safe access for @type sub-captures. Resolved: Fixed on 2026-04-13. Changed to match_captures.get("type") for safe access.
  • [TPR-06-002-gemini][high] section-07-code-import.md:74 — Add stub-merge strategy to Section 07. Resolved: Same as [TPR-06-001-codex] — noted as Section 07 action item.
  • [TPR-06-003-gemini][medium] section-07-code-import.md:73 — Include all extracted properties in Section 07 Symbol nodes. Resolved: Noted as Section 07 action item on 2026-04-13.

Iteration 3 findings:

  • [TPR-06-001-codex][medium] section-06-symbol-extraction.md:143 — Sync Go package-clause move with Section 05 golden baseline. Resolved: Fixed on 2026-04-13. Added checklist items for golden-probes.yaml and Section 05 documentation updates.
  • [TPR-06-002-codex][medium] section-06-symbol-extraction.md:455 — TypeScript top-level const test expects wrong kind. Resolved: Fixed on 2026-04-13. Changed expected kind from “const” to “variable” (TS uses @definition.variable for both const/let).
  • [TPR-06-003-codex][medium] section-06-symbol-extraction.md:436 — Use valid Rust trait-impl syntax in test. Resolved: Fixed on 2026-04-13. Changed impl Bar: Foo to impl Foo for Bar (valid Rust syntax).

06.N Completion Checklist

  • extract_symbols.py produces JSONL for all 9 supported languages via a single data-driven pipeline
  • CAPTURE_TO_KIND registry covers all 12 semantic capture names from the .scm files, plus @name/@type sub-capture handling and _-prefix ignore rule
  • NODE_TYPE_REFINEMENTS handles the 8 node-type refinement cases (sum_type, type_alias, Go interface, Haskell signature dedup)
  • Rust IMPLEMENTS extraction handles both @name (trait) and @type (implementing type) captures
  • Symbol kinds normalized across languages per the capture-to-kind registry
  • CALLS, IMPORTS, IMPLEMENTS relationships extracted with unresolved target identifiers
  • qualified_name derived from filesystem path + AST nesting, deterministic
  • signature_hash is body-independent (hashes signature only, not function body) — BODY_CHILD_TYPES covers all 9 languages including Haskell/Koka expression bodies
  • Local-scope symbols filtered out (@definition.variable inside function bodies)
  • parse_snippet() test helper implemented for inline source fixture tests
  • Extraction loop uses QueryCursor.matches() (NOT .captures()) to group sub-captures with their parents
  • Unresolved target contract documented — Section 07 needs stub-merge strategy for edge targets
  • Parse metadata (had_error, coverage_status) propagated to output records
  • Go package_clause moved from imports.scm to decls.scm
  • Haskell instance capture handled (no @name sub-capture)
  • end_line included in symbol records (required by Section 08 for code reference resolution)
  • Normalization tests pass (all tests in test_extract_symbols.py)
  • timeout 150 ./test-all.sh green — no regressions from integration changes
  • Frontmatter status updated to complete
  • /tpr-review clean
  • /impl-hygiene-review clean
  • /improve-tooling section-close sweep