06 Code Graph: Symbol Extraction
06.0 Goal
Transform tree-sitter ASTs into normalized symbol and relationship records. This is the bridge between raw parsing (Section 05) and Neo4j import (Section 07). The output is language-neutral JSONL — the same format regardless of whether the input is Rust, Go, or Haskell.
Key architectural decision: data-driven, not per-language. The .scm query files from Section 05 already standardize capture names (@definition.function, @definition.class, @definition.interface, etc.) across all 9 languages. The Python extraction layer must map those capture names to normalized kinds via a registry table — NOT via 9 bespoke per-language extractor functions. Per-language functions are a LEAK: the .scm files are the canonical source of “what constitutes a declaration in language X,” and the Python layer duplicating that knowledge creates a shadow home that will drift.
Downstream consumers and their contracts:
- Section 07 (Import Pipeline): Consumes JSONL records. Needs
qualified_name,kind,line,end_line,signature_hash,visibility. Section 07’s schema creates Symbol nodes keyed by(repo, qualified_name). - Section 08 (Issue-to-Code Bridge): Matches issue text to
Symbol.qualified_nameandFile.path. Needs line ranges (line,end_line) for code reference resolution. - Section 09 (Ori Live Sync): Uses
signature_hash+content_hashfor incremental diffing. Thesignature_hashMUST be body-independent or every function body edit triggers a full re-sync.
06.1 Capture-to-Kind Registry
File: ~/projects/lang_intelligence/neo4j/extract_symbols.py (top-level constant)
The .scm query files use these capture names (verified by reading all 32 query files):
| Capture Name | Languages Using It | Example Node Types |
|---|---|---|
@definition.function | all 9 | function_item (Rust), function_declaration (Go/TS/Swift), fundecl (Koka) |
@definition.class | rust, go, ts, swift, haskell, cpp, koka | struct_item/enum_item (Rust), type_declaration (Go), class_declaration (TS/Swift), data_type (Haskell) |
@definition.interface | rust, ts, swift | trait_item (Rust), interface_declaration (TS), protocol_declaration (Swift) |
@definition.method | go, ts | method_declaration (Go), method_definition (TS) |
@definition.constant | rust, go | const_item/static_item (Rust), const_declaration (Go) |
@definition.variable | go, ts, swift, zig | var_declaration (Go), lexical_declaration (TS), property_declaration (Swift), variable_declaration (Zig) |
@definition.module | rust, go, cpp | mod_item (Rust), package_clause (Go, in imports.scm!), namespace_definition (C++) |
@definition.field | zig | container_field (Zig) |
@definition.macro | rust | macro_definition (Rust) |
@reference.call | all 9 | call_expression (most), macro_invocation (Rust), apply (Haskell), opexpr (Koka) |
@reference.import | all 9 | use_declaration (Rust), import_declaration (Go), import_statement (TS), etc. |
@reference.implementation | rust, ts, swift, haskell | impl_item (Rust), class_declaration+implements_clause (TS), instance (Haskell) |
Semantic vs helper captures: Query files contain both semantic captures (mapped to normalized kinds via CAPTURE_TO_KIND) and helper captures used for sub-node extraction:
@name— symbol/target name (semantic: consumed by extractor)@type— implementing type in IMPLEMENTS relationships (semantic: consumed by extractor for Rust impls)@_fn,@_prefix— internal query anchors (helper: ignored by extractor, prefixed with_)
The extractor MUST consume @name and @type from match results. It MUST ignore any capture name starting with _ (helper captures). All other non-CAPTURE_TO_KIND captures should be logged at debug level as unrecognized.
Rust IMPLEMENTS captures: Rust’s impls.scm has TWO patterns — one captures trait: @name + type: @type (impl with trait), the other captures only type: @name (inherent impl). The extractor must handle both: for trait impls, @name = trait name, @type = implementing type (use match_captures.get("type") — safe access, since inherent impls lack @type); for inherent impls, @name = implementing type, no trait.
Rust impl dedup rule: Both patterns can match the SAME impl_item node. For impl Foo for Bar {}, the trait-impl pattern matches with name=Foo, type=Bar, AND the type-only pattern matches with name=Bar. The extractor MUST deduplicate: if matches() yields two matches for the same impl_item node (same start_byte), and one has a @type capture while the other does not, keep only the trait-impl match (the one with @type) and discard the type-only match. This prevents spurious inherent-impl records for every trait impl.
Capture name to normalized kind mapping:
CAPTURE_TO_KIND = {
# Declarations
"definition.function": "function",
"definition.method": "method",
"definition.class": "type", # struct, enum, class, data — NOT OOP "class"
"definition.interface": "trait_like", # trait, interface, protocol, typeclass
"definition.constant": "const",
"definition.variable": "variable", # filtered in scope pass unless top-level
"definition.module": "module",
"definition.field": "field",
"definition.macro": "macro",
# References
"reference.call": "CALLS",
"reference.import": "IMPORTS",
"reference.implementation": "IMPLEMENTS",
}
Critical: Rust maps BOTH struct_item AND enum_item to @definition.class. The plan’s original kind table had separate type vs sum_type kinds, but the .scm files do not distinguish them — both are @definition.class. To recover sum_type, the extractor would need to check the tree-sitter node type (enum_item vs struct_item). This is the ONE place where the node type matters beyond the capture name.
Node-type refinement table (applied after capture-to-kind mapping):
| Language | Capture | Node Type | Refined Kind |
|---|---|---|---|
| rust | @definition.class | enum_item | sum_type |
| rust | @definition.class | type_item | type_alias |
| typescript | @definition.class | type_alias_declaration | type_alias |
| typescript | @definition.class | enum_declaration | sum_type |
| swift | @definition.class | typealias_declaration | type_alias |
| haskell | @definition.class | type_synomym | type_alias |
| go | @definition.class | type_spec with child interface_type | trait_like |
| haskell | @definition.function | signature | skip (type sig, not a definition — deduplicate with the function node) |
Go interface detection: Go’s decls.scm captures all type_declaration nodes as @definition.class. To distinguish type Foo interface { ... } from type Bar struct { ... }, the extractor must inspect the type_spec child: if it contains an interface_type child, refine to trait_like. This requires walking one level deeper than the capture node.
Haskell signature deduplication: Haskell’s decls.scm captures both (function ...) and (signature ...) as @definition.function. A function with a type signature produces TWO captures for the same name. The extractor must deduplicate: if both a function and a signature capture exist for the same name in the same file, keep only the function capture (it has the body, line range, and is the actual definition).
- Define
CAPTURE_TO_KINDregistry as a module-level dict inextract_symbols.py - Define
NODE_TYPE_REFINEMENTSas a(language_id, capture_name, node_type) -> refined_kindtable for the 8 cases above (6 original + Go interface + Haskell signature dedup) - Preserve
language_kindas the raw tree-sitter node type name (e.g.,fn_item,struct_item,enum_item) for downstream consumers that need language-specific detail - Verify the registry covers all 12 semantic capture names found in the
.scmfiles (9@definition.*+ 3@reference.*). Additionally, define the extractor’s handling of the 2 sub-captures (@name,@type) and document that_-prefixed captures (e.g.,@_fn) are ignored
Go @definition.module in imports.scm — NOT decls.scm
Go’s package_clause capture (@definition.module) lives in queries/go/imports.scm, not queries/go/decls.scm. The extractor MUST check BOTH query_handles["decls"] AND query_handles["imports"] for @definition.* captures. Alternatively, move the package_clause pattern from imports.scm to decls.scm in Go’s query files (the correct fix — package declarations are declarations, not imports).
- Move
(package_clause (package_identifier) @name) @definition.modulefromqueries/go/imports.scmtoqueries/go/decls.scm— this is a declaration, not an import - Verify
parse_repo("go", manifests)still works after the move (query_handles keys unchanged since the family “decls” already exists) - Update
tests/golden-probes.yamlGo baseline:importscount drops by 1 (package_clause moves to decls),declscount rises by 1 - Update any Section 05 documentation that references Go capture counts (the prose in section-05 mentions Go
imports=2)
Subsection 06.1 close-out
/improve-tooling retrospective: Registry covered all 12 captures across 9 languages. Go package_clause successfully moved from imports.scm to decls.scm. Golden probes updated. No captures fell through to unknown. Refinement table handles 6 node-type cases + Go interface (child inspection) + Haskell signature dedup. Retrospective 06.1: no tooling gaps.
06.2 Qualified Name Derivation
File: ~/projects/lang_intelligence/neo4j/extract_symbols.py (helper function)
Tree-sitter is a single-file parser — it has NO concept of module paths. A function parse_expr in file compiler/rustc_parse/src/parser/expr.rs must be assigned the qualified name rustc_parse::parser::expr::parse_expr, but tree-sitter sees only the function name.
Derivation strategy per language:
| Language | Module Path Source | Separator | Example |
|---|---|---|---|
| Rust | File path: strip src/, replace / with ::, strip extension. Crate name from top-level dir. Nested mod_item nodes add segments. | :: | compiler/rustc_parse/src/parser/expr.rs:parse_expr -> rustc_parse::parser::expr::parse_expr |
| Go | package_clause node (from imports.scm or decls.scm after move). Import path from directory structure. | . | src/go/types/errors.go:Checker.err -> go.types.Checker.err |
| TypeScript | File path relative to project root, replace / with ., strip extension | . | src/compiler/checker.ts:checkExpression -> compiler.checker.checkExpression |
| Haskell | module declaration in file, or fallback to file path. | . | compiler/src/Reporting/Error/Type.hs:toReport -> Reporting.Error.Type.toReport |
| Zig | File path: replace / with ., strip extension | . | src/Sema.zig:resolveType -> Sema.resolveType |
| Swift | File path relative to source root, no src/ stripping | . | lib/SILOptimizer/ARC/ARCSequenceOpts.cpp:optimize -> SILOptimizer.ARC.ARCSequenceOpts.optimize |
| C++ | Namespace from namespace_definition nodes + class from class_specifier nodes, fallback to file path | :: | src/runtime/object.cpp:lean::object::inc_ref -> lean::object::inc_ref |
| Koka | Module path from file path (.kk files), replace / with / | / | src/Type/Infer.kk:inferExpr -> Type/Infer.inferExpr |
| Lean | Shares C++ strategy (Lean repo only parses C++ runtime, not .lean files due to 86% error rate). Namespace + class from C++ nodes, fallback to file path | :: | src/runtime/object.cpp:lean::object::inc_ref -> lean::object::inc_ref |
Implementation approach:
- File-path-based module prefix (works for all languages): Convert
ParseResult.relative_pathinto a module path using the language’s separator and path conventions. This is the PRIMARY strategy. - AST-based nesting (augments file path): Walk the tree-sitter tree to find enclosing
mod_item(Rust),namespace_definition(C++),impl_item(Rust),class_declaration(TS/Swift) nodes. These add segments to the qualified name. - Determinism requirement: The same file with the same content MUST produce the same qualified_name on every run. No randomness, no timestamp-dependent derivation.
- Implement
derive_module_prefix(language_id: str, relative_path: str) -> str— file-path-based module prefix using per-language conventions (separator, path stripping rules) - Implement
derive_qualified_name(language_id: str, module_prefix: str, symbol_node, tree) -> str— walks parent nodes to find enclosing scopes (modules, classes, namespaces, impl blocks) and appends the symbol name - Handle Rust
mod.rs/lib.rsconventions:src/parser/mod.rs-> moduleparser(stripmod),src/lib.rs-> crate root - Handle Rust
implblocks: methodcheckinsideimpl TypeCheckergets qualified name ending inTypeChecker::check, not justcheck - Handle Go
package_clause: use the package name as the first segment, NOT the file name - Handle C++ namespaces: extract
namespace_identifierfrom enclosingnamespace_definitionnodes - Test: determinism — parsing the same file twice produces identical qualified_names
Subsection 06.2 close-out
/improve-tooling retrospective: File-path derivation works for all 8 tested cases. Default handler strips common prefixes (src/, lib/, compiler/) recursively. Haskell path fix needed for compiler/src/ double-prefix. Tested against Gleam repo — 6885 symbols with correct qualified names. Retrospective 06.2: no tooling gaps.
06.3 Signature Hash (Body-Independent)
File: ~/projects/lang_intelligence/neo4j/extract_symbols.py (helper function)
Problem: Tree-sitter captures the entire function node including its body. Hashing the full captured text makes signature_hash body-dependent — every body edit changes the hash, which breaks Section 09’s incremental sync (every commit triggers a full re-sync of every function).
Solution: Hash only the signature portion of function/method nodes.
Strategy per node type:
| Node Type | Signature = | Body = (excluded from hash) |
|---|---|---|
function_item (Rust) | Everything before the block child | The block child |
function_declaration (Go, TS, Swift) | Everything before the block/statement_block/function_body child | The block child |
fundecl (Koka) | The identifier + type annotation children | The body |
function (Haskell) | The variable name + patterns | The RHS expression |
| Non-function declarations | Full node text (structs, enums, traits change “signature” when fields change) | N/A |
Implementation:
def compute_signature_hash(node, source_bytes: bytes) -> str:
"""Hash only the signature portion of a node, excluding the body.
For function-like nodes, finds the body child (block, statement_block,
function_body) and hashes everything before it. For non-function nodes,
hashes the entire node text.
"""
BODY_CHILD_TYPES = {"block", "statement_block", "function_body", "compound_statement",
"match", # Haskell: function body wrapper (verified via parser probe)
"funbody"} # Koka: function body wrapper (verified via parser probe)
body_child = None
for child in node.children:
if child.type in BODY_CHILD_TYPES:
body_child = child
break
if body_child:
sig_bytes = source_bytes[node.start_byte:body_child.start_byte]
else:
sig_bytes = source_bytes[node.start_byte:node.end_byte]
return hashlib.sha256(sig_bytes).hexdigest()[:16] # 16 hex chars = 64 bits
- Implement
compute_signature_hash(node, source_bytes)as above - Verify the
BODY_CHILD_TYPESset against actual tree-sitter node type definitions for ALL languages. The initial set covers C-family blocks + Haskell expressions + Koka bodyexpr, but may need expansion for edge cases. When a body child type is not found, the function hashes the entire node — verify this doesn’t happen for any language’s function declarations by testing each language - Test: changing a function body does NOT change its
signature_hash - Test: changing a function’s parameter list DOES change its
signature_hash - Test: changing a struct’s fields DOES change its
signature_hash(full node hash for non-functions) - Truncate to 16 hex chars (64-bit fingerprint) — sufficient for change detection, not cryptographic
Subsection 06.3 close-out
/improve-tooling retrospective: BODY_CHILD_TYPES expanded per TPR iteration 2 (codex ran parser probes). match (Haskell) and funbody (Koka) verified. Body-independence test passes: changing Rust function body does NOT change hash, changing params DOES. Struct field changes correctly affect hash (full-node for non-functions). Retrospective 06.3: no tooling gaps.
06.4 Extraction Pipeline
File: ~/projects/lang_intelligence/neo4j/extract_symbols.py
Contract:
Usage: python3 neo4j/extract_symbols.py <repo_name> [--output symbols.jsonl]
Consumes: parser_adapter.parse_repo() -> Iterator[ParseResult]
(ParseResult includes: source_bytes, tree, query_handles, had_error,
error_node_count, coverage_status, content_hash, relative_path, language_id)
Outputs: JSONL with one record per symbol or relationship
Note: This script does NOT read repos.yaml, languages.yaml, or query .scm files directly. All grammar loading, file walking, query compilation, and error handling is the responsibility of Section 05’s parser_adapter.py. This script operates on ParseResult objects — it extracts symbols from pre-parsed trees using pre-compiled query handles.
Ori is NOT in scope for this section. parser_adapter.py skips coverage_status: "custom" languages (line 348), so parse_repo("ori", manifests) yields zero ParseResult objects. Ori extraction is Section 09.3’s responsibility — it uses Ori’s own Rust parser via CLI, not tree-sitter. This script handles the 9 tree-sitter-parseable languages only (the 10 reference repos minus Ori, noting Lean only parses C++ runtime code).
Symbol record format:
{"type": "symbol", "repo": "rust", "file": "compiler/rustc_parse/src/parser/expr.rs",
"name": "parse_expr", "qualified_name": "rustc_parse::parser::expr::parse_expr",
"kind": "function", "language": "rust", "language_kind": "function_item",
"line": 42, "end_line": 120, "visibility": "pub",
"signature_hash": "a1b2c3d4e5f6g7h8",
"had_error": false, "coverage_status": "full"}
Relationship record format:
{"type": "relationship", "kind": "CALLS",
"source_qualified_name": "rustc_parse::parser::expr::parse_expr",
"target_identifier": "parse_item",
"repo": "rust", "file": "compiler/rustc_parse/src/parser/expr.rs", "line": 67}
Critical: CALLS/IMPORTS targets are UNRESOLVED. Tree-sitter captures bare identifiers (e.g., parse_item) or scoped identifiers (e.g., self::parse_item), not fully qualified names. The relationship record stores target_identifier as the raw captured text. Resolution to a qualified_name is NOT this script’s responsibility.
Downstream resolution contract: Section 07 (Neo4j Import Pipeline) must handle unresolved targets by MERGE-ing a stub Symbol node (or UnresolvedSymbol node) keyed by (repo, target_identifier) as the edge target. The relationship edge is created pointing at this stub. Section 08 (Issue-to-Code Bridge) or a future resolution pass can later merge the stub with the actual resolved Symbol node once cross-file analysis identifies the target. This ensures Section 07 can import ALL relationship records without dropping edges that lack resolved targets.
Section 07 action items (to be addressed when Section 07 is reviewed/implemented):
- Add stub-merge strategy for unresolved relationship targets —
MERGEstub nodes keyed by(repo, target_identifier)before creating edges - Update Symbol node property list to include ALL fields from the JSONL format:
end_line,visibility,name,language_kind,had_error,coverage_status(currently only listskind,language,qualified_name,line,signature_hash) - Define the follow-up merge path when a stub target later resolves to a qualified symbol
Data-driven extraction loop (single function for all languages):
from tree_sitter import QueryCursor
def extract_from_parse_result(result: ParseResult) -> Iterator[dict]:
module_prefix = derive_module_prefix(result.language_id, result.relative_path)
# 1. Extract declarations from "decls" query family
decl_query = result.query_handles.get("decls")
if decl_query:
cursor = QueryCursor(decl_query)
# matches() groups sub-captures by match, so @name is paired with
# its parent @definition.* in each match dict
for _pattern_idx, match_captures in cursor.matches(result.tree.root_node):
# match_captures: dict[str, list[Node]]
for capture_name, nodes in match_captures.items():
if capture_name == "name" or capture_name == "type":
continue # sub-captures, processed with their parent
kind = CAPTURE_TO_KIND.get(capture_name)
if kind is None:
continue
name_nodes = match_captures.get("name", [])
# ... derive symbol name from name_nodes, apply refinements, emit record
# 2. Extract relationships from "calls", "imports", "impls" query families
for family in ("calls", "imports", "impls"):
query = result.query_handles.get(family)
if query:
cursor = QueryCursor(query)
for _pattern_idx, match_captures in cursor.matches(result.tree.root_node):
# Each match groups the parent capture with its @name sub-capture
# ... map, emit relationship record
- Implement
extract_from_parse_result(result: ParseResult) -> Iterator[dict]as a single data-driven function that processes ALL languages via theCAPTURE_TO_KINDregistry — NO per-language if/elif branches - Use tree-sitter’s
QueryCursor(query).matches(node)API — NOT.captures().matches()yields(pattern_index, match_captures_dict)tuples wherematch_captures_dictisdict[str, list[Node]]keyed by capture name. This groups the@namesub-capture with its parent@definition.*or@reference.*capture in each match.captures()returns a flat dict of all captures grouped by name — usable for counting but NOT for associating sub-captures with their parents. The existingvalidate-parsers.pyusescaptures()for counting; extraction needsmatches()for grouping. - For
@definition.*matches:match_captures["name"][0]gives the symbol name node. The outer node (e.g.,match_captures["definition.function"][0]) gives the full node forline,end_line,signature_hash - For
@reference.*matches:match_captures["name"][0]gives the target identifier text. The outer node gives the source location (line). For Rust IMPLEMENTS, usematch_captures.get("type")(safe access — inherent impls lack@type) - Deduplicate Rust IMPLEMENTS matches: when two matches share the same
impl_itemnode (start_byte), keep only the trait-impl match (has@type) and discard the type-only match - Determine
visibilityby checking forpub/export/ visibility modifiers on the parent/sibling nodes. Use"pub"if found,""otherwise. This is best-effort — correctness is NOT critical for the intelligence graph - Output JSONL (streaming, one record per line, flushed immediately) for memory efficiency on large repos
- Include
had_errorandcoverage_statusfromParseResultin every symbol record — downstream consumers need this to assess trustworthiness - Handle
query_handlesentries that areNone(stub queries that matched no patterns) — skip silently, do not error - CLI:
argparsewithrepo_namepositional arg,--outputoptional (default: stdout),--statsflag to print summary counts
Haskell @reference.implementation edge case
Haskell’s impls.scm captures (instance) @reference.implementation with NO @name sub-capture. The instance declaration node text contains both the class name and the type, e.g., instance Show Foo where. The extractor must parse the node text to extract the trait name and type name, or emit the full instance text as target_identifier and let Section 08 resolve it.
- Handle Haskell
@reference.implementationcaptures that lack@name: extract class name and type from theinstancenode’s children ((class_name)and(instance_type)if available), or use full node text as fallback
Subsection 06.4 close-out
/improve-tooling retrospective: Extraction accurate on Gleam repo (188 files, 6885 symbols, 29351 relationships in 1.3s). Rust impl dedup works correctly (tested via impl Foo for Bar {} test). Haskell instance fallback uses full node text. No false positives detected in initial run. Data-driven loop handles all languages except 3 refinement points (Go interface child inspection, Haskell signature dedup, Rust impl dedup). Retrospective 06.4: no tooling gaps.
06.5 Scope Filtering & Parse Metadata
File: ~/projects/lang_intelligence/neo4j/extract_symbols.py (filter pass)
Problem: Several query files capture local-scope symbols that should NOT appear in the structural graph:
- TypeScript
decls.scm:(lexical_declaration ...)capturesconst/letinside function bodies - Go
decls.scm:(var_declaration ...)captures local variables - Zig
decls.scm:(variable_declaration ...)captures localvar/const
The success criterion “No expression or statement-level nodes — structural only” requires filtering these out.
Scope filtering strategy:
A symbol captured by @definition.variable is LOCAL if its tree-sitter node is nested inside a function/method body node. Walk parent nodes upward: if any ancestor is a body node (block, statement_block, function_body, compound_statement, block_expression), the variable is local and should be excluded.
Captures that are ALWAYS structural regardless of nesting:
@definition.function,@definition.method— functions are always structural@definition.class,@definition.interface— types are always structural@definition.module— modules are always structural@definition.macro— macros are always structural
Captures that need scope checking:
-
@definition.variable— local unless at module/namespace/file top level -
@definition.constant— same (though most are top-level in practice) -
@definition.field— always structural (fields are type members, not local) -
Implement
is_local_scope(node) -> boolthat walks parent nodes looking for function body ancestors -
Apply scope filter to
@definition.variableand@definition.constantcaptures — exclude local-scope instances -
Do NOT filter
@definition.field(Zig container fields are always structural) -
Emit a
--include-localsCLI flag for debugging that disables the scope filter -
Log filtered-out count per file at
debuglevel for diagnostics
Parse metadata propagation
ParseResult carries had_error, error_node_count, and coverage_status that downstream consumers need to assess whether extracted symbols are trustworthy.
- Include
had_error: boolin every symbol JSONL record (copied from the originatingParseResult) - Include
coverage_status: strin every symbol JSONL record - When
had_erroris true, log a warning with the file path anderror_node_count— symbols from error trees may be incomplete or mislocated - Do NOT skip files with errors — extract what tree-sitter recovered, but mark the records
Subsection 06.5 close-out
/improve-tooling retrospective: Scope filtering tested on Go (local var x int inside function filtered correctly) and Zig (container_field always structural). Parse metadata (had_error, coverage_status) included in all symbol records. —include-locals flag available for debugging. Retrospective 06.5: no tooling gaps.
06.6 Cross-Language Normalization Tests
File: ~/projects/lang_intelligence/tests/test_extract_symbols.py
Each test uses a small inline source snippet, writes it to a temporary file (via tempfile.NamedTemporaryFile with the correct extension for the language), parses it via parse_file() from parser_adapter.py, feeds the ParseResult to extract_from_parse_result(), and asserts on the output records.
Test fixture helper (required — parse_file() reads from disk, it does NOT accept raw source text):
def parse_snippet(lang_id: str, source: str, manifests: Manifests) -> ParseResult:
"""Parse an inline source snippet by writing to a temp file."""
lang_config = manifests.languages[lang_id]
ext = lang_config["extensions"][0] # e.g., ".rs", ".go"
with tempfile.NamedTemporaryFile(suffix=ext, mode="w", delete=False) as f:
f.write(source)
f.flush()
return parse_file(
repo_config={"repo_id": "test", "source_root": str(Path(f.name).parent)},
lang_id=lang_id,
lang_config=lang_config,
file_path=Path(f.name),
source_root=Path(f.name).parent,
)
- Implement
parse_snippet()test helper intest_extract_symbols.pyfor inline source fixture tests
Normalization correctness
- Test: Rust
trait Foo { ... }produceskind: "trait_like",language_kind: "trait_item" - Test: Go
type Foo interface { ... }produceskind: "trait_like"via@definition.classwith Go-specific refinement (inspectstype_specchild forinterface_type) - Test: TypeScript
interface Foo { ... }produceskind: "trait_like",language_kind: "interface_declaration" - Test: Swift
protocol Foo { ... }produceskind: "trait_like",language_kind: "protocol_declaration" - Test: Rust
enum Color { ... }produceskind: "sum_type"via node-type refinement - Test: Rust
type Alias = OtherType;produceskind: "type_alias"via node-type refinement - Test: TypeScript
enum Direction { ... }produceskind: "sum_type"via node-type refinement - Test: TypeScript
type Alias = string | number;produceskind: "type_alias"via node-type refinement
Relationship extraction
- Test: Function calls in Rust, Go, and TypeScript all produce
kind: "CALLS"relationship records - Test: Import statements in Rust, Go, and TypeScript all produce
kind: "IMPORTS"relationship records - Test: Rust
impl Foo for Bar { ... }produceskind: "IMPLEMENTS"relationship with both trait (Foo) and implementing type (Bar) captured, after dedup (only one record, not two) - Test: TypeScript
class Bar implements Foo { ... }produceskind: "IMPLEMENTS"relationship - Test: CALLS
target_identifieris the raw captured text (bare identifier or scoped identifier), NOT a qualified name
Qualified name
- Test:
qualified_nameis deterministic — same file parsed twice yields identical qualified_names - Test: Rust
src/parser/expr.rs:parse_expr-> qualified_name containsparser::expr::parse_expr - Test: Go function in package
types-> qualified_name starts withtypes.
Signature hash
- Test: Changing a function body does NOT change
signature_hash - Test: Changing a function’s parameter list DOES change
signature_hash - Test: Struct/enum full node changes DO change
signature_hash
Scope filtering
- Test: TypeScript
const x = 1inside a function body is filtered out (not emitted) - Test: TypeScript
const X = 1at module top-level is NOT filtered (emitted askind: "variable"— TypeScriptconst/letboth produce@definition.variable, not@definition.constant) - Test: Go
var x intinside a function body is filtered out - Test: Zig
container_fieldinside a struct is NOT filtered (it’s@definition.field, always structural)
Parse metadata
- Test: Symbol records from error-containing files include
had_error: true - Test: Symbol records include
coverage_statusmatching the language config
Subsection 06.6 close-out
/improve-tooling retrospective: 24 tests covering normalization (6), relationships (4), qualified-name (3), signature-hash (3), scope-filtering (2), parse-metadata (3), registry-coverage (3). All pass in 0.10s. Tests use parse_snippet() temp-file helper. Deterministic qualified-name test uses same temp file for both parses. Retrospective 06.6: no tooling gaps.
06.R Third Party Review Findings
-
[TPR-06-001-codex][high]section-06-symbol-extraction.md:294— Correct the QueryCursor capture contract. Resolved: Fixed on 2026-04-13. Rewrote extraction loop to useQueryCursor.matches()and corrected the API documentation in checklist items. -
[TPR-06-002-codex][high]section-06-symbol-extraction.md:268— Add a resolution stage for unresolved relationship targets. Resolved: Fixed on 2026-04-13. Added downstream resolution contract specifying Section 07 must MERGE stub Symbol nodes for unresolved targets. -
[TPR-06-003-codex][medium]section-06-symbol-extraction.md:295— Specify how IMPLEMENTS records recover the implementing type. Resolved: Fixed on 2026-04-13. Added semantic vs helper capture contract, documented Rust IMPLEMENTS@name/@typedual-capture pattern, added checklist item. -
[TPR-06-004-codex][medium]section-06-symbol-extraction.md:364— Rewrite the tests around an executable parse helper. Resolved: Fixed on 2026-04-13. Addedparse_snippet()temp-file fixture helper with implementation and checklist item. -
[TPR-06-005-codex][medium]section-06-symbol-extraction.md:148— Add Lean and fix the inconsistent Koka qualified-name rules. Resolved: Fixed on 2026-04-13. Added Lean row (shares C++ strategy), fixed Koka row (.kk extension, / separator). -
[TPR-06-001-gemini][high]section-06-symbol-extraction.md:282— Use dictionary get without string modification for capture names. Resolved: Fixed on 2026-04-13. Removed erroneouslstrip("definition.")—CAPTURE_TO_KINDkeys are already fully prefixed. Same fix as [TPR-06-001-codex] (convergent). -
[TPR-06-002-gemini][high]section-06-symbol-extraction.md:294— Use QueryCursor matches to group sub-captures. Resolved: Fixed on 2026-04-13. Same fix as [TPR-06-001-codex] — rewrote to use.matches(). -
[TPR-06-003-gemini][medium]section-06-symbol-extraction.md:206— Support expression bodies for Haskell and Koka signature hashes. Resolved: Fixed on 2026-04-13. ExpandedBODY_CHILD_TYPESto include Haskell expression nodes and Kokabodyexpr. -
[TPR-06-004-gemini][medium]section-06-symbol-extraction.md:268— Clarify Neo4j target resolution strategy for CALLS and IMPORTS. Resolved: Fixed on 2026-04-13. Same fix as [TPR-06-002-codex] — added stub-merge contract for Section 07.
Iteration 2 findings:
-
[TPR-06-001-codex][high]section-07-code-import.md:73— Add unresolved-target stub merge contract to Section 07. Resolved: Noted as Section 07 action item in Section 06’s downstream contract on 2026-04-13. Cross-section: Section 07’s review will address. -
[TPR-06-002-codex][high]section-06-symbol-extraction.md:216— Expand BODY_CHILD_TYPES to real Haskell/Koka wrappers. Resolved: Fixed on 2026-04-13. Replaced Haskell expression nodes withmatch, Kokabodyexprwithfunbody(verified via parser probes). -
[TPR-06-003-codex][medium]section-06-symbol-extraction.md:90— Deduplicate Rust trait impl matches. Resolved: Fixed on 2026-04-13. Added dedup rule: sameimpl_itemwith overlapping patterns → keep trait-impl match (has@type), discard type-only. -
[TPR-06-004-codex][medium]section-06-symbol-extraction.md:121— Usetypescriptnottsin refinement table. Resolved: Fixed on 2026-04-13. Replacedtswithtypescriptto matchlanguages.yamllanguage_id. -
[TPR-06-001-gemini][medium]section-06-symbol-extraction.md:322— Use safe access for @type sub-captures. Resolved: Fixed on 2026-04-13. Changed tomatch_captures.get("type")for safe access. -
[TPR-06-002-gemini][high]section-07-code-import.md:74— Add stub-merge strategy to Section 07. Resolved: Same as [TPR-06-001-codex] — noted as Section 07 action item. -
[TPR-06-003-gemini][medium]section-07-code-import.md:73— Include all extracted properties in Section 07 Symbol nodes. Resolved: Noted as Section 07 action item on 2026-04-13.
Iteration 3 findings:
-
[TPR-06-001-codex][medium]section-06-symbol-extraction.md:143— Sync Go package-clause move with Section 05 golden baseline. Resolved: Fixed on 2026-04-13. Added checklist items for golden-probes.yaml and Section 05 documentation updates. -
[TPR-06-002-codex][medium]section-06-symbol-extraction.md:455— TypeScript top-level const test expects wrong kind. Resolved: Fixed on 2026-04-13. Changed expected kind from “const” to “variable” (TS uses @definition.variable for both const/let). -
[TPR-06-003-codex][medium]section-06-symbol-extraction.md:436— Use valid Rust trait-impl syntax in test. Resolved: Fixed on 2026-04-13. Changedimpl Bar: Footoimpl Foo for Bar(valid Rust syntax).
06.N Completion Checklist
-
extract_symbols.pyproduces JSONL for all 9 supported languages via a single data-driven pipeline -
CAPTURE_TO_KINDregistry covers all 12 semantic capture names from the.scmfiles, plus@name/@typesub-capture handling and_-prefix ignore rule -
NODE_TYPE_REFINEMENTShandles the 8 node-type refinement cases (sum_type, type_alias, Go interface, Haskell signature dedup) - Rust IMPLEMENTS extraction handles both
@name(trait) and@type(implementing type) captures - Symbol kinds normalized across languages per the capture-to-kind registry
- CALLS, IMPORTS, IMPLEMENTS relationships extracted with unresolved target identifiers
-
qualified_namederived from filesystem path + AST nesting, deterministic -
signature_hashis body-independent (hashes signature only, not function body) —BODY_CHILD_TYPEScovers all 9 languages including Haskell/Koka expression bodies - Local-scope symbols filtered out (
@definition.variableinside function bodies) -
parse_snippet()test helper implemented for inline source fixture tests - Extraction loop uses
QueryCursor.matches()(NOT.captures()) to group sub-captures with their parents - Unresolved target contract documented — Section 07 needs stub-merge strategy for edge targets
- Parse metadata (
had_error,coverage_status) propagated to output records - Go
package_clausemoved fromimports.scmtodecls.scm - Haskell
instancecapture handled (no@namesub-capture) -
end_lineincluded in symbol records (required by Section 08 for code reference resolution) - Normalization tests pass (all tests in
test_extract_symbols.py) -
timeout 150 ./test-all.shgreen — no regressions from integration changes - Frontmatter
statusupdated tocomplete -
/tpr-reviewclean -
/impl-hygiene-reviewclean -
/improve-toolingsection-close sweep