08 Issue-to-Code Bridge

08.G Goal

This is the bridge layer that connects the issue graph to the code graph. Without it, issues and code live in separate universes within the same Neo4j instance. The bridge enables the killer queries: “find issues that reference code implementing the same concept as Ori’s exhaustiveness checker.”

Design decision: CodeReference intermediary nodes. Rather than creating direct Issue->Symbol edges, the bridge uses CodeReference intermediary nodes. This is the correct shape because: (1) unresolved references have a home — a CodeReference without a RESOLVES_TO edge is a first-class entity, not a dangling edge; (2) provenance metadata (mention_kind, confidence, raw_text, body_offset) lives on the intermediary, not crammed onto an edge property; (3) re-resolution after code graph updates can target CodeReference nodes directly without re-scanning issue bodies.

08.0 Schema Extension for Bridge Layer

File: ~/projects/lang_intelligence/neo4j/schema.cypher

Before writing any Python scripts, extend the schema with constraints and indexes for all new node types. Without these, bulk creation of CodeReference/Concept/CompilerPhase/FailureMode nodes has no uniqueness guarantee and no query index.

// ─────────────────────────────────────────────
// Bridge Layer: Constraints (Section 08)
// ─────────────────────────────────────────────

// CodeReference nodes keyed by (repo, source_type, source_key, raw_text).
// source_type: "issue" | "comment" | "review"
// source_key: issue (repo, number) | comment/review github_id
// raw_text: the extracted mention text
CREATE CONSTRAINT coderef_key IF NOT EXISTS
  FOR (cr:CodeReference)
  REQUIRE (cr.repo, cr.source_type, cr.source_key, cr.raw_text) IS UNIQUE;

// Ontology node constraints
CREATE CONSTRAINT concept_name IF NOT EXISTS
  FOR (c:Concept) REQUIRE c.name IS UNIQUE;
CREATE CONSTRAINT compiler_phase_name IF NOT EXISTS
  FOR (cp:CompilerPhase) REQUIRE cp.name IS UNIQUE;
CREATE CONSTRAINT failure_mode_name IF NOT EXISTS
  FOR (fm:FailureMode) REQUIRE fm.name IS UNIQUE;
CREATE CONSTRAINT design_decision_name IF NOT EXISTS
  FOR (dd:DesignDecision) REQUIRE dd.name IS UNIQUE;

// ─────────────────────────────────────────────
// Bridge Layer: Indexes (Section 08)
// ─────────────────────────────────────────────

CREATE INDEX coderef_repo IF NOT EXISTS FOR (cr:CodeReference) ON (cr.repo);
CREATE INDEX coderef_resolved IF NOT EXISTS FOR (cr:CodeReference) ON (cr.resolved);
CREATE INDEX coderef_stale IF NOT EXISTS FOR (cr:CodeReference) ON (cr.stale);

New node types:

(:CodeReference {repo, source_type, source_key, raw_text, mention_kind,
                 confidence, file_hint, symbol_hint, body_offsets,
                 resolved, stale, stale_since, resolution_attempted_at,
                 ambiguous, ambiguous_count, occurrence_count})
  - repo: string — matches Repo.name
  - source_type: "issue" | "comment" | "review"
  - source_key: string — issue: "{repo}/{number}", comment/review: github_id
  - raw_text: string — the extracted mention text as it appears in the body
  - mention_kind: "file_path" | "backtick" | "qualified_name" | "line_ref" | "code_block"
  - confidence: float — extraction confidence (0.0-1.0)
  - file_hint: string | null — extracted file path if present
  - symbol_hint: string | null — extracted symbol name if present
  - body_offsets: [int] — character offsets in source body where mention appears (aggregated from all occurrences after dedup)
  - resolved: boolean — whether RESOLVES_TO edge exists
  - stale: boolean — whether a previously-resolved RESOLVES_TO target has been deleted/renamed
  - stale_since: datetime | null — when the reference became stale
  - resolution_attempted_at: datetime | null — when last resolution was attempted
  - ambiguous: boolean — whether >1 symbol matched (not fanned out)
  - ambiguous_count: int | null — number of matching symbols when ambiguous
  - occurrence_count: int — how many times this mention appears in the source body

(:Concept {name, aliases, description})
(:CompilerPhase {name, description, order})
(:FailureMode {name, description})
(:DesignDecision {name, description, rationale, alternatives})

New relationship types:

(Issue|Comment|Review)-[:MENTIONS_CODE]->(CodeReference)
(CodeReference)-[:RESOLVES_TO]->(File|Symbol)
(Symbol)-[:TAGGED_AS]->(Concept)
(Issue)-[:INTRODUCES_FAILURE_MODE]->(FailureMode)
(Symbol)-[:IN_PHASE]->(CompilerPhase)
(Issue)-[:REFLECTS_DECISION]->(DesignDecision)
(DesignDecision)-[:REJECTS_APPROACH]->(DesignDecision)
(DesignDecision)-[:SUPERSEDES_DECISION]->(DesignDecision)

Checklist:

Add CodeReference constraint, resolved/stale indexes to schema.cypher
Add Concept, CompilerPhase, FailureMode, DesignDecision constraints to schema.cypher
Apply updated schema to running Neo4j — verify no conflicts with existing constraints
Verify constraint names don’t collide with existing 8 constraints + 24 indexes from Section 07

Subsection 08.0 close-out

Confirm all constraints and indexes are applied before proceeding to 08.1/08.2/08.3.

08.1 Code Reference Extraction

File: ~/projects/lang_intelligence/neo4j/extract_code_refs.py

Extract code mentions from issue/comment/review bodies using regex patterns. This is a pure extraction pass — no Neo4j queries, no resolution. Output is JSONL consumed by the resolution step (08.2).

Input: Issue/Comment/Review nodes already in Neo4j (from the issue graph import pipeline).

Pattern types (ordered by confidence):

File paths (confidence: 0.9): compiler/rustc_parse/src/parser/expr.rs, src/Sema/TypeChecker.cpp — regex: path-like strings with / separators and known source extensions. Note: these are examples of paths found in reference repo issue bodies (e.g., a Rust issue referencing compiler/rustc_parse/src/parser/expr.rs), NOT files in the ori_lang project.
Backticked identifiers (confidence: 0.7): `check_exhaustiveness`, `PatternColumn` — must pass stop-word/keyword filtering (see below)
Qualified names (confidence: 0.7): rustc_pattern_analysis::usefulness::compute_exhaustiveness — double-colon or dot-separated identifiers
Line references (confidence: 0.8): expr.rs:42, L123-L156 — file + line number patterns. Note: expr.rs:42 is an example of line reference syntax found in issue comments (e.g., a Rust issue saying “see expr.rs:42”), NOT a file in ori_lang.
Fenced code blocks (confidence: 0.3): Code snippets that might contain function/type names — lowest confidence, most noise

Stop-word/keyword filtering (for backtick extraction):

Backtick extraction without filtering will capture language keywords, boolean literals, shell commands, and single-letter variables that are not meaningful code references. Filter these before emitting:

Language keywords: true, false, null, nil, None, Some, Ok, Err, self, Self, super, crate, pub, fn, let, mut, const, if, else, match, for, while, loop, return, break, continue, struct, enum, trait, impl, type, where, use, mod, async, await, unsafe, extern, dyn, ref, move
Shell/tool noise: npm, cargo, git, cd, ls, rm, cp, mv, mkdir, grep, sed, awk, curl, wget, pip, python, node, bash, sh, zsh
Single-character identifiers: a-z, A-Z, _, T, N, E, S (too ambiguous)
Common non-code words: TODO, FIXME, NOTE, HACK, XXX, WIP, LGTM, PTAL, nit
Minimum length: backticked text must be >= 2 characters after trimming

Occurrence records, NOT premature deduplication:

Emit one JSONL record per occurrence, preserving body_offset. The same symbol mentioned 5 times in an issue produces 5 records. Deduplication happens AFTER resolution in 08.2 — collapsing before resolution throws away occurrence context (e.g., which paragraph discusses the symbol, proximity to error descriptions). Post-resolution, create one CodeReference node per unique (source, raw_text) pair with an occurrence_count property.

Output JSONL format (one record per occurrence):

{
  "repo": "rust",
  "source_type": "issue",
  "source_key": "rust/12345",
  "raw_text": "check_exhaustiveness",
  "mention_kind": "backtick",
  "file_hint": null,
  "symbol_hint": "check_exhaustiveness",
  "confidence": 0.7,
  "body_offset": 342
}

For comments and reviews, source_key is the github_id string:

{
  "repo": "rust",
  "source_type": "comment",
  "source_key": "1234567890",
  "raw_text": "compiler/rustc_parse/src/parser/expr.rs",
  "mention_kind": "file_path",
  "file_hint": "compiler/rustc_parse/src/parser/expr.rs",
  "symbol_hint": null,
  "confidence": 0.9,
  "body_offset": 55
}

Subsection 08.1 close-out

/improve-tooling retrospective: Were the regex patterns accurate? High false positive rate? Any common patterns missed? Is the stop-word list sufficient or too aggressive?

08.2 Reference Resolution

File: ~/projects/lang_intelligence/neo4j/resolve_code_refs.py

Match extracted references to actual code symbols in Neo4j. Create CodeReference nodes and MENTIONS_CODE/RESOLVES_TO edges.

In-memory resolution pattern (critical for performance):

Per-reference Cypher queries will be far too slow (tens of thousands of references x round-trip per query). Instead, preload the symbol index into memory at startup — the same pattern used by import_code_graph.py’s _build_symbol_index():

def _build_resolution_index(driver, repo: str) -> dict:
    """Preload File paths and Symbol business keys for a repo.

    Uses stable business keys (NOT internal Neo4j node IDs, which are
    invalidated by Section 07's atomic wipe-and-replace).

    Returns:
        {
            "files": {path: path, ...},  # exact path -> path (business key)
            "symbols_by_qname": {qualified_name: [(qualified_name, signature_hash), ...], ...},
            "symbols_by_name": {name: [(qualified_name, signature_hash), ...], ...},
        }
    """

Resolution strategy (ordered by precision):

File path resolution:
- Exact match: file_hint == File.path in same repo
- Fuzzy match (for partial paths like parser/expr.rs): use ENDS WITH against the preloaded file paths. If partial path matches exactly 1 file, resolve. If >1 match, leave unresolved with ambiguous_matches metadata.
- The existing file_text Lucene fulltext index can supplement in-memory matching for edge cases, but the primary path is in-memory.
Symbol resolution (backtick and qualified name mentions):
- First try: exact match on qualified_name in the preloaded index
- Second try: exact match on name in the preloaded index
- If exactly 1 match: resolve (create RESOLVES_TO edge)
- If 2+ matches: do NOT fan out to N edges. Mark as ambiguous on the CodeReference node, store ambiguous_count property. Rationale: fanning out creates false edges that pollute all downstream queries. A single ambiguous reference with count=4 is honest; 4 RESOLVES_TO edges are misleading.
- If 0 matches: leave unresolved (resolved: false). The CodeReference node persists for future re-resolution.
Line reference resolution: resolve the file part (as above), then store line number as a property on the CodeReference — do not attempt to resolve to a specific Symbol by line (symbols move between imports).

Post-resolution deduplication:

After resolution, collapse occurrence records into CodeReference nodes:

Group by (source_type, source_key, raw_text) — same mention in same source = one CodeReference
Aggregate all body_offset values from grouped occurrences into body_offsets: [int] on the node
Store occurrence_count on the CodeReference node
Use the highest-confidence occurrence’s metadata for the node properties

Edge creation:

MENTIONS_CODE: from the source Issue/Comment/Review node to the CodeReference
- For issues: match by (repo, number) from source_key
- For comments/reviews: match by github_id from source_key
RESOLVES_TO: from CodeReference to File or Symbol (only when unambiguous)

Re-resolution mechanism:

When the code graph is updated (Section 07 re-import), previously-unresolved CodeReferences may now be resolvable. The resolution script supports a --re-resolve flag:

python3 resolve_code_refs.py <repo> --re-resolve

This queries all CodeReference nodes where resolved: false and re-attempts resolution against the current symbol index. If a previously-unresolved reference now resolves, create the RESOLVES_TO edge and update resolved: true. This is cheap — it only touches unresolved refs, not the full corpus.

The pipeline runner (08.4) invokes re-resolution after code graph updates.

Stale-reference invalidation:

When Section 07’s wipe-and-replace reimport deletes a File or Symbol that a CodeReference has a RESOLVES_TO edge pointing to, the CodeReference becomes stale. The resolution script supports a --invalidate-stale flag:

python3 resolve_code_refs.py <repo> --invalidate-stale

This scans resolved CodeReferences and checks whether their RESOLVES_TO targets still exist. If a target was deleted: remove the dangling RESOLVES_TO edge, set stale: true, stale_since: now(), resolved: false on the CodeReference. The pipeline runner (08.4) invokes invalidation after code graph rebuilds.

Module-level source resolution (fulfilling TPR-07-010):

Section 07 delegated the module-scope source_unresolved gap to Section 08 (see  on TPR-07-010-codex/TPR-07-017-codex in section-07). Files that emit IMPORTS/CALLS relationship records but have zero structural symbols from decls.scm (e.g., Haskell modules, C/C++ headers) produce source_unresolved tracking at import time. The fix: emit a synthetic file-scope Symbol record in extract_symbols.py when relationships exist but no declaration symbols do. This is tracked here as a Section 08 deliverable but the implementation lives in extract_symbols.py (Section 06’s SSOT for symbol extraction).

Subsection 08.2 close-out

/improve-tooling retrospective: What’s the resolution success rate? What fraction of references resolve unambiguously? Should we lower/raise the confidence threshold? Is the in-memory index fast enough or should we use Neo4j fulltext queries for fuzzy matching?

08.3 Ontology Seeding (independent)

This subsection has zero data dependency on 08.1/08.2. It reads from the existing code graph (Section 07) and issue graph, not from CodeReference nodes. It can be implemented and run in parallel with 08.1/08.2. It is grouped in Section 08 because it creates the taxonomy layer that makes the bridge queries meaningful, but its execution is independent.

File: ~/projects/lang_intelligence/neo4j/seed_ontology.py

Start narrow — 5 core concepts, 5 compiler phases, 10 failure modes:

Concepts (per ChatGPT + TPR consensus):

pattern_matching, type_inference, reference_counting, effect_handling, diagnostics

Compiler phases:

parser, typechecker, lowering, codegen, diagnostics

Failure modes:

Subsection 08.3 close-out

/improve-tooling retrospective: Were the auto-tagging heuristics accurate? Too many false tags? Need manual override mechanism?

08.4 Pipeline Orchestration

File: ~/projects/lang_intelligence/scripts/build-bridge.sh

A runner script that chains the three bridge steps with correct ordering and provides integration with the existing pipeline (build-code-graph.sh).

#!/usr/bin/env bash
set -euo pipefail

# Usage: build-bridge.sh [--repo REPO] [--re-resolve-only] [--seed-only]
#
# Full pipeline (default): extract -> resolve -> seed (for all repos or --repo)
# --re-resolve-only: skip extraction, re-resolve unresolved refs (after code graph update)
# --seed-only: only run ontology seeding (independent of extraction/resolution)

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
INTEL_DIR="$(dirname "$SCRIPT_DIR")"
NEO4J_DIR="$INTEL_DIR/neo4j"
VENV="$INTEL_DIR/.venv"

# ... activate venv, check Neo4j health ...

if [[ "${SEED_ONLY:-}" == "true" ]]; then
    python3 "$NEO4J_DIR/seed_ontology.py"
    exit 0
fi

for repo in "${REPOS[@]}"; do
    echo "=== Bridge: $repo ==="

    if [[ "${RE_RESOLVE_ONLY:-}" != "true" ]]; then
        # Step 1: Extract code references from issue/comment/review bodies
        python3 "$NEO4J_DIR/extract_code_refs.py" "$repo" \
            --output "$INTEL_DIR/data/$repo/code_refs.jsonl"

        # Step 2: Resolve references against code graph
        python3 "$NEO4J_DIR/resolve_code_refs.py" "$repo" \
            --input "$INTEL_DIR/data/$repo/code_refs.jsonl"
    else
        # Re-resolve only: update previously-unresolved refs
        python3 "$NEO4J_DIR/resolve_code_refs.py" "$repo" --re-resolve
    fi
done

# Step 3: Seed ontology (runs once, not per-repo)
if [[ "${RE_RESOLVE_ONLY:-}" != "true" ]]; then
    python3 "$NEO4J_DIR/seed_ontology.py"
fi

Integration with build-code-graph.sh:

After build-code-graph.sh completes a re-import of a repo’s code graph, it should optionally trigger build-bridge.sh --repo <repo> --re-resolve-only to update any previously-unresolved CodeReferences that may now be resolvable. This is not mandatory on every code graph rebuild — it’s an optimization for keeping the bridge fresh.

Create build-bridge.sh with --repo, --re-resolve-only, --seed-only flags
Implement per-repo extraction -> resolution pipeline
Implement seed-only mode for independent ontology seeding
Implement re-resolve-only mode for post-code-graph-update resolution refresh
Add optional --bridge flag to build-code-graph.sh that triggers re-resolution after code import
Test: build-bridge.sh --repo gleam runs full pipeline end-to-end — gleam: 7255 nodes, 290 resolved
Test: build-bridge.sh --re-resolve-only --repo gleam only touches unresolved refs — 0 stale, 0 re-resolved (expected — no code graph changes since initial run)
Test: build-bridge.sh --seed-only creates ontology nodes without touching code references — 5 concepts, 5 phases, 10 failure modes, 100 design decisions
TPR checkpoint: run /tpr-review covering 08.3 + 08.4 — covered by implementation TPR

Subsection 08.4 close-out

/improve-tooling retrospective: Is the pipeline ordering correct? Any race conditions? Should seed run before or after resolution? Performance acceptable?

08.R Third Party Review Findings

Implementation TPR (code review):