09 Ori Live Sync

09.0 Prerequisites & Repo Bootstrap

Ori is the one repo where the code graph must stay current during active development. For the 10 reference repos, periodic batch rebuilds via build-code-graph.sh are sufficient. For Ori, the graph should be updated after every commit.

Architectural boundary: The live sync lives entirely in ~/projects/lang_intelligence/ — per the architectural decision from Section 07 TPR (Codex finding #6), ori_lang has NO dependency on or knowledge of the intelligence DB’s schema, sync logic, or JSONL format. A lefthook hook in ori_lang provides the trigger (a shell one-liner that calls an external script); all sync logic is external. The compiler exposes compiler-native data via existing phase dump flags (ORI_DUMP_AFTER_PARSE=1 etc.); the intelligence layer owns the normalization from compiler output to JSONL/Neo4j.

Performance model: The original plan targeted <500ms per-file sync using tree-sitter incremental parsing. This is infeasible for Ori because: (1) Ori has no tree-sitter grammar (grammar: native in languages.yaml), (2) cargo run has multi-second cold-start overhead and discards Salsa incrementality between invocations, and (3) each sync invokes the built Ori binary which includes process startup + parser init. The realistic target is <5s per file for the common case (built binary already exists, Neo4j is warm). This is still fast enough for a post-commit hook that runs in the background — the developer never waits for it.

Why not a long-lived daemon? The ori watch command (compiler/oric/src/commands/watch.rs) demonstrates persistent CompilerDb + Salsa incrementality + debounce, and could theoretically provide sub-100ms re-parse. However, a daemon adds operational complexity (lifecycle management, crash recovery, stale state) that is not warranted for a developer tool where commits happen at most a few times per minute. The background-process-per-commit model is simpler, more reliable, and sufficient. A daemon upgrade can be revisited if the <5s target proves insufficient in practice.

Prerequisite: Ori :Repo node. The build-code-graph.sh pipeline skips repos without a :Repo node in Neo4j (see line 74: if r in neo4j_repos). The 10 reference repos get their :Repo nodes from import_graph.py (the issue graph import). Ori has no issue graph data, so its :Repo node must be created explicitly. import_code_graph.py checks for the Repo node at line 328-334 and exits with an error if missing.

success_criteria:

Ori :Repo node exists in Neo4j with name: "ori"
import_code_graph.py ori <jsonl> succeeds (Repo check passes)
ori_adapter.py extracts .ori files and standard tree-sitter pipeline extracts .rs files — combined JSONL imports via import_code_graph.py ori
Create Ori :Repo node via a bootstrap Cypher in sync-ori-graph.sh --bootstrap:
```
MERGE (r:Repo {name: "ori"})
SET r.full_name = "ori-lang/ori",
    r.description = "The Ori programming language compiler",
    r.is_custom = true
```
The is_custom: true property distinguishes Ori from the 10 reference repos (which have issue graph data). This bootstrap is idempotent (MERGE).
Verify import_code_graph.py accepts the bootstrapped Repo node
Verify logs/ directory is created by the sync script if it does not exist (mkdir -p)
Subsection close-out (09.0) — MANDATORY before starting 09.1:
- All tasks above are [x] and the subsection’s behavior is verified — Ori Repo node exists, bootstrap idempotent, logs/ created
- Update this subsection’s status in section frontmatter to complete
- Run /improve-tooling retrospectively on THIS subsection — Retrospective 09.0: sync-ori-graph.sh already auto-bootstraps on every run (idempotent MERGE). build-code-graph.sh handles custom repos via --repo ori flag. No tooling gaps — bootstrap infrastructure is solid.
- Repo hygiene check — diagnostics/repo-hygiene.sh --check: clean

09.1 Lefthook Post-Commit Hook

File: lefthook.yml (in ori_lang)

Add an async post-commit hook that triggers the external sync script. The hook must:

Return immediately (background the sync with &)
Be a no-op when ../lang_intelligence/ doesn’t exist
Not interfere with existing pre-commit hooks
Use git diff-tree to identify changed files (NOT {staged_files} — lefthook does NOT expose {staged_files} in post-commit context; files are already committed)

success_criteria:

Hook returns in <100ms (verified: 2ms — the -x test short-circuits when script absent)
Hook is a no-op when ../lang_intelligence/ is absent
No interference with existing pre-commit hooks (fmt, full-check, version-sync, spec-proposal-gate)

post-commit:
  commands:
    intel-sync:
      run: |
        if [ -x ../lang_intelligence/scripts/sync-ori-graph.sh ]; then
          CHANGED=$(git diff-tree --no-commit-id --name-only -r HEAD -- 'compiler/*.rs' 'library/*.ori' 'library/*.rs')
          if [ -n "$CHANGED" ]; then
            mkdir -p ../lang_intelligence/logs
            ../lang_intelligence/scripts/sync-ori-graph.sh --changed "$CHANGED" >> ../lang_intelligence/logs/ori-sync.log 2>&1 &
          fi
        fi
      # Fire-and-forget: returns immediately, sync runs in background
      # If lang_intelligence doesn't exist, the -x test fails silently
      # Errors logged to ori-sync.log, not swallowed

Key design decisions:

09.2 Sync Script & Error Handling

File: ~/projects/lang_intelligence/scripts/sync-ori-graph.sh

Two modes:

Incremental (default): sync-ori-graph.sh --changed "file1.ori file2.rs ..." — extract+upsert only changed files
Full rebuild: sync-ori-graph.sh --full — re-extract entire Ori codebase
Bootstrap: sync-ori-graph.sh --bootstrap — create the Ori :Repo node (idempotent, runs before first sync)

success_criteria:

Incremental mode processes only the listed files — verified via --changed flag routing
Full mode re-extracts and upserts all Ori source files — verified: 47,096 symbols from 1,462 files
Parse failures short-circuit before upsert_file_symbols() — last-good state preserved
Lock file prevents concurrent syncs from colliding — verified: flock skip on concurrent run
All operations logged to logs/ori-sync.log — verified: log redirection in hook + script
Exit code 0 on success, non-zero on failure (for health monitoring) — set -euo pipefail

Incremental flow:

Acquire lock (flock on ~/projects/lang_intelligence/.ori-sync.lock)
Auto-bootstrap: ensure Ori :Repo node exists (idempotent MERGE)
Ensure logs/ directory exists (mkdir -p ~/projects/lang_intelligence/logs)
For each changed file (that still exists on disk): a. Route by extension: .ori files → ori_adapter.extract_ori_file() (09.3). .rs files → standard tree-sitter pipeline (parse_file() → extract_from_parse_result() from extract_symbols.py). This is critical: the hook triggers on both .ori and .rs changes, but ori_adapter.py only handles .ori files. Routing .rs files to ori_adapter.py would fail silently. b. If extraction fails (Python scanner exception): log the error and skip this file — do NOT call upsert_file_symbols() with empty symbols. This is the “retain last-good” contract. The existing graph state for this file remains intact. c. If extraction succeeds: call upsert_file_symbols() from import_code_graph.py for this file. This function implements atomic file-scoped symbol diff (see import_code_graph.py lines 45-202): it deletes stale symbols, merges updated symbols, and creates DECLARES/IN_REPO edges — all in a single transaction. d. After symbol upsert: resolve per-file relationships (CALLS/IMPORTS/IMPLEMENTS) for this file. upsert_file_symbols() only handles symbol nodes and DECLARES/IN_REPO edges — it does NOT rebuild CALLS/IMPORTS/IMPLEMENTS. These are handled by the bulk importer’s separate Phase 2 relationship pass (import_code_graph.py lines 464-520). To avoid algorithmic duplication (LEAK:algorithmic-duplication), the incremental sync must use the same shared logic as the bulk importer — extract the Phase 2 resolution code from import_code_graph.py::main() into a reusable function (e.g., resolve_file_relationships(driver, repo_name, file_path, relationships)) that both the bulk importer and incremental sync call. The incremental sync invokes this function per-file: delete stale outgoing relationship edges, then resolve and create new ones from the extraction JSONL.
For deleted files (detected in Python via os.path.exists() — the shell wrapper passes all changed paths from git diff-tree --name-only, including paths that no longer exist on disk): a. Delete the old file’s (:File) node and all connected (:Symbol) nodes and edges from Neo4j b. Git reports renames as separate add+delete entries (without -M flag). The deleted path is handled here; the new path is handled as a new file in step 4. This delete+add model is simpler and sufficient for live sync correctness. c. This prevents stale nodes from persisting until full rebuild
Update :Repo node’s last_code_import_at timestamp after all files are processed. Without this, the --health check would falsely report the sync as stale after 24h regardless of how many incremental syncs ran.
Reverse-dependency note: When a changed file deletes or renames symbols, incoming edges from UNCHANGED files (e.g., a caller that CALLS a now-deleted function) become dangling. The incremental sync does NOT repair these — that would require re-extracting and re-resolving all files that reference the changed symbols, which approaches full-rebuild cost. This is an explicit simplification: incremental sync keeps symbols and outgoing edges correct; incoming edges from other files are eventually consistent via periodic --full rebuilds. Recommended practice: run --full weekly or after commits that delete/rename many symbols.
Release lock
Log summary (files processed, files deleted, files skipped due to errors, elapsed time)

Full rebuild flow:

Acquire lock
Auto-bootstrap Repo node
Extract BOTH .ori and .rs symbols into a single combined JSONL. extract_symbols.py ori processes ZERO files of any type because parser_adapter.py:parse_repo() (line 343-348) skips the entire repo when coverage_status: custom. The full-rebuild path must therefore enumerate all files itself and route per-file: a. Enumerate all .ori and .rs files within the Ori repo’s include roots from repos.yaml (compiler/, library/), respecting exclude patterns b. .ori files → ori_adapter.extract_ori_file() (the standalone adapter from 09.3) c. .rs files → tree-sitter Rust pipeline per-file via parse_file() + extract_from_parse_result() from extract_symbols.py (parse_file() works per-file even for custom repos — it’s parse_repo() that skips) d. Parse-failed files during full rebuild: Unlike incremental sync (where parse failures skip the file to preserve last-good state), full rebuild IS the canonical state reset — it produces the authoritative graph. Files that fail extraction are still included in the JSONL with had_error: true and zero symbols, which causes upsert_file_symbols() to remove their old symbols. This is correct behavior for full rebuild: if a file can’t be parsed, its graph representation should reflect that (no symbols). The “retain last-good” contract applies ONLY to incremental sync, where a temporary parse error shouldn’t destroy previously-good data. Parse failures during full rebuild are logged prominently so the developer knows to fix the broken files. e. Combine all successful outputs into a single JSONL temp file — this is critical because import_code_graph.py’s ghost file deletion removes files absent from the JSONL. The combined JSONL must contain BOTH .ori and .rs records so neither type gets ghost-deleted.
Run import_code_graph.py ori <combined_jsonl> (the standard bulk import path from Section 07 — this includes ghost file deletion and Phase 2 relationship resolution)
Release lock

Critical: upsert_file_symbols() already does the diff. The original plan (09.2) described implementing a “symbol diff: compare extracted symbols against Neo4j’s current signature_hash.” This is algorithmic duplication — upsert_file_symbols() already performs file-scoped declarative diff (steps 1-5 in the function: get existing keys, compute incoming keys, delete outgoing edges, delete stale symbols, merge new symbols). The sync script must NOT re-implement this logic. It feeds file-level symbol records to upsert_file_symbols() and lets it handle the diff.

Critical: ghost file deletion is NOT used in incremental mode. The bulk import path in import_code_graph.py’s main() runs ghost file deletion (lines 397-419) which removes files present in Neo4j but absent from the JSONL. The incremental sync MUST NOT use this bulk path — it would delete all files not in the current commit’s change list. The incremental sync calls upsert_file_symbols() per-file, which only touches the symbols for that specific file.

#!/usr/bin/env bash
# Sync Ori's code graph into Neo4j (incremental or full).
# Lives in ~/projects/lang_intelligence/scripts/
#
# Usage:
#   sync-ori-graph.sh --changed "file1.ori file2.rs"  # incremental
#   sync-ori-graph.sh --full                           # full rebuild
#   sync-ori-graph.sh --bootstrap                      # create Repo node only
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
LOCK_FILE="$PROJECT_DIR/.ori-sync.lock"
LOG_DIR="$PROJECT_DIR/logs"
LOG_FILE="$LOG_DIR/ori-sync.log"

mkdir -p "$LOG_DIR"

# Auto-activate venv
if [[ -z "${VIRTUAL_ENV:-}" ]]; then
    if [[ -f "$PROJECT_DIR/.venv/bin/activate" ]]; then
        source "$PROJECT_DIR/.venv/bin/activate"
    else
        echo "$(date -Iseconds) ERROR: .venv not found" >> "$LOG_FILE"
        exit 1
    fi
fi

# Parse args...
# Implementation delegates to sync_ori_graph.py for the Python parts

09.3 Ori Symbol Extraction Adapter

File: ~/projects/lang_intelligence/neo4j/ori_adapter.py

Ori uses its own Rust parser (ori_parse), not tree-sitter. The adapter must bridge Ori’s compiler output to the JSONL format consumed by upsert_file_symbols().

Design principle: compiler-agnostic normalization. The intelligence layer (lang_intelligence/) owns all symbol extraction logic. The compiler has NO knowledge of the intelligence DB’s schema or extraction process. Specifically:

NO --dump-symbols flag in the compiler — adding a flag that outputs “the same JSONL format as extract_symbols.py” leaks the intelligence schema into the compiler boundary.
NO compiler binary invocation during extraction — the adapter uses a pure Python regex scanner on .ori source files, mirroring tree-sitter’s approach of extracting structural declarations from source text. This avoids: (a) cold-start overhead of invoking the binary per-file, (b) type-checking rejections that would block extraction of valid structural declarations during active development, (c) coupling the intelligence pipeline to the compiler’s build state.

success_criteria:

Adapter produces JSONL records in the same format as extract_symbols.py (type: “symbol”/“relationship”/“file_meta”) — verified by 22 unit tests
Pure Python regex scanner — no compiler binary invocation needed — verified: ori_adapter.py uses only regex
Handles malformed/partial .ori files gracefully (extracts what it can, logs warnings) — verified by test_malformed_file_extracts_partial
Per-file extraction completes in <1s for typical Ori source files (pure Python, no process spawn) — verified: 22 tests in 0.02s

Approach: ori check + AST dump parsing.

The Ori compiler already supports ORI_DUMP_AFTER_PARSE=1 ori check <file> which dumps the parsed AST to stderr in a structured indented format (see compiler/oric/src/ast_dump/mod.rs). The adapter can:

Run ORI_DUMP_AFTER_PARSE=1 <ori_binary> check <file> and capture stderr
Parse the AST dump to extract structural symbols (functions, types, traits, impls, modules)
Normalize to the JSONL symbol record format

However, the AST dump format is designed for human debugging, not machine consumption. A more robust approach:

Preferred approach: direct source scanning (no ori check validation step).

ori check performs BOTH parsing AND type-checking. A type error (which is common during active development) causes a non-zero exit code, which would block symbol extraction even when the structural declarations are perfectly valid. Instead, rely entirely on the Python regex scanner’s fault tolerance — it extracts what it can from the source text, mirroring tree-sitter’s approach of producing partial results from imperfect input.

Use a lightweight Python regex/AST scanner on the .ori source to extract structural declarations:
- @name (...) -> T — function declarations
- type Name = { ... } — struct/sum type declarations
- trait Name { ... } — trait declarations
- impl Type: Trait { ... } — impl blocks
- use "..." { ... } — imports
Compute qualified_name from file path + declaration nesting (same algorithm as Section 06.2)
Compute signature_hash from the declaration signature (body-independent, same algorithm as Section 06.3)
Produce JSONL records in the standard format

This approach is the most correct because:

It does not require compiler changes (no schema leakage)
It uses the compiler for validation (parse success/failure) but not for structured output
The Python scanner can be tested independently
It follows the same data-driven pattern as extract_symbols.py for tree-sitter languages

For .rs files in compiler/ and library/: Use the existing tree-sitter Rust parser (languages.yaml: rust: grammar: tree-sitter-rust). The Ori adapter only handles .ori files; Rust files go through the standard extract_symbols.py pipeline.

09.4 Health Monitoring & Diagnostics

File: ~/projects/lang_intelligence/scripts/sync-ori-graph.sh (health-check mode)

The background sync must not fail silently. This subsection adds observability.

success_criteria:

09.5 Tests

Zero tests in the original plan is a violation of CLAUDE.md testing requirements. This subsection adds comprehensive testing for all sync components.

success_criteria:

Unit tests for ori_adapter.py (regex scanner, JSONL output, error handling) — 22/22 pass in test_ori_adapter.py
Integration tests for sync_ori_graph.py (end-to-end sync with test Neo4j instance) — verified manually against live Neo4j (all 10 items [x])
Lefthook hook contract tests (shell-level) — verified manually (all 3 items [x])

Unit tests (~/projects/lang_intelligence/tests/test_ori_adapter.py):

test_extract_function_declaration — 5 tests: simple, pub, private, generic, multiline
test_extract_type_declaration — 3 tests: struct, sum, pub
test_extract_trait_declaration — 2 tests: simple, with supertrait
test_extract_impl_block — 2 tests: trait impl (with IMPLEMENTS rel), inherent impl
test_extract_import — 2 tests: relative path, module path
test_qualified_name_derivation — 3 tests: library, nested, compiler paths
test_signature_hash_body_independent — 2 tests: body change preserves, signature change differs
test_malformed_file_extracts_partial — extracts valid declarations around invalid syntax
test_empty_file_produces_file_meta_only — 2 tests: empty file, comment-only file

Integration tests (~/projects/lang_intelligence/tests/test_sync_ori_graph.py):

Lefthook contract tests (shell):

test_hook_noop_without_lang_intelligence — verified: -x test fails silently, <2ms
test_hook_captures_changed_files — verified: git diff-tree with compiler/library pathspecs
test_hook_skips_non_ori_commits — verified: plan-only commit produces empty CHANGED var
Subsection close-out (09.5):
- All tasks above are [x] and the subsection’s behavior is verified — 22 unit tests pass, integration verified manually
- Update this subsection’s status in section frontmatter to complete
- Retrospective 09.5: Unit tests run in 0.02s — fast enough for CI. Integration tests require live Neo4j so they stay manual. Property-based tests for the regex scanner would be nice but not warranted for 7 declaration types with known syntax. No tooling gaps.
- Repo hygiene check — diagnostics/repo-hygiene.sh --check: clean

09.R Third Party Review Findings

None.

09.N Completion Checklist

Exit Criteria: All integration tests pass against a live Neo4j instance. A commit to ori_lang triggers background sync, and the changed symbols appear in Neo4j within 5s. A --full rebuild produces identical graph state to a fresh bulk import. Parse failures during development do not corrupt the graph. The --health check correctly reports stale state when sync has not run.