s00 — Reference Corpus + Paper Corpus Pre-work
Goal
The three influence-source trees (llvm-project, gcc, wasmtime/Cranelift) exist locally as shallow clones, are queryable through the intel graph like the existing 10 reference compilers, and a curated modern-backend paper corpus + clean-room discipline doc anchor every later research section.
Implementation Sketch
- Clones land in
~/projects/reference_repos/lang_repos/beside the existing repos:git clone --depth 1ofllvm/llvm-project,gcc-mirror/gcc,bytecodealliance/wasmtime(Cranelift lives atwasmtime/cranelift/). Record cloned SHAs in the dossier file. - Intel-graph indexing extends the existing multi-repo import (
intel_reposync scripts; the swift repo proves C++ indexing works). Verify post-index:scripts/intel-query.sh file-symbols "cranelift/codegen" --repo wasmtime,similar "regalloc" --repo wasmtime,golang, and a GCC IPA symbol lookup all return results. - Paper corpus is an annotated bibliography authored into this plan’s
content/(one entry per paper: claim, relevance to which section, adoption verdict): aegraphs (Fallin et al.), ISLE (Fallin), regalloc2 design notes, Braun et al. 2013 SSA construction, Click sea-of-nodes, RVSDG, copy-and-patch (Xu & Kjolstad), Perceus (PLDI 2021), Go SSA backend README/regalloc notes, linear-scan (Poletto & Sarkar) + extensions. - Clean-room discipline doc: ZERO code copying from ANY source (user directive — uniform rule including Apache-2.0 Cranelift/LLVM); GCC (GPLv3) is study-only with no code reuse possible regardless; ideas/algorithms are re-implemented from understanding; port test INTENT, never test text. Every research dossier carries this header.
Constraints
- No compiler-source changes in this section; tooling edits (intel_repo import config) follow /improve-tooling discipline.
- Indexing capacity: llvm-project + gcc are an order of magnitude larger than existing repos — if Neo4j ingestion needs batching/config changes, fix the importer (tooling-first), never trim the corpus silently.
Work Items
- Shallow-clone llvm-project, gcc, and wasmtime into ~/projects/reference_repos/lang_repos/ (—depth 1); record SHAs + clone commands in a content dossier.
- Extend the intel-graph import config to index the three new repos; run the import; verify file-symbols/similar/callers queries return results for cranelift/, llvm/lib/Transforms/, gcc/gcc/ipa-*.
- Author the annotated paper-corpus bibliography (>= 10 entries, each with claim/relevance/adoption-verdict fields) into this plan’s content dir.
- Author the clean-room licensing discipline doc (zero-copy rule incl. permissive licenses; GCC study-only; test-intent-not-text porting rule) and reference it from every later research section.