String Interning
The Ori compiler interns all identifiers to enable O(1) comparison and reduce memory usage.
What is String Interning?
String interning stores each unique string once and represents it with a small ID. When the same string appears multiple times, they all share the same ID.
let interner = Interner::new();
let name1 = interner.intern("foo"); // Name(0)
let name2 = interner.intern("foo"); // Name(0) - same!
let name3 = interner.intern("bar"); // Name(1)
assert_eq!(name1, name2); // O(1) comparison
Implementation
Name Type
/// Interned string identifier
#[derive(Clone, Copy, Eq, PartialEq, Hash, Debug)]
pub struct Name(pub u32);
impl Name {
pub fn index(self) -> usize {
self.0 as usize
}
}
Name is just a 32-bit index. It’s:
Copy- Cheap to pass aroundEq, Hash- Can be used in collections- 4 bytes (vs ~24 bytes for String on 64-bit)
Interner
#[derive(Clone, Debug, Default)]
pub struct Interner {
/// All interned strings
strings: Vec<String>,
/// Map from string to Name
lookup: HashMap<String, Name>,
}
impl Interner {
pub fn new() -> Self {
Self::default()
}
/// Intern a string, returning its Name
pub fn intern(&mut self, s: &str) -> Name {
if let Some(&name) = self.lookup.get(s) {
return name;
}
let name = Name(self.strings.len() as u32);
self.strings.push(s.to_string());
self.lookup.insert(s.to_string(), name);
name
}
/// Resolve Name back to string
pub fn resolve(&self, name: Name) -> &str {
&self.strings[name.index()]
}
/// Check if string is interned
pub fn get(&self, s: &str) -> Option<Name> {
self.lookup.get(s).copied()
}
}
Usage
During Lexing
The lexer interns identifiers as it tokenizes:
impl Lexer {
fn scan_identifier(&mut self) -> Token {
let text = self.read_while(is_identifier_char);
let name = self.interner.intern(&text);
Token::Ident(name)
}
}
In the AST
AST nodes store Names, not Strings:
struct Function {
name: Name, // Not String
params: Vec<Param>,
body: ExprId,
}
struct Param {
name: Name, // Not String
ty: Type,
}
enum ExprKind {
Ident(Name), // Variable reference
Field { name: Name }, // Field access
// ...
}
In the Type Checker
Type names are also interned:
enum Type {
Named(Name), // User-defined type name
// ...
}
In the Evaluator
Environment uses Names for variable lookup:
struct Environment {
scopes: Vec<HashMap<Name, Value>>,
}
impl Environment {
fn get(&self, name: Name) -> Option<&Value> {
for scope in self.scopes.iter().rev() {
if let Some(value) = scope.get(&name) {
return Some(value);
}
}
None
}
}
Benefits
1. Fast Comparison
// String comparison: O(n) where n = string length
"long_identifier_name" == "long_identifier_name"
// Name comparison: O(1)
Name(42) == Name(42)
2. Reduced Memory
Without interning:
let x = "foo" // "foo" allocated
let y = "foo" // Another "foo" allocated
With interning:
let x = intern("foo") // Name(0), "foo" stored once
let y = intern("foo") // Name(0), reuses existing
3. Hashable
Names can be used in HashMaps without hashing strings:
// HashMap<String, Value> - must hash entire string
// HashMap<Name, Value> - hashes 4-byte integer
4. Salsa Compatible
Name is Copy + Eq + Hash, perfect for Salsa queries.
Thread Safety
The current Interner is not thread-safe. For parallel compilation:
// Option 1: Mutex-protected
struct SharedInterner {
inner: Mutex<Interner>,
}
// Option 2: Concurrent hashmap
struct ConcurrentInterner {
strings: DashMap<String, Name>,
reverse: Vec<RwLock<String>>,
}
Currently, the compiler is single-threaded during lexing/parsing, so this isn’t needed yet.
Common Identifiers
Frequently used identifiers are pre-interned:
impl Interner {
pub fn with_keywords() -> Self {
let mut interner = Self::new();
// Pre-intern keywords
interner.intern("if");
interner.intern("else");
interner.intern("let");
interner.intern("fn");
// ...
interner
}
}
This ensures keywords have predictable Names (useful for fast keyword checking).
Debugging
Printing Names
When debugging, resolve Names back to strings:
fn debug_expr(interner: &Interner, expr: &Expr) {
match &expr.kind {
ExprKind::Ident(name) => {
println!("Ident: {}", interner.resolve(*name));
}
// ...
}
}
Display Implementations
impl Name {
pub fn display<'a>(&self, interner: &'a Interner) -> impl Display + 'a {
interner.resolve(*self)
}
}
StringLookup Trait
The StringLookup trait provides a minimal interface for Name resolution, avoiding
circular dependencies between crates:
// In ori_ir::interner
pub trait StringLookup {
fn lookup(&self, name: Name) -> &str;
}
impl StringLookup for StringInterner {
fn lookup(&self, name: Name) -> &str {
StringInterner::lookup(self, name)
}
}
This trait is re-exported from ori_patterns and used by Value::type_name_with_interner():
// In ori_patterns::value
pub fn type_name_with_interner<I: StringLookup>(&self, interner: &I) -> Cow<'static, str> {
match self {
Value::Struct(s) => Cow::Owned(interner.lookup(s.type_name).to_string()),
_ => Cow::Borrowed(self.type_name()),
}
}
This pattern allows the value crate to resolve struct type names without depending on the full interner implementation.
Limitations
- Interned strings are never freed - The interner only grows
- Requires interner access - Must pass interner to display names
- Global state - Interner must be accessible throughout compilation
These are acceptable tradeoffs for the performance benefits.