Lexer Overview
The Ori lexer converts source text into a stream of tokens. It’s implemented using the logos crate for DFA-based tokenization.
Location
compiler/ori_lexer/src/lib.rs (~643 lines)
The lexer is a separate crate with minimal dependencies:
ori_ir- forToken,TokenKind,Span,TokenList,StringInternerlogos- for DFA-based tokenization
Design Goals
- Fast tokenization via logos DFA
- String interning for identifiers
- Special literals (duration, size)
- No errors - invalid input becomes Error token
Architecture
Source Text
│
│ logos::Lexer
▼
Raw Tokens (logos-generated)
│
│ Post-processing
▼
TokenList (with spans, interned names)
Token Definition
Tokens are defined using logos derive macro:
#[derive(Logos, Debug, Clone, PartialEq)]
pub enum TokenKind {
// Keywords
#[token("let")]
Let,
#[token("if")]
If,
#[token("else")]
Else,
#[token("then")]
Then,
// ...
// Operators
#[token("+")]
Plus,
#[token("-")]
Minus,
#[token("*")]
Star,
// ...
// Literals
#[regex(r"[0-9]+", |lex| lex.slice().parse().ok())]
Int(i64),
#[regex(r"[0-9]+\.[0-9]+", |lex| lex.slice().parse().ok())]
Float(f64),
#[regex(r#""[^"]*""#, |lex| {
let s = lex.slice();
Some(s[1..s.len()-1].to_string())
})]
String(String),
// Identifiers (interned later)
#[regex(r"[a-zA-Z_][a-zA-Z0-9_]*")]
Ident,
// Special - handled in post-processing
#[regex(r"[0-9]+(ms|s|m|h)")]
Duration,
#[regex(r"[0-9]+(b|kb|mb|gb)")]
Size,
// Error fallback
#[error]
Error,
}
Tokenization Process
1. Initial Tokenization
pub fn tokenize(db: &dyn Db, source: &str) -> TokenList {
let lexer = TokenKind::lexer(source);
let mut tokens = Vec::new();
let mut spans = Vec::new();
for (kind, span) in lexer.spanned() {
tokens.push(kind);
spans.push(Span::new(span.start, span.end));
}
// ...
}
2. Post-Processing
After logos tokenization:
// Intern identifiers
for (i, token) in tokens.iter_mut().enumerate() {
if let TokenKind::Ident = token {
let text = &source[spans[i].start..spans[i].end];
let name = db.interner().intern(text);
*token = TokenKind::Identifier(name);
}
}
// Parse duration literals
for (i, token) in tokens.iter_mut().enumerate() {
if let TokenKind::Duration = token {
let text = &source[spans[i].start..spans[i].end];
*token = parse_duration(text);
}
}
3. Result
TokenList {
tokens,
spans,
}
Escape Sequence Handling
String and character unescaping share a common resolve_escape() helper:
fn resolve_escape(c: char) -> Option<char> {
match c {
'n' => Some('\n'),
't' => Some('\t'),
'r' => Some('\r'),
'0' => Some('\0'),
'\\' => Some('\\'),
'"' => Some('"'),
'\'' => Some('\''),
_ => None,
}
}
Both unescape_string() and unescape_char() delegate to this function, avoiding duplicated escape logic.
Special Literals
Duration Literals
100ms -> Duration(Milliseconds(100))
5s -> Duration(Seconds(5))
2m -> Duration(Minutes(2))
1h -> Duration(Hours(1))
Size Literals
1024b -> Size(Bytes(1024))
4kb -> Size(Kilobytes(4))
10mb -> Size(Megabytes(10))
2gb -> Size(Gigabytes(2))
TokenList Structure
#[derive(Clone, Eq, PartialEq, Hash, Debug)]
pub struct TokenList {
tokens: Vec<TokenKind>,
spans: Vec<Span>,
}
impl TokenList {
pub fn get(&self, index: usize) -> Option<&TokenKind> {
self.tokens.get(index)
}
pub fn span(&self, index: usize) -> Span {
self.spans[index]
}
pub fn len(&self) -> usize {
self.tokens.len()
}
}
No Error Recovery
The lexer does not attempt error recovery. Invalid characters become Error tokens:
// Input: "let x = @#$"
// Output: [Let, Ident, Eq, Error, Error, Error]
Error handling is deferred to the parser, which can provide better diagnostics with context.
Performance
Using logos provides:
- DFA-based - O(n) tokenization
- Zero-copy where possible
- Compiled regex - patterns compiled at build time
Salsa Integration
Tokenization is a Salsa query:
#[salsa::tracked]
pub fn tokens(db: &dyn Db, file: SourceFile) -> TokenList {
let source = file.text(db);
tokenize(db, &source)
}
This enables:
- Caching of token results
- Early cutoff if tokens unchanged
- Dependency tracking
Related Documents
- Token Design - Token type details
- Architecture: Pipeline - Pipeline overview