Proposal: Character and Byte Classification Methods

Status: Approved Author: Eric (with Claude) Created: 2026-03-05 Approved: 2026-03-05


Summary

Define standard classification methods on char and byte types for character category testing. These are essential for text processing, lexers, parsers, and input validation.

'A'.is_alphabetic()    // true
b'\t'.is_whitespace()  // true
'3'.is_digit()         // true
b'\xFF'.is_ascii()     // false

Motivation

The Problem

Ori has char and byte types but no spec-defined methods for character classification. Writing a lexer requires constant range checks:

// Current: manual range checks
if (ch >= b'a' && ch <= b'z') || (ch >= b'A' && ch <= b'Z') then ...
if ch >= b'0' && ch <= b'9' then ...
if ch == b' ' || ch == b'\t' || ch == b'\n' || ch == b'\r' then ...

This is error-prone (easy to miss a case), verbose, and lacks intent clarity.

What We Want

if ch.is_alpha() then ...
if ch.is_digit() then ...
if ch.is_whitespace() then ...

Self-documenting, correct, and concise.

Prior Art

Every systems language provides these:

LanguageModule/TypeExample Methods
Rustchar, u8is_alphabetic(), is_ascii_digit(), is_whitespace()
GounicodeIsLetter(), IsDigit(), IsSpace()
SwiftCharacterisLetter, isNumber, isWhitespace
Zigstd.asciiisAlphabetic(), isDigit(), isWhitespace()

Archived Design

The Ori archived design doc (03-type-system/01-primitive-types.md) already lists is_alphabetic(), is_numeric(), is_whitespace(), is_ascii() on char. This proposal formalizes and extends that design.


Design

Char Methods (Unicode)

Methods on char operate on full Unicode:

MethodDescriptionUnicode Category
is_alphabetic()LetterL* (Letter)
is_digit()Decimal digitNd (Decimal Number) — includes digits from all scripts
is_alphanumeric()Letter or digitL* or Nd
is_whitespace()WhitespaceZs, \t, \n, \r, \u{000B}, \u{000C}, \u{0085}, \u{2028}, \u{2029}
is_uppercase()Uppercase letterLu
is_lowercase()Lowercase letterLl
is_ascii()In ASCII rangeU+0000..U+007F
is_control()Control characterCc

All return bool. All are pure (no self-mutation).

Char Methods (ASCII)

ASCII-scoped methods on char for when Unicode classification is not desired:

MethodDescriptionRange
is_ascii_alphabetic()ASCII lettera-z, A-Z
is_ascii_digit()ASCII digit0-9
is_ascii_alphanumeric()ASCII letter or digita-z, A-Z, 0-9
is_ascii_whitespace()ASCII whitespace' ', \t, \n, \r, \x0B, \x0C
is_ascii_uppercase()ASCII uppercaseA-Z
is_ascii_lowercase()ASCII lowercasea-z
is_ascii_hex_digit()Hex digit0-9, a-f, A-F
is_ascii_punctuation()ASCII punctuation!-/, :-@, [-`, {-~`
is_ascii_control()ASCII control0x00..0x1F, 0x7F

These return false for any char outside the ASCII range (U+0080 and above). This mirrors Rust’s char::is_ascii_*() family.

Rationale: Without ASCII-scoped methods on char, checking if a character is an ASCII digit requires ch.is_ascii() && (ch as byte).is_digit() which is awkward. Lexers and parsers frequently work with char values but only care about ASCII.

Byte Methods

Methods on byte operate on ASCII only (0-127). Bytes outside ASCII range return false for all classification methods except where noted:

MethodDescriptionRange
is_ascii()In ASCII range0x00..0x7F
is_ascii_alpha()ASCII lettera-z, A-Z
is_ascii_digit()ASCII digit0-9
is_ascii_alphanumeric()ASCII letter or digita-z, A-Z, 0-9
is_ascii_whitespace()ASCII whitespace' ', \t, \n, \r, \x0B, \x0C
is_ascii_uppercase()ASCII uppercaseA-Z
is_ascii_lowercase()ASCII lowercasea-z
is_ascii_punctuation()ASCII punctuation!-/, :-@, [-`, {-~`
is_ascii_control()ASCII control0x00..0x1F, 0x7F
is_ascii_hex_digit()Hex digit0-9, a-f, A-F

Short Aliases on Byte

For conciseness in byte-heavy code (lexers), short aliases are provided:

ShortFullReason
is_alpha()is_ascii_alpha()Bytes are inherently ASCII-scoped
is_digit()is_ascii_digit()Unambiguous for bytes
is_alnum()is_ascii_alphanumeric()Common abbreviation
is_whitespace()is_ascii_whitespace()Most common use
is_upper()is_ascii_uppercase()Concise
is_lower()is_ascii_lowercase()Concise
is_hex_digit()is_ascii_hex_digit()Common in parsers

Rationale: byte is an 8-bit unsigned integer. It has no Unicode semantics. The is_ascii_* prefix is redundant for bytes — all byte classification is inherently ASCII. The short aliases remove this noise.

No short aliases on char. char methods use full names (is_alphabetic(), not is_alpha()) to reinforce that they operate on full Unicode. The naming distinction between char.is_alphabetic() (Unicode) and byte.is_alpha() (ASCII) signals the semantic difference.

Conversion Methods

MethodOnReturnsDescription
to_ascii_uppercase()char, bytechar, byteUppercase if ASCII letter, else self
to_ascii_lowercase()char, bytechar, byteLowercase if ASCII letter, else self
to_digit(radix: int)char, byteOption<int>Convert digit char to numeric value
b'a'.to_ascii_uppercase()    // b'A'
'3'.to_digit(radix: 10)      // Some(3)
'f'.to_digit(radix: 16)      // Some(15)
'z'.to_digit(radix: 10)      // None

to_digit Radix Rules

The radix parameter shall be in the range 2..=36 (inclusive). Values outside this range panic:

'a'.to_digit(radix: 16)      // Some(10)
'a'.to_digit(radix: 10)      // None ('a' is not a base-10 digit)
'a'.to_digit(radix: 0)       // panic: radix must be in range 2..=36
'a'.to_digit(radix: 37)      // panic: radix must be in range 2..=36

For radices > 10, letters a-z / A-Z represent digit values 10-35 (case-insensitive).

Full Unicode Case Conversion

Full Unicode case conversion (to_uppercase(), to_lowercase()) is deferred. These are complex (locale-sensitive, one-to-many mappings like ß -> SS) and will be addressed in a future proposal if needed. The ASCII conversion methods cover the common case.


Implementation

Char Methods

Unicode-aware methods use lookup tables generated from the Unicode Character Database (UCD). The compiler or stdlib includes compressed tables for:

  • L* (Letter) categories
  • Nd (Decimal Number) category
  • Zs (Space Separator) category
  • Lu, Ll (Upper/Lowercase Letter)

ASCII-scoped methods on char compile to simple range checks (same as byte methods, with an additional ch <= '\x7F' guard).

Byte Methods

All byte methods compile to simple range checks — no tables needed:

// is_ascii_alpha() — conceptual implementation
@is_ascii_alpha (self) -> bool =
    (self >= b'a' && self <= b'z') || (self >= b'A' && self <= b'Z');

These are trivially inlineable.


Where These Methods Live

These are inherent methods on primitive types, defined in the prelude:

impl char {
    @is_alphabetic (self) -> bool = ...;
    @is_digit (self) -> bool = ...;
    @is_ascii_alphabetic (self) -> bool = ...;
    @is_ascii_digit (self) -> bool = ...;
    // ...
}

impl byte {
    @is_ascii (self) -> bool = ...;
    @is_ascii_alpha (self) -> bool = ...;
    @is_alpha (self) -> bool = ...;       // alias for is_ascii_alpha
    // ...
}

Migration / Compatibility

  • No breaking changes. These are new methods on existing types.
  • Prelude inclusion: All methods are available without import.

Depends On

  • byte-literals-proposal.md [approved] — uses byte literals in examples and implementations

References


Changelog

  • 2026-03-05: Initial draft
  • 2026-03-05: Approved — added ASCII-scoped methods on char; resolved to_digit radix rules (panic on invalid, 2..=36); no short aliases on char; deferred full Unicode case conversion; resolved all open questions