Proposal: Limbs Trait — Compiler-Driven Multi-Word Arithmetic
Status: Draft
Author: Eric (with AI assistance)
Created: 2026-04-14
Affects: Compiler (type checker, AIMS pipeline, LLVM codegen), type system, standard library, spec
Depends On: value-trait-proposal.md, overflow-behavior-proposal.md, intrinsics-v2-byte-simd-proposal.md, const-generics-proposal.md
Summary
Introduce a Limbs trait that marks a type as a fixed-width multi-word integer. Types conforming to Limbs declare their structure — limb type, limb count — and the compiler derives optimal arithmetic, comparison, hashing, and serialization automatically. The programmer describes what the type is; the compiler generates how to operate on it.
type U256: Value, Limbs, Eq, Comparable, Hashable, Add, Mul = {
digits: [int, max 4],
}
// That's it. The compiler derives:
// - Carry-chained addition/subtraction (unrolled for WIDTH <= 8, SIMD-vectorized otherwise)
// - Schoolbook/Karatsuba/Toom-Cook multiplication (selected by WIDTH)
// - MSB-first comparison, limb-wise equality, FNV-1a hashing
// - All operations are ARC-free (Value), register-allocated when possible
This is the Ori approach to high-performance numerics: traits as compiler-consumable meta-information, not hand-written optimized code.
Motivation
The GMP Problem
High-performance arbitrary-precision arithmetic (GMP, libtommath, OpenSSL BIGNUM) is implemented by hand-tuning algorithms and assembly for each target architecture. This approach:
- Requires expert manual optimization — GMP has 30+ years of hand-written assembly for 13 CPU families
- Duplicates algorithm logic per representation — the same Karatsuba is reimplemented for each limb size and type
- Cannot benefit from compiler analysis — the optimizer sees opaque pointer arithmetic, not semantic intent
- Creates a C dependency — every language binds to GMP via FFI rather than expressing the problem natively
The Ori Opportunity
Ori already has the building blocks to solve this differently:
Valuetrait — guarantees no ARC, inline storage, bitwise copy. The compiler knows there is no reference counting overhead, not merely optimizes it away.- Const generics (
$N: int) — compile-time parameterization over limb count. Enables monomorphization per size. - Compile-time reflection (
$for,$if) — generic code that expands to specialized code per instantiation. Loop unrolling is a compiler consequence of knowing$WIDTH, not a manual optimization. - SIMD intrinsics generic over
[T, max N]— hardware instruction selection driven by type, not by programmer-written dispatch tables. - AIMS lattice — functional-style code (creating new values) compiles to imperative code (in-place mutation) when uniqueness is proven. The programmer writes clean algorithms; AIMS makes them fast.
checked_add/wrapping_add/wrapping_mul— safe overflow control for limb arithmetic without dropping to unsafe.
What’s missing is the trait that ties these together — the declaration that says “this type is a multi-word integer” and unlocks the compiler’s ability to derive everything from that fact.
The Philosophy
In C, the programmer optimizes. In Ori, the programmer informs and the compiler optimizes. The Limbs trait is the information channel: it tells the compiler “these are limbs, here is their structure” and the compiler generates carry chains, selects algorithms, chooses SIMD widths, and proves in-place mutation safety — all from the type declaration.
This is not syntactic sugar. It is a categorically different approach to high-performance numerics.
Design
The Limbs Trait
trait Limbs: Value {
/// The type of each individual limb.
/// Must be Value (inline, no ARC). Typically `int` (64-bit) or `byte` (8-bit).
type Limb: Value
/// Number of limbs, known at compile time.
/// Drives algorithm selection, SIMD width, unrolling decisions.
let $WIDTH: int
/// Access the limb at the given index (0 = least significant).
@limb (self, index: int) -> Self.Limb
/// Return a copy with one limb replaced.
@with_limb (self, index: int, value: Self.Limb) -> Self
/// Number of significant limbs (may be less than $WIDTH).
@limb_count (self) -> int
}
Supertrait rationale: Value is required because multi-word integers must be ARC-free and bitwise-copyable for arithmetic to be efficient. Heap-allocated, reference-counted limb arrays would defeat the purpose. The Value supertrait is a hard constraint, not a convenience.
Declaration Syntax
// 256-bit unsigned integer: 4 × 64-bit limbs
type U256: Value, Limbs, Eq, Comparable, Hashable = {
digits: [int, max 4],
}
// 512-bit unsigned integer: 8 × 64-bit limbs
type U512: Value, Limbs, Eq, Comparable, Hashable = {
digits: [int, max 8],
}
// 2048-bit for RSA key operations: 32 × 64-bit limbs
type U2048: Value, Limbs, Eq, Comparable, Hashable = {
digits: [int, max 32],
}
// Byte-oriented 256-bit for constant-time crypto
type CryptoU256: Value, Limbs, Eq = {
bytes: [byte, max 32],
}
Each declaration tells the compiler: “this is a multi-word integer with these dimensions.” The compiler derives everything else.
Compiler-Derived Trait Implementations
When a type declares Limbs, the compiler can derive implementations for arithmetic and comparison traits based on the limb structure. These derived implementations use the $WIDTH and Limb associated type to select optimal strategies.
Arithmetic: Add, Sub
// Compiler-derived addition for any T: Limbs
// The compiler generates this — the user does NOT write it.
impl<T: Limbs> T: Add {
@add (self, other: Self) -> Self uses Intrinsics = {
$if T.$WIDTH <= 8 then {
// Small: fully unrolled at compile time via $for
let $result = Self.zero()
let $carry = 0
$for i in 0..T.$WIDTH do {
let $a = self.limb(index: i)
let $b = other.limb(index: i)
let $sum_with_carry = wrapping_add(a: wrapping_add(a: a, b: b), b: carry)
carry = $if is_type(T.Limb, int) then {
// 64-bit limbs: carry when sum < either operand
let $sum_no_carry = wrapping_add(a: a, b: b)
if sum_no_carry < a || sum_with_carry < carry then 1 else 0
} else {
// byte limbs: carry via checked_add
match checked_add(a: a, b: b) {
None -> 1
Some(_) -> if sum_with_carry < carry then 1 else 0
}
}
result = result.with_limb(index: i, value: sum_with_carry)
}
result
} else {
// Large: SIMD-vectorized, auto-selected width
add_vectorized<T>(a: self, b: other)
}
}
}
For U256 ($WIDTH=4), the $for expands to four straight-line addition sequences — no loop overhead, fully register-allocated. For U2048 ($WIDTH=32), the compiler emits a SIMD-vectorized loop using simd_add at the widest available lane width for the target.
Multiplication: Mul
// Compiler-derived multiplication — algorithm selected by $WIDTH
impl<T: Limbs> T: Mul {
@multiply (self, other: Self) -> Self uses Intrinsics = {
$if T.$WIDTH <= 4 then {
schoolbook<T>(a: self, b: other)
} else $if T.$WIDTH <= 32 then {
karatsuba<T>(a: self, b: other)
} else {
toom_cook_3<T>(a: self, b: other)
}
}
}
The $if on T.$WIDTH is compile-time dead-code elimination. Each monomorphized instance contains only the algorithm that applies to its size. U256 gets schoolbook (optimal for 4 limbs). U2048 gets Karatsuba. No runtime dispatch.
Comparison: Eq, Comparable
// Derived Eq: limb-by-limb equality
// For Value types, the compiler may further optimize to memcmp
impl<T: Limbs> T: Eq {
@equals (self, other: Self) -> bool = {
$for i in 0..T.$WIDTH do {
if self.limb(index: i) != other.limb(index: i) then return false
}
true
}
}
// Derived Comparable: MSB-first lexicographic ordering
impl<T: Limbs> T: Comparable {
@compare (self, other: Self) -> Ordering = {
// Most significant limb first
$for i in (0..T.$WIDTH).rev() do {
let $a = self.limb(index: i)
let $b = other.limb(index: i)
if a < b then return Ordering.Less
if a > b then return Ordering.Greater
}
Ordering.Equal
}
}
Hashing: Hashable
// Derived Hashable: FNV-1a across limbs
impl<T: Limbs> T: Hashable {
@hash (self) -> int = {
let $h = 0
$for i in 0..T.$WIDTH do {
h = hash_combine(seed: h, value: self.limb(index: i))
}
h
}
}
AIMS Interaction
AIMS sees Limbs types as Value — the entire AIMS pipeline already handles this optimally:
- Uniqueness = Unique for local values → in-place mutation of limbs
- No RC operations anywhere —
Valuesupertrait guarantees this - FIP certification possible for pure arithmetic functions — the compiler can certify that
add,mul, etc. perform zero net allocations - COW is irrelevant — no shared references to copy-on-write from
Functional-style code:
let $a = U256.from_int(42)
let $b = a + U256.from_int(58) // 'a' not used after this line
AIMS proves a is consumed. The addition mutates a’s storage in place. The programmer wrote pure functional code. The emitted code is imperative.
SIMD Integration
The Limbs trait’s $WIDTH and Limb type directly inform SIMD instruction selection:
Limb type | $WIDTH | Target | SIMD strategy |
|---|---|---|---|
int (64-bit) | 2 | SSE2 | 1 × paddq (128-bit) |
int (64-bit) | 4 | AVX2 | 1 × vpaddq (256-bit) |
int (64-bit) | 8 | AVX-512 | 1 × vpaddq (512-bit) |
int (64-bit) | 32 | AVX2 | 8 × vpaddq loop |
byte (8-bit) | 32 | AVX2 | 1 × vpaddb (256-bit) |
The compiler maps simd_add<T.Limb, N> to the right instruction. The programmer never names an instruction — the type carries the information.
Runtime feature detection via Intrinsics.cpu_has_feature(feature:) allows the compiler to generate multi-versioned functions that select the widest available SIMD path at runtime, driven by the same $WIDTH information.
Standard Library Surface
The Limbs trait enables a family of fixed-width integer types in std.math:
// std.math.fixed_int
pub type U128: Value, Limbs, Eq, Comparable, Hashable, Add, Sub, Mul, Div, Rem = {
digits: [int, max 2],
}
pub type U256: Value, Limbs, Eq, Comparable, Hashable, Add, Sub, Mul, Div, Rem = {
digits: [int, max 4],
}
pub type U512: Value, Limbs, Eq, Comparable, Hashable, Add, Sub, Mul, Div, Rem = {
digits: [int, max 8],
}
pub type U1024: Value, Limbs, Eq, Comparable, Hashable, Add, Sub, Mul, Div, Rem = {
digits: [int, max 16],
}
pub type U2048: Value, Limbs, Eq, Comparable, Hashable, Add, Sub, Mul, Div, Rem = {
digits: [int, max 32],
}
pub type U4096: Value, Limbs, Eq, Comparable, Hashable, Add, Sub, Mul, Div, Rem = {
digits: [int, max 64],
}
Each type is a one-line declaration. The compiler generates the entire optimized implementation from the trait declarations. No hand-written arithmetic, no platform-specific code, no FFI.
Users can also define their own:
// Domain-specific: 384-bit for P-384 elliptic curve
type P384Int: Value, Limbs, Eq, Comparable = {
digits: [int, max 6],
}
// Game engine: 96-bit fixed-point with 32-bit fractional part
type FixedPoint96: Value, Limbs = {
digits: [int, max 2], // 1 integer limb + 1 fractional limb (custom interpretation)
}
Common Operations
Beyond arithmetic traits, Limbs types support standard operations:
// Construction
let $a = U256.zero() // All limbs zero
let $b = U256.from_int(42) // From native int
let $c = U256.from_str("0xFF00FF00FF") // From hex/decimal string
// Bitwise (derived from Limbs structure)
let $d = a & b // Limb-wise AND
let $e = a | b // Limb-wise OR
let $f = a ^ b // Limb-wise XOR
let $g = ~a // Limb-wise complement
let $h = a << 64 // Cross-limb shift (limb shuffle + intra-limb shift)
let $i = a >> 1 // Cross-limb shift right
// Inspection
let $bits = a.leading_zeros() // CLZ across limbs
let $is_zero = a == U256.zero()
let $top = a.limb(index: 3) // Direct limb access
Cross-limb shifts are derived from $WIDTH — shifting by a multiple of the limb bit-width becomes a limb shuffle (register rename), and the residual shift is intra-limb. The compiler generates this decomposition automatically.
Widening Multiply (Future Extension)
The one operation that would significantly benefit Limbs arithmetic but is not yet available is widening multiply — 64×64→128-bit. This proposal recommends a companion intrinsic:
trait Intrinsics {
// ... existing intrinsics ...
/// Widening multiply: returns (low, high) halves of a × b.
/// For int (64-bit): 64×64→128, returned as two 64-bit values.
@widening_mul (a: int, b: int) -> (int, int)
}
Without this intrinsic, Limbs multiplication uses the standard split-into-halves approach (as mini-gmp and libtommath do). With it, the inner loop of schoolbook and Karatsuba multiplication operates at full hardware speed. This is a single intrinsic addition, not a language redesign.
Semantic Rules
Rule 1: Value Supertrait Required
A type implementing Limbs must also implement Value. Attempting to declare Limbs without Value is a compile error:
// Error E2050: Limbs requires Value supertrait
type Bad: Limbs = { digits: [int, max 4] }
// Correct
type Good: Value, Limbs = { digits: [int, max 4] }
Rule 2: $WIDTH Must Be Positive
The associated constant $WIDTH must be a positive integer known at compile time:
// Error E2051: Limbs $WIDTH must be a positive compile-time integer
type Bad: Value, Limbs = { ... } // where $WIDTH = 0
Rule 3: Limb Type Must Be Value
The associated type Limb must satisfy Value. Non-value limb types are rejected:
// Error E2052: Limbs.Limb must satisfy Value
type Bad: Value, Limbs = { digits: [str, max 4] } // str is not Value
Rule 4: Derived Arithmetic Is Total
Arithmetic operations on Limbs types are modular (wrapping) by default — overflow wraps to zero, not panics. This matches the behavior of fixed-width hardware integers and GMP’s mpn_ layer. For checked variants:
let $a = U256.max_value()
let $b = a + U256.from_int(1) // Wraps to zero (modular)
let $c = a.checked_add(U256.from_int(1)) // Returns None (overflow)
let $d = a.saturating_add(U256.from_int(1)) // Returns U256.max_value()
Rule 5: Limb Ordering Is Little-Endian
Limb index 0 is the least significant. This is the universal convention in multi-word arithmetic (GMP, libtommath, Go math/big) and enables carry propagation from low to high.
Rule 6: Size Limit Follows Value Trait
Limbs types inherit Value’s size limits: warning above 256 bytes, error above 512 bytes. This constrains $WIDTH:
| Limb type | Max $WIDTH (256B warning) | Max $WIDTH (512B hard limit) |
|---|---|---|
int (8 bytes) | 32 (2048-bit) | 64 (4096-bit) |
byte (1 byte) | 256 (2048-bit) | 512 (4096-bit) |
For larger multi-word integers, use heap-allocated [int] without the Value/Limbs traits.
Prior Art
| Language/Library | Approach | Limb representation | Algorithm selection | Compiler integration |
|---|---|---|---|---|
| GMP (C) | Hand-tuned asm per arch | mp_limb_t (machine word) | Manual dispatch by size | None — opaque to compiler |
Rust num-bigint | Portable Rust + unsafe | Vec<u64> (heap) | Runtime size check | None — generic library |
Go math/big | Go + asm kernels | []Word (slice) | Runtime size check | Minimal — escape analysis |
Haskell Integer | GMP FFI | GMP’s internal repr | GMP’s internal dispatch | None — FFI boundary |
Zig std.math.big | Comptime + SIMD | Fixed arrays | Comptime dispatch | Partial — comptime evaluation |
Ori Limbs | Trait-driven codegen | [T, max N] (inline) | Compile-time via $WIDTH | Full — AIMS + SIMD + unrolling |
Ori’s approach is unique: no other language uses a trait declaration to drive algorithm selection, SIMD width mapping, and memory model optimization simultaneously. Zig’s comptime comes closest but lacks the unified memory model (AIMS) and trait-driven derivation.
Examples
Cryptography: 256-bit Arithmetic
use std.math.fixed_int { U256 }
@mod_exp (base: U256, exp: U256, modulus: U256) -> U256 = {
let $result = U256.from_int(1)
let $b = base % modulus
let $e = exp
while e > U256.zero() do {
if e.limb(index: 0) & 1 == 1 then {
result = (result * b) % modulus
}
e = e >> 1
b = (b * b) % modulus
}
result
}
The entire function operates on stack-allocated Value types. No heap allocation, no ARC, no GC. AIMS certifies this as FIP — zero net allocations. The compiler selects SIMD-accelerated multiplication based on the U256.$WIDTH of 4.
Blockchain: Hash Computation
use std.math.fixed_int { U256 }
type BlockHeader: Value, Eq = {
prev_hash: U256,
merkle_root: U256,
timestamp: int,
difficulty: U256,
nonce: int,
}
@meets_difficulty (hash: U256, target: U256) -> bool = {
hash <= target
}
U256 comparison is MSB-first, derived automatically from Limbs + Comparable. No manual implementation needed.
Scientific Computing: Extended Precision
use std.math.fixed_int { U128 }
// Exact 128-bit accumulator for summing many floats
@kahan_sum_exact (values: [float]) -> U128 = {
let $sum = U128.zero()
for v in values do {
sum = sum + U128.from_float_bits(v)
}
sum
}
Custom Domain: 384-bit for P-384 Elliptic Curve
type P384: Value, Limbs, Eq, Comparable, Add, Sub, Mul = {
digits: [int, max 6],
}
@p384_field_mul (a: P384, b: P384, modulus: P384) -> P384 = {
(a * b) % modulus
}
Six limbs. The compiler selects schoolbook multiplication (optimal for $WIDTH=6), unrolls the carry chain via $for, and SIMD-accelerates the reduction. One type declaration, zero hand-tuning.
Implementation
Phase 1: Trait Definition and Validation
- Define
Limbstrait in prelude withLimb,$WIDTH,limb,with_limb,limb_count - Type checker validates
Valuesupertrait, positive$WIDTH,Limb: Value - Error codes E2050–E2052
- Tests: validation of trait conformance, rejection of invalid declarations
Phase 2: Derived Arithmetic
- Compiler-generated
Add,Subimplementations with carry chains $forunrolling for small$WIDTH, loop emission for largechecked_add/wrapping_addintegration for carry detection- Tests: correctness across all standard sizes (U128–U4096), edge cases (max + 1, zero, carry propagation across all limbs)
Phase 3: Derived Comparison and Hashing
- Compiler-generated
Eq(limb-wise),Comparable(MSB-first),Hashable(FNV-1a across limbs) - Tests: ordering correctness, hash consistency with equality
Phase 4: SIMD Integration
- Map
simd_add/simd_cmpltto carry-chained vectorized arithmetic - Automatic SIMD width selection based on
$WIDTHand target capabilities - Runtime multi-versioning via
cpu_has_feature - Tests: correctness parity between scalar and SIMD paths, cross-platform (x86-64, aarch64, wasm32)
Phase 5: Multiplication Algorithms
- Schoolbook for
$WIDTH <= 4 - Karatsuba for
$WIDTH <= 32 - Toom-Cook-3 for larger
- Tests: multiplication correctness across algorithm boundaries, cross-size verification
Phase 6: Standard Library Types
std.math.fixed_intmodule with U128, U256, U512, U1024, U2048, U4096- Bitwise operations, shifts, string conversion
- Tests: full arithmetic test matrix, string round-trip, bitwise operations
Phase 7: Widening Multiply Intrinsic (Optional, Companion)
Intrinsics.widening_mul(a: int, b: int) -> (int, int)- LLVM codegen to target-specific widening multiply instructions
- Integration into
Limbsschoolbook/Karatsuba inner loops - Tests: widening multiply correctness, performance comparison with split-half approach
Spec & Grammar Impact
- Spec Clause 8 (Types): Add
Limbsto marker trait inventory alongsideValueandSendable - Spec Clause 12 (Traits): Document
Limbstrait definition, associated type and constant requirements - Spec Clause 20 (Capabilities): Add
widening_multoIntrinsicstrait (Phase 7) - No grammar changes —
Limbsuses existing trait syntax and$const generic syntax
Design Decisions
Why a trait, not a built-in type?
A built-in BigInt type would be one specific implementation baked into the compiler. A trait lets users define their own sizes, their own limb types, their own representations — all getting the same compiler-driven optimization. This follows Ori’s purity principle: the compiler provides the analysis machinery, the library provides the types.
Why Value supertrait, not optional?
Multi-word arithmetic with ARC overhead is pointless — you’d spend more time on reference counting than on the actual computation. Making Value a hard requirement ensures the compiler can always generate optimal code. Users who need heap-allocated arbitrary-precision integers should use a separate BigInt type backed by [int], which does not conform to Limbs.
Why little-endian limb ordering?
Every major multi-word arithmetic library uses least-significant-limb-first ordering (GMP, libtommath, Go math/big, Rust num-bigint). Carry propagation naturally flows from low to high indices. Deviating would create gratuitous incompatibility with existing algorithms and FFI interop.
Why modular (wrapping) arithmetic by default?
Fixed-width integers wrap on overflow in hardware (x86 add, ARM add). Ori’s default int panics on overflow for safety, but multi-word integers are used in domains (cryptography, hashing, checksums) where wrapping is the correct semantics. Checked and saturating variants are available for domains that need them.
Why compile-time algorithm selection, not runtime?
The $WIDTH is a compile-time constant. There is no reason to dispatch at runtime when the compiler knows the optimal algorithm at compile time. Runtime dispatch would add branch prediction overhead to every arithmetic operation for zero benefit. The $if compile-time branching ensures each monomorphized instance contains only the code it needs.
Future Extensions
Signed Multi-Word Integers
The Limbs trait as designed is unsigned. Signed variants (I256, I512, etc.) can be built on top with a sign field and two’s-complement or sign-magnitude representation — both representable as a Limbs type plus a bool or a convention on the high limb.
Arbitrary-Precision BigInt
A heap-allocated BigInt backed by [int] would not conform to Limbs (violates Value supertrait) but could use the same algorithm implementations internally. The compiler-generated algorithms for Limbs types could be extracted as generic functions usable by BigInt.
Montgomery Multiplication
Cryptographic applications need modular multiplication in Montgomery form. This could be a trait extension:
trait MontgomeryLimbs: Limbs {
let $MODULUS: Self
let $MONT_R: Self
@to_montgomery (self) -> Self
@from_montgomery (self) -> Self
@mont_mul (self, other: Self) -> Self
}
Constant-Time Operations
Cryptographic code requires constant-time arithmetic (no data-dependent branches). A ConstantTime marker trait could compose with Limbs to instruct the compiler to use branchless implementations:
type CryptoU256: Value, Limbs, ConstantTime = { digits: [int, max 4] }
// Compiler generates cmov-based carry chains, no branches
Summary Table
| Feature | Mechanism | Compiler action |
|---|---|---|
| No ARC | Value supertrait | AIMS skips all RC analysis |
| Inline storage | [T, max N] limbs | Register allocation, no heap |
| Algorithm selection | $if T.$WIDTH | Dead-code elimination per size |
| Loop unrolling | $for i in 0..T.$WIDTH | Compile-time expansion |
| SIMD acceleration | simd_add<T.Limb, N> | Target-specific instruction |
| In-place mutation | AIMS uniqueness proof | Functional code → imperative codegen |
| Carry detection | checked_add / wrapping_add | Safe, no unsafe needed |
| Comparison | Derived from Limbs + Comparable | MSB-first unrolled comparison |
| Hashing | Derived from Limbs + Hashable | FNV-1a across limbs |