Proposal: Intrinsics Capability

Status: Approved Author: Eric (with AI assistance) Created: 2026-01-30 Approved: 2026-01-30 Affects: Compiler, capabilities, low-level operations


Summary

This proposal formalizes the Intrinsics capability for low-level, platform-specific operations including SIMD, bit manipulation, and hardware feature detection.


Scope

This proposal covers:

  • SIMD operations (arithmetic, comparisons, reductions)
  • Bit manipulation operations
  • CPU feature detection

Deferred to separate proposal:

  • Atomic operations (require integration with memory model)
  • Memory operations (prefetch, memory_fence)

Capability Definition

Intrinsics is a capability trait providing low-level hardware operations:

trait Intrinsics {
    // SIMD float operations (128-bit / 4-wide)
    @simd_add_f32x4 (a: [float, max 4], b: [float, max 4]) -> [float, max 4]
    @simd_sub_f32x4 (a: [float, max 4], b: [float, max 4]) -> [float, max 4]
    @simd_mul_f32x4 (a: [float, max 4], b: [float, max 4]) -> [float, max 4]
    @simd_div_f32x4 (a: [float, max 4], b: [float, max 4]) -> [float, max 4]
    @simd_min_f32x4 (a: [float, max 4], b: [float, max 4]) -> [float, max 4]
    @simd_max_f32x4 (a: [float, max 4], b: [float, max 4]) -> [float, max 4]
    @simd_sqrt_f32x4 (a: [float, max 4]) -> [float, max 4]
    @simd_abs_f32x4 (a: [float, max 4]) -> [float, max 4]
    @simd_eq_f32x4 (a: [float, max 4], b: [float, max 4]) -> [bool, max 4]
    @simd_lt_f32x4 (a: [float, max 4], b: [float, max 4]) -> [bool, max 4]
    @simd_gt_f32x4 (a: [float, max 4], b: [float, max 4]) -> [bool, max 4]
    @simd_sum_f32x4 (a: [float, max 4]) -> float  // Horizontal sum

    // SIMD float operations (256-bit / 8-wide, AVX)
    @simd_add_f32x8 (a: [float, max 8], b: [float, max 8]) -> [float, max 8]
    @simd_sub_f32x8 (a: [float, max 8], b: [float, max 8]) -> [float, max 8]
    @simd_mul_f32x8 (a: [float, max 8], b: [float, max 8]) -> [float, max 8]
    @simd_div_f32x8 (a: [float, max 8], b: [float, max 8]) -> [float, max 8]
    @simd_min_f32x8 (a: [float, max 8], b: [float, max 8]) -> [float, max 8]
    @simd_max_f32x8 (a: [float, max 8], b: [float, max 8]) -> [float, max 8]
    @simd_sqrt_f32x8 (a: [float, max 8]) -> [float, max 8]
    @simd_abs_f32x8 (a: [float, max 8]) -> [float, max 8]
    @simd_eq_f32x8 (a: [float, max 8], b: [float, max 8]) -> [bool, max 8]
    @simd_lt_f32x8 (a: [float, max 8], b: [float, max 8]) -> [bool, max 8]
    @simd_gt_f32x8 (a: [float, max 8], b: [float, max 8]) -> [bool, max 8]
    @simd_sum_f32x8 (a: [float, max 8]) -> float

    // SIMD float operations (512-bit / 16-wide, AVX-512)
    @simd_add_f32x16 (a: [float, max 16], b: [float, max 16]) -> [float, max 16]
    @simd_sub_f32x16 (a: [float, max 16], b: [float, max 16]) -> [float, max 16]
    @simd_mul_f32x16 (a: [float, max 16], b: [float, max 16]) -> [float, max 16]
    @simd_div_f32x16 (a: [float, max 16], b: [float, max 16]) -> [float, max 16]
    @simd_min_f32x16 (a: [float, max 16], b: [float, max 16]) -> [float, max 16]
    @simd_max_f32x16 (a: [float, max 16], b: [float, max 16]) -> [float, max 16]
    @simd_sqrt_f32x16 (a: [float, max 16]) -> [float, max 16]
    @simd_abs_f32x16 (a: [float, max 16]) -> [float, max 16]
    @simd_eq_f32x16 (a: [float, max 16], b: [float, max 16]) -> [bool, max 16]
    @simd_lt_f32x16 (a: [float, max 16], b: [float, max 16]) -> [bool, max 16]
    @simd_gt_f32x16 (a: [float, max 16], b: [float, max 16]) -> [bool, max 16]
    @simd_sum_f32x16 (a: [float, max 16]) -> float

    // SIMD 64-bit integer operations (128-bit / 2-wide)
    @simd_add_i64x2 (a: [int, max 2], b: [int, max 2]) -> [int, max 2]
    @simd_sub_i64x2 (a: [int, max 2], b: [int, max 2]) -> [int, max 2]
    @simd_mul_i64x2 (a: [int, max 2], b: [int, max 2]) -> [int, max 2]
    @simd_min_i64x2 (a: [int, max 2], b: [int, max 2]) -> [int, max 2]
    @simd_max_i64x2 (a: [int, max 2], b: [int, max 2]) -> [int, max 2]
    @simd_eq_i64x2 (a: [int, max 2], b: [int, max 2]) -> [bool, max 2]
    @simd_lt_i64x2 (a: [int, max 2], b: [int, max 2]) -> [bool, max 2]
    @simd_gt_i64x2 (a: [int, max 2], b: [int, max 2]) -> [bool, max 2]
    @simd_sum_i64x2 (a: [int, max 2]) -> int

    // SIMD 64-bit integer operations (256-bit / 4-wide, AVX2)
    @simd_add_i64x4 (a: [int, max 4], b: [int, max 4]) -> [int, max 4]
    @simd_sub_i64x4 (a: [int, max 4], b: [int, max 4]) -> [int, max 4]
    @simd_mul_i64x4 (a: [int, max 4], b: [int, max 4]) -> [int, max 4]
    @simd_min_i64x4 (a: [int, max 4], b: [int, max 4]) -> [int, max 4]
    @simd_max_i64x4 (a: [int, max 4], b: [int, max 4]) -> [int, max 4]
    @simd_eq_i64x4 (a: [int, max 4], b: [int, max 4]) -> [bool, max 4]
    @simd_lt_i64x4 (a: [int, max 4], b: [int, max 4]) -> [bool, max 4]
    @simd_gt_i64x4 (a: [int, max 4], b: [int, max 4]) -> [bool, max 4]
    @simd_sum_i64x4 (a: [int, max 4]) -> int

    // Bit operations
    @count_leading_zeros (value: int) -> int
    @count_trailing_zeros (value: int) -> int
    @count_ones (value: int) -> int
    @rotate_left (value: int, amount: int) -> int
    @rotate_right (value: int, amount: int) -> int

    // Hardware queries
    @cpu_has_feature (feature: str) -> bool
}

Usage

Function Declaration

@fast_dot_product (a: [float], b: [float]) -> float uses Intrinsics =
    // Use SIMD intrinsics for vectorized computation
    ...

Capability Provision

The default def impl Intrinsics uses native instructions when available, falling back to scalar emulation when not:

// Uses default (NativeWithFallback)
@compute () -> float uses Intrinsics =
    Intrinsics.simd_add_f32x4(a: vec1, b: vec2)

Override for testing or explicit control:

with Intrinsics = EmulatedIntrinsics {} in
    fast_operation()  // Always uses scalar fallback

Conditional Use with Feature Detection

@dot_product (a: [float], b: [float]) -> float uses Intrinsics =
    if Intrinsics.cpu_has_feature(feature: "avx2") then
        avx2_dot_product(a, b)
    else if Intrinsics.cpu_has_feature(feature: "sse4.1") then
        sse4_dot_product(a, b)
    else
        scalar_dot_product(a, b)

Compile-Time Platform Targeting

#target(arch: "x86_64")
@fast_checksum (data: [byte]) -> int uses Intrinsics =
    Intrinsics.crc32(data: data)

#target(not_arch: "x86_64")
@fast_checksum (data: [byte]) -> int =
    data.fold(initial: 0, op: (acc, b) -> acc ^ (b as int))

SIMD Operations

Vector Types

SIMD operations work on fixed-capacity lists:

  • [float, max 4] — 128-bit (SSE, NEON)
  • [float, max 8] — 256-bit (AVX, AVX2)
  • [float, max 16] — 512-bit (AVX-512)
  • [int, max 2] — 128-bit i64 (SSE2)
  • [int, max 4] — 256-bit i64 (AVX2)

The int type is 64-bit in Ori, so integer SIMD uses i64 lanes.

Core Operations

CategoryFloat OperationsInt Operations
Arithmeticadd, sub, mul, divadd, sub, mul
Comparisoneq, lt, gteq, lt, gt
Min/Maxmin, maxmin, max
Mathsqrt, abs
Reductionsumsum

Example: SIMD Dot Product

@simd_dot_4 (a: [float, max 4], b: [float, max 4]) -> float uses Intrinsics = {
    let products = Intrinsics.simd_mul_f32x4(a: a, b: b)
    Intrinsics.simd_sum_f32x4(a: products)
}

Platform Availability

Target128-bit (x4)256-bit (x8)512-bit (x16)
x86_64SSE (baseline)AVX/AVX2AVX-512
aarch64NEON
wasm32SIMD128

Bit Manipulation

Operations

// Number of set bits (population count)
@count_ones (value: int) -> int uses Intrinsics

// Number of leading zero bits
@count_leading_zeros (value: int) -> int uses Intrinsics

// Number of trailing zero bits
@count_trailing_zeros (value: int) -> int uses Intrinsics

// Bitwise rotation
@rotate_left (value: int, amount: int) -> int uses Intrinsics
@rotate_right (value: int, amount: int) -> int uses Intrinsics

Example

@is_power_of_two (n: int) -> bool uses Intrinsics =
    n > 0 && Intrinsics.count_ones(value: n) == 1

Hardware Feature Detection

Runtime Detection

@cpu_has_feature (feature: str) -> bool uses Intrinsics

Valid feature strings:

PlatformFeatures
x86_64"sse", "sse2", "sse3", "sse4.1", "sse4.2", "avx", "avx2", "avx512f"
aarch64"neon"
wasm32"simd128"

Unknown feature strings cause a panic.

Example

@optimized_compute (data: [float]) -> float uses Intrinsics =
    if Intrinsics.cpu_has_feature(feature: "avx2") then
        avx2_compute(data)
    else if Intrinsics.cpu_has_feature(feature: "sse4.1") then
        sse4_compute(data)
    else
        scalar_compute(data)

Platform-Specific Behavior

Auto-Fallback (Default)

The default def impl Intrinsics provides NativeWithFallback:

  • Uses native SIMD instructions when the operation is supported on the current platform
  • Falls back to scalar emulation when not supported
  • Always works, but may be slower on emulated paths
// This always works, even on platforms without AVX
Intrinsics.simd_add_f32x8(a, b)  // Uses AVX if available, emulates otherwise

Explicit Control

For performance-critical code, use feature detection to select optimal paths:

@fast_path (data: [float]) -> float uses Intrinsics =
    if Intrinsics.cpu_has_feature(feature: "avx2") then
        // Known to use native AVX2
        avx2_implementation(data)
    else
        // Known to use scalar
        scalar_implementation(data)

Implementation Providers

ProviderBehavior
NativeWithFallbackNative when available, scalar fallback (default)
EmulatedIntrinsicsAlways uses scalar operations (for testing)

Safety Guarantees

No Undefined Behavior

Unlike C intrinsics, Ori intrinsics:

  • Check input sizes at runtime
  • Panic on invalid inputs (not UB)
  • Don’t allow arbitrary memory access

SIMD Safety

SIMD operations require correctly sized inputs:

Intrinsics.simd_add_f32x4(a: [1.0, 2.0], b: [...])  // panic: expected 4 elements

Bit Operation Safety

Rotation amounts are taken modulo 64:

Intrinsics.rotate_left(value: 1, amount: 65)  // Same as amount: 1

Error Messages

Missing Capability

error[E1060]: `simd_add_f32x4` requires `Intrinsics` capability
  --> src/main.ori:5:5
   |
 5 |     Intrinsics.simd_add_f32x4(a, b)
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ requires `uses Intrinsics`
   |
   = help: add `uses Intrinsics` to function signature

Unknown Feature

error[E1062]: unknown CPU feature `"avx3"`
  --> src/main.ori:5:45
   |
 5 |     if Intrinsics.cpu_has_feature(feature: "avx3") then
   |                                            ^^^^^^ unknown feature
   |
   = note: valid features for x86_64: "sse", "sse2", "sse3", "sse4.1", "sse4.2", "avx", "avx2", "avx512f"

Wrong Vector Size

error[E1063]: SIMD operation requires exactly 4 elements
  --> src/main.ori:5:5
   |
 5 |     Intrinsics.simd_add_f32x4(a: [1.0, 2.0], b: b)
   |                                  ^^^^^^^^^^ has 2 elements

Spec Changes Required

Update 14-capabilities.md

Add Intrinsics to standard capabilities table:

| Capability | Purpose | Suspends |
|------------|---------|----------|
| `Intrinsics` | Low-level SIMD and bit operations | No |

Add section describing available operations and platform behavior.


Implementation

Phase 6.14: Intrinsics Capability

  1. Add Intrinsics trait to prelude
  2. Implement def impl Intrinsics with NativeWithFallback
  3. Add EmulatedIntrinsics provider
  4. Implement SIMD codegen for LLVM backend
  5. Add feature detection for x86_64, aarch64, wasm32
  6. Add comprehensive tests

LLVM Backend

SIMD operations map to LLVM vector intrinsics:

  • simd_add_f32x4fadd <4 x float>
  • count_onesllvm.ctpop.i64
  • cpu_has_feature → Runtime CPUID check

Summary

CategoryOperations
SIMD Floatadd, sub, mul, div, min, max, sqrt, abs, eq, lt, gt, sum
SIMD Intadd, sub, mul, min, max, eq, lt, gt, sum
Widths4-wide (128-bit), 8-wide (256-bit), 16-wide (512-bit)
Bitcount_ones, count_leading_zeros, count_trailing_zeros, rotate_left, rotate_right
Querycpu_has_feature
AspectBehavior
SafetyBounds-checked, panic on invalid
FallbackAuto-emulation (default), explicit via EmulatedIntrinsics
Detectioncpu_has_feature for runtime, #target for compile-time
ProviderCapability trait with def impl

Design Decisions

  1. Atomics deferred — Atomic operations require integration with Ori’s memory model and proper pointer types. A separate proposal will address these.
  2. Auto-fallback default — The default def impl uses native instructions when available and emulates otherwise, ensuring code always works across platforms.
  3. 64-bit integers only — Integer SIMD uses Ori’s native int (i64) to avoid truncation complexity.
  4. String-based feature detection — Simple cpu_has_feature("avx2") pattern with documented valid strings and panic on unknown features.
  5. Core operation set — Includes arithmetic, comparisons, min/max, math (sqrt/abs), and horizontal sum. More exotic operations (shuffle, blend, FMA) can be added in future proposals.

Errata (added 2026-03-05)

Superseded by intrinsics-v2-byte-simd-proposal: This proposal’s explicit-width naming scheme (simd_add_f32x4, simd_add_i64x2, etc.) is replaced by a generic API (simd_add<T, $N>). Key changes:

  1. Generic API: All operations are generic over lane type T and width $N. The compiler monomorphizes based on the fixed-capacity list type at the call site.
  2. Float lane width fix: f32 in names was incorrect — Ori’s float is f64. [float, max 2] = 128-bit (not [float, max 4] as this proposal stated).
  3. Mask<$N> type: Comparison operations now return Mask<$N> instead of [bool, max N].
  4. Byte SIMD added: 12 byte-level operations at 128/256/512-bit widths.
  5. std.bytes module: High-level byte search functions backed by SIMD.
  6. V1 names deprecated: Explicit-width names remain as aliases but emit deprecation warnings.