Proposal: Byte-Level String Access

Status: Approved Author: Eric (with Claude) Created: 2026-03-05 Approved: 2026-03-05


Summary

Add methods for accessing the raw UTF-8 bytes of a str value, and provide efficient byte-buffer types for byte-oriented processing like lexers and parsers.

let bytes: [byte] = "hello".to_bytes();
let view: [byte] = "hello".as_bytes();   // zero-copy view (seamless slice)

let ch: byte = bytes[0];                  // b'h' — O(1) access

Motivation

The Problem

Ori’s str type is UTF-8 encoded, and str[i] returns a single-codepoint str (not a byte). This is correct for user-facing string manipulation, but wrong for byte-oriented processing.

Lexers, parsers, protocol handlers, and binary format readers need:

  • O(1) byte-level indexing
  • Byte pattern matching
  • Byte-by-byte iteration
  • No UTF-8 decoding overhead
// Current: str[i] returns str, not byte
let source = "let x = 42";
let first = source[0];        // "l" (type: str, not byte)
// No way to get the raw byte value

What We Want

let source = "let x = 42";
let buf = source.as_bytes();   // [byte] view — zero copy
let first = buf[0];            // b'l' (type: byte, O(1))

// Lexer can now work at byte level:
match buf[pos] {
    b'a'..b'z' | b'A'..b'Z' -> read_ident(),
    b'0'..b'9' -> read_number(),
    b' ' | b'\t' | b'\n' -> skip_whitespace(),
    _ -> error(),
}

Prior Art

LanguageMethodReturnsZero-Copy
Rusts.as_bytes()&[u8]Yes (borrow)
Go[]byte(s)[]byteNo (copy)
Swifts.utf8String.UTF8ViewYes (view)
Zigs.ptr[0..s.len][]const u8Yes (slice)

Design

Methods on str

MethodReturnsCopy?Description
as_bytes()[byte]No — seamless sliceZero-copy view of UTF-8 bytes
to_bytes()[byte]YesIndependent copy of UTF-8 bytes
byte_len()intN/ANumber of UTF-8 bytes (O(1))
let s = "hello";
s.byte_len()     // 5
s.as_bytes()     // [104, 101, 108, 108, 111] — zero-copy view
s.to_bytes()     // [104, 101, 108, 108, 111] — independent copy

as_bytes() — Zero-Copy Semantics

as_bytes() returns a [byte] that shares the underlying allocation with the source str via seamless slicing (spec 21.4). No data is copied. The returned list is read-only in the sense that COW semantics apply — modifying the [byte] triggers a copy, leaving the original str unaffected.

let s = "hello";
let bytes = s.as_bytes();     // shares allocation (seamless slice)
let b = bytes[0];             // b'h' — O(1), no copy
bytes[0] = b'H';              // COW: bytes gets its own copy, s unaffected

Flattening for Substrings

If the source str is itself a seamless slice (e.g., from .substring() or .trim()), as_bytes() produces a single-level [byte] view of the same byte range. No nested slices are created — the implementation takes the substring’s pointer and length and creates a byte slice directly from those.

let s = "hello world";
let sub = s.substring(start: 0, end: 5);  // "hello" — seamless slice of s
let bytes = sub.as_bytes();                // [byte] view of bytes 0..5 — single level

to_bytes() — Owned Copy

to_bytes() returns an independent [byte] copy. Use when you need to mutate the bytes without affecting the source string.

byte_len() vs len()

MethodReturns
s.len()Number of Unicode code points (grapheme clusters or codepoints — TBD)
s.byte_len()Number of UTF-8 bytes

For ASCII strings, these are equal. For multibyte characters, they differ:

let s = "cafe\u{0301}";      // "cafe" — 6 bytes, 5 codepoints
s.byte_len()                  // 6
s.len()                       // depends on len() definition

Constructing str from [byte]

@from_utf8 (bytes: [byte]) -> Result<str, Error>
@from_utf8_unchecked (bytes: [byte]) -> str   // unsafe — caller guarantees valid UTF-8

from_utf8 validates UTF-8 encoding and returns an error on invalid sequences.

from_utf8_unchecked skips validation and requires unsafe. If called with invalid UTF-8, the behavior is unspecified but memory-safe — the program may panic, produce garbled string output, or behave unexpectedly, but it shall never cause memory corruption, buffer overflows, or use-after-free. This is consistent with Ori’s safety philosophy: unsafe relaxes type-level guarantees but does not permit memory unsafety.

These are associated functions on str:

let bytes: [byte] = [104, 101, 108, 108, 111];
let s = str.from_utf8(bytes:);               // Ok("hello")

unsafe {
    let s = str.from_utf8_unchecked(bytes:);  // "hello"
}

Iteration

// Iterate over bytes
for b in "hello".as_bytes().iter() do { ... }

// Iterate over chars (existing)
for c in "hello".chars() do { ... }

Interaction with Seamless Slicing

as_bytes() leverages the existing seamless slicing mechanism (spec 21.4):

  • The [byte] view shares the str’s heap allocation
  • The SLICE_FLAG in the capacity field marks it as a view
  • ori_buffer_rc_dec handles cleanup for both regular and slice-backed lists
  • COW on the [byte] view triggers materialization (copy) — the original str is never affected

This is the same mechanism used by list.take(), list.skip(), str.substring(), and str.trim().


Migration / Compatibility

  • No breaking changes. New methods on existing types.
  • as_bytes() is the preferred method for read-only byte access. to_bytes() for when mutation is needed.

Depends On

  • byte-literals-proposal.md [approved] — uses byte literals in examples
  • Seamless slicing (spec 21.4) [implemented] — runtime mechanism for zero-copy views

References


Changelog

  • 2026-03-05: Initial draft
  • 2026-03-05: Approved — specified flatten behavior for substring slices; clarified from_utf8_unchecked as unspecified-but-memory-safe (no true UB); resolved byte_len naming; resolved all open questions (byte string literals deferred)