Proposal: Byte-Level String Access
Status: Approved Author: Eric (with Claude) Created: 2026-03-05 Approved: 2026-03-05
Summary
Add methods for accessing the raw UTF-8 bytes of a str value, and provide efficient byte-buffer types for byte-oriented processing like lexers and parsers.
let bytes: [byte] = "hello".to_bytes();
let view: [byte] = "hello".as_bytes(); // zero-copy view (seamless slice)
let ch: byte = bytes[0]; // b'h' — O(1) access
Motivation
The Problem
Ori’s str type is UTF-8 encoded, and str[i] returns a single-codepoint str (not a byte). This is correct for user-facing string manipulation, but wrong for byte-oriented processing.
Lexers, parsers, protocol handlers, and binary format readers need:
- O(1) byte-level indexing
- Byte pattern matching
- Byte-by-byte iteration
- No UTF-8 decoding overhead
// Current: str[i] returns str, not byte
let source = "let x = 42";
let first = source[0]; // "l" (type: str, not byte)
// No way to get the raw byte value
What We Want
let source = "let x = 42";
let buf = source.as_bytes(); // [byte] view — zero copy
let first = buf[0]; // b'l' (type: byte, O(1))
// Lexer can now work at byte level:
match buf[pos] {
b'a'..b'z' | b'A'..b'Z' -> read_ident(),
b'0'..b'9' -> read_number(),
b' ' | b'\t' | b'\n' -> skip_whitespace(),
_ -> error(),
}
Prior Art
| Language | Method | Returns | Zero-Copy |
|---|---|---|---|
| Rust | s.as_bytes() | &[u8] | Yes (borrow) |
| Go | []byte(s) | []byte | No (copy) |
| Swift | s.utf8 | String.UTF8View | Yes (view) |
| Zig | s.ptr[0..s.len] | []const u8 | Yes (slice) |
Design
Methods on str
| Method | Returns | Copy? | Description |
|---|---|---|---|
as_bytes() | [byte] | No — seamless slice | Zero-copy view of UTF-8 bytes |
to_bytes() | [byte] | Yes | Independent copy of UTF-8 bytes |
byte_len() | int | N/A | Number of UTF-8 bytes (O(1)) |
let s = "hello";
s.byte_len() // 5
s.as_bytes() // [104, 101, 108, 108, 111] — zero-copy view
s.to_bytes() // [104, 101, 108, 108, 111] — independent copy
as_bytes() — Zero-Copy Semantics
as_bytes() returns a [byte] that shares the underlying allocation with the source str via seamless slicing (spec 21.4). No data is copied. The returned list is read-only in the sense that COW semantics apply — modifying the [byte] triggers a copy, leaving the original str unaffected.
let s = "hello";
let bytes = s.as_bytes(); // shares allocation (seamless slice)
let b = bytes[0]; // b'h' — O(1), no copy
bytes[0] = b'H'; // COW: bytes gets its own copy, s unaffected
Flattening for Substrings
If the source str is itself a seamless slice (e.g., from .substring() or .trim()), as_bytes() produces a single-level [byte] view of the same byte range. No nested slices are created — the implementation takes the substring’s pointer and length and creates a byte slice directly from those.
let s = "hello world";
let sub = s.substring(start: 0, end: 5); // "hello" — seamless slice of s
let bytes = sub.as_bytes(); // [byte] view of bytes 0..5 — single level
to_bytes() — Owned Copy
to_bytes() returns an independent [byte] copy. Use when you need to mutate the bytes without affecting the source string.
byte_len() vs len()
| Method | Returns |
|---|---|
s.len() | Number of Unicode code points (grapheme clusters or codepoints — TBD) |
s.byte_len() | Number of UTF-8 bytes |
For ASCII strings, these are equal. For multibyte characters, they differ:
let s = "cafe\u{0301}"; // "cafe" — 6 bytes, 5 codepoints
s.byte_len() // 6
s.len() // depends on len() definition
Constructing str from [byte]
@from_utf8 (bytes: [byte]) -> Result<str, Error>
@from_utf8_unchecked (bytes: [byte]) -> str // unsafe — caller guarantees valid UTF-8
from_utf8 validates UTF-8 encoding and returns an error on invalid sequences.
from_utf8_unchecked skips validation and requires unsafe. If called with invalid UTF-8, the behavior is unspecified but memory-safe — the program may panic, produce garbled string output, or behave unexpectedly, but it shall never cause memory corruption, buffer overflows, or use-after-free. This is consistent with Ori’s safety philosophy: unsafe relaxes type-level guarantees but does not permit memory unsafety.
These are associated functions on str:
let bytes: [byte] = [104, 101, 108, 108, 111];
let s = str.from_utf8(bytes:); // Ok("hello")
unsafe {
let s = str.from_utf8_unchecked(bytes:); // "hello"
}
Iteration
// Iterate over bytes
for b in "hello".as_bytes().iter() do { ... }
// Iterate over chars (existing)
for c in "hello".chars() do { ... }
Interaction with Seamless Slicing
as_bytes() leverages the existing seamless slicing mechanism (spec 21.4):
- The
[byte]view shares thestr’s heap allocation - The
SLICE_FLAGin the capacity field marks it as a view ori_buffer_rc_dechandles cleanup for both regular and slice-backed lists- COW on the
[byte]view triggers materialization (copy) — the originalstris never affected
This is the same mechanism used by list.take(), list.skip(), str.substring(), and str.trim().
Migration / Compatibility
- No breaking changes. New methods on existing types.
as_bytes()is the preferred method for read-only byte access.to_bytes()for when mutation is needed.
Depends On
byte-literals-proposal.md[approved] — uses byte literals in examples- Seamless slicing (spec 21.4) [implemented] — runtime mechanism for zero-copy views
References
Changelog
- 2026-03-05: Initial draft
- 2026-03-05: Approved — specified flatten behavior for substring slices; clarified
from_utf8_uncheckedas unspecified-but-memory-safe (no true UB); resolvedbyte_lennaming; resolved all open questions (byte string literals deferred)