Section 7E: std.text — Comprehensive Text Processing

Goal: Implement std.text per the approved stdlib-text-api-proposal — a 7-layer text library covering Unicode properties, grapheme segmentation, display width, normalization, case folding, string similarity, text analysis (from Pretext), pluggable measurement, production-quality line breaking, bidi, confusable detection, encoding conversion, and high-level convenience functions.

PROPOSAL: proposals/approved/stdlib-text-api-proposal.md — Full API design REFERENCE: ~/projects/reference_repos/pretext/ — Pretext source (analysis pipeline, layout engine) REFERENCE: Rust crates unicode-segmentation, unicode-width, unicode-normalization for table generation patterns

Architecture

  ori_rt (Rust)                     library/std/text/ (Ori)
  ┌─────────────────────┐          ┌──────────────────────────────────┐
  │ Unicode tables       │          │ std.text (root)                  │
  │ ├── grapheme_break   │◄────────│ ├── unicode/                     │
  │ ├── word_break       │  FFI    │ │   ├── segmentation             │
  │ ├── east_asian_width │  calls  │ │   ├── normalization            │
  │ ├── normalization    │          │ │   ├── bidi                     │
  │ ├── case_fold        │          │ │   └── security                 │
  │ ├── bidi_class       │          │ ├── width/                       │
  │ ├── confusables      │          │ ├── similarity/                  │
  │ └── line_break       │          │ ├── case/                        │
  └─────────────────────┘          │ ├── analysis/    ← from Pretext  │
                                    │ ├── measure/     ← TextMeasure   │
                                    │ ├── layout/      ← from Pretext  │
                                    │ └── transform/                   │
                                    └──────────────────────────────────┘

Key design: Unicode data tables live in ori_rt as Rust compile-time constants (~230KB total, tree-shaken per function). All algorithms above the table layer are pure Ori.

Binary size tree-shaking (linker DCE per table symbol):

Usage	Tables linked	Approx size
`display_width` only	EAW + grapheme break	~21KB
+ `wrap`	+ word break	~30KB
+ `normalize`	+ decomposition/composition	~90KB
+ `case_fold`	+ case folding	~105KB
All features	All tables	~230KB

7E.1 Phase 1: Unicode Foundation + Display Width + Convenience

Goal: Ship the Unicode character properties, grapheme/word segmentation, East Asian Width, display width calculation, ANSI escape handling, and high-level convenience functions (wrap, truncate, pad, indent, dedent). This phase alone makes Ori’s text handling better than any language except Swift/Elixir.

Estimated scope: ~3,000 LOC Rust (tables + state machines in ori_rt), ~1,500 LOC Ori (API + tests)

7E.1.1 Unicode Data Table Generation

Create scripts/generate-unicode-tables.py — downloads Unicode Character Database files from unicode.org, generates Rust source in compiler/ori_rt/src/unicode/tables/
- Two-level trie generator for GeneralCategory, Script (~80KB)
- Sorted range list generator for East_Asian_Width (~6KB)
- Grapheme break property table generator (GraphemeBreakProperty.txt → ~15KB)
- Word break property table generator (WordBreakProperty.txt → ~10KB)
- Generate mod.rs re-exporting all tables
Verify all tables match Unicode 16.0 test data

7E.1.2 Character Property Functions in `ori_rt`

7E.1.3 Grapheme Cluster Segmentation (UAX #29)

7E.1.4 Word Segmentation (UAX #29)

Implement UAX #29 word break state machine in compiler/ori_rt/src/unicode/word.rs
- ori_word_next(s: *const u8, len: usize, offset: usize) -> (usize, bool) — (boundary, is_word_like)
Expose as @words (s: str) -> Iterator<WordSegment> in library/std/text/unicode/segmentation.ori
Tests: Run against WordBreakTest.txt from Unicode
- Matrix: English, CJK (per-char), contractions (“don’t”), hyphenated words, numeric, mixed

7E.1.5 Display Width (`std.text.width`)

Implement library/std/text/width/mod.ori:
- @east_asian_width (c: char) -> EastAsianWidth — calls ori_char_east_asian_width
- @char_width (c: char) -> int — calls ori_char_width
- @display_width (s: str) -> int — iterates grapheme clusters, sums widths
  - Fast path: ASCII-only → byte length
  - CJK grapheme → 2
  - Emoji grapheme (ZWJ, flag) → 2
  - Combining mark → 0
  - ANSI escape → 0 (auto-strip)
- @truncate_to_width (s: str, max_width: int, suffix: str = "…") -> str
- @pad_to_width (s: str, target_width: int, fill: char = ' ', align: Alignment = Alignment.Left) -> str
- @center_to_width (s: str, target_width: int, fill: char = ' ') -> str
Tests:
- Semantic pin: display_width("こんにちは") == 10 (CJK = 2 each)
- Semantic pin: display_width("👨‍👩‍👧‍👦") == 2 (ZWJ emoji = 2)
- Semantic pin: display_width("\x1b[31mHi\x1b[0m") == 2 (ANSI = 0)
- Matrix: ASCII, CJK, emoji, combining marks, control chars, ANSI escapes, mixed
- Truncation: grapheme-safe (never splits cluster), CJK at boundary, suffix accounting
- Padding: left/right/center, CJK content, emoji content

7E.1.6 ANSI Escape Handling (`std.text.transform.ansi`)

Implement library/std/text/transform/ansi.ori:
- @strip_ansi (s: str) -> str
- @ansi_display_width (s: str) -> int — single pass, no intermediate allocation
- @has_ansi (s: str) -> bool
- @parse_ansi (s: str) -> Iterator<AnsiSegment> (Text | Escape variants)
Tests: CSI sequences, OSC sequences, SGR reset, nested colors, empty strings

7E.1.7 High-Level Convenience Functions (`std.text` root)

Implement library/std/text/mod.ori (replacing current stub):
- @wrap (text: str, width: int) -> [str] — uses display_width + word segmentation
- @wrap_lines (text: str, width: int) -> str — wrap + join with “\n”
- @truncate (text: str, max_graphemes: int, suffix: str = "...") -> str
- @indent (text: str, prefix: str) -> str
- @dedent (text: str) -> str
- @is_blank (text: str) -> bool — empty or all Unicode whitespace
- Re-exports from submodules (display_width, graphemes, etc.)
Tests:
- wrap: English, CJK (no spaces), mixed CJK+Latin, long words, empty strings
- Semantic pin: wrap("日本語テスト", 6) == ["日本語", "テスト"] (CJK width-aware)
- truncate: grapheme-safe, suffix accounting
- indent/dedent: multiline, mixed indentation, empty lines

7E.1.R Third-Party Review Findings

None.

7E.1 Completion Checklist

7E.2 Phase 2: Normalization + Case + Similarity + Transforms

Goal: Ship Unicode normalization (NFC/NFD/NFKC/NFKD), case folding, locale-aware case conversion, string similarity functions (edit distance, Jaro-Winkler, closest match), natural sort, slugification, and case style conversion. Enables compiler diagnostics (“did you mean?”), search, data processing, URL generation.

Estimated scope: ~1,500 LOC Rust (tables), ~2,000 LOC Ori (algorithms + tests)

Depends on: 7E.1 (grapheme segmentation used by similarity functions)

7E.2.1 Normalization Tables in `ori_rt`

7E.2.2 Normalization Ori API (`std.text.unicode.normalization`)

Implement library/std/text/unicode/normalization.ori:
- @normalize (s: str, form: NormalizationForm = NormalizationForm.NFC) -> str
- @is_normalized (s: str, form: NormalizationForm = NormalizationForm.NFC) -> bool
- @canonical_equals (a: str, b: str) -> bool
- @compatibility_equals (a: str, b: str) -> bool
Tests:
- Semantic pin: canonical_equals("café", "cafe\u{0301}") == true
- Semantic pin: is_normalized("Hello", NFC) == true (ASCII fast path, no allocation)
- Matrix: ASCII (fast path), precomposed, decomposed, compatibility, Hangul, mixed

7E.2.3 Case Folding Tables in `ori_rt`

Generate case folding table from CaseFolding.txt (~15KB)
Generate special casing data from SpecialCasing.txt (~5KB)
Implement ori_case_fold(s, len, out, out_len) in ori_rt

7E.2.4 Case Operations (`std.text.case`)

Implement library/std/text/case/mod.ori:
- @case_fold (s: str) -> str
- @case_fold_equals (a: str, b: str) -> bool
- @case_fold_compare (a: str, b: str) -> Ordering
- @to_uppercase (s: str, locale: CaseLocale = CaseLocale.Default) -> str
- @to_lowercase (s: str, locale: CaseLocale = CaseLocale.Default) -> str
- @to_titlecase (s: str, locale: CaseLocale = CaseLocale.Default) -> str
Tests:
- Semantic pin: case_fold_equals("straße", "STRASSE") == true
- Matrix: ASCII, Latin extended, Turkish İ/ı, German ß→SS, Greek final σ→ς, Lithuanian

7E.2.5 String Similarity (`std.text.similarity`)

7E.2.6 Text Transforms (`std.text.transform`)

Implement library/std/text/transform/slug.ori:
- @slugify (s: str, options: SlugOptions = SlugOptions {}) -> str
Implement library/std/text/transform/mod.ori:
- @remove_diacritics (s: str) -> str — NFKD + strip combining marks
- @to_ascii_approximation (s: str) -> str — diacritics + basic transliteration
- @to_snake_case, @to_camel_case, @to_pascal_case, @to_kebab_case, @to_screaming_snake
Tests: slugify with diacritics, CJK, spaces, special chars; case conversion with camelCase, PascalCase, snake_case, mixed

7E.2.R Third-Party Review Findings

None.

7E.2 Completion Checklist

7E.3 Phase 3: Analysis Pipeline + Layout Engine

Goal: Port Pretext’s text analysis pipeline (12+ linguistic merging passes) and greedy line-breaking engine to Ori. Ship the TextMeasure trait with built-in MonospaceMeasure, TerminalMeasure, and CachedMeasure. Upgrade wrap() to use the full pipeline. This enables production-quality text layout for TUIs, editors, GPU widgets, and browser engines.

Estimated scope: ~3,000 LOC Ori (pipeline + engine), ~2,000 LOC Ori (tests)

Depends on: 7E.1 (segmentation, display width), 7E.2 (normalization for whitespace handling)

Reference: ~/projects/reference_repos/pretext/src/ — analysis.ts, measurement.ts, layout.ts, line-break.ts

7E.3.1 TextMeasure Trait and Built-in Implementations

Implement library/std/text/measure/mod.ori:
- pub trait TextMeasure { @measure (self, text: str) -> float }
- type MonospaceMeasure = { char_width: float } + impl
- type TerminalMeasure = { narrow_width: float, wide_width: float } + impl
- type CachedMeasure<M: TextMeasure> = { inner: M, cache: {str: float} } + impl
- Constructor helpers: @monospace(), @terminal(), @cached<M>(inner: M)
Tests: measure ASCII, CJK, emoji, empty string with each measurer

7E.3.2 Kinsoku Tables and Segment Classification

Implement library/std/text/analysis/kinsoku.ori:
- let $kinsoku_start: Set<char> — CJK line-start-prohibited characters (26 chars)
- let $kinsoku_end: Set<char> — CJK line-end-prohibited characters (18 chars)
- let $left_sticky_punctuation: Set<char> — left-sticky punctuation (~30 chars)
- let $arabic_no_space_trailing: Set<char> — Arabic trailing punctuation
- let $myanmar_medial_glue: Set<char> — Myanmar medial connectors
- Classification functions: is_left_sticky_segment, is_forward_sticky_segment, etc.
Port from: pretext/src/analysis.ts lines 129-207 (kinsoku tables and classification)

7E.3.3 Text Analysis Pipeline

Implement library/std/text/analysis/mod.ori:
- Whitespace normalization (Normal, PreWrap, Pre, PreLine modes)
- Segment-by-break-kind splitting
- 12+ merging passes (each a linear scan):
  1. Left-sticky punctuation merge
  2. CJK kinsoku merge (line-start/end prohibited)
  3. Forward-sticky cluster carry
  4. Arabic no-space punctuation merge
  5. Myanmar medial glue merge
  6. Escaped quote cluster merge
  7. Repeated single-char run merge
  8. Glue-connected text run merge (NBSP)
  9. URL-like run merge
  10. URL query run merge
  11. Numeric run merge
  12. ASCII punctuation chain merge
  13. Hyphenated numeric split
  14. Forward-sticky carry across CJK
  15. Arabic space+mark split
- Hard-break chunk compilation
- @analyze (text: str, options: AnalysisOptions) -> TextAnalysis
Port from: pretext/src/analysis.ts (full file, ~1020 lines)
Tests:
- Per-pass unit tests (each merging pass tested independently)
- Integration: English, CJK, Arabic, Thai, Myanmar, URLs, numeric, emoji, mixed
- Semantic pin: analyze("better.").segments.len() == 1 (punctuation merged)
- Semantic pin: URL segments grouped correctly
- Corpus tests adapted from Pretext’s accuracy suite

7E.3.4 Line Breaking Engine

7E.3.5 Layout API

Implement library/std/text/layout/mod.ori:
- @layout (prepared, max_width, line_height) -> LayoutResult — hot path, pure arithmetic
- @layout_lines (prepared, max_width) -> Iterator<LayoutLine> — with text materialization
- @layout_next_line (prepared, start, max_width) -> Option<LayoutLine> — streaming
- @walk_line_ranges (prepared, max_width) -> Iterator<LayoutLineRange> — non-materializing
- @natural_width (prepared) -> float — intrinsic width
Port from: pretext/src/layout.ts lines 495-716
Tests:
- Round-trip: layout_lines line count matches layout line count
- layout_next_line with variable widths
- natural_width for single-line and multi-line text
- Matrix: various measurers (monospace, terminal), various content types

7E.3.6 Upgrade `wrap()` to Full Pipeline

Replace Phase 1’s basic wrap() with pipeline-backed implementation:
- wrap() now calls analyze() + prepare() + layout_lines()
- wrap_measured<M: TextMeasure>() for custom measurers
Verify all existing Phase 1 wrap tests still pass
Add new tests for kinsoku, soft hyphens, URLs, numeric runs

7E.3.R Third-Party Review Findings

None.

7E.3 Completion Checklist

7E.4 Phase 4: Bidi + Security + Inline Flow + Encoding

Goal: Ship the bidirectional text algorithm, confusable detection, mixed inline content layout, and legacy encoding conversion. This completes the full std.text API surface.

Estimated scope: ~1,000 LOC Rust (bidi class + confusable tables), ~2,000 LOC Ori (algorithms + tests)

Depends on: 7E.1 (character properties), 7E.3 (layout engine for inline flow)

7E.4.1 Bidi Tables and Algorithm

7E.4.2 Confusable Detection (UTS #39)

Generate confusable mappings table in ori_rt from confusables.txt (~50KB)
Implement library/std/text/unicode/security.ori:
- @skeleton (s: str) -> str
- @is_confusable (a: str, b: str) -> bool
- @mixed_script_status (s: str) -> MixedScriptStatus
- @restriction_level (s: str) -> RestrictionLevel
Tests:
- Semantic pin: is_confusable("аpple", "apple") == true (Cyrillic а vs Latin a)
- Mixed script: Latin-only → SingleScript, Latin+Common → SafeMix, Latin+Cyrillic → SuspiciousMix

7E.4.3 Inline Flow Layout

Implement library/std/text/layout/inline_flow.ori:
- @prepare_inline_flow<M> (items, measurer) -> PreparedInlineFlow
- @layout_inline_flow_lines (prepared, max_width) -> Iterator<InlineFlowLine>
- @measure_inline_flow (prepared, max_width, line_height) -> LayoutResult
- Atomic items (break: Never), boundary whitespace collapse, gap calculation
Port from: pretext/src/inline-flow.ts (344 lines)
Tests: Mixed text + atomic chips, boundary whitespace, variable item widths

7E.4.4 Encoding Conversion

7E.4.R Third-Party Review Findings

None.

7E.4 Completion Checklist

7E.5 Section Completion Checklist

Inspired By

Pretext (github.com/chenglou/pretext) — text analysis + layout engine
ICU4X — modular Unicode algorithms
Rust unicode-segmentation, unicode-width, textwrap crates
Swift String — grapheme cluster correctness
Elixir String — built-in similarity functions