From 22f1b4a4487307f0ac22cbef5d2f8a73f740abc8 Mon Sep 17 00:00:00 2001
From: ZHU Yuhao <dr.yuhao.zhu@outlook.com>
Date: Wed, 8 Apr 2026 21:01:49 +0200
Subject: [PATCH 1/5] Add planning

---
 docs/plans/decimal128_enhancement.md | 757 +++++++++++++++++++++++++++
 1 file changed, 757 insertions(+)
 create mode 100644 docs/plans/decimal128_enhancement.md

diff --git a/docs/plans/decimal128_enhancement.md b/docs/plans/decimal128_enhancement.md
new file mode 100644
index 00000000..16af1106
--- /dev/null
+++ b/docs/plans/decimal128_enhancement.md
@@ -0,0 +1,757 @@
+# Decimal128 Enhancement Plan
+
+> **Date**: 2026-04-08  
+> **Target**: decimo >=0.9.0  
+> **Mojo Version**: >=0.26.2  
+>
+> 子曰工欲善其事必先利其器  
+> The mechanic, who wishes to do his work well, must first sharpen his tools -- Confucius
+
+This document is a thorough audit of the `src/decimo/decimal128/` module. I compared our
+implementation against other 128-bit fixed-precision decimal libraries — C# `System.Decimal`,
+Rust `rust_decimal`, Apache Arrow `Decimal128`, and Go `govalues/decimal` — and recorded
+everything I found: correctness bugs, performance bottlenecks, and improvement opportunities.
+
+**Scope:** Only 128-bit (or near-128-bit) fixed-precision, non-floating-point decimal types.
+Arbitrary-precision decimals (Python `decimal.Decimal`, Java `BigDecimal`) are out of scope here —
+they are covered by `BigDecimal`. IEEE 754 decimal128 is also out of scope — it is a floating-point
+format with discontinuous representation, not comparable to our fixed-point design.
+
+---
+
+## 1. Cross-Language Comparison
+
+### 1.1 Storage & Layout
+
+| Feature                | Decimo Decimal128    | C# System.Decimal    | Rust rust_decimal    | Arrow Decimal128           | Go govalues/decimal |
+| ---------------------- | -------------------- | -------------------- | -------------------- | -------------------------- | ------------------- |
+| Total bits             | 128                  | 128                  | 128                  | 128                        | 128 (bool+u64+int)  |
+| Coefficient storage    | 96-bit (3×UInt32 LE) | 96-bit (3×UInt32 LE) | 96-bit (3×UInt32 LE) | 128-bit signed two's compl | 64-bit unsigned     |
+| Max coefficient        | 2^96 − 1             | 2^96 − 1             | 2^96 − 1             | 10^38 − 1                  | 10^19 − 1           |
+| Bound type             | Binary               | Binary               | Binary               | Decimal                    | Decimal             |
+| Max significant digits | 29*                  | 29*                  | 29*                  | 38                         | 19                  |
+| Scale range            | 0–28                 | 0–28                 | 0–28                 | User-defined               | 0–19                |
+| Sign storage           | Bit 31 of flags      | Bit 31 of flags      | Bit 31 of flags      | Two's complement           | Bool field          |
+| Endianness             | Little-endian        | Little-endian        | Little-endian        | Platform-native            | N/A                 |
+
+\* 29 digits, but the leading digit can only be 0–7 (since 10^29 − 1 > 2^96 − 1). See §2 for why
+this matters.
+
+**Observation:** Decimo, C#, and Rust share the same layout — a proven design. Arrow and govalues
+use a fundamentally different approach with decimal-bounded coefficients (10^p − 1 instead of
+2^N − 1), which gives them cleaner digit semantics at the cost of unused bit range.
+
+### 1.2 Special Values
+
+| Feature       | Decimo                | C#         | Rust rust_decimal | Arrow Decimal128 | Go govalues/decimal |
+| ------------- | --------------------- | ---------- | ----------------- | ---------------- | ------------------- |
+| +Infinity     | ✓ (broken — see §3.1) | ✗ (throws) | ✗                 | ✗                | ✗                   |
+| −Infinity     | ✓ (broken)            | ✗          | ✗                 | ✗                | ✗                   |
+| NaN           | ✓ (broken — see §3.1) | ✗          | ✗                 | ✗                | ✗                   |
+| Negative zero | ✗                     | ✗          | ✗                 | ✗                | ✗                   |
+| Subnormals    | ✗                     | ✗          | ✗                 | ✗                | ✗                   |
+
+**Observation:** None of the comparable 128-bit fixed-precision libraries support NaN or Infinity.
+We are the only one, and our implementation is broken (§3.1). I think we should seriously consider
+removing NaN/Infinity support to match the established paradigm — all four comparable libraries
+simply throw or return an error for undefined operations. If we keep them, the bugs must be fixed
+and full arithmetic propagation must be added, which is a significant effort for a feature no peer
+library provides.
+
+### 1.3 Rounding Modes
+
+| Mode                 | Decimo | C#                       | Rust        | Arrow           | Go govalues    |
+| -------------------- | ------ | ------------------------ | ----------- | --------------- | -------------- |
+| HALF_EVEN (banker's) | ✓      | ✓ (default)              | ✓ (default) | ✓ (default)     | ✓ (default)    |
+| HALF_UP              | ✓      | ✓ (`AwayFromZero`)       | ✓           | ✓ (`HALF_UP`)   | ✓ (`HalfUp`)   |
+| HALF_DOWN            | ✓      | ✗                        | ✓           | ✓ (`HALF_DOWN`) | ✓ (`HalfDown`) |
+| UP (away from zero)  | ✓      | ✓ (`AwayFromZero`)       | ✓           | ✓ (`UP`)        | ✓ (`Up`)       |
+| DOWN (truncate)      | ✓      | ✓ (`ToZero`)             | ✓           | ✓ (`DOWN`)      | ✓ (`Down`)     |
+| CEILING              | ✓      | ✓ (`ToPositiveInfinity`) | ✓           | ✓ (`CEILING`)   | ✓ (`Ceiling`)  |
+| FLOOR                | ✓      | ✓ (`ToNegativeInfinity`) | ✓           | ✓ (`FLOOR`)     | ✓ (`Floor`)    |
+
+**Observation:** All five libraries (including us) support these 7 rounding modes. We are on par.
+
+### 1.4 Arithmetic Coverage
+
+| Operation            | Decimo          | C#               | Rust rust_decimal | Arrow Decimal128 | Go govalues  |
+| -------------------- | --------------- | ---------------- | ----------------- | ---------------- | ------------ |
+| add                  | ✓               | ✓                | ✓                 | ✓                | ✓            |
+| subtract             | ✓ (via add)     | ✓                | ✓                 | ✓                | ✓            |
+| multiply             | ✓               | ✓                | ✓                 | ✓                | ✓            |
+| divide               | ✓               | ✓                | ✓                 | ✓                | ✓ (`Quo`)    |
+| truncate_divide      | ✓               | ✓ (`Truncate`)   | ✓                 | ✗                | ✓ (`QuoRem`) |
+| modulo               | ✓               | ✓ (`%`)          | ✓                 | ✗                | ✓ (`Rem`)    |
+| power (int exponent) | ✓               | ✗ (use Math.Pow) | ✓ (`powi`)        | ✗                | ✓ (`PowInt`) |
+| sqrt                 | ✓               | ✗                | ✓                 | ✗                | ✓            |
+| root (nth)           | ✓               | ✗                | ✗                 | ✗                | ✗            |
+| exp                  | ✓               | ✗                | ✓                 | ✗                | ✗            |
+| ln                   | ✓               | ✗                | ✓                 | ✗                | ✗            |
+| log10                | ✓               | ✗                | ✗                 | ✗                | ✗            |
+| log (arbitrary base) | ✓               | ✗                | ✗                 | ✗                | ✗            |
+| abs                  | ✓               | ✓                | ✓                 | ✓                | ✓            |
+| negate               | ✓               | ✓                | ✓                 | ✓                | ✓            |
+| round                | ✓               | ✓                | ✓                 | ✓                | ✓            |
+| quantize             | ✓               | ✗                | ✗                 | via round        | ✗            |
+| factorial            | ✓ (0–27 lookup) | ✗                | ✗                 | ✗                | ✗            |
+| min / max            | ✗               | ✓                | ✓                 | ✓                | ✓            |
+| normalize            | ✗               | ✗                | ✓                 | ✗                | ✗            |
+
+**Observation:** Our arithmetic coverage is the most complete among all five libraries — we are the
+only one with `root`, `log10`, `log`, and `factorial`. Matching Rust on `exp` and `ln`. The gap is
+`min`/`max` which every other library provides and we do not.
+
+---
+
+## 2. The Coefficient Bound Problem
+
+This is probably the biggest architectural concern I found. It affects performance, code complexity,
+and user-facing semantics.
+
+### 2.1 The Problem
+
+Our max coefficient is 2^96 − 1 = 79,228,162,514,264,337,593,543,950,335. This is a 29-digit
+number, but the leading digit can only be 0–7. The number 80,000,000,000,000,000,000,000,000,000
+(which has only 2 significant digits) is out of range. Meanwhile, all 28-digit numbers fit.
+
+This creates a messy boundary: after every arithmetic operation that might produce a wide result
+(multiplication, addition with carry, etc.), I need to check whether the coefficient exceeds
+2^96 − 1 and, if so, round it down. The rounding itself is non-trivial because the boundary is not
+at a clean decimal digit — I cannot just drop the last digit. The `truncate_to_max` function in
+`utility.mojo` handles this, and it is one of the most complex functions in the codebase.
+
+### 2.2 How Other Libraries Handle This
+
+#### C# System.Decimal — `ScaleResult()` (binary bound, same as us)
+
+.NET's approach is heavily optimized. The core function `ScaleResult()` in `Decimal.DecCalc.cs`:
+
+1. Estimates how many decimal digits to remove using `LeadingZeroCount` and the constant
+   `log10(2) ≈ 77/256`.
+2. Divides the wide result (stored in a `Buf24`, up to 192 bits) by powers of 10, using
+   `DivByConst()` specialized per constant (10^1 through 10^9) for maximum speed — on 64-bit
+   targets, these use compiler-generated multiply-by-reciprocal.
+3. Applies banker's rounding with a sticky bit for lost precision.
+4. If rounding causes a carry that pushes above 96 bits again, scales down by 10 one more time.
+
+Additional C# tricks:
+
+- `SearchScale()` — binary search using precomputed `OVFL_MAX_N_HI` constants to find the largest
+  safe scale-up factor.
+- `PowerOvflValues[]` — table of largest 96-bit values that won't overflow when multiplied by
+  10^1 through 10^8.
+- `Unscale()` — efficiently removes trailing zeros using binary search: try 10^8, 10^4, 10^2, 10^1,
+  with quick-reject bit checks (e.g., `(low & 0xF) == 0` before trying 10^4).
+- `OverflowUnscale()` — when quotient overflows by exactly 1 bit, feeds the carry back in and
+  divides by 10, avoiding a full rescale.
+
+The bottom line: .NET has hundreds of lines of intricate, heavily-optimized code just for this
+boundary handling. Every multiply that exceeds 96 bits pays for multi-word division.
+
+#### Rust rust_decimal — Port of .NET (binary bound, same as us)
+
+`rust_decimal` is essentially a Rust port of .NET's `DecCalc`. The `Buf24::rescale()` in
+`ops/common.rs` is the equivalent of `ScaleResult()`. Same `log10(2) × 256 = 77` trick, same
+`OVERFLOW_MAX_N_HI` constants, same `POWER_OVERFLOW_VALUES` table.
+
+One difference: `rust_decimal` returns `CalculationResult::Overflow` instead of throwing, letting
+the caller handle it.
+
+#### Apache Arrow Decimal128 — 256-bit promotion (decimal bound)
+
+Arrow sidesteps the problem entirely by capping at 10^38 − 1 instead of 2^128 − 1. The overflow
+check is just `abs(value) < 10^precision` — a single comparison against a precomputed constant.
+
+For multiplication that might overflow 128 bits, Arrow promotes to `int256_t` (Boost or compiler
+`__int128`-based), multiplies, scales down by `10^delta_scale` in one clean division, then converts
+back. This replaces .NET's iterative divide-and-round loop with a single wide multiplication +
+one division.
+
+The `FitsInPrecision(precision)` check is trivially a comparison against a table entry. No
+multi-step rescaling needed for the check itself.
+
+#### Go govalues/decimal — Two-tier fast path (decimal bound)
+
+`govalues/decimal` uses the most elegant approach. Max coefficient = 10^19 − 1 (fits in `uint64`).
+
+1. **Fast path:** Try the operation using native 64-bit arithmetic. If overflow detected (e.g.,
+   `z/y != x || z > maxFint`), fall through.
+2. **Slow path:** Redo with `big.Int` (arbitrary precision), compute exact result, then round to
+   19 digits using `rshHalfEven` (right-shift in decimal = divide by 10^N, round half-to-even).
+
+Since the bound IS a power of 10, the rounding is clean: just count digits, divide by 10^excess,
+round. No awkward non-decimal boundary to deal with.
+
+### 2.3 Comparison
+
+| Strategy            | Used by              | Bound check cost      | Overflow rounding cost                      | Code complexity |
+| ------------------- | -------------------- | --------------------- | ------------------------------------------- | --------------- |
+| Binary (2^96 − 1)   | C#, Rust, **Decimo** | Cheap (bit compare)   | **Expensive** (multi-word ÷10^N, iterative) | **High**        |
+| Decimal (10^38 − 1) | Arrow, SQL Server    | Cheap (one compare)   | Cheap (one wide ÷10^N)                      | Low             |
+| Decimal (10^19 − 1) | govalues/decimal     | Trivial (one compare) | Trivial (big.Int fallback + round)          | Very low        |
+
+### 2.4 Implications for Decimo
+
+Since we follow the C#/Rust paradigm (binary bound, 2^96 − 1), the `truncate_to_max` complexity
+is inherent. There are a few things I can do:
+
+1. **Adopt .NET's optimization tricks.** Specifically: the `log10(2) ≈ 77/256` estimation for
+   scale-down amount, precomputed `POWER_OVERFLOW_VALUES` table for safe scale-up, and the
+   `Unscale()` trailing-zero removal with quick-reject bit checks. These would make our existing
+   `truncate_to_max` and `round_to_keep_first_n_digits` much faster.
+
+2. **Consider decimal bound for a future Decimal256.** If I ever widen to full 128-bit coefficient,
+   using 10^38 − 1 as the max (matching Arrow and SQL Server) would eliminate this problem entirely
+   and give us interoperability with Arrow wire format and SQL `decimal(38)`.
+
+3. **Document the "29 digits but not 29 nines" behavior clearly.** Users should know that
+   "29 digits of precision" really means "28 full digits plus a leading digit 0–7".
+
+---
+
+## 3. Correctness Bugs
+
+### 3.1 CRITICAL: NaN/Infinity Implementation Is Broken
+
+**File:** `decimal128.mojo`
+
+There are multiple compounding bugs in the NaN/Infinity support:
+
+**Bug A — NaN mask mismatch:**
+
+```txt
+NAN_MASK = UInt32(0x00000002)    # bit 1
+```
+
+But the constructors set a different bit:
+
+```txt
+NAN()          → Self(0, 0, 0, 0x00000010)  # bit 4
+NEGATIVE_NAN() → Self(0, 0, 0, 0x80000010)  # bit 4
+```
+
+So `Decimal128.NAN().is_nan()` returns **False**.
+
+**Bug B — `is_zero()` returns True for NaN and Infinity:**
+
+```python
+fn is_zero(self) -> Bool:
+    return self.low == 0 and self.mid == 0 and self.high == 0
+```
+
+NaN and Infinity have zero coefficients, so `INFINITY().is_zero()` → **True**.
+
+**Bug C — No arithmetic propagation:** None of the arithmetic operations check for NaN or Infinity
+inputs. Passing them into `add()`, `multiply()`, etc., produces garbage silently.
+
+**Recommendation:** Since no comparable 128-bit fixed-precision library supports NaN or Infinity
+(see §1.2), I think the cleanest fix is to **remove NaN/Infinity support entirely** and raise
+errors for undefined operations (matching C#, Rust, Arrow, and govalues). If we decide to keep
+them, all three bugs must be fixed:
+
+- Fix A: Change `NAN()` → `Self(0, 0, 0, 0x00000002)` and `NEGATIVE_NAN()` →
+  `Self(0, 0, 0, 0x80000002)` to match the mask.
+- Fix B: Guard `is_zero()` with `if self.is_nan() or self.is_infinity(): return False`.
+- Fix C: Add early-return NaN/Inf checks to every arithmetic function (significant effort).
+
+### 3.2 MEDIUM: `from_words` Uses `testing.assert_true` in Production Code
+
+**File:** `decimal128.mojo`, lines ~344, 351
+
+```python
+from testing import assert_true
+assert_true(scale <= Self.MAX_SCALE, ...)
+assert_true(coefficient <= Self.MAX_AS_UINT128, ...)
+```
+
+This imports the test framework into production code. `assert_true` panics with a test failure
+message rather than raising a recoverable error. Users expect `raise Error(...)`.
+
+**Fix:**
+
+```python
+if scale > Self.MAX_SCALE:
+    raise Error("Error in Decimal128.from_words(): Scale must be <= 28, got " + str(scale))
+if coefficient > Self.MAX_AS_UINT128:
+    raise Error("Error in Decimal128.from_words(): Coefficient exceeds 96-bit max")
+```
+
+### 3.3 MEDIUM: Division Hardcodes Rounding Behavior
+
+**File:** `arithmetics.mojo`, inside `divide()`
+
+The long division loop always uses banker's rounding (HALF_EVEN). The function does not accept a
+rounding mode parameter.
+
+This matches C# and Rust behavior — both hardcode banker's rounding for the `/` operator. Arrow
+and govalues also default to HALF_EVEN for division. So this is consistent with all comparable
+libraries and not really a bug.
+
+If we ever want configurable rounding in division, we would add a
+`divide(x, y, rounding_mode)` overload. Low priority.
+
+### 3.4 LOW: `compare_absolute` Potential Overflow
+
+**File:** `comparison.mojo`
+
+When comparing fractional parts, the code computes:
+
+```python
+fractional_1 = UInt128(x1_frac_part) * UInt128(10) ** scale_diff
+```
+
+Since `scale_diff` can be up to 28 and `x1_frac_part` can be up to 2^96 − 1, the product can be
+as large as (2^96 − 1) × 10^28 ≈ 7.9 × 10^57. UInt128 max is ~3.4 × 10^38. This overflows.
+
+**Fix:** Use UInt256 for this computation, or normalize both values to the same scale before
+extracting integer and fractional parts.
+
+### 3.5 LOW: `is_one()` Might Be Incomplete
+
+**File:** Used in `exponential.mojo` (e.g., `log()` calls `x.is_one()`)
+
+I should verify `is_one()` handles all representations of 1: `1` (coef=1, scale=0), `1.0`
+(coef=10, scale=1), `1.00` (coef=100, scale=2), etc. If it only checks
+`coefficient == 1 and scale == 0`, it misses the other forms.
+
+---
+
+## 4. Performance Bottlenecks
+
+### 4.1 `number_of_bits()` Uses a Loop
+
+**File:** `utility.mojo`
+
+```python
+fn number_of_bits(n: UInt128) -> Int:
+    var count = 0
+    var x = n
+    while x > 0:
+        count += 1
+        x >>= 1
+    return count
+```
+
+O(n) in bit count — up to 96 iterations. C#'s `ScaleResult` uses `LeadingZeroCount` which is
+a single instruction on modern CPUs.
+
+**Fix:** Use `count_leading_zeros()` or `bit_width()` if available in Mojo. If not available for
+UInt128, split into two UInt64s and use CLZ on the high word.
+
+### 4.2 `power_of_10` Is Not Using Precomputed Constants Efficiently
+
+**File:** `utility.mojo`
+
+The function already has hardcoded return values for n=0 through n=32, but for n=33 through n=56+
+it falls back to `ValueType(10) ** n` which computes via loop.
+
+Since `power_of_10` is called in nearly every arithmetic operation (scale adjustment,
+`number_of_digits`, comparison, division), the fallback path matters.
+
+**Fix:** Extend the hardcoded constants up to n=58 (the maximum needed for UInt256 products of
+two 29-digit numbers). The pattern is already there for n≤32 — just continue it. For UInt256
+values too large for integer literals, use the `@always_inline` constant function pattern from
+`constants.mojo`.
+
+### 4.3 `truncate_to_max` Could Use .NET Tricks
+
+**File:** `utility.mojo`
+
+Our `truncate_to_max` computes `number_of_digits`, then divides by `power_of_10`, then rounds.
+.NET's `ScaleResult` uses:
+
+- `LeadingZeroCount` × `77/256` to estimate digit count without a full `number_of_digits` call
+- Division by compile-time constants (10^1 through 10^9) using multiply-by-reciprocal
+- `Unscale()` trailing-zero removal with bit-check quick-rejects
+
+These tricks could significantly speed up our overflow handling.
+
+### 4.4 `ln()` Range Reduction Uses Loops
+
+**File:** `exponential.mojo`
+
+The `ln()` function reduces input to [0.5, 2.0) by repeatedly dividing by 10, then by 2:
+
+```python
+while x_reduced > Decimal128(2, 0, 0, 0):
+    x_reduced = decimo.decimal128.arithmetics.divide(x_reduced, Decimal128(2, 0, 0, 0))
+    halving_count += 1
+```
+
+For `ln(1e28)`, this loop runs ~93 times (since 10^28 ≈ 2^93), each doing a full Decimal128
+division.
+
+**Fix:** Use the identity `ln(a × 10^n) = ln(a) + n × ln(10)` to strip the scale in one step.
+Then only the coefficient (which is in [1, 2^96)) needs halving, which is at most ~96 iterations
+but could also be reduced by computing the bit width and dividing by the appropriate power of 2
+in one shot.
+
+### 4.5 `subtract()` Creates a Temporary
+
+**File:** `arithmetics.mojo`
+
+```python
+def subtract(x1: Decimal128, x2: Decimal128) raises -> Decimal128:
+    return add(x1, negative(x2))
+```
+
+Creates a temporary just to flip the sign bit. Probably minor — Decimal128 is 16 bytes and should
+be in registers. Low priority unless profiling shows otherwise.
+
+### 4.6 Series Computations Cap at 500 Iterations
+
+**File:** `exponential.mojo`
+
+Both `exp_series` and `ln_series` loop up to 500 with convergence check `term.is_zero()`. In
+practice they converge in 30-60 iterations. The issue: `is_zero()` triggers only when the term
+underflows to exactly zero, which may require a few extra iterations beyond when the term is already
+too small to affect the result.
+
+**Fix:** Break early if the term is smaller than 10^(−29) — it cannot change the result at our
+precision.
+
+### 4.7 `from_string` Processes Digits One at a Time
+
+**File:** `decimal128.mojo`
+
+```python
+coef = coef * 10 + digit
+```
+
+Each iteration does a UInt128 multiply-by-10 and add.
+
+**Fix:** Batch up to 9 digits into a UInt64, then multiply `coef` by the appropriate power of 10
+and add the batch. Reduces 128-bit multiplications from ~29 to ~4 for a max-length number.
+
+### 4.8 Division Loop: Separate `//` and `%` Operations
+
+**File:** `arithmetics.mojo`
+
+```python
+digit = rem // x2_coef
+rem = rem % x2_coef
+```
+
+Two separate 128-bit divisions on the same operands. Most hardware produces both quotient and
+remainder in a single `div` instruction.
+
+**Fix:** Use `divmod()` if available in Mojo.
+
+---
+
+## 5. Improvement Opportunities
+
+### 5.1 Add `__hash__` Support
+
+C# and Rust both support hashing their decimal types. We should implement `__hash__` so Decimal128
+can be used as a dictionary key or in sets.
+
+**Approach:** Hash the normalized form (strip trailing zeros, then hash coefficient + scale + sign).
+
+### 5.2 Add `Stringable` / `Representable` Protocol Conformance
+
+Check that we conform to Mojo's `Stringable` and `Representable` traits for `str()` and `repr()`.
+
+### 5.3 Better `from_float` Accuracy
+
+The current `from_float` (Float64) path likely goes through string conversion. Consider exact float
+decomposition (extracting mantissa and exponent from IEEE 754 double bits).
+
+**Comparison:** Rust `rust_decimal` uses exact mantissa/exponent extraction from f64 bits.
+
+### 5.4 `min()` / `max()` / `clamp()`
+
+All four comparable libraries provide `min`/`max`. We should too. Easy to implement.
+
+### 5.5 Canonicalization / `normalize()`
+
+Strip trailing zeros: `1.200` (coef=1200, scale=3) → `1.2` (coef=12, scale=1). Useful for
+hashing (§5.1) and reducing coefficient size for faster subsequent arithmetic. Rust `rust_decimal`
+has `normalize()`.
+
+### 5.6 Wider Testing for Edge Cases
+
+I recommend adding test cases for:
+
+| Test Case                                    | Expected Behavior            |
+| -------------------------------------------- | ---------------------------- |
+| `from_words` with scale > 28                 | Error (not assertion panic)  |
+| `from_words` with coefficient > 2^96 − 1     | Error (not assertion panic)  |
+| `compare_absolute` with max scale difference | Correct result (no overflow) |
+| `is_one()` with 1.0, 1.00, 1.000             | True                         |
+| Max coefficient after multiply               | Correct rounding             |
+| 29-digit numbers near 2^96 − 1 boundary      | Correct truncate/round       |
+
+---
+
+## 6. Priority Summary
+
+| #   | Issue                                   | Severity    | Effort                        | Priority |
+| --- | --------------------------------------- | ----------- | ----------------------------- | -------- |
+| 3.1 | NaN/Inf broken (fix or remove)          | Critical    | Small (remove) / Medium (fix) | **P0**   |
+| 3.2 | `from_words` uses `testing.assert_true` | Medium      | Small                         | **P1**   |
+| 4.2 | `power_of_10` not fully precomputed     | High        | Small                         | **P1**   |
+| 4.3 | `truncate_to_max` lacks .NET tricks     | High        | Medium                        | **P1**   |
+| 3.4 | `compare_absolute` overflow             | Medium      | Small                         | **P2**   |
+| 4.1 | `number_of_bits` loop                   | Medium      | Small                         | **P2**   |
+| 4.8 | Separate `//` and `%` in division       | Medium      | Small                         | **P2**   |
+| 4.7 | `from_string` digit-by-digit            | Medium      | Medium                        | **P2**   |
+| 4.4 | `ln()` range reduction loops            | Medium      | Medium                        | **P2**   |
+| 5.4 | `min/max/clamp`                         | Enhancement | Trivial                       | **P3**   |
+| 5.5 | `normalize()`                           | Enhancement | Small                         | **P3**   |
+| 5.1 | `__hash__`                              | Enhancement | Small                         | **P3**   |
+| 5.6 | Edge case tests                         | Enhancement | Medium                        | **P3**   |
+| 4.5 | `subtract` temporary                    | Low         | Trivial                       | **P4**   |
+| 4.6 | Series convergence tolerance            | Low         | Small                         | **P4**   |
+| 3.3 | Division rounding mode (configurable)   | Low         | Medium                        | **P4**   |
+| 5.3 | Better `from_float`                     | Enhancement | Medium                        | **P4**   |
+| 5.2 | `Stringable` conformance                | Enhancement | Trivial                       | **P4**   |
+
+---
+
+## 7. Execution Order
+
+**Phase 1 — Critical correctness:**
+
+1. Decide: remove NaN/Infinity or fix them (§3.1). If removing, delete `NAN()`, `NEGATIVE_NAN()`,
+   `INFINITY()`, `NEGATIVE_INFINITY()`, `NAN_MASK`, `INFINITY_MASK`, `is_nan()`, `is_infinity()`
+   and change callers to raise errors. If fixing, apply fixes A/B/C from §3.1.
+2. Fix `from_words` to use `raise Error` instead of `testing.assert_true` (§3.2).
+3. Add edge case tests (§5.6).
+
+**Phase 2 — Performance (coefficient bound):**
+
+1. Extend `power_of_10` hardcoded constants up to n=58 (§4.2).
+2. Add .NET-style tricks to `truncate_to_max`: CLZ-based digit estimation, divide-by-constant
+   optimization, trailing-zero quick-reject (§4.3).
+3. Replace `number_of_bits` with hardware CLZ (§4.1).
+4. Fix `compare_absolute` overflow with UInt256 (§3.4).
+
+**Phase 3 — Performance (general):**
+
+1. Optimize `from_string` digit batching (§4.7).
+2. Use single divmod in division loop (§4.8).
+3. Improve `ln()` range reduction (§4.4).
+
+**Phase 4 — Enhancements:**
+
+1. Add `min/max/clamp` (§5.4).
+2. Add `normalize()` (§5.5).
+3. Add `__hash__` (§5.1).
+4. Better `from_float` (§5.3).
+
+---
+
+## Appendix A. Survey of 128-Bit Fixed-Precision Decimal Types
+
+> **Scope:** 128-bit (or near-128-bit) fixed-precision, non-floating-point decimal types across
+> programming languages and libraries. This explicitly **excludes** arbitrary-precision decimals
+> (Python `decimal.Decimal`, Java `BigDecimal`, Go `shopspring/decimal`, Go `cockroachdb/apd`) and
+> IEEE 754 decimal128 (which is a floating-point format with 34-digit significand, exponent range
+> −6176 to +6111, and NaN/Infinity/subnormals).
+
+### A.1 Detailed Comparison Table
+
+| Name                 | Language / Platform           | Total Bits                      | Coefficient Storage                              | Max Coefficient                                      | Max Sig. Digits | Scale Range                | NaN / ±Inf |
+| -------------------- | ----------------------------- | ------------------------------- | ------------------------------------------------ | ---------------------------------------------------- | --------------- | -------------------------- | ---------- |
+| **System.Decimal**   | C# / .NET CLR                 | 128                             | 96-bit unsigned (3×Int32: lo, mid, hi)           | 2^96 − 1 = 79,228,162,514,264,337,593,543,950,335    | 29              | 0–28                       | No         |
+| **rust_decimal**     | Rust (crate)                  | 128                             | 96-bit unsigned (3×u32: lo, mid, hi)             | 2^96 − 1 (same as C#)                                | 29              | 0–28                       | No         |
+| **VB.NET Decimal**   | VB.NET / .NET CLR             | 128                             | Identical to C# (same CLR type `System.Decimal`) | 2^96 − 1                                             | 29              | 0–28                       | No         |
+| **Arrow Decimal128** | Apache Arrow (cross-language) | 128                             | 128-bit two's complement signed integer          | 10^p − 1 (bounded by declared precision p, max p=38) | 38              | User-defined (any integer) | No         |
+| **govalues/decimal** | Go (module)                   | 128 (1 bool + 1 uint64 + 1 int) | 64-bit unsigned integer (uint64 coefficient)     | 10^19 − 1 = 9,999,999,999,999,999,999                | 19              | 0–19                       | No         |
+
+Other non-128-bit types for reference:
+
+| Name                          | Language / Platform    | Total Bits     | Coefficient Storage                    | Max Coefficient                                                                   | Max Sig. Digits   | Scale Range            | NaN / ±Inf |
+| ----------------------------- | ---------------------- | -------------- | -------------------------------------- | --------------------------------------------------------------------------------- | ----------------- | ---------------------- | ---------- |
+| **Swift Decimal** (NSDecimal) | Swift / Foundation     | 160 (20 bytes) | 128-bit mantissa (8×UInt16)            | Up to 38 decimal digits; mantissa is 128 bits but capped at 10^38 − 1 in practice | 38                | Exponent: −128 to +127 | NaN only   |
+| **SQL Server decimal(38)**    | T-SQL / SQL Server     | 136 (17 bytes) | 128-bit unsigned integer (4×Int32)     | 10^38 − 1 = 99,999,999,999,999,999,999,999,999,999,999,999,999                    | 38                | 0 to p (max 38)        | No         |
+| **Delphi Currency**           | Delphi / Object Pascal | 64             | 64-bit signed integer (scaled by 10^4) | 2^63 − 1 = 922,337,203,685,477.5807                                               | 19 (4 fractional) | Fixed at 4             | No         |
+
+### A.2 Notes on Each Type
+
+#### C# System.Decimal (and VB.NET, F#)
+
+The .NET CLR `System.Decimal` is the reference design that Decimo Decimal128, Rust `rust_decimal`,
+and several others copy. Its 128-bit layout packs a 96-bit unsigned coefficient into three 32-bit
+words (`lo`, `mid`, `hi`), with a 32-bit flags word encoding: sign in bit 31, scale in bits 16–20
+(value 0–28), and bits 0–15 & 21–30 reserved (must be zero).
+
+Constructor: `Decimal(Int32 lo, Int32 mid, Int32 hi, Boolean isNegative, Byte scale)`.
+
+Max value: ±79,228,162,514,264,337,593,543,950,335 (= 2^96 − 1). This is **not** a round decimal
+number — the upper bound is a power-of-2 boundary, not 10^29 − 1.
+
+No NaN, no Infinity. `INumberBase<Decimal>.IsNaN()` always returns `false`.
+
+#### Rust rust_decimal
+
+Mirrors the C# layout exactly: `lo: u32, mid: u32, hi: u32` for the 96-bit coefficient, flags word
+with sign + scale. `MAX = 79_228_162_514_264_337_593_543_950_335`. Serializes to 16 bytes
+(4 bytes flags + 12 bytes coefficient). The `from_parts(lo, mid, hi, negative, scale)` constructor
+matches C# directly.
+
+No NaN, no Infinity. The `MathematicalOps` trait adds `sqrt()`, `exp()`, `ln()`, `pow()`, etc.
+
+#### Apache Arrow Decimal128
+
+Fundamentally different from the C#/Rust design. Arrow Decimal128 stores the value as a **128-bit
+two's complement signed integer**, not a 96-bit unsigned coefficient. The value represents
+`integer_value / 10^scale`, where `precision` (1–38) and `scale` are declared in the schema.
+
+The max representable coefficient for `decimal128(38, 0)` is 10^38 − 1 (= 38 nines), which is
+much smaller than 2^127 − 1 ≈ 1.7 × 10^38. Arrow deliberately caps at 10^precision − 1 rather
+than using the full bit range, to ensure consistent decimal digit semantics.
+
+Schema definition (Apache Arrow `Schema.fbs`):
+
+```txt
+table Decimal { precision: int; scale: int; bitWidth: int = 128; }
+```
+
+No NaN, no Infinity. Each column has a single fixed precision and scale.
+
+#### Swift Foundation Decimal (NSDecimal)
+
+Not truly 128-bit — the struct is **160 bits (20 bytes)**. Layout:
+
+- `exponent: Int8` (−128 to +127)
+- `lengthFlagsAndReserved: UInt8` (4-bit length, 1-bit isNegative, 1-bit isCompact, 2-bit reserved)
+- `reserved: UInt16`
+- `mantissa: (UInt16, UInt16, UInt16, UInt16, UInt16, UInt16, UInt16, UInt16)` — 8×UInt16 = 128 bits
+
+The 128-bit mantissa can theoretically hold values up to 2^128 − 1, but the `_length` field
+(4 bits, max 15) indicates how many of the 8 UInt16 slots are used, and Apple documents the max as
+38 significant decimal digits (i.e., effectively capped at 10^38 − 1).
+
+Unlike C# and Rust, Swift Decimal **supports NaN** (`isNaN` property). It does NOT support Infinity
+in practice — the `isInfinite` property exists (inherited from `FloatingPoint` protocol) but Apple's
+implementation does not produce or handle Infinity values meaningfully.
+
+#### SQL Server decimal / numeric
+
+SQL Server `decimal(p, s)` with max precision 38. Storage size varies by precision:
+
+| Precision | Storage Bytes |
+| --------- | ------------- |
+| 1–9       | 5             |
+| 10–19     | 9             |
+| 20–28     | 13            |
+| 29–38     | 17            |
+
+At precision 29–38, the storage is 17 bytes: 1 byte for sign + 16 bytes (128 bits) for the
+unsigned integer coefficient. The max value is 10^38 − 1 (38 nines). This is a decimal-bounded
+maximum, not a binary one. Valid values range from `-(10^p - 1)` to `+(10^p - 1)`.
+
+No NaN, no Infinity. `decimal` and `numeric` are synonyms; both are fixed precision and scale.
+
+#### Go govalues/decimal
+
+A high-performance, zero-allocation decimal designed for financial systems. Internally uses a
+`uint64` coefficient (max 10^19 − 1 = 9,999,999,999,999,999,999) with 19-digit precision and
+scale 0–19. The struct fits in 128 bits total (bool sign + uint64 coefficient + int scale, though
+Go struct layout may pad slightly).
+
+No NaN, no Infinity, no negative zero, no subnormals. Immutable, panic-free (returns errors).
+Uses half-to-even rounding by default. Falls back to `big.Int` for intermediate calculations to
+maintain correctness, but final results are always rounded to 19 digits.
+
+#### Delphi Currency
+
+Only 64 bits, included for completeness. A 64-bit signed integer scaled by 10^4 (i.e., always
+exactly 4 decimal places). Max value: 922,337,203,685,477.5807. Not truly 128-bit, but it is a
+notable example of a fixed-point decimal type in the wild.
+
+### A.3 Eliminated Candidates
+
+The following were investigated but **excluded** because they are arbitrary-precision (not
+fixed-precision within 128 bits):
+
+| Name                   | Language   | Reason for Exclusion                                                                                                           |
+| ---------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------ |
+| **shopspring/decimal** | Go         | Arbitrary precision; uses `math/big.Int` internally. No fixed bit width.                                                       |
+| **cockroachdb/apd**    | Go         | Arbitrary precision; `Decimal.Coeff` is a `BigInt` (wrapper around `big.Int`). Implements the General Decimal Arithmetic spec. |
+| **PostgreSQL numeric** | PostgreSQL | Variable-length arbitrary precision. Stored as variable-length array of base-10000 digits. No 128-bit limit.                   |
+| **GCC __int128**       | C/C++      | A 128-bit integer, not a decimal type. Could be used to build a decimal library but is not one itself.                         |
+
+### A.4 Critical Analysis: Coefficient Upper Bound Approaches
+
+There are **two fundamentally different approaches** to bounding the coefficient in fixed-precision
+decimal types:
+
+#### Approach 1: Binary Bound (2^N − 1)
+
+**Used by:** C# System.Decimal, Rust rust_decimal, Decimo Decimal128
+
+The coefficient is an N-bit unsigned integer, and the maximum value is the full binary range
+2^N − 1. For 96-bit coefficients:
+
+- Max = 2^96 − 1 = **79,228,162,514,264,337,593,543,950,335**
+- This is a 29-digit number, but the leading digit can only be 0–7 (since 10^29 − 1 > 2^96 − 1)
+- Consequence: The first significant digit is constrained. You get 29 digits only when the leading
+  digit is ≤ 7. You get the full 0–9 range only for 28-digit numbers.
+
+**Pros:**
+
+- Natural fit for hardware — the coefficient is just a native (multi-word) integer
+- Simple bounds check: just compare against 2^96 − 1
+- Slightly larger range than 10^28 − 1 (about 7.9× more values in the 29th digit range)
+
+**Cons:**
+
+- The "truncate-to-max" problem: when an operation produces a coefficient > 2^96 − 1, you must
+  either raise an error or round. The boundary is not at a clean decimal digit boundary, which
+  makes rounding semantics awkward. E.g., 80,000,000,000,000,000,000,000,000,000 (8×10^28) is
+  out of range, even though it only needs 2 significant digits.
+- Non-uniform digit range: the 29th digit has range 0–7, not 0–9. This is confusing for users.
+
+#### Approach 2: Decimal Bound (10^p − 1)
+
+**Used by:** Apache Arrow Decimal128, SQL Server decimal(38), Swift Decimal, govalues/decimal
+
+The coefficient is bounded by 10^p − 1, where p is the declared precision. Even if the underlying
+storage has more bits available, values above 10^p − 1 are not representable.
+
+For Arrow/SQL precision 38:
+
+- Max = 10^38 − 1 = **99,999,999,999,999,999,999,999,999,999,999,999,999** (38 nines)
+- Fits in 128 bits (10^38 − 1 < 2^127 − 1), with ~90 bits of the 128 used
+- Every digit position has the full 0–9 range
+
+For govalues/decimal precision 19:
+
+- Max = 10^19 − 1 = **9,999,999,999,999,999,999** (19 nines)
+- Fits in a single uint64 (10^19 − 1 < 2^64 − 1), with ~63 bits of the 64 used
+
+**Pros:**
+
+- Clean decimal semantics: every digit position has the full 0–9 range
+- No "truncate-to-max" confusion at non-decimal boundaries
+- Precision is exactly p significant decimal digits, no asterisks
+- Easier to reason about for financial applications
+
+**Cons:**
+
+- Wastes some of the available bit range (Arrow uses ~126.5 of 128 bits; govalues uses ~63.1 of 64)
+- Bounds checking requires a comparison against a decimal constant, not a simple overflow check
+
+#### Implications for Decimo
+
+Decimo follows the C#/Rust approach (binary bound, 2^96 − 1). This means:
+
+1. **The `truncate_to_max` style bound-checking IS a concern:** When multiplying two 29-digit
+   numbers, the intermediate product can have up to 58 digits. If the result after scale adjustment
+   still exceeds 2^96 − 1, we must handle it. The current behavior should be documented: do we
+   raise an error, or do we round to fit?
+
+2. **The non-uniform 29th digit** should be documented. Users may expect that "29 digits of
+   precision" means they can represent any 29-digit number, but
+   `99,999,999,999,999,999,999,999,999,999` (29 nines) = ~10^29 is > 2^96 and therefore
+   out of range. The actual guarantee is "28 full digits plus a leading digit 0–7".
+
+3. **If we ever consider a Decimal256 or widen to full 128-bit coefficient:** We should evaluate
+   whether to switch to the decimal-bounded approach (10^38 − 1 with 128-bit storage, matching
+   Arrow/SQL) vs. staying with binary bound (2^128 − 1, giving ~38.5 digits with non-uniform
+   leading digit). The Arrow/SQL approach would give us exact compatibility with SQL Server
+   `decimal(38)` and Arrow `Decimal128` wire format, which is a significant interoperability
+   advantage.

From 66e69bfbeb64bd605dd163f465da0869e13fc369 Mon Sep 17 00:00:00 2001
From: ZHU Yuhao <dr.yuhao.zhu@outlook.com>
Date: Wed, 8 Apr 2026 21:47:55 +0200
Subject: [PATCH 2/5] Update doc

---
 docs/plans/decimal128_enhancement.md | 418 +++++++++++----------------
 1 file changed, 162 insertions(+), 256 deletions(-)

diff --git a/docs/plans/decimal128_enhancement.md b/docs/plans/decimal128_enhancement.md
index 16af1106..71c3fb88 100644
--- a/docs/plans/decimal128_enhancement.md
+++ b/docs/plans/decimal128_enhancement.md
@@ -1,23 +1,15 @@
 # Decimal128 Enhancement Plan
 
-> **Date**: 2026-04-08  
-> **Target**: decimo >=0.9.0  
-> **Mojo Version**: >=0.26.2  
+> **Date**: 2026-04-08
+> **Target**: decimo >=0.9.0
+> **Mojo Version**: >=0.26.2
 >
-> 子曰工欲善其事必先利其器  
+> 子曰工欲善其事必先利其器
 > The mechanic, who wishes to do his work well, must first sharpen his tools -- Confucius
 
-This document is a thorough audit of the `src/decimo/decimal128/` module. I compared our
-implementation against other 128-bit fixed-precision decimal libraries — C# `System.Decimal`,
-Rust `rust_decimal`, Apache Arrow `Decimal128`, and Go `govalues/decimal` — and recorded
-everything I found: correctness bugs, performance bottlenecks, and improvement opportunities.
+I did a thorough audit of `src/decimo/decimal128/` and compared it against other 128-bit fixed-precision decimal libraries — C# [`System.Decimal`](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Decimal.DecCalc.cs), Rust [`rust_decimal`](https://github.com/paupino/rust-decimal), Apache Arrow [`Decimal128`](https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/basic_decimal.h), and Go [`govalues/decimal`](https://github.com/govalues/decimal). This document records everything I found: correctness bugs, performance bottlenecks, and improvement opportunities.
 
-**Scope:** Only 128-bit (or near-128-bit) fixed-precision, non-floating-point decimal types.
-Arbitrary-precision decimals (Python `decimal.Decimal`, Java `BigDecimal`) are out of scope here —
-they are covered by `BigDecimal`. IEEE 754 decimal128 is also out of scope — it is a floating-point
-format with discontinuous representation, not comparable to our fixed-point design.
-
----
+Scope: only 128-bit (or near-128-bit) fixed-precision, non-floating-point decimal types. Arbitrary-precision decimals (Python `decimal.Decimal`, Java `BigDecimal`) are out of scope — they are covered by `BigDecimal`. IEEE 754 decimal128 is also out of scope — it is a floating-point format with discontinuous representation, not comparable to our fixed-point design.
 
 ## 1. Cross-Language Comparison
 
@@ -34,29 +26,21 @@ format with discontinuous representation, not comparable to our fixed-point desi
 | Sign storage           | Bit 31 of flags      | Bit 31 of flags      | Bit 31 of flags      | Two's complement           | Bool field          |
 | Endianness             | Little-endian        | Little-endian        | Little-endian        | Platform-native            | N/A                 |
 
-\* 29 digits, but the leading digit can only be 0–7 (since 10^29 − 1 > 2^96 − 1). See §2 for why
-this matters.
+\* 29 digits, but the leading digit can only be 0–7 (since 10^29 − 1 > 2^96 − 1). This is pretty dirty and difficult to handle. I think the current implmention is not the most optimized. Need to check and refine.
 
-**Observation:** Decimo, C#, and Rust share the same layout — a proven design. Arrow and govalues
-use a fundamentally different approach with decimal-bounded coefficients (10^p − 1 instead of
-2^N − 1), which gives them cleaner digit semantics at the cost of unused bit range.
+Decimo, C#, and Rust share the same layout — a proven design. Arrow and govalues use a fundamentally different approach with decimal-bounded coefficients (10^p − 1 instead of 2^N − 1), which gives them cleaner digit semantics at the cost of unused bit range.
 
 ### 1.2 Special Values
 
-| Feature       | Decimo                | C#         | Rust rust_decimal | Arrow Decimal128 | Go govalues/decimal |
-| ------------- | --------------------- | ---------- | ----------------- | ---------------- | ------------------- |
-| +Infinity     | ✓ (broken — see §3.1) | ✗ (throws) | ✗                 | ✗                | ✗                   |
-| −Infinity     | ✓ (broken)            | ✗          | ✗                 | ✗                | ✗                   |
-| NaN           | ✓ (broken — see §3.1) | ✗          | ✗                 | ✗                | ✗                   |
-| Negative zero | ✗                     | ✗          | ✗                 | ✗                | ✗                   |
-| Subnormals    | ✗                     | ✗          | ✗                 | ✗                | ✗                   |
-
-**Observation:** None of the comparable 128-bit fixed-precision libraries support NaN or Infinity.
-We are the only one, and our implementation is broken (§3.1). I think we should seriously consider
-removing NaN/Infinity support to match the established paradigm — all four comparable libraries
-simply throw or return an error for undefined operations. If we keep them, the bugs must be fixed
-and full arithmetic propagation must be added, which is a significant effort for a feature no peer
-library provides.
+| Feature       | Decimo                | C#         | Rust | Arrow | Go govalues |
+| ------------- | --------------------- | ---------- | ---- | ----- | ----------- |
+| +Infinity     | ✓ (broken — see §3.1) | ✗ (throws) | ✗    | ✗     | ✗           |
+| −Infinity     | ✓ (broken)            | ✗          | ✗    | ✗     | ✗           |
+| NaN           | ✓ (broken — see §3.1) | ✗          | ✗    | ✗     | ✗           |
+| Negative zero | ✗                     | ✗          | ✗    | ✗     | ✗           |
+| Subnormals    | ✗                     | ✗          | ✗    | ✗     | ✗           |
+
+None of the comparable 128-bit fixed-precision libraries support NaN or Infinity. We are the only one, and our implementation is broken (§3.1). Maybe consider removing NaN/Infinity support to match the established paradigm — all four comparable libraries simply throw or return an error for undefined operations. If we keep them, the bugs must be fixed and full arithmetic propagation must be added, which is a lot of effort for a feature no peer library provides.
 
 ### 1.3 Rounding Modes
 
@@ -70,154 +54,126 @@ library provides.
 | CEILING              | ✓      | ✓ (`ToPositiveInfinity`) | ✓           | ✓ (`CEILING`)   | ✓ (`Ceiling`)  |
 | FLOOR                | ✓      | ✓ (`ToNegativeInfinity`) | ✓           | ✓ (`FLOOR`)     | ✓ (`Floor`)    |
 
-**Observation:** All five libraries (including us) support these 7 rounding modes. We are on par.
+All five libraries (including us) support these 7 rounding modes. We are on par.
 
 ### 1.4 Arithmetic Coverage
 
-| Operation            | Decimo          | C#               | Rust rust_decimal | Arrow Decimal128 | Go govalues  |
-| -------------------- | --------------- | ---------------- | ----------------- | ---------------- | ------------ |
-| add                  | ✓               | ✓                | ✓                 | ✓                | ✓            |
-| subtract             | ✓ (via add)     | ✓                | ✓                 | ✓                | ✓            |
-| multiply             | ✓               | ✓                | ✓                 | ✓                | ✓            |
-| divide               | ✓               | ✓                | ✓                 | ✓                | ✓ (`Quo`)    |
-| truncate_divide      | ✓               | ✓ (`Truncate`)   | ✓                 | ✗                | ✓ (`QuoRem`) |
-| modulo               | ✓               | ✓ (`%`)          | ✓                 | ✗                | ✓ (`Rem`)    |
-| power (int exponent) | ✓               | ✗ (use Math.Pow) | ✓ (`powi`)        | ✗                | ✓ (`PowInt`) |
-| sqrt                 | ✓               | ✗                | ✓                 | ✗                | ✓            |
-| root (nth)           | ✓               | ✗                | ✗                 | ✗                | ✗            |
-| exp                  | ✓               | ✗                | ✓                 | ✗                | ✗            |
-| ln                   | ✓               | ✗                | ✓                 | ✗                | ✗            |
-| log10                | ✓               | ✗                | ✗                 | ✗                | ✗            |
-| log (arbitrary base) | ✓               | ✗                | ✗                 | ✗                | ✗            |
-| abs                  | ✓               | ✓                | ✓                 | ✓                | ✓            |
-| negate               | ✓               | ✓                | ✓                 | ✓                | ✓            |
-| round                | ✓               | ✓                | ✓                 | ✓                | ✓            |
-| quantize             | ✓               | ✗                | ✗                 | via round        | ✗            |
-| factorial            | ✓ (0–27 lookup) | ✗                | ✗                 | ✗                | ✗            |
-| min / max            | ✗               | ✓                | ✓                 | ✓                | ✓            |
-| normalize            | ✗               | ✗                | ✓                 | ✗                | ✗            |
-
-**Observation:** Our arithmetic coverage is the most complete among all five libraries — we are the
-only one with `root`, `log10`, `log`, and `factorial`. Matching Rust on `exp` and `ln`. The gap is
-`min`/`max` which every other library provides and we do not.
-
----
+| Operation            | Decimo          | C#               | Rust       | Arrow     | Go govalues  |
+| -------------------- | --------------- | ---------------- | ---------- | --------- | ------------ |
+| add                  | ✓               | ✓                | ✓          | ✓         | ✓            |
+| subtract             | ✓ (via add)     | ✓                | ✓          | ✓         | ✓            |
+| multiply             | ✓               | ✓                | ✓          | ✓         | ✓            |
+| divide               | ✓               | ✓                | ✓          | ✓         | ✓ (`Quo`)    |
+| truncate_divide      | ✓               | ✓ (`Truncate`)   | ✓          | ✗         | ✓ (`QuoRem`) |
+| modulo               | ✓               | ✓ (`%`)          | ✓          | ✗         | ✓ (`Rem`)    |
+| power (int exponent) | ✓               | ✗ (use Math.Pow) | ✓ (`powi`) | ✗         | ✓ (`PowInt`) |
+| sqrt                 | ✓               | ✗                | ✓          | ✗         | ✓            |
+| root (nth)           | ✓               | ✗                | ✗          | ✗         | ✗            |
+| exp                  | ✓               | ✗                | ✓          | ✗         | ✗            |
+| ln                   | ✓               | ✗                | ✓          | ✗         | ✗            |
+| log10                | ✓               | ✗                | ✗          | ✗         | ✗            |
+| log (arbitrary base) | ✓               | ✗                | ✗          | ✗         | ✗            |
+| abs                  | ✓               | ✓                | ✓          | ✓         | ✓            |
+| negate               | ✓               | ✓                | ✓          | ✓         | ✓            |
+| round                | ✓               | ✓                | ✓          | ✓         | ✓            |
+| quantize             | ✓               | ✗                | ✗          | via round | ✗            |
+| factorial            | ✓ (0–27 lookup) | ✗                | ✗          | ✗         | ✗            |
+| min / max            | ✗               | ✓                | ✓          | ✓         | ✓            |
+| normalize            | ✗               | ✗                | ✓          | ✗         | ✗            |
+
+Our arithmetic coverage is the most complete among all five libraries — we are the only one with `root`, `log10`, `log`, and `factorial`. Matching Rust on `exp` and `ln`. The gap is `min`/`max` which every other library provides and we do not.
 
 ## 2. The Coefficient Bound Problem
 
-This is probably the biggest architectural concern I found. It affects performance, code complexity,
-and user-facing semantics.
+This is probably the biggest architectural concern I found. It affects performance, code complexity, and user-facing semantics.
 
 ### 2.1 The Problem
 
-Our max coefficient is 2^96 − 1 = 79,228,162,514,264,337,593,543,950,335. This is a 29-digit
-number, but the leading digit can only be 0–7. The number 80,000,000,000,000,000,000,000,000,000
-(which has only 2 significant digits) is out of range. Meanwhile, all 28-digit numbers fit.
+Our max coefficient is 2^96 − 1 = 79,228,162,514,264,337,593,543,950,335. This is a 29-digit number, but the leading digit can only be 0–7. The number 80,000,000,000,000,000,000,000,000,000 (which has only 2 significant digits) is out of range. Meanwhile, all 28-digit numbers fit.
 
-This creates a messy boundary: after every arithmetic operation that might produce a wide result
-(multiplication, addition with carry, etc.), I need to check whether the coefficient exceeds
-2^96 − 1 and, if so, round it down. The rounding itself is non-trivial because the boundary is not
-at a clean decimal digit — I cannot just drop the last digit. The `truncate_to_max` function in
-`utility.mojo` handles this, and it is one of the most complex functions in the codebase.
+This creates a messy boundary: after every arithmetic operation that might produce a wide result (multiplication, addition with carry, etc.), I need to check whether the coefficient exceeds 2^96 − 1 and, if so, round it down. The rounding itself is non-trivial because the boundary is not at a clean decimal digit — I cannot just drop the last digit. The `truncate_to_max` function in `utility.mojo` handles this, and it is one of the most complex functions in the codebase.
 
 ### 2.2 How Other Libraries Handle This
 
-#### C# System.Decimal — `ScaleResult()` (binary bound, same as us)
+#### C# System.Decimal — [`ScaleResult()`](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Decimal.DecCalc.cs) (binary bound, same as us)
 
 .NET's approach is heavily optimized. The core function `ScaleResult()` in `Decimal.DecCalc.cs`:
 
-1. Estimates how many decimal digits to remove using `LeadingZeroCount` and the constant
-   `log10(2) ≈ 77/256`.
-2. Divides the wide result (stored in a `Buf24`, up to 192 bits) by powers of 10, using
-   `DivByConst()` specialized per constant (10^1 through 10^9) for maximum speed — on 64-bit
-   targets, these use compiler-generated multiply-by-reciprocal.
+1. Estimates how many decimal digits to remove using `LeadingZeroCount` and the constant `log10(2) ≈ 77/256`.
+2. Divides the wide result (stored in a `Buf24`, up to 192 bits) by powers of 10, using `DivByConst()` specialized per constant (10^1 through 10^9) for maximum speed — on 64-bit targets, these use compiler-generated multiply-by-reciprocal.
 3. Applies banker's rounding with a sticky bit for lost precision.
 4. If rounding causes a carry that pushes above 96 bits again, scales down by 10 one more time.
 
 Additional C# tricks:
 
-- `SearchScale()` — binary search using precomputed `OVFL_MAX_N_HI` constants to find the largest
-  safe scale-up factor.
-- `PowerOvflValues[]` — table of largest 96-bit values that won't overflow when multiplied by
-  10^1 through 10^8.
-- `Unscale()` — efficiently removes trailing zeros using binary search: try 10^8, 10^4, 10^2, 10^1,
-  with quick-reject bit checks (e.g., `(low & 0xF) == 0` before trying 10^4).
-- `OverflowUnscale()` — when quotient overflows by exactly 1 bit, feeds the carry back in and
-  divides by 10, avoiding a full rescale.
+- `SearchScale()` — binary search using precomputed `OVFL_MAX_N_HI` constants to find the largest safe scale-up factor.
+- `PowerOvflValues[]` — table of largest 96-bit values that won't overflow when multiplied by 10^1 through 10^8.
+- `Unscale()` — efficiently removes trailing zeros using binary search: try 10^8, 10^4, 10^2, 10^1, with quick-reject bit checks (e.g., `(low & 0xF) == 0` before trying 10^4).
+- `OverflowUnscale()` — when quotient overflows by exactly 1 bit, feeds the carry back in and divides by 10, avoiding a full rescale.
 
-The bottom line: .NET has hundreds of lines of intricate, heavily-optimized code just for this
-boundary handling. Every multiply that exceeds 96 bits pays for multi-word division.
+The bottom line: .NET has hundreds of lines of intricate, heavily-optimized code just for this boundary handling. Every multiply that exceeds 96 bits pays for multi-word division.
 
-#### Rust rust_decimal — Port of .NET (binary bound, same as us)
+#### Rust rust_decimal — [Port of .NET](https://github.com/paupino/rust-decimal/blob/master/src/ops/common.rs) (binary bound, same as us)
 
-`rust_decimal` is essentially a Rust port of .NET's `DecCalc`. The `Buf24::rescale()` in
-`ops/common.rs` is the equivalent of `ScaleResult()`. Same `log10(2) × 256 = 77` trick, same
-`OVERFLOW_MAX_N_HI` constants, same `POWER_OVERFLOW_VALUES` table.
+`rust_decimal` is essentially a Rust port of .NET's `DecCalc`. The `Buf24::rescale()` in `ops/common.rs` is the equivalent of `ScaleResult()`. Same `log10(2) × 256 = 77` trick, same `OVERFLOW_MAX_N_HI` constants, same `POWER_OVERFLOW_VALUES` table.
 
-One difference: `rust_decimal` returns `CalculationResult::Overflow` instead of throwing, letting
-the caller handle it.
+One difference: `rust_decimal` returns `CalculationResult::Overflow` instead of throwing, letting the caller handle it.
 
-#### Apache Arrow Decimal128 — 256-bit promotion (decimal bound)
+#### Apache Arrow Decimal128 — [256-bit promotion](https://github.com/apache/arrow/blob/main/cpp/src/gandiva/precompiled/decimal_ops.cc) (decimal bound)
 
-Arrow sidesteps the problem entirely by capping at 10^38 − 1 instead of 2^128 − 1. The overflow
-check is just `abs(value) < 10^precision` — a single comparison against a precomputed constant.
+Arrow sidesteps the problem entirely by capping at 10^38 − 1 instead of 2^128 − 1. The overflow check is just `abs(value) < 10^precision` — a single comparison against a precomputed constant.
 
-For multiplication that might overflow 128 bits, Arrow promotes to `int256_t` (Boost or compiler
-`__int128`-based), multiplies, scales down by `10^delta_scale` in one clean division, then converts
-back. This replaces .NET's iterative divide-and-round loop with a single wide multiplication +
-one division.
+For multiplication that might overflow 128 bits, Arrow promotes to `int256_t` (Boost or compiler `__int128`-based), multiplies, scales down by `10^delta_scale` in one clean division, then converts back. This replaces .NET's iterative divide-and-round loop with a single wide multiplication + one division.
 
-The `FitsInPrecision(precision)` check is trivially a comparison against a table entry. No
-multi-step rescaling needed for the check itself.
+The `FitsInPrecision(precision)` check is trivially a comparison against a table entry. No multi-step rescaling needed for the check itself.
 
-#### Go govalues/decimal — Two-tier fast path (decimal bound)
+#### Go govalues/decimal — [Two-tier fast path](https://github.com/govalues/decimal/blob/main/decimal.go) (decimal bound)
 
 `govalues/decimal` uses the most elegant approach. Max coefficient = 10^19 − 1 (fits in `uint64`).
 
-1. **Fast path:** Try the operation using native 64-bit arithmetic. If overflow detected (e.g.,
-   `z/y != x || z > maxFint`), fall through.
-2. **Slow path:** Redo with `big.Int` (arbitrary precision), compute exact result, then round to
-   19 digits using `rshHalfEven` (right-shift in decimal = divide by 10^N, round half-to-even).
+1. Fast path: try the operation using native 64-bit arithmetic. If overflow detected (e.g., `z/y != x || z > maxFint`), fall through.
+2. Slow path: redo with `big.Int` (arbitrary precision), compute exact result, then round to 19 digits using `rshHalfEven` (right-shift in decimal = divide by 10^N, round half-to-even).
 
-Since the bound IS a power of 10, the rounding is clean: just count digits, divide by 10^excess,
-round. No awkward non-decimal boundary to deal with.
+Since the bound IS a power of 10, the rounding is clean: just count digits, divide by 10^excess, round. No awkward non-decimal boundary to deal with.
 
 ### 2.3 Comparison
 
-| Strategy            | Used by              | Bound check cost      | Overflow rounding cost                      | Code complexity |
-| ------------------- | -------------------- | --------------------- | ------------------------------------------- | --------------- |
-| Binary (2^96 − 1)   | C#, Rust, **Decimo** | Cheap (bit compare)   | **Expensive** (multi-word ÷10^N, iterative) | **High**        |
-| Decimal (10^38 − 1) | Arrow, SQL Server    | Cheap (one compare)   | Cheap (one wide ÷10^N)                      | Low             |
-| Decimal (10^19 − 1) | govalues/decimal     | Trivial (one compare) | Trivial (big.Int fallback + round)          | Very low        |
+| Strategy            | Used by           | Bound check cost      | Overflow rounding cost                  | Code complexity |
+| ------------------- | ----------------- | --------------------- | --------------------------------------- | --------------- |
+| Binary (2^96 − 1)   | C#, Rust, Decimo  | Cheap (bit compare)   | Expensive (multi-word ÷10^N, iterative) | High            |
+| Decimal (10^38 − 1) | Arrow, SQL Server | Cheap (one compare)   | Cheap (one wide ÷10^N)                  | Low             |
+| Decimal (10^19 − 1) | govalues/decimal  | Trivial (one compare) | Trivial (big.Int fallback + round)      | Very low        |
 
 ### 2.4 Implications for Decimo
 
-Since we follow the C#/Rust paradigm (binary bound, 2^96 − 1), the `truncate_to_max` complexity
-is inherent. There are a few things I can do:
+Since we follow the C#/Rust paradigm (binary bound, 2^96 − 1), the `truncate_to_max` complexity is inherent. There are a few things I can do:
+
+1. Adopt .NET's optimization tricks. Specifically: the `log10(2) ≈ 77/256` estimation for scale-down amount, precomputed `POWER_OVERFLOW_VALUES` table for safe scale-up, and the `Unscale()` trailing-zero removal with quick-reject bit checks. These would make our existing `truncate_to_max` and `round_to_keep_first_n_digits` much faster.
+
+2. Consider decimal bound for a future Decimal256. If I ever widen to full 128-bit coefficient, using 10^38 − 1 as the max (matching Arrow and SQL Server) would eliminate this problem entirely and give us interoperability with Arrow wire format and SQL `decimal(38)`.
+
+3. Document the "29 digits but not 29 nines" behavior clearly. Users should know that "29 digits of precision" really means "28 full digits plus a leading digit 0–7".
 
-1. **Adopt .NET's optimization tricks.** Specifically: the `log10(2) ≈ 77/256` estimation for
-   scale-down amount, precomputed `POWER_OVERFLOW_VALUES` table for safe scale-up, and the
-   `Unscale()` trailing-zero removal with quick-reject bit checks. These would make our existing
-   `truncate_to_max` and `round_to_keep_first_n_digits` much faster.
+### 2.5 Using UInt128/UInt256 as Acceleration Bridge
 
-2. **Consider decimal bound for a future Decimal256.** If I ever widen to full 128-bit coefficient,
-   using 10^38 − 1 as the max (matching Arrow and SQL Server) would eliminate this problem entirely
-   and give us interoperability with Arrow wire format and SQL `decimal(38)`.
+Mojo now has native `UInt128` and `UInt256` types (via `Scalar[DType.uint128]` and `Scalar[DType.uint256]`). The codebase already uses them — `coefficient()` bitcasts the three UInt32 words to UInt128, and `multiply()` uses UInt256 for intermediate products. But there are more opportunities:
 
-3. **Document the "29 digits but not 29 nines" behavior clearly.** Users should know that
-   "29 digits of precision" really means "28 full digits plus a leading digit 0–7".
+- In `truncate_to_max` and `round_to_keep_first_n_digits`, the divmod operations on UInt128/UInt256 could exploit the fact that Mojo compiles to LLVM IR, where UInt128 division on 64-bit targets translates to two hardware `div` instructions (high and low halves). This is much faster than manually managing three 32-bit limbs. We are already partially doing this, but some code paths still work with individual UInt32 words when they could just operate on the UInt128 directly.
+- The `number_of_bits` loop (§4.1) could be replaced by casting UInt128 to two UInt64s and using `count_leading_zeros` on the high word — this gives O(1) bit width instead of a 96-iteration loop.
+- Arrow's approach of promoting to 256-bit for multiply is directly applicable since we already have UInt256. Instead of the current 3×UInt32 partial-product approach in some code paths, we could do: `UInt256(x_coef) * UInt256(y_coef)`, then scale/truncate the result. This is simpler and likely just as fast since LLVM will optimize the wide multiply.
+- For the `compare_absolute` overflow bug (§3.4), using UInt256 instead of UInt128 for the fractional scaling is a one-line fix thanks to the type already being available.
 
----
+In short: we already depend on UInt128/UInt256 for the core paths. The opportunity is to use them more consistently and eliminate the remaining manual multi-word arithmetic.
 
 ## 3. Correctness Bugs
 
-### 3.1 CRITICAL: NaN/Infinity Implementation Is Broken
+### 3.1 NaN/Infinity Implementation Is Broken
 
-**File:** `decimal128.mojo`
+File: `decimal128.mojo`
 
 There are multiple compounding bugs in the NaN/Infinity support:
 
-**Bug A — NaN mask mismatch:**
+Bug A — NaN mask mismatch:
 
 ```txt
 NAN_MASK = UInt32(0x00000002)    # bit 1
@@ -230,33 +186,28 @@ NAN()          → Self(0, 0, 0, 0x00000010)  # bit 4
 NEGATIVE_NAN() → Self(0, 0, 0, 0x80000010)  # bit 4
 ```
 
-So `Decimal128.NAN().is_nan()` returns **False**.
+So `Decimal128.NAN().is_nan()` returns False.
 
-**Bug B — `is_zero()` returns True for NaN and Infinity:**
+Bug B — `is_zero()` returns True for NaN and Infinity:
 
 ```python
 fn is_zero(self) -> Bool:
     return self.low == 0 and self.mid == 0 and self.high == 0
 ```
 
-NaN and Infinity have zero coefficients, so `INFINITY().is_zero()` → **True**.
+NaN and Infinity have zero coefficients, so `INFINITY().is_zero()` → True.
 
-**Bug C — No arithmetic propagation:** None of the arithmetic operations check for NaN or Infinity
-inputs. Passing them into `add()`, `multiply()`, etc., produces garbage silently.
+Bug C — No arithmetic propagation: none of the arithmetic operations check for NaN or Infinity inputs. Passing them into `add()`, `multiply()`, etc., produces garbage silently.
 
-**Recommendation:** Since no comparable 128-bit fixed-precision library supports NaN or Infinity
-(see §1.2), I think the cleanest fix is to **remove NaN/Infinity support entirely** and raise
-errors for undefined operations (matching C#, Rust, Arrow, and govalues). If we decide to keep
-them, all three bugs must be fixed:
+Since no comparable 128-bit fixed-precision library supports NaN or Infinity (see §1.2), maybe the cleanest fix is to just remove NaN/Infinity support entirely and raise errors for undefined operations (matching C#, Rust, Arrow, and govalues). If I decide to keep them, all three bugs must be fixed:
 
-- Fix A: Change `NAN()` → `Self(0, 0, 0, 0x00000002)` and `NEGATIVE_NAN()` →
-  `Self(0, 0, 0, 0x80000002)` to match the mask.
+- Fix A: Change `NAN()` → `Self(0, 0, 0, 0x00000002)` and `NEGATIVE_NAN()` → `Self(0, 0, 0, 0x80000002)` to match the mask.
 - Fix B: Guard `is_zero()` with `if self.is_nan() or self.is_infinity(): return False`.
 - Fix C: Add early-return NaN/Inf checks to every arithmetic function (significant effort).
 
-### 3.2 MEDIUM: `from_words` Uses `testing.assert_true` in Production Code
+### 3.2 `from_words` Uses `testing.assert_true` in Production Code
 
-**File:** `decimal128.mojo`, lines ~344, 351
+File: `decimal128.mojo`, lines ~344, 351
 
 ```python
 from testing import assert_true
@@ -264,10 +215,9 @@ assert_true(scale <= Self.MAX_SCALE, ...)
 assert_true(coefficient <= Self.MAX_AS_UINT128, ...)
 ```
 
-This imports the test framework into production code. `assert_true` panics with a test failure
-message rather than raising a recoverable error. Users expect `raise Error(...)`.
+This imports the test framework into production code. `assert_true` panics with a test failure message rather than raising a recoverable error. Users expect `raise Error(...)`.
 
-**Fix:**
+Fix:
 
 ```python
 if scale > Self.MAX_SCALE:
@@ -276,23 +226,19 @@ if coefficient > Self.MAX_AS_UINT128:
     raise Error("Error in Decimal128.from_words(): Coefficient exceeds 96-bit max")
 ```
 
-### 3.3 MEDIUM: Division Hardcodes Rounding Behavior
+### 3.3 Division Hardcodes Rounding Behavior
 
-**File:** `arithmetics.mojo`, inside `divide()`
+File: `arithmetics.mojo`, inside `divide()`
 
-The long division loop always uses banker's rounding (HALF_EVEN). The function does not accept a
-rounding mode parameter.
+The long division loop always uses banker's rounding (HALF_EVEN). The function does not accept a rounding mode parameter.
 
-This matches C# and Rust behavior — both hardcode banker's rounding for the `/` operator. Arrow
-and govalues also default to HALF_EVEN for division. So this is consistent with all comparable
-libraries and not really a bug.
+This matches C# and Rust behavior — both hardcode banker's rounding for the `/` operator. Arrow and govalues also default to HALF_EVEN for division. So this is consistent with all comparable libraries and not really a bug.
 
-If we ever want configurable rounding in division, we would add a
-`divide(x, y, rounding_mode)` overload. Low priority.
+If I ever want configurable rounding in division, I would add a `divide(x, y, rounding_mode)` overload. Low priority.
 
-### 3.4 LOW: `compare_absolute` Potential Overflow
+### 3.4 `compare_absolute` Potential Overflow
 
-**File:** `comparison.mojo`
+File: `comparison.mojo`
 
 When comparing fractional parts, the code computes:
 
@@ -300,27 +246,21 @@ When comparing fractional parts, the code computes:
 fractional_1 = UInt128(x1_frac_part) * UInt128(10) ** scale_diff
 ```
 
-Since `scale_diff` can be up to 28 and `x1_frac_part` can be up to 2^96 − 1, the product can be
-as large as (2^96 − 1) × 10^28 ≈ 7.9 × 10^57. UInt128 max is ~3.4 × 10^38. This overflows.
+Since `scale_diff` can be up to 28 and `x1_frac_part` can be up to 2^96 − 1, the product can be as large as (2^96 − 1) × 10^28 ≈ 7.9 × 10^57. UInt128 max is ~3.4 × 10^38. This overflows.
 
-**Fix:** Use UInt256 for this computation, or normalize both values to the same scale before
-extracting integer and fractional parts.
+Fix: use UInt256 for this computation (see §2.5), or normalize both values to the same scale before extracting integer and fractional parts.
 
-### 3.5 LOW: `is_one()` Might Be Incomplete
+### 3.5 `is_one()` Might Be Incomplete
 
-**File:** Used in `exponential.mojo` (e.g., `log()` calls `x.is_one()`)
+File: used in `exponential.mojo` (e.g., `log()` calls `x.is_one()`)
 
-I should verify `is_one()` handles all representations of 1: `1` (coef=1, scale=0), `1.0`
-(coef=10, scale=1), `1.00` (coef=100, scale=2), etc. If it only checks
-`coefficient == 1 and scale == 0`, it misses the other forms.
-
----
+I should verify `is_one()` handles all representations of 1: `1` (coef=1, scale=0), `1.0` (coef=10, scale=1), `1.00` (coef=100, scale=2), etc. If it only checks `coefficient == 1 and scale == 0`, it misses the other forms.
 
 ## 4. Performance Bottlenecks
 
 ### 4.1 `number_of_bits()` Uses a Loop
 
-**File:** `utility.mojo`
+File: `utility.mojo`
 
 ```python
 fn number_of_bits(n: UInt128) -> Int:
@@ -332,33 +272,25 @@ fn number_of_bits(n: UInt128) -> Int:
     return count
 ```
 
-O(n) in bit count — up to 96 iterations. C#'s `ScaleResult` uses `LeadingZeroCount` which is
-a single instruction on modern CPUs.
+O(n) in bit count — up to 96 iterations. C#'s [`ScaleResult`](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Decimal.DecCalc.cs) uses `LeadingZeroCount` which is a single instruction on modern CPUs.
 
-**Fix:** Use `count_leading_zeros()` or `bit_width()` if available in Mojo. If not available for
-UInt128, split into two UInt64s and use CLZ on the high word.
+Fix: split UInt128 into two UInt64s and use `count_leading_zeros` on the high word. If the high word is zero, use CLZ on the low word. This gives O(1) bit width. See §2.5 for more on using UInt128/UInt256 as acceleration.
 
 ### 4.2 `power_of_10` Is Not Using Precomputed Constants Efficiently
 
-**File:** `utility.mojo`
+File: `utility.mojo`
 
-The function already has hardcoded return values for n=0 through n=32, but for n=33 through n=56+
-it falls back to `ValueType(10) ** n` which computes via loop.
+The function already has hardcoded return values for n=0 through n=32, but for n=33 through n=56+ it falls back to `ValueType(10) ** n` which computes via loop.
 
-Since `power_of_10` is called in nearly every arithmetic operation (scale adjustment,
-`number_of_digits`, comparison, division), the fallback path matters.
+Since `power_of_10` is called in nearly every arithmetic operation (scale adjustment, `number_of_digits`, comparison, division), the fallback path matters.
 
-**Fix:** Extend the hardcoded constants up to n=58 (the maximum needed for UInt256 products of
-two 29-digit numbers). The pattern is already there for n≤32 — just continue it. For UInt256
-values too large for integer literals, use the `@always_inline` constant function pattern from
-`constants.mojo`.
+Fix: extend the hardcoded constants up to n=58 (the maximum needed for UInt256 products of two 29-digit numbers). The pattern is already there for n ≤ 32 — just continue it. For UInt256 values too large for integer literals, use the `@always_inline` constant function pattern from `constants.mojo`.
 
 ### 4.3 `truncate_to_max` Could Use .NET Tricks
 
-**File:** `utility.mojo`
+File: `utility.mojo`
 
-Our `truncate_to_max` computes `number_of_digits`, then divides by `power_of_10`, then rounds.
-.NET's `ScaleResult` uses:
+Our `truncate_to_max` computes `number_of_digits`, then divides by `power_of_10`, then rounds. .NET's [`ScaleResult`](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Decimal.DecCalc.cs) uses:
 
 - `LeadingZeroCount` × `77/256` to estimate digit count without a full `number_of_digits` call
 - Division by compile-time constants (10^1 through 10^9) using multiply-by-reciprocal
@@ -368,7 +300,7 @@ These tricks could significantly speed up our overflow handling.
 
 ### 4.4 `ln()` Range Reduction Uses Loops
 
-**File:** `exponential.mojo`
+File: `exponential.mojo`
 
 The `ln()` function reduces input to [0.5, 2.0) by repeatedly dividing by 10, then by 2:
 
@@ -378,41 +310,32 @@ while x_reduced > Decimal128(2, 0, 0, 0):
     halving_count += 1
 ```
 
-For `ln(1e28)`, this loop runs ~93 times (since 10^28 ≈ 2^93), each doing a full Decimal128
-division.
+For `ln(1e28)`, this loop runs ~93 times (since 10^28 ≈ 2^93), each doing a full Decimal128 division.
 
-**Fix:** Use the identity `ln(a × 10^n) = ln(a) + n × ln(10)` to strip the scale in one step.
-Then only the coefficient (which is in [1, 2^96)) needs halving, which is at most ~96 iterations
-but could also be reduced by computing the bit width and dividing by the appropriate power of 2
-in one shot.
+Fix: use the identity `ln(a × 10^n) = ln(a) + n × ln(10)` to strip the scale in one step. Then only the coefficient (which is in [1, 2^96)) needs halving, which could be reduced by computing the bit width and dividing by the appropriate power of 2 in one shot.
 
 ### 4.5 `subtract()` Creates a Temporary
 
-**File:** `arithmetics.mojo`
+File: `arithmetics.mojo`
 
 ```python
 def subtract(x1: Decimal128, x2: Decimal128) raises -> Decimal128:
     return add(x1, negative(x2))
 ```
 
-Creates a temporary just to flip the sign bit. Probably minor — Decimal128 is 16 bytes and should
-be in registers. Low priority unless profiling shows otherwise.
+Creates a temporary just to flip the sign bit. Probably minor — Decimal128 is 16 bytes and should be in registers. Low priority unless profiling shows otherwise.
 
 ### 4.6 Series Computations Cap at 500 Iterations
 
-**File:** `exponential.mojo`
+File: `exponential.mojo`
 
-Both `exp_series` and `ln_series` loop up to 500 with convergence check `term.is_zero()`. In
-practice they converge in 30-60 iterations. The issue: `is_zero()` triggers only when the term
-underflows to exactly zero, which may require a few extra iterations beyond when the term is already
-too small to affect the result.
+Both `exp_series` and `ln_series` loop up to 500 with convergence check `term.is_zero()`. In practice they converge in 30–60 iterations. The issue: `is_zero()` triggers only when the term underflows to exactly zero, which may require a few extra iterations beyond when the term is already too small to affect the result.
 
-**Fix:** Break early if the term is smaller than 10^(−29) — it cannot change the result at our
-precision.
+Fix: break early if the term is smaller than 10^(−29) — it cannot change the result at our precision.
 
 ### 4.7 `from_string` Processes Digits One at a Time
 
-**File:** `decimal128.mojo`
+File: `decimal128.mojo`
 
 ```python
 coef = coef * 10 + digit
@@ -420,33 +343,28 @@ coef = coef * 10 + digit
 
 Each iteration does a UInt128 multiply-by-10 and add.
 
-**Fix:** Batch up to 9 digits into a UInt64, then multiply `coef` by the appropriate power of 10
-and add the batch. Reduces 128-bit multiplications from ~29 to ~4 for a max-length number.
+Fix: batch up to 9 digits into a UInt64, then multiply `coef` by the appropriate power of 10 and add the batch. Reduces 128-bit multiplications from ~29 to ~4 for a max-length number.
 
 ### 4.8 Division Loop: Separate `//` and `%` Operations
 
-**File:** `arithmetics.mojo`
+File: `arithmetics.mojo`
 
 ```python
 digit = rem // x2_coef
 rem = rem % x2_coef
 ```
 
-Two separate 128-bit divisions on the same operands. Most hardware produces both quotient and
-remainder in a single `div` instruction.
-
-**Fix:** Use `divmod()` if available in Mojo.
+Two separate 128-bit divisions on the same operands. Most hardware produces both quotient and remainder in a single `div` instruction.
 
----
+Fix: use `divmod()` if available in Mojo.
 
 ## 5. Improvement Opportunities
 
 ### 5.1 Add `__hash__` Support
 
-C# and Rust both support hashing their decimal types. We should implement `__hash__` so Decimal128
-can be used as a dictionary key or in sets.
+C# and Rust both support hashing their decimal types. I should implement `__hash__` so Decimal128 can be used as a dictionary key or in sets.
 
-**Approach:** Hash the normalized form (strip trailing zeros, then hash coefficient + scale + sign).
+Approach: hash the normalized form (strip trailing zeros, then hash coefficient + scale + sign).
 
 ### 5.2 Add `Stringable` / `Representable` Protocol Conformance
 
@@ -454,24 +372,21 @@ Check that we conform to Mojo's `Stringable` and `Representable` traits for `str
 
 ### 5.3 Better `from_float` Accuracy
 
-The current `from_float` (Float64) path likely goes through string conversion. Consider exact float
-decomposition (extracting mantissa and exponent from IEEE 754 double bits).
+The current `from_float` (Float64) path likely goes through string conversion. Maybe consider exact float decomposition (extracting mantissa and exponent from IEEE 754 double bits).
 
-**Comparison:** Rust `rust_decimal` uses exact mantissa/exponent extraction from f64 bits.
+For reference: Rust [`rust_decimal`](https://github.com/paupino/rust-decimal) uses exact mantissa/exponent extraction from f64 bits.
 
 ### 5.4 `min()` / `max()` / `clamp()`
 
-All four comparable libraries provide `min`/`max`. We should too. Easy to implement.
+All four comparable libraries provide `min`/`max`. Easy to implement.
 
 ### 5.5 Canonicalization / `normalize()`
 
-Strip trailing zeros: `1.200` (coef=1200, scale=3) → `1.2` (coef=12, scale=1). Useful for
-hashing (§5.1) and reducing coefficient size for faster subsequent arithmetic. Rust `rust_decimal`
-has `normalize()`.
+Strip trailing zeros: `1.200` (coef=1200, scale=3) → `1.2` (coef=12, scale=1). Useful for hashing (§5.1) and reducing coefficient size for faster subsequent arithmetic. Rust [`rust_decimal`](https://docs.rs/rust_decimal/latest/rust_decimal/struct.Decimal.html#method.normalize) has `normalize()`.
 
 ### 5.6 Wider Testing for Edge Cases
 
-I recommend adding test cases for:
+Some test cases worth adding:
 
 | Test Case                                    | Expected Behavior            |
 | -------------------------------------------- | ---------------------------- |
@@ -482,66 +397,57 @@ I recommend adding test cases for:
 | Max coefficient after multiply               | Correct rounding             |
 | 29-digit numbers near 2^96 − 1 boundary      | Correct truncate/round       |
 
----
-
 ## 6. Priority Summary
 
 | #   | Issue                                   | Severity    | Effort                        | Priority |
 | --- | --------------------------------------- | ----------- | ----------------------------- | -------- |
-| 3.1 | NaN/Inf broken (fix or remove)          | Critical    | Small (remove) / Medium (fix) | **P0**   |
-| 3.2 | `from_words` uses `testing.assert_true` | Medium      | Small                         | **P1**   |
-| 4.2 | `power_of_10` not fully precomputed     | High        | Small                         | **P1**   |
-| 4.3 | `truncate_to_max` lacks .NET tricks     | High        | Medium                        | **P1**   |
-| 3.4 | `compare_absolute` overflow             | Medium      | Small                         | **P2**   |
-| 4.1 | `number_of_bits` loop                   | Medium      | Small                         | **P2**   |
-| 4.8 | Separate `//` and `%` in division       | Medium      | Small                         | **P2**   |
-| 4.7 | `from_string` digit-by-digit            | Medium      | Medium                        | **P2**   |
-| 4.4 | `ln()` range reduction loops            | Medium      | Medium                        | **P2**   |
-| 5.4 | `min/max/clamp`                         | Enhancement | Trivial                       | **P3**   |
-| 5.5 | `normalize()`                           | Enhancement | Small                         | **P3**   |
-| 5.1 | `__hash__`                              | Enhancement | Small                         | **P3**   |
-| 5.6 | Edge case tests                         | Enhancement | Medium                        | **P3**   |
-| 4.5 | `subtract` temporary                    | Low         | Trivial                       | **P4**   |
-| 4.6 | Series convergence tolerance            | Low         | Small                         | **P4**   |
-| 3.3 | Division rounding mode (configurable)   | Low         | Medium                        | **P4**   |
-| 5.3 | Better `from_float`                     | Enhancement | Medium                        | **P4**   |
-| 5.2 | `Stringable` conformance                | Enhancement | Trivial                       | **P4**   |
-
----
+| 3.1 | NaN/Inf broken (fix or remove)          | Critical    | Small (remove) / Medium (fix) | P0       |
+| 3.2 | `from_words` uses `testing.assert_true` | Medium      | Small                         | P1       |
+| 4.2 | `power_of_10` not fully precomputed     | High        | Small                         | P1       |
+| 4.3 | `truncate_to_max` lacks .NET tricks     | High        | Medium                        | P1       |
+| 3.4 | `compare_absolute` overflow             | Medium      | Small                         | P2       |
+| 4.1 | `number_of_bits` loop                   | Medium      | Small                         | P2       |
+| 4.8 | Separate `//` and `%` in division       | Medium      | Small                         | P2       |
+| 4.7 | `from_string` digit-by-digit            | Medium      | Medium                        | P2       |
+| 4.4 | `ln()` range reduction loops            | Medium      | Medium                        | P2       |
+| 5.4 | `min/max/clamp`                         | Enhancement | Trivial                       | P3       |
+| 5.5 | `normalize()`                           | Enhancement | Small                         | P3       |
+| 5.1 | `__hash__`                              | Enhancement | Small                         | P3       |
+| 5.6 | Edge case tests                         | Enhancement | Medium                        | P3       |
+| 4.5 | `subtract` temporary                    | Low         | Trivial                       | P4       |
+| 4.6 | Series convergence tolerance            | Low         | Small                         | P4       |
+| 3.3 | Division rounding mode (configurable)   | Low         | Medium                        | P4       |
+| 5.3 | Better `from_float`                     | Enhancement | Medium                        | P4       |
+| 5.2 | `Stringable` conformance                | Enhancement | Trivial                       | P4       |
 
 ## 7. Execution Order
 
-**Phase 1 — Critical correctness:**
+Phase 1 — correctness:
 
-1. Decide: remove NaN/Infinity or fix them (§3.1). If removing, delete `NAN()`, `NEGATIVE_NAN()`,
-   `INFINITY()`, `NEGATIVE_INFINITY()`, `NAN_MASK`, `INFINITY_MASK`, `is_nan()`, `is_infinity()`
-   and change callers to raise errors. If fixing, apply fixes A/B/C from §3.1.
+1. Decide: remove NaN/Infinity or fix them (§3.1). If removing, delete `NAN()`, `NEGATIVE_NAN()`, `INFINITY()`, `NEGATIVE_INFINITY()`, `NAN_MASK`, `INFINITY_MASK`, `is_nan()`, `is_infinity()` and change callers to raise errors. If fixing, apply fixes A/B/C from §3.1.
 2. Fix `from_words` to use `raise Error` instead of `testing.assert_true` (§3.2).
 3. Add edge case tests (§5.6).
 
-**Phase 2 — Performance (coefficient bound):**
+Phase 2 — performance (coefficient bound):
 
 1. Extend `power_of_10` hardcoded constants up to n=58 (§4.2).
-2. Add .NET-style tricks to `truncate_to_max`: CLZ-based digit estimation, divide-by-constant
-   optimization, trailing-zero quick-reject (§4.3).
-3. Replace `number_of_bits` with hardware CLZ (§4.1).
+2. Add .NET-style tricks to `truncate_to_max`: CLZ-based digit estimation, divide-by-constant optimization, trailing-zero quick-reject (§4.3).
+3. Replace `number_of_bits` with hardware CLZ via UInt64 split (§4.1).
 4. Fix `compare_absolute` overflow with UInt256 (§3.4).
 
-**Phase 3 — Performance (general):**
+Phase 3 — performance (general):
 
 1. Optimize `from_string` digit batching (§4.7).
 2. Use single divmod in division loop (§4.8).
 3. Improve `ln()` range reduction (§4.4).
 
-**Phase 4 — Enhancements:**
+Phase 4 — enhancements:
 
 1. Add `min/max/clamp` (§5.4).
 2. Add `normalize()` (§5.5).
 3. Add `__hash__` (§5.1).
 4. Better `from_float` (§5.3).
 
----
-
 ## Appendix A. Survey of 128-Bit Fixed-Precision Decimal Types
 
 > **Scope:** 128-bit (or near-128-bit) fixed-precision, non-floating-point decimal types across

From b33eba3d57f3be6c956c005722eb51c5431d91bf Mon Sep 17 00:00:00 2001
From: ZHU Yuhao <dr.yuhao.zhu@outlook.com>
Date: Wed, 8 Apr 2026 21:52:20 +0200
Subject: [PATCH 3/5] Address comments

---
 docs/plans/decimal128_enhancement.md | 38 ++++++++++++++--------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/docs/plans/decimal128_enhancement.md b/docs/plans/decimal128_enhancement.md
index 71c3fb88..9002c309 100644
--- a/docs/plans/decimal128_enhancement.md
+++ b/docs/plans/decimal128_enhancement.md
@@ -15,16 +15,16 @@ Scope: only 128-bit (or near-128-bit) fixed-precision, non-floating-point decima
 
 ### 1.1 Storage & Layout
 
-| Feature                | Decimo Decimal128    | C# System.Decimal    | Rust rust_decimal    | Arrow Decimal128           | Go govalues/decimal |
-| ---------------------- | -------------------- | -------------------- | -------------------- | -------------------------- | ------------------- |
-| Total bits             | 128                  | 128                  | 128                  | 128                        | 128 (bool+u64+int)  |
-| Coefficient storage    | 96-bit (3×UInt32 LE) | 96-bit (3×UInt32 LE) | 96-bit (3×UInt32 LE) | 128-bit signed two's compl | 64-bit unsigned     |
-| Max coefficient        | 2^96 − 1             | 2^96 − 1             | 2^96 − 1             | 10^38 − 1                  | 10^19 − 1           |
-| Bound type             | Binary               | Binary               | Binary               | Decimal                    | Decimal             |
-| Max significant digits | 29*                  | 29*                  | 29*                  | 38                         | 19                  |
-| Scale range            | 0–28                 | 0–28                 | 0–28                 | User-defined               | 0–19                |
-| Sign storage           | Bit 31 of flags      | Bit 31 of flags      | Bit 31 of flags      | Two's complement           | Bool field          |
-| Endianness             | Little-endian        | Little-endian        | Little-endian        | Platform-native            | N/A                 |
+| Feature                | Decimo Decimal128    | C# System.Decimal    | Rust rust_decimal    | Arrow Decimal128           | Go govalues/decimal             |
+| ---------------------- | -------------------- | -------------------- | -------------------- | -------------------------- | ------------------------------- |
+| Total bits             | 128                  | 128                  | 128                  | 128                        | ~128 (bool+uint64+int, may pad) |
+| Coefficient storage    | 96-bit (3×UInt32 LE) | 96-bit (3×UInt32 LE) | 96-bit (3×UInt32 LE) | 128-bit signed two's compl | 64-bit unsigned                 |
+| Max coefficient        | 2^96 − 1             | 2^96 − 1             | 2^96 − 1             | 10^38 − 1                  | 10^19 − 1                       |
+| Bound type             | Binary               | Binary               | Binary               | Decimal                    | Decimal                         |
+| Max significant digits | 29*                  | 29*                  | 29*                  | 38                         | 19                              |
+| Scale range            | 0–28                 | 0–28                 | 0–28                 | User-defined               | 0–19                            |
+| Sign storage           | Bit 31 of flags      | Bit 31 of flags      | Bit 31 of flags      | Two's complement           | Bool field                      |
+| Endianness             | Little-endian        | Little-endian        | Little-endian        | Platform-native            | N/A                             |
 
 \* 29 digits, but the leading digit can only be 0–7 (since 10^29 − 1 > 2^96 − 1). This is pretty dirty and difficult to handle. I think the current implmention is not the most optimized. Need to check and refine.
 
@@ -190,7 +190,7 @@ So `Decimal128.NAN().is_nan()` returns False.
 
 Bug B — `is_zero()` returns True for NaN and Infinity:
 
-```python
+```mojo
 fn is_zero(self) -> Bool:
     return self.low == 0 and self.mid == 0 and self.high == 0
 ```
@@ -209,7 +209,7 @@ Since no comparable 128-bit fixed-precision library supports NaN or Infinity (se
 
 File: `decimal128.mojo`, lines ~344, 351
 
-```python
+```mojo
 from testing import assert_true
 assert_true(scale <= Self.MAX_SCALE, ...)
 assert_true(coefficient <= Self.MAX_AS_UINT128, ...)
@@ -219,7 +219,7 @@ This imports the test framework into production code. `assert_true` panics with
 
 Fix:
 
-```python
+```mojo
 if scale > Self.MAX_SCALE:
     raise Error("Error in Decimal128.from_words(): Scale must be <= 28, got " + str(scale))
 if coefficient > Self.MAX_AS_UINT128:
@@ -242,7 +242,7 @@ File: `comparison.mojo`
 
 When comparing fractional parts, the code computes:
 
-```python
+```mojo
 fractional_1 = UInt128(x1_frac_part) * UInt128(10) ** scale_diff
 ```
 
@@ -262,7 +262,7 @@ I should verify `is_one()` handles all representations of 1: `1` (coef=1, scale=
 
 File: `utility.mojo`
 
-```python
+```mojo
 fn number_of_bits(n: UInt128) -> Int:
     var count = 0
     var x = n
@@ -304,7 +304,7 @@ File: `exponential.mojo`
 
 The `ln()` function reduces input to [0.5, 2.0) by repeatedly dividing by 10, then by 2:
 
-```python
+```mojo
 while x_reduced > Decimal128(2, 0, 0, 0):
     x_reduced = decimo.decimal128.arithmetics.divide(x_reduced, Decimal128(2, 0, 0, 0))
     halving_count += 1
@@ -318,7 +318,7 @@ Fix: use the identity `ln(a × 10^n) = ln(a) + n × ln(10)` to strip the scale i
 
 File: `arithmetics.mojo`
 
-```python
+```mojo
 def subtract(x1: Decimal128, x2: Decimal128) raises -> Decimal128:
     return add(x1, negative(x2))
 ```
@@ -337,7 +337,7 @@ Fix: break early if the term is smaller than 10^(−29) — it cannot change the
 
 File: `decimal128.mojo`
 
-```python
+```mojo
 coef = coef * 10 + digit
 ```
 
@@ -349,7 +349,7 @@ Fix: batch up to 9 digits into a UInt64, then multiply `coef` by the appropriate
 
 File: `arithmetics.mojo`
 
-```python
+```mojo
 digit = rem // x2_coef
 rem = rem % x2_coef
 ```

From 8edf22cfd12b1a2bb685830a20dd6cf0134f6f86 Mon Sep 17 00:00:00 2001
From: ZHU Yuhao <dr.yuhao.zhu@outlook.com>
Date: Wed, 8 Apr 2026 21:54:26 +0200
Subject: [PATCH 4/5] Change debug level

---
 tests/test_bigdecimal.sh | 2 +-
 tests/test_bigfloat.sh   | 2 +-
 tests/test_bigint.sh     | 2 +-
 tests/test_bigint10.sh   | 2 +-
 tests/test_biguint.sh    | 2 +-
 tests/test_cli.sh        | 2 +-
 tests/test_decimal128.sh | 2 +-
 tests/test_toml.sh       | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/tests/test_bigdecimal.sh b/tests/test_bigdecimal.sh
index b4451691..43f07cb9 100755
--- a/tests/test_bigdecimal.sh
+++ b/tests/test_bigdecimal.sh
@@ -2,5 +2,5 @@
 set -e
 
 for f in tests/bigdecimal/*.mojo; do
-    pixi run mojo run -I src -D ASSERT=all --debug-level=line-tables "$f"
+    pixi run mojo run -I src -D ASSERT=all --debug-level=full "$f"
 done
diff --git a/tests/test_bigfloat.sh b/tests/test_bigfloat.sh
index f9f5081a..d623a0f2 100644
--- a/tests/test_bigfloat.sh
+++ b/tests/test_bigfloat.sh
@@ -37,7 +37,7 @@ trap cleanup EXIT
 for f in tests/bigfloat/*.mojo; do
     echo "=== $f ==="
     TMPBIN=$(mktemp /tmp/decimo_test_bigfloat_XXXXXX)
-    pixi run mojo build -I src --debug-level=line-tables \
+    pixi run mojo build -I src --debug-level=full \
         -Xlinker -L./"$WRAPPER_DIR" -Xlinker -ldecimo_gmp_wrapper \
         -o "$TMPBIN" "$f"
     DYLD_LIBRARY_PATH="./$WRAPPER_DIR" LD_LIBRARY_PATH="./$WRAPPER_DIR" "$TMPBIN"
diff --git a/tests/test_bigint.sh b/tests/test_bigint.sh
index 94a45c5d..88790488 100755
--- a/tests/test_bigint.sh
+++ b/tests/test_bigint.sh
@@ -2,5 +2,5 @@
 set -e
 
 for f in tests/bigint/*.mojo; do
-    pixi run mojo run -I src -D ASSERT=all --debug-level=line-tables "$f"
+    pixi run mojo run -I src -D ASSERT=all --debug-level=full "$f"
 done
diff --git a/tests/test_bigint10.sh b/tests/test_bigint10.sh
index db746be8..a5043d71 100755
--- a/tests/test_bigint10.sh
+++ b/tests/test_bigint10.sh
@@ -2,5 +2,5 @@
 set -e
 
 for f in tests/bigint10/*.mojo; do
-    pixi run mojo run -I src -D ASSERT=all --debug-level=line-tables "$f"
+    pixi run mojo run -I src -D ASSERT=all --debug-level=full "$f"
 done
diff --git a/tests/test_biguint.sh b/tests/test_biguint.sh
index 7e09a61c..309884ff 100755
--- a/tests/test_biguint.sh
+++ b/tests/test_biguint.sh
@@ -2,5 +2,5 @@
 set -e
 
 for f in tests/biguint/*.mojo; do
-    pixi run mojo run -I src -D ASSERT=all --debug-level=line-tables "$f"
+    pixi run mojo run -I src -D ASSERT=all --debug-level=full "$f"
 done
diff --git a/tests/test_cli.sh b/tests/test_cli.sh
index 79f78ee4..5ab2b56b 100644
--- a/tests/test_cli.sh
+++ b/tests/test_cli.sh
@@ -3,7 +3,7 @@ set -e  # Exit immediately if any command fails
 
 # ── Unit tests ─────────────────────────────────────────────────────────────
 for f in tests/cli/*.mojo; do
-    pixi run mojo run -I src -I src/cli -D ASSERT=all --debug-level=line-tables "$f"
+    pixi run mojo run -I src -I src/cli -D ASSERT=all --debug-level=full "$f"
 done
 
 # ── Integration tests (exercise the compiled binary) ───────────────────────
diff --git a/tests/test_decimal128.sh b/tests/test_decimal128.sh
index e845e8b2..e229f0c1 100755
--- a/tests/test_decimal128.sh
+++ b/tests/test_decimal128.sh
@@ -2,5 +2,5 @@
 set -e
 
 for f in tests/decimal128/*.mojo; do
-    pixi run mojo run -I src -D ASSERT=all --debug-level=line-tables "$f"
+    pixi run mojo run -I src -D ASSERT=all --debug-level=full "$f"
 done
diff --git a/tests/test_toml.sh b/tests/test_toml.sh
index 78efb439..7fe978a3 100755
--- a/tests/test_toml.sh
+++ b/tests/test_toml.sh
@@ -2,5 +2,5 @@
 set -e
 
 for f in tests/toml/*.mojo; do
-    pixi run mojo run -I src -D ASSERT=all --debug-level=line-tables "$f"
+    pixi run mojo run -I src -D ASSERT=all --debug-level=full "$f"
 done

From 00196d129f3c1787b398c33e225e01316731b45e Mon Sep 17 00:00:00 2001
From: ZHU Yuhao <dr.yuhao.zhu@outlook.com>
Date: Wed, 8 Apr 2026 22:01:26 +0200
Subject: [PATCH 5/5] Test on macos instead of ubuntu

---
 .github/workflows/run_tests.yaml | 350 ++++++++++++++++++++++++-------
 1 file changed, 278 insertions(+), 72 deletions(-)

diff --git a/.github/workflows/run_tests.yaml b/.github/workflows/run_tests.yaml
index cad7a1a6..58eca682 100644
--- a/.github/workflows/run_tests.yaml
+++ b/.github/workflows/run_tests.yaml
@@ -1,4 +1,5 @@
-name: CI
+name: Decimo Unit Tests
+
 on:
   push:
     branches: [main, dev]
@@ -20,10 +21,8 @@ jobs:
   # ── Test: BigDecimal ─────────────────────────────────────────────────────────
   test-bigdecimal:
     name: Test BigDecimal
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 30
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -34,19 +33,42 @@ jobs:
           echo "$HOME/.pixi/bin" >> $GITHUB_PATH
       - name: pixi install
         run: pixi install
-      - name: Build packages
+      - name: Build packages (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== build attempt $attempt ==="
+            if pixi run mojo package src/decimo && cp decimo.mojopkg tests/; then
+              echo "=== build succeeded on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== build failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== build crashed, retrying in 5s... ==="
+            sleep 5
+          done
+      - name: Run tests (with retry for Mojo compiler intermittent crashes)
         run: |
-          pixi run mojo package src/decimo && cp decimo.mojopkg tests/
-      - name: Run tests
-        run: bash ./tests/test_bigdecimal.sh
+          for attempt in 1 2 3; do
+            echo "=== test attempt $attempt ==="
+            if bash ./tests/test_bigdecimal.sh; then
+              echo "=== tests passed on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== tests failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== test run crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Test: BigInt ─────────────────────────────────────────────────────────────
   test-bigint:
     name: Test BigInt
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 30
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -57,19 +79,42 @@ jobs:
           echo "$HOME/.pixi/bin" >> $GITHUB_PATH
       - name: pixi install
         run: pixi install
-      - name: Build packages
+      - name: Build packages (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== build attempt $attempt ==="
+            if pixi run mojo package src/decimo && cp decimo.mojopkg tests/; then
+              echo "=== build succeeded on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== build failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== build crashed, retrying in 5s... ==="
+            sleep 5
+          done
+      - name: Run tests (with retry for Mojo compiler intermittent crashes)
         run: |
-          pixi run mojo package src/decimo && cp decimo.mojopkg tests/
-      - name: Run tests
-        run: bash ./tests/test_bigint.sh
+          for attempt in 1 2 3; do
+            echo "=== test attempt $attempt ==="
+            if bash ./tests/test_bigint.sh; then
+              echo "=== tests passed on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== tests failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== test run crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Test: BigUint ────────────────────────────────────────────────────────────
   test-biguint:
     name: Test BigUint
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 30
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -80,19 +125,42 @@ jobs:
           echo "$HOME/.pixi/bin" >> $GITHUB_PATH
       - name: pixi install
         run: pixi install
-      - name: Build packages
+      - name: Build packages (with retry for Mojo compiler intermittent crashes)
         run: |
-          pixi run mojo package src/decimo && cp decimo.mojopkg tests/
-      - name: Run tests
-        run: bash ./tests/test_biguint.sh
+          for attempt in 1 2 3; do
+            echo "=== build attempt $attempt ==="
+            if pixi run mojo package src/decimo && cp decimo.mojopkg tests/; then
+              echo "=== build succeeded on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== build failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== build crashed, retrying in 5s... ==="
+            sleep 5
+          done
+      - name: Run tests (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== test attempt $attempt ==="
+            if bash ./tests/test_biguint.sh; then
+              echo "=== tests passed on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== tests failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== test run crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Test: BigInt10 ───────────────────────────────────────────────────────────
   test-bigint10:
     name: Test BigInt10
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 30
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -103,19 +171,42 @@ jobs:
           echo "$HOME/.pixi/bin" >> $GITHUB_PATH
       - name: pixi install
         run: pixi install
-      - name: Build packages
+      - name: Build packages (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== build attempt $attempt ==="
+            if pixi run mojo package src/decimo && cp decimo.mojopkg tests/; then
+              echo "=== build succeeded on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== build failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== build crashed, retrying in 5s... ==="
+            sleep 5
+          done
+      - name: Run tests (with retry for Mojo compiler intermittent crashes)
         run: |
-          pixi run mojo package src/decimo && cp decimo.mojopkg tests/
-      - name: Run tests
-        run: bash ./tests/test_bigint10.sh
+          for attempt in 1 2 3; do
+            echo "=== test attempt $attempt ==="
+            if bash ./tests/test_bigint10.sh; then
+              echo "=== tests passed on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== tests failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== test run crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Test: Decimal128 ─────────────────────────────────────────────────────────
   test-decimal128:
     name: Test Decimal128
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 30
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -126,19 +217,42 @@ jobs:
           echo "$HOME/.pixi/bin" >> $GITHUB_PATH
       - name: pixi install
         run: pixi install
-      - name: Build packages
+      - name: Build packages (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== build attempt $attempt ==="
+            if pixi run mojo package src/decimo && cp decimo.mojopkg tests/; then
+              echo "=== build succeeded on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== build failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== build crashed, retrying in 5s... ==="
+            sleep 5
+          done
+      - name: Run tests (with retry for Mojo compiler intermittent crashes)
         run: |
-          pixi run mojo package src/decimo && cp decimo.mojopkg tests/
-      - name: Run tests
-        run: bash ./tests/test_decimal128.sh
+          for attempt in 1 2 3; do
+            echo "=== test attempt $attempt ==="
+            if bash ./tests/test_decimal128.sh; then
+              echo "=== tests passed on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== tests failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== test run crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Test: BigFloat ─────────────────────────────────────────────────────────
   test-bigfloat:
     name: Test BigFloat
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 30
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -150,20 +264,43 @@ jobs:
       - name: pixi install
         run: pixi install
       - name: Install MPFR
-        run: sudo apt-get update && sudo apt-get install -y libmpfr-dev
-      - name: Build packages
+        run: brew install mpfr
+      - name: Build packages (with retry for Mojo compiler intermittent crashes)
         run: |
-          pixi run mojo package src/decimo && cp decimo.mojopkg tests/
-      - name: Run tests
-        run: bash ./tests/test_bigfloat.sh
+          for attempt in 1 2 3; do
+            echo "=== build attempt $attempt ==="
+            if pixi run mojo package src/decimo && cp decimo.mojopkg tests/; then
+              echo "=== build succeeded on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== build failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== build crashed, retrying in 5s... ==="
+            sleep 5
+          done
+      - name: Run tests (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== test attempt $attempt ==="
+            if bash ./tests/test_bigfloat.sh; then
+              echo "=== tests passed on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== tests failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== test run crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Test: TOML parser ─────────────────────────────────────────────────────
   test-toml:
     name: Test TOML parser
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 15
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -174,19 +311,42 @@ jobs:
           echo "$HOME/.pixi/bin" >> $GITHUB_PATH
       - name: pixi install
         run: pixi install
-      - name: Build packages
+      - name: Build packages (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== build attempt $attempt ==="
+            if pixi run mojo package src/decimo && cp decimo.mojopkg tests/; then
+              echo "=== build succeeded on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== build failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== build crashed, retrying in 5s... ==="
+            sleep 5
+          done
+      - name: Run tests (with retry for Mojo compiler intermittent crashes)
         run: |
-          pixi run mojo package src/decimo && cp decimo.mojopkg tests/
-      - name: Run tests
-        run: bash ./tests/test_toml.sh
+          for attempt in 1 2 3; do
+            echo "=== test attempt $attempt ==="
+            if bash ./tests/test_toml.sh; then
+              echo "=== tests passed on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== tests failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== test run crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Test: CLI ────────────────────────────────────────────────────────────────
   test-cli:
     name: Test CLI
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 15
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -197,18 +357,42 @@ jobs:
           echo "$HOME/.pixi/bin" >> $GITHUB_PATH
       - name: pixi install
         run: pixi install
-      - name: Build CLI binary
-        run: pixi run buildcli
-      - name: Run tests
-        run: bash ./tests/test_cli.sh
+      - name: Build CLI binary (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== build attempt $attempt ==="
+            if pixi run buildcli; then
+              echo "=== build succeeded on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== build failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== build crashed, retrying in 5s... ==="
+            sleep 5
+          done
+      - name: Run tests (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== test attempt $attempt ==="
+            if bash ./tests/test_cli.sh; then
+              echo "=== tests passed on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== tests failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== test run crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Test: Python bindings ────────────────────────────────────────────────────
   test-python:
     name: Test Python bindings
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 15
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -219,16 +403,27 @@ jobs:
           echo "$HOME/.pixi/bin" >> $GITHUB_PATH
       - name: pixi install
         run: pixi install
-      - name: Build & run Python tests
-        run: pixi run testpy
+      - name: Build & run Python tests (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== test attempt $attempt ==="
+            if pixi run testpy; then
+              echo "=== tests passed on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== tests failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== test run crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Doc generation check ──────────────────────────────────────────────────────
   doc-check:
     name: Doc generation
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 10
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -239,16 +434,27 @@ jobs:
           echo "$HOME/.pixi/bin" >> $GITHUB_PATH
       - name: pixi install
         run: pixi install
-      - name: Generate docs
-        run: pixi run doc
+      - name: Generate docs (with retry for Mojo compiler intermittent crashes)
+        run: |
+          for attempt in 1 2 3; do
+            echo "=== doc attempt $attempt ==="
+            if pixi run doc; then
+              echo "=== doc succeeded on attempt $attempt ==="
+              break
+            fi
+            if [ "$attempt" -eq 3 ]; then
+              echo "=== doc failed after 3 attempts ==="
+              exit 1
+            fi
+            echo "=== doc crashed, retrying in 5s... ==="
+            sleep 5
+          done
 
   # ── Format check ─────────────────────────────────────────────────────────────
   format-check:
     name: Format check
-    runs-on: ubuntu-22.04
+    runs-on: macos-latest
     timeout-minutes: 10
-    env:
-      DEBIAN_FRONTEND: noninteractive
     steps:
       - uses: actions/checkout@v4
       - name: Install pixi
@@ -262,4 +468,4 @@ jobs:
       - name: Install pre-commit
         run: pip install pre-commit
       - name: Run format check
-        run: pre-commit run --all-files
\ No newline at end of file
+        run: pre-commit run --all-files