Track candid performance series and benchmark deltas

## Summary
This issue records the performance work on [`sat-perf-improvements`](https://github.com/dfinity/candid/commits/sat-perf-improvements/) and the benchmark evidence for the full series relative to `ba72cf4`.

The series currently includes **eight** focused changes:

**Original series (5 commits):**
- fast-path small `Nat` / `Int` encode-decode paths
- avoid cloning record field queues during decode
- fast-path exact primitive vector decode
- remove duplicate whole-file type rebuilding in `candid_parser` typing
- skip trivia-only tokenizer bookkeeping when trivia capture is disabled

**New additions (3 commits):**
- bulk memcpy for primitive vector encoding on LE platforms (14.7x encoding speedup)
- batch cost tracking for primitive vector decode (41% decode speedup)
- `Int::decode` small-value fast path (12% decode speedup + 50% heap reduction)

## Benchmark Comparison (Combined)
Baseline: `ba72cf4` (master)
Compared branch state: `sat-perf-improvements` (all 8 commits)
Tool: `canbench` (wasm32)

### Key improvements

| Benchmark | Metric | Baseline | Optimized | Change |
|-----------|--------|----------|-----------|--------|
| **vec_nat** | Encoding | 1.10B | **66.54M** | **-93.9% (16.5x)** |
| **vec_nat** | Decoding | 1.49B | **917.08M** | **-38.5%** |
| **vec_nat** | Total | 2.59B | **983.62M** | **-62.0%** |
| **vec_nat32** | Encoding | 136.28M | **16.79M** | **-87.7% (8.1x)** |
| **vec_nat32** | Decoding | 1.06B | **406.87M** | **-61.6%** |
| **vec_nat32** | Total | 1.20B | **423.67M** | **-64.7%** |
| **vec_nat64** | Encoding | 161.45M | **33.57M** | **-79.2% (4.8x)** |
| **vec_nat64** | Decoding | 1.06B | **411.07M** | **-61.2%** |
| **vec_nat64** | Total | 1.22B | **444.64M** | **-63.6%** |
| **vec_int16** | Encoding | 123.7M | **8.4M** | **-93.2% (14.7x)** |
| **vec_int16** | Decoding | 1.07B | **411.07M** | **-61.5%** |
| **vec_int16** | Total | 1.19B | **419.48M** | **-64.8%** |
| **option_list** | Decoding | 28.35M | **23.70M** | **-16.4%** |
| **option_list** | Heap | 2 pages | **1 page** | **-50%** |
| **variant_list** | Decoding | 27.05M | **22.07M** | **-18.4%** |
| **variant_list** | Heap | 2 pages | **1 page** | **-50%** |
| **btreemap** | Encoding | 4.70B | **526.00M** | **-88.8% (8.9x)** |
| **btreemap** | Decoding | 15.74B | **13.41B** | **-14.8%** |
| **btreemap** | Total | 20.44B | **13.94B** | **-31.8%** |

### Full benchmark table

| status | name | master | optimized | ins Δ% | HI master | HI opt | HI Δ% |
|--------|------|--------|-----------|--------|-----------|--------|--------|
| - | blob | 6.33M | 6.33M | +0.0% | 66 | 66 | +0.0% |
| - | btreemap | 20.44B | 13.94B | -31.8% | 1179 | 1154 | -2.1% |
| - | btreemap::1. Encoding | 4.70B | 526.00M | -88.8% | 159 | 159 | +0.0% |
| - | btreemap::2. Decoding | 15.74B | 13.41B | -14.8% | 1020 | 995 | -2.5% |
| - | extra_args | 3.51M | 2.90M | -17.4% | 0 | 0 | +0.0% |
| - | nns | 26.64M | 25.56M | -4.1% | 2 | 2 | +0.0% |
| - | nns::0. Parsing | 17.86M | 16.84M | -5.7% | 2 | 2 | +0.0% |
| + | nns::1. Encoding | 2.09M | 2.09M | +0.0% | 0 | 0 | +0.0% |
| - | nns::2. Decoding | 5.74M | 5.69M | -0.9% | 0 | 0 | +0.0% |
| - | nns_list_proposal | 77.38M | 74.12M | -4.2% | 17 | 17 | +0.0% |
| + | nns_list_proposal::1. Encoding | 6.88M | 6.89M | +0.1% | 3 | 3 | +0.0% |
| - | nns_list_proposal::2. Decoding | 70.50M | 67.23M | -4.6% | 14 | 14 | +0.0% |
| - | option_list | 36.38M | 24.35M | -33.1% | 2 | 1 | -50.0% |
| - | option_list::1. Encoding | 8.03M | 641.58K | -92.0% | 0 | 0 | +0.0% |
| - | option_list::2. Decoding | 28.35M | 23.70M | -16.4% | 2 | 1 | -50.0% |
| + | text | 12.09M | 12.09M | +0.0% | 99 | 99 | +0.0% |
| - | variant_list | 35.28M | 22.71M | -35.6% | 2 | 1 | -50.0% |
| - | variant_list::1. Encoding | 8.22M | 636.51K | -92.3% | 0 | 0 | +0.0% |
| - | variant_list::2. Decoding | 27.05M | 22.07M | -18.4% | 2 | 1 | -50.0% |
| - | vec_int16 | 1.19B | 419.48M | -64.8% | 261 | 195 | -25.3% |
| - | vec_int16::1. Encoding | 123.70M | 8.40M | -93.2% | 261 | 130 | -50.2% |
| - | vec_int16::2. Decoding | 1.07B | 411.07M | -61.5% | 0 | 65 | — |
| - | vec_nat | 2.59B | 983.62M | -62.0% | 172 | 151 | -12.2% |
| - | vec_nat::1. Encoding | 1.10B | 66.54M | -93.9% | 33 | 33 | +0.0% |
| - | vec_nat::2. Decoding | 1.49B | 917.08M | -38.5% | 139 | 118 | -15.1% |
| - | vec_nat32 | 1.20B | 423.67M | -64.7% | 518 | 387 | -25.3% |
| - | vec_nat32::1. Encoding | 136.28M | 16.79M | -87.7% | 518 | 258 | -50.2% |
| - | vec_nat32::2. Decoding | 1.06B | 406.87M | -61.6% | 0 | 129 | — |
| - | vec_nat64 | 1.22B | 444.64M | -63.6% | 1031 | 771 | -25.2% |
| - | vec_nat64::1. Encoding | 161.45M | 33.57M | -79.2% | 1031 | 514 | -50.1% |
| - | vec_nat64::2. Decoding | 1.06B | 411.07M | -61.2% | 0 | 257 | — |

## Notes
- All optimizations preserve the wire format — byte-identical serialization output.
- Forward and backward compatibility verified via round-trip tests for all affected types.
- Zero regressions across all 12 benchmarks (minor noise on `nns_list_proposal::1. Encoding`).
- The `btreemap` improvements come from the `Nat` fast paths (earlier commits); the `Int` fast path further helps `option_list` and `variant_list`.
- Heap reduction in `option_list`/`variant_list` comes from avoiding `BigInt` allocation for small `Int` values.
- `vec_nat` encoding speedup (16.5x) comes from the `Nat::encode` u64 fast path avoiding BigUint arithmetic for small values.
- `vec_nat32`/`vec_nat64` encoding speedups come from bulk memcpy on little-endian platforms.
- `vec_nat` decoding (38.5% improvement) benefits from `Nat::decode` u64 fast path; further gains possible with reduced per-element serde overhead.
- Heap increase entries marked "—" indicate the baseline had 0 pages (all heap was used for encoding), so the optimized branch's decode-only heap is not comparable.

## Benchmarks added in #718
New benchmarks (`vec_nat`, `vec_nat32`, `vec_nat64`) added to master separately to track these types going forward.

Benchmark	Metric	Baseline	Optimized	Change
vec_nat	Encoding	1.10B	66.54M	-93.9% (16.5x)
vec_nat	Decoding	1.49B	917.08M	-38.5%
vec_nat	Total	2.59B	983.62M	-62.0%
vec_nat32	Encoding	136.28M	16.79M	-87.7% (8.1x)
vec_nat32	Decoding	1.06B	406.87M	-61.6%
vec_nat32	Total	1.20B	423.67M	-64.7%
vec_nat64	Encoding	161.45M	33.57M	-79.2% (4.8x)
vec_nat64	Decoding	1.06B	411.07M	-61.2%
vec_nat64	Total	1.22B	444.64M	-63.6%
vec_int16	Encoding	123.7M	8.4M	-93.2% (14.7x)
vec_int16	Decoding	1.07B	411.07M	-61.5%
vec_int16	Total	1.19B	419.48M	-64.8%
option_list	Decoding	28.35M	23.70M	-16.4%
option_list	Heap	2 pages	1 page	-50%
variant_list	Decoding	27.05M	22.07M	-18.4%
variant_list	Heap	2 pages	1 page	-50%
btreemap	Encoding	4.70B	526.00M	-88.8% (8.9x)
btreemap	Decoding	15.74B	13.41B	-14.8%
btreemap	Total	20.44B	13.94B	-31.8%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track candid performance series and benchmark deltas #710

Summary

Benchmark Comparison (Combined)

Key improvements

Full benchmark table

Notes

Benchmarks added in #718

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

status	name	master	optimized	ins Δ%	HI master	HI opt	HI Δ%
-	blob	6.33M	6.33M	+0.0%	66	66	+0.0%
-	btreemap	20.44B	13.94B	-31.8%	1179	1154	-2.1%
-	btreemap::1. Encoding	4.70B	526.00M	-88.8%	159	159	+0.0%
-	btreemap::2. Decoding	15.74B	13.41B	-14.8%	1020	995	-2.5%
-	extra_args	3.51M	2.90M	-17.4%	0	0	+0.0%
-	nns	26.64M	25.56M	-4.1%	2	2	+0.0%
-	nns::0. Parsing	17.86M	16.84M	-5.7%	2	2	+0.0%
+	nns::1. Encoding	2.09M	2.09M	+0.0%	0	0	+0.0%
-	nns::2. Decoding	5.74M	5.69M	-0.9%	0	0	+0.0%
-	nns_list_proposal	77.38M	74.12M	-4.2%	17	17	+0.0%
+	nns_list_proposal::1. Encoding	6.88M	6.89M	+0.1%	3	3	+0.0%
-	nns_list_proposal::2. Decoding	70.50M	67.23M	-4.6%	14	14	+0.0%
-	option_list	36.38M	24.35M	-33.1%	2	1	-50.0%
-	option_list::1. Encoding	8.03M	641.58K	-92.0%	0	0	+0.0%
-	option_list::2. Decoding	28.35M	23.70M	-16.4%	2	1	-50.0%
+	text	12.09M	12.09M	+0.0%	99	99	+0.0%
-	variant_list	35.28M	22.71M	-35.6%	2	1	-50.0%
-	variant_list::1. Encoding	8.22M	636.51K	-92.3%	0	0	+0.0%
-	variant_list::2. Decoding	27.05M	22.07M	-18.4%	2	1	-50.0%
-	vec_int16	1.19B	419.48M	-64.8%	261	195	-25.3%
-	vec_int16::1. Encoding	123.70M	8.40M	-93.2%	261	130	-50.2%
-	vec_int16::2. Decoding	1.07B	411.07M	-61.5%	0	65	—
-	vec_nat	2.59B	983.62M	-62.0%	172	151	-12.2%
-	vec_nat::1. Encoding	1.10B	66.54M	-93.9%	33	33	+0.0%
-	vec_nat::2. Decoding	1.49B	917.08M	-38.5%	139	118	-15.1%
-	vec_nat32	1.20B	423.67M	-64.7%	518	387	-25.3%
-	vec_nat32::1. Encoding	136.28M	16.79M	-87.7%	518	258	-50.2%
-	vec_nat32::2. Decoding	1.06B	406.87M	-61.6%	0	129	—
-	vec_nat64	1.22B	444.64M	-63.6%	1031	771	-25.2%
-	vec_nat64::1. Encoding	161.45M	33.57M	-79.2%	1031	514	-50.1%
-	vec_nat64::2. Decoding	1.06B	411.07M	-61.2%	0	257	—

Track candid performance series and benchmark deltas #710

Description

Summary

Benchmark Comparison (Combined)

Key improvements

Full benchmark table

Notes

Benchmarks added in #718

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions