Optimize allocation, lockless accumulation, and tape flushing by Aethdv · Pull Request #351 · official-clockwork/Clockwork

Aethdv · 2026-01-09T20:24:56Z

Greatly improve the evaltune system by redesigning part of the process. Brings all around speedups that range from 16% to 200% depending on the machine setup.

Startup QoL and speedup: Pre-scan FEN files to count lines. Reserve positions and results vectors exact size upfront to prevent reallocations.
Huge Pages (neutral): Apply MADV_HUGEPAGE to the reserved capacity. (measured no speedup, but including this doesn't hurt).
Concurrency improvement: Replace mutex-based gradient accumulation with per-thread buffers.
Optimized Reduction: Use std::barrier to sum thread gradients and apply optimizer steps, removing lock contention.
Reintroduce Microbatching: Process positions in chunks of 160 to flush the autograd tape (cleanup/zero_grad) frequently. This value was obtained by trial and error to maximize speed.

Noted from baseline 10s -> 8.5s -> 5s per epoch improvement on Ryzen 7 7735hs.
and fixed the memory usage exceeding the expected estimation of ~5.2G, in my case it got up to 7.7G (stabilized) and on giovok's machine it could reach up to +20GB when loading and stabilizing at around 11GB.

Passed STC nonregr:

Test  | evaltune-optm
Elo   | 0.72 +- 1.90 (95%)
SPRT  | 8.0+0.08s Threads=1 Hash=16MB
LLR   | 2.96 (-2.94, 2.94) [-3.00, 0.00]
Games | N: 43444 W: 10734 L: 10644 D: 22066
Penta | [520, 5205, 10184, 5291, 522]

https://clockworkopenbench.pythonanywhere.com/test/1100/

Bench: 13922193

- **Startup**: Pre-scan FEN files to count lines. Reserve and vectors exact size upfront to prevent reallocations. - **Huge Pages**: Apply to the reserved capacity. (measured no speedup) - **Concurrency**: Replace mutex-based gradient accumulation with per-thread buffers. - **Reduction**: Use to sum thread gradients and apply optimizer steps, removing lock contention. - **Tape Management**: Process positions in chunks of 1024 to flush the autograd tape (/) frequently.

Bench: 13922193

Aethdv and others added 5 commits January 9, 2026 15:31

Merge branch 'official-clockwork:main' into evaltune-optm

87daf47

nits & warning fixes

6c8828f

Huge speedup by tuning the microbatch size

ec52fd1

Tuned values

7bc17ef

Bench: 13922193

TheRealGioviok merged commit 1313df4 into official-clockwork:main Jan 14, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize allocation, lockless accumulation, and tape flushing#351

Optimize allocation, lockless accumulation, and tape flushing#351
TheRealGioviok merged 5 commits intoofficial-clockwork:mainfrom
Aethdv:evaltune-optm

Aethdv commented Jan 9, 2026 •

edited by TheRealGioviok

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aethdv commented Jan 9, 2026 • edited by TheRealGioviok Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aethdv commented Jan 9, 2026 •

edited by TheRealGioviok

Loading