Optimize allocation, lockless accumulation, and tape flushing#351
Merged
TheRealGioviok merged 5 commits intoofficial-clockwork:mainfrom Jan 14, 2026
Merged
Optimize allocation, lockless accumulation, and tape flushing#351TheRealGioviok merged 5 commits intoofficial-clockwork:mainfrom
TheRealGioviok merged 5 commits intoofficial-clockwork:mainfrom
Conversation
- **Startup**: Pre-scan FEN files to count lines. Reserve and vectors exact size upfront to prevent reallocations. - **Huge Pages**: Apply to the reserved capacity. (measured no speedup) - **Concurrency**: Replace mutex-based gradient accumulation with per-thread buffers. - **Reduction**: Use to sum thread gradients and apply optimizer steps, removing lock contention. - **Tape Management**: Process positions in chunks of 1024 to flush the autograd tape (/) frequently.
Bench: 13922193
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Greatly improve the evaltune system by redesigning part of the process. Brings all around speedups that range from 16% to 200% depending on the machine setup.
positionsandresultsvectors exact size upfront to prevent reallocations.MADV_HUGEPAGEto the reserved capacity. (measured no speedup, but including this doesn't hurt).std::barrierto sum thread gradients and apply optimizer steps, removing lock contention.cleanup/zero_grad) frequently. This value was obtained by trial and error to maximize speed.Noted from baseline 10s -> 8.5s -> 5s per epoch improvement on Ryzen 7 7735hs.
and fixed the memory usage exceeding the expected estimation of ~5.2G, in my case it got up to 7.7G (stabilized) and on giovok's machine it could reach up to +20GB when loading and stabilizing at around 11GB.
Passed STC nonregr:
https://clockworkopenbench.pythonanywhere.com/test/1100/
Bench: 13922193