Feature/performance change: lazy parsing #16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We ran into a problem when parsing somewhat sizable EDI files (>25MB) because the parser already broke our 2GB memory limit.
After a bit of fiddling it turned out that the buffering the parser does while in the Tokenizer as well as in
convertTokensToSegmentsis quite heavy on the memory. For our 26MB test file thegetTokensbuffer alone added about 1650 MB to the memory, which would not be released until after all segments have been produced. The OOM then broken soon somewhere inconvertTokensToSegments.Since the
convertTokensToSegmentsalready returns a generator I extended that concept. TokenizerInterface now returns a iterable, and yields tokens on demand.convertTokensToSegmentsconsumes this generator, pulls tokens on demand, and, instead of first collecting all segments and then yielding them when done, it immediately yields a segment before starting to read the next one.This allows the consumer to start processing segments pretty much instantly and moves the whole overhad for the Parser from ~90xFilesize to ~1xFilesize.
In my tests this improves parsing time for large files by ~20%, since very little time is spend on memory allocation.
Here are some numbers of benchmarks I ran for this:
The phpunit performance test showed no signicant change in performance for me.
And heres bench.php, so you see whats benchmarked:
I am aware that this hides a crucial stop: somehow somewhere has do something with these segments and certainly would want to buffer some of them.
My experience, at least for other application, is that whatever the consumer does with that segments uses less memory. Also they may also comsume them in a lazy iterative fashin, e.g. processing a full message before starting the next one. So peak memory can be lowered by this, which is what we wanted to achive.
BC Break:
I think this - if accepted - requires releasing a new major version, since the TokenizerInterface::getToken changes return type in an incompatible way.