Skip to content

Conversation

@themasch
Copy link
Contributor

@themasch themasch commented Aug 21, 2025

We ran into a problem when parsing somewhat sizable EDI files (>25MB) because the parser already broke our 2GB memory limit.

After a bit of fiddling it turned out that the buffering the parser does while in the Tokenizer as well as in convertTokensToSegments is quite heavy on the memory. For our 26MB test file the getTokens buffer alone added about 1650 MB to the memory, which would not be released until after all segments have been produced. The OOM then broken soon somewhere in convertTokensToSegments.

Since the convertTokensToSegments already returns a generator I extended that concept. TokenizerInterface now returns a iterable, and yields tokens on demand.

convertTokensToSegments consumes this generator, pulls tokens on demand, and, instead of first collecting all segments and then yielding them when done, it immediately yields a segment before starting to read the next one.

This allows the consumer to start processing segments pretty much instantly and moves the whole overhad for the Parser from ~90xFilesize to ~1xFilesize.

In my tests this improves parsing time for large files by ~20%, since very little time is spend on memory allocation.

Here are some numbers of benchmarks I ran for this:

# first the current main branch
> php -dmemory_limit=6G bench.php  AN_UNGODLY_AMOUNT_OF_EDI
current memory:     27.0 MB| peak memory:     27.1 MB | after reading file
current memory:    187.9 MB| peak memory:  2,412.9 MB | after parsing all segments
parsing all segments took 10787 ms

### next with just the first commit, changing Tokenizer
# again 3 runs to check for variance
> php -dmemory_limit=6G bench.php  AN_UNGODLY_AMOUNT_OF_EDI
current memory:     27.1 MB| peak memory:     27.1 MB | after reading file
current memory:     53.7 MB| peak memory:    833.7 MB | after parsing all segments
parsing all segments took 9173 ms
> php -dmemory_limit=6G bench.php  AN_UNGODLY_AMOUNT_OF_EDI
current memory:     27.0 MB| peak memory:     27.1 MB | after reading file
current memory:     53.7 MB| peak memory:    833.7 MB | after parsing all segments
parsing all segments took 8892 ms
> php -dmemory_limit=6G bench.php  AN_UNGODLY_AMOUNT_OF_EDI
current memory:     27.0 MB| peak memory:     27.1 MB | after reading file
current memory:     53.7 MB| peak memory:    833.7 MB | after parsing all segments
parsing all segments took 9026 ms

### and this is both changes together
# again 3 runs to check for variance
> php -dmemory_limit=6G bench.php  AN_UNGODLY_AMOUNT_OF_EDI
current memory:     27.0 MB| peak memory:     27.1 MB | after reading file
current memory:     53.7 MB| peak memory:     53.7 MB | after parsing all segments
parsing all segments took 8419 ms
> php -dmemory_limit=6G bench.php  AN_UNGODLY_AMOUNT_OF_EDI
current memory:     27.0 MB| peak memory:     27.1 MB | after reading file
current memory:     53.7 MB| peak memory:     53.7 MB | after parsing all segments
parsing all segments took 8211 ms
> php -dmemory_limit=6G bench.php  AN_UNGODLY_AMOUNT_OF_EDI
current memory:     27.0 MB| peak memory:     27.1 MB | after reading file
current memory:     53.7 MB| peak memory:     53.7 MB | after parsing all segments
parsing all segments took 8316 ms

The phpunit performance test showed no signicant change in performance for me.

And heres bench.php, so you see whats benchmarked:

<?php

require_once __DIR__ . '/vendor/autoload.php';

$parser = new \Estrato\Edifact\Parser();
$data = file_get_contents($argv[1]);

printf(
    "current memory: %8s MB| peak memory: %8s MB | after reading file\n",
    number_format(memory_get_usage() / 1_000_000, 1),
    number_format(memory_get_peak_usage() / 1_000_000, 1)
);

$start = microtime(true);
foreach ($parser->parse($data) as $segment) {
    // do something clever with segment but forget it ASAP
}
$duration = microtime(true) - $start;

printf(
    "current memory: %8s MB| peak memory: %8s MB | after parsing all segments\n",
    number_format(memory_get_usage() / 1_000_000, 1),
    number_format(memory_get_peak_usage() / 1_000_000, 1)
);

printf("parsing all segments took %d ms\n", (int)($duration * 1000));

I am aware that this hides a crucial stop: somehow somewhere has do something with these segments and certainly would want to buffer some of them.
My experience, at least for other application, is that whatever the consumer does with that segments uses less memory. Also they may also comsume them in a lazy iterative fashin, e.g. processing a full message before starting the next one. So peak memory can be lowered by this, which is what we wanted to achive.

BC Break:

I think this - if accepted - requires releasing a new major version, since the TokenizerInterface::getToken changes return type in an incompatible way.

Mark Schmale added 2 commits August 21, 2025 13:55
This allows the tokenizer to be lazy and only to produce a next token when the parser needs it. This reduces the amount of memory used for buffing all tokens before starting to convert them into segments.

This is a BC break, because the API of the TokenizerInterface changes: getTokens no longer returns an array, but any iterable.
Let the parser emit segments as soon as we start the next one. No need to buffer all of them before we start emitting.

This, again, reduces the peek memory used before the consumer can start processing segments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant