-
Notifications
You must be signed in to change notification settings - Fork 1.3k
BENCHMARK(benchmark-jmh): Add benchmark for Operations.determinize() #15232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
For some of these regexps, they may now come out deterministic or even minimal+deterministic to begin with at the parsing phase (this is a good thing!): we may have to get more creative with the regexps to force the determinize() to do actual work. In such a case, the Best way to check out the regexes is to just write little throwaway unit-tests similar to: lucene/lucene/core/src/test/org/apache/lucene/util/automaton/TestRegExpParsing.java Lines 527 to 533 in 0020946
Basically, if the result from lucene/lucene/test-framework/src/java/org/apache/lucene/tests/util/automaton/AutomatonTestUtil.java Lines 394 to 413 in de1ed71
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
@rmuir Thank you so much for your help. I confirmed the hypothesis that the original benchmark patterns were already optimised to DFAs by Lucene's parser, causing Operations.determinize() to be a no-op.
I've written a thrown away test script using the assertCleanDFA method as you suggested, extracted and tested 27 patterns from OpenJDK's test suite (from the SO post), and found 11 NFA patterns that force determinization work:
Also notice something very interesting, the two patterns explicitly marked as "Nondeterministic group" in OpenJDK's test suite seems to be optimized to DFAs by Lucene's parser (≈ 10⁻⁶ ms/op), contradicting OpenJDK's "nondeterministic" label // Nondeterministic group
(a+b)+
(a|b)+ I've updated RegexDeterminizeBenchmark.java with the 11 verified NFA patterns, ran the benchmark, and committed the results. All patterns now show meaningful determination work (0.001-0.014 ms/op vs ≈ 10⁻⁶ ms/op previously). The benchmark appears to be working correctly now. Please let me know if any additional changes are needed. Thank you again for your help, please let me know if any extra change is needed ! |
Introduces a new JMH benchmark for Operations.determinize() to provide performance data for future optimizations, addressing issue #11025.
The benchmark measures the performance of the NFA-to-DFA conversion against a curated set of regular expressions.