BENCHMARK(benchmark-jmh): Add benchmark for Operations.determinize() #15232

Jacky040124 · 2025-09-25T16:16:26Z

Introduces a new JMH benchmark for Operations.determinize() to provide performance data for future optimizations, addressing issue #11025.

The benchmark measures the performance of the NFA-to-DFA conversion against a curated set of regular expressions.

github-actions · 2025-09-25T16:17:20Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions · 2025-09-25T16:24:33Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

rmuir · 2025-09-25T19:33:03Z

For some of these regexps, they may now come out deterministic or even minimal+deterministic to begin with at the parsing phase (this is a good thing!): we may have to get more creative with the regexps to force the determinize() to do actual work.

In such a case, the Operations.determinize() is a no-op, which is why I think you see some of the crazy-fast numbers here.

Best way to check out the regexes is to just write little throwaway unit-tests similar to:

lucene/lucene/core/src/test/org/apache/lucene/util/automaton/TestRegExpParsing.java

Lines 527 to 533 in 0020946

    
           public void testRepeat0() { 
        
             RegExp re = new RegExp("a*"); 
        
             assertEquals("(a)*", re.toString()); 
        
             assertEquals(String.join("\n", "REGEXP_REPEAT", "  REGEXP_CHAR char=a\n"), re.toStringTree()); 
        
             Automaton actual = re.toAutomaton(); 
        
             AutomatonTestUtil.assertMinimalDFA(actual);

Basically, if the result from toAutomaton() passes assertCleanDFA(), then you know it is already a DFA and determinize() wont do anything. See the assertions here:

lucene/lucene/test-framework/src/java/org/apache/lucene/tests/util/automaton/AutomatonTestUtil.java

Lines 394 to 413 in de1ed71

    
           /** Asserts that an automaton is a minimal DFA. */ 
        
           public static void assertMinimalDFA(Automaton automaton) { 
        
             assertCleanDFA(automaton); 
        
             Automaton minimized = minimizeSimple(automaton); 
        
             assertEquals(minimized.getNumStates(), automaton.getNumStates()); 
        
           } 
        
           /** Asserts that an automaton is a DFA with no dead states */ 
        
           public static void assertCleanDFA(Automaton automaton) { 
        
             assertCleanNFA(automaton); 
        
             assertTrue("must be deterministic", automaton.isDeterministic()); 
        
           } 
        
           /** Asserts that an automaton has no dead states */ 
        
           public static void assertCleanNFA(Automaton automaton) { 
        
             assertFalse( 
        
                 "has dead states reachable from initial", Operations.hasDeadStatesFromInitial(automaton)); 
        
             assertFalse("has dead states leading to accept", Operations.hasDeadStatesToAccept(automaton)); 
        
             assertFalse("has unreachable dead states (ghost states)", Operations.hasDeadStates(automaton)); 
        
           }

github-actions · 2025-10-02T03:11:44Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Jacky040124 · 2025-10-02T03:24:29Z

@rmuir Thank you so much for your help.

I confirmed the hypothesis that the original benchmark patterns were already optimised to DFAs by Lucene's parser, causing Operations.determinize() to be a no-op.

Pattern: a+                    → Clean DFA
Pattern: .*                    → Clean DFA
Pattern: [0-9]+                → Clean DFA 
Pattern: a(b+|c+)d             → Clean DFA
Pattern: (cat|dog|bird|fish|mouse) → Clean DFA 
Only pattern: (a+)+           → NFA (0.003 ms/op)

I've written a thrown away test script using the assertCleanDFA method as you suggested, extracted and tested 27 patterns from OpenJDK's test suite (from the SO post), and found 11 NFA patterns that force determinization work:

^(a)?a, ((a|b)?b)+, (aaa)?aaa, ^(a(b(c)?)?)?abc,
(a+)+, (a*)+, (b+)+, (|f)?+, (y+)*,
(foo|foobar)*, (aa+|bb+)+

Also notice something very interesting, the two patterns explicitly marked as "Nondeterministic group" in OpenJDK's test suite seems to be optimized to DFAs by Lucene's parser (≈ 10⁻⁶ ms/op), contradicting OpenJDK's "nondeterministic" label

// Nondeterministic group
(a+b)+
(a|b)+

I've updated RegexDeterminizeBenchmark.java with the 11 verified NFA patterns, ran the benchmark, and committed the results. All patterns now show meaningful determination work (0.001-0.014 ms/op vs ≈ 10⁻⁶ ms/op previously). The benchmark appears to be working correctly now. Please let me know if any additional changes are needed.

Thank you again for your help, please let me know if any extra change is needed !

Jacky040124 and others added 3 commits September 17, 2025 13:39

Add benchmark for Operations.determinize() (apache#11025)

e00e2b5

test: include test result

d472f7a

Merge branch 'apache:main' into main

652a7d5

Merge branch 'main' of https://github.com/Jacky040124/lucene

dbfefa5

iverase force-pushed the main branch from 978b2d3 to 38fc368 Compare September 30, 2025 14:57

test: update test cases

9ed2177

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BENCHMARK(benchmark-jmh): Add benchmark for Operations.determinize() #15232

BENCHMARK(benchmark-jmh): Add benchmark for Operations.determinize() #15232

Jacky040124 commented Sep 25, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

rmuir commented Sep 25, 2025

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

Jacky040124 commented Oct 2, 2025

Uh oh!

Uh oh!

BENCHMARK(benchmark-jmh): Add benchmark for Operations.determinize() #15232

Are you sure you want to change the base?

BENCHMARK(benchmark-jmh): Add benchmark for Operations.determinize() #15232

Conversation

Jacky040124 commented Sep 25, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

rmuir commented Sep 25, 2025

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

Jacky040124 commented Oct 2, 2025

Uh oh!

Uh oh!