Skip to content

Conversation

Jacky040124
Copy link

Introduces a new JMH benchmark for Operations.determinize() to provide performance data for future optimizations, addressing issue #11025.

The benchmark measures the performance of the NFA-to-DFA conversion against a curated set of regular expressions.

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@rmuir
Copy link
Member

rmuir commented Sep 25, 2025

For some of these regexps, they may now come out deterministic or even minimal+deterministic to begin with at the parsing phase (this is a good thing!): we may have to get more creative with the regexps to force the determinize() to do actual work.

In such a case, the Operations.determinize() is a no-op, which is why I think you see some of the crazy-fast numbers here.

Best way to check out the regexes is to just write little throwaway unit-tests similar to:

public void testRepeat0() {
RegExp re = new RegExp("a*");
assertEquals("(a)*", re.toString());
assertEquals(String.join("\n", "REGEXP_REPEAT", " REGEXP_CHAR char=a\n"), re.toStringTree());
Automaton actual = re.toAutomaton();
AutomatonTestUtil.assertMinimalDFA(actual);

Basically, if the result from toAutomaton() passes assertCleanDFA(), then you know it is already a DFA and determinize() wont do anything. See the assertions here:

/** Asserts that an automaton is a minimal DFA. */
public static void assertMinimalDFA(Automaton automaton) {
assertCleanDFA(automaton);
Automaton minimized = minimizeSimple(automaton);
assertEquals(minimized.getNumStates(), automaton.getNumStates());
}
/** Asserts that an automaton is a DFA with no dead states */
public static void assertCleanDFA(Automaton automaton) {
assertCleanNFA(automaton);
assertTrue("must be deterministic", automaton.isDeterministic());
}
/** Asserts that an automaton has no dead states */
public static void assertCleanNFA(Automaton automaton) {
assertFalse(
"has dead states reachable from initial", Operations.hasDeadStatesFromInitial(automaton));
assertFalse("has dead states leading to accept", Operations.hasDeadStatesToAccept(automaton));
assertFalse("has unreachable dead states (ghost states)", Operations.hasDeadStates(automaton));
}

Copy link
Contributor

github-actions bot commented Oct 2, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@Jacky040124
Copy link
Author

@rmuir Thank you so much for your help.

I confirmed the hypothesis that the original benchmark patterns were already optimised to DFAs by Lucene's parser, causing Operations.determinize() to be a no-op.

Pattern: a+                    → Clean DFA
Pattern: .*                    → Clean DFA
Pattern: [0-9]+                → Clean DFA 
Pattern: a(b+|c+)d             → Clean DFA
Pattern: (cat|dog|bird|fish|mouse) → Clean DFA 
Only pattern: (a+)+           → NFA (0.003 ms/op)

I've written a thrown away test script using the assertCleanDFA method as you suggested, extracted and tested 27 patterns from OpenJDK's test suite (from the SO post), and found 11 NFA patterns that force determinization work:

^(a)?a, ((a|b)?b)+, (aaa)?aaa, ^(a(b(c)?)?)?abc,
(a+)+, (a*)+, (b+)+, (|f)?+, (y+)*,
(foo|foobar)*, (aa+|bb+)+

Also notice something very interesting, the two patterns explicitly marked as "Nondeterministic group" in OpenJDK's test suite seems to be optimized to DFAs by Lucene's parser (≈ 10⁻⁶ ms/op), contradicting OpenJDK's "nondeterministic" label

// Nondeterministic group
(a+b)+
(a|b)+

I've updated RegexDeterminizeBenchmark.java with the 11 verified NFA patterns, ran the benchmark, and committed the results. All patterns now show meaningful determination work (0.001-0.014 ms/op vs ≈ 10⁻⁶ ms/op previously). The benchmark appears to be working correctly now. Please let me know if any additional changes are needed.

Thank you again for your help, please let me know if any extra change is needed !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants