Skip to content

fix ci flakiness leaderelection test#4388

Draft
epugh wants to merge 4 commits intoapache:mainfrom
epugh:copilot/fix-ci-flakiness-leaderelection-test
Draft

fix ci flakiness leaderelection test#4388
epugh wants to merge 4 commits intoapache:mainfrom
epugh:copilot/fix-ci-flakiness-leaderelection-test

Conversation

@epugh
Copy link
Copy Markdown
Contributor

@epugh epugh commented May 2, 2026

I found this while running some tests on Crave:

./gradlew :solr:core:test --tests "org.apache.solr.cloud.LeaderElectionIntegrationTest.testSimpleSliceLeaderElection" "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=F08428E6CA23E0C6 -Ptests.timeoutSuite=600000! -Ptests.useSecurityManager=true -Ptests.file.encoding=UTF-8

I asked for a tip from copilot on what was failing, and this is what it gave me. I believe that I've basically seen this same timing fix proposed on other flaky tests..

Copilot AI and others added 2 commits May 2, 2026 14:08
…erElection

- Increase cluster.waitForActiveCollection timeout from 10s to 60s
- Replace ad-hoc polling loop after expireZkSession with waitForState
  (waits until leader moves away from the expired-session node)
- Replace Thread.sleep with waitForState for node rejoining live nodes
- Replace final polling loop + assertEquals with waitForState
  (waits until original node becomes leader again)

Agent-Logs-Url: https://github.com/epugh/solr/sessions/1eab1dea-7bf6-4911-93ff-03f3c6614cfd

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
@epugh epugh marked this pull request as draft May 2, 2026 14:15
createCollection(collection);

cluster.waitForActiveCollection(collection, 10, TimeUnit.SECONDS, 2, 6);
cluster.waitForActiveCollection(collection, 60, TimeUnit.SECONDS, 2, 6);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is to set an overall upper limit before failing!


// make sure we have waited long enough for the first leader to have come back
Thread.sleep(ZkTestServer.TICK_TIME * 2 + 100);
// Wait until leadership has moved away from the expired-session node
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the waitForState instead of the thread.sleep

Copilot AI and others added 2 commits May 2, 2026 14:28
…nd ZkShardTermsRecoveryTest

Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
…in LeaderElectionIntegrationTest

Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants