Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it: At QDR, we've recently see Dataverse stop responding without any out of memory or high load issues.
The one thing I've seen in the log that is potentially related are errors related to the postgres connection:
Showing AI how the DataSource was configured, it pointed to an issue where we are specifying a pooling connector while Payara is also setting up a pool, causing duplication and potential issues under high load.
This PR switches to the recommended "Simple" connector and adds a new "fish.payara.fail-all-connections=${MPCONFIG=dataverse.db.fail-all-connections:true}" option that seems useful for recovering more quickly when connections are going bad.
While the relevant Dataverse code hasn't changed lately, it's possible that the update to Payara 7/new Eclipselink has changed something that is making the problem more visible.
Even if this is not related to the problem we're seeing at QDR, it seems like a useful update.
Which issue(s) this PR closes:
Special notes for your reviewer:
AI also recommended several settings that we don't have by default, which ~match Payara's documentation on advanced db settings. I don't know if changing the defaults in our code, or adding these to the recommended settings when upgrading/the docker default config would be useful:
-Ddataverse.db.is-connection-validation-required=true
-Ddataverse.db.validate-atmost-once-period-in-seconds=30
-Ddataverse.db.connection-validation-method=custom-validation
-Ddataverse.db.validation-classname=org.glassfish.api.jdbc.validation.PostgresConnectionValidation
-Ddataverse.db.fail-all-connections=true
-Ddataverse.db.connection-leak-timeout-in-seconds=300
Suggestions on how to test this:
Nominally make sure Dataverse works after the change, probably perf test and assure performance doesn't drop and performance under load is as/more stable than before.
FWIW: All of these changes are on QDR test machines - we'll be monitoring performance and hopefully pushing them to production where we've seen outages.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: