`admin fate` improvements, `LockID`s use for fate stores improved/fixed #5028

kevinrr888 · 2024-10-31T20:02:28Z

This PR makes several improvements/fixes: improvements to the admin fate util which were made possible from #4524, replaced incorrect use of createDummyLockID() in real code (now only used in tests), <User/Meta>FateStore now support a null lock id if they will be used as read-only stores: write ops will fail on a store with a null lock, and some other misc. changes.

Full list of changes:

Removed the check for a dead Manager in the Admin fate util (AdminUtil) which was checked before admin fate delete <tx> or admin fate fail <tx> was able to run. This check is no longer needed with the changes from Fate reservations moved out of memory #4524. Fate reservations moved out of memory #4524 moved reservations out of Manager memory into the FATE table (for UserFateStore) and into ZK (for MetaFateStore). Prior to this, the Admin process would have no way of knowing if the Manager process had a transaction reserved, so the Manager had to be shutdown to ensure it was not. But now that reservations are visible to any process, we can try to reserve the transaction in Admin, and if it cannot be reserved and deleted/failed in a reasonable time, we let the user know that the Manager would need to be shutdown if deleting/failing the transaction is still desired.
- This has several benefits:
- It is one less thing to worry about when implementing multiple managers in the future since Admin assumes only one Manager for these commands. However, there is still the case where the Manager may
  keep a transaction reserved for a long period of time and the Admin can never reserve it. In this case, we inform the user that the transaction could not be deleted/failed and that if deleting/failing
  is still desired, the Manager may need to be shutdown.
- It covers a potential issue in the previously existing code where there was nothing stopping or ensuring a Manager is not started after the check is already performed in Admin but before the delete/
  fail was executed.
- It also should make the commands easier to use now since the Manager is not required to be shutdown before use.
Changes and adds some tests for admin fate fail and admin fate delete: ensures the Manager is not required to be down to fail/delete a transaction, and ensures that if the Manager does have a transaction reserved, admin will fail to reserve and fail/delete the transaction.
Another change which was needed as a prerequisite for the above changes was creating a ZK lock for Admin so transactions can be properly reserved by the command. Added new constant Constants.ZADMIN_LOCK = "/admin/lock", changed ServiceLockPaths, and added Admin.createAdminLock() to support this
New class TestLock (in test package) which is used by tests to create a real ZK lock, or a dummy one. Removed createDummyLockID() from AbstractFateStore (moved to TestLock), and createDummyLock() is now only used in test code. Added new constant ZTEST_LOCK = "/test/lock", changed ServiceLockPaths, and added createTestLock() which is used to create a real lock id (one held in ZK) which is needed for some tests.
- This fixes an unexpected failure that could have occurred for ExternalCompaction_1_IT. Was using a dummy lock for the store before and the fate data was being stored in the same locations that the
  Manager uses for it's fate data. The DeadReservationCleaner running in Manager would have cleaned up reservations created using this store if it ran when reservations were present. Now the test creates
  a real ZK lock so the DeadReservationCleaner won't clean these up unexpectedly.
Stores now support a null lock id for the situation where they will be used as read-only stores. A store with a null lock id will fail on write ops. Changed all existing uses of stores to only have a lock id if writes will occur (previously, all instances of the stores had a lock id).
Removed unused or unneccesary constructors for AbstractFateStore, MetaFateStore, UserFateStore
Ensured all tests changed, all FATE tests, and sunny day tests still pass

closes #4904

This commit makes several improvements/fixes: improvements to the `admin fate` util which were made possible from apache#4524, replaced incorrect use of `createDummyLockID()` in real code (now only used in tests), `<User/Meta>FateStore` now support a null lock id if they will be used as read-only stores: write ops will fail on a store with a null lock, and some other misc. changes. Full list of changes: - Removed the check for a dead Manager in the Admin fate util (AdminUtil) which was checked before `admin fate delete <tx>` or `admin fate fail <tx>` was able to run. This check is no longer needed with the changes from apache#4524. apache#4524 moved reservations out of Manager memory into the FATE table (for UserFateStore) and into ZK (for MetaFateStore). Prior to this, the Admin process would have no way of knowing if the Manager process had a transaction reserved, so the Manager had to be shutdown to ensure it was not. But now that reservations are visible to any process, we can try to reserve the transaction in Admin, and if it cannot be reserved and deleted/failed in a reasonable time, we let the user know that the Manager would need to be shutdown if deleting/failing the transaction is still desired. - This has several benefits: - It is one less thing to worry about when implementing multiple managers in the future since Admin assumes only one Manager for these commands. However, there is still the case where the Manager may keep a transaction reserved for a long period of time and the Admin can never reserve it. In this case, we inform the user that the transaction could not be deleted/failed and that if deleting/failing is still desired, the Manager may need to be shutdown. - It covers a potential issue in the previously existing code where there was nothing stopping or ensuring a Manager is not started after the check is already performed in Admin but before the delete/ fail was executed. - It also should make the commands easier to use now since the Manager is not required to be shutdown before use. - Changes and adds some tests for `admin fate fail` and `admin fate delete`: ensures the Manager is not required to be down to fail/delete a transaction, and ensures that if the Manager does have a transaction reserved, admin will fail to reserve and fail/delete the transaction. - Another change which was needed as a prerequisite for the above changes was creating a ZK lock for Admin so transactions can be properly reserved by the command. Added new constant `Constants.ZADMIN_LOCK = "/admin/lock"`, changed `ServiceLockPaths`, and added `Admin.createAdminLock()` to support this - New class `TestLock` (in test package) which is used by tests to create a real ZK lock, or a dummy one. Removed `createDummyLockID()` from `AbstractFateStore` (moved to TestLock), and `createDummyLock()` is now only used in test code. Added new constant `ZTEST_LOCK = "/test/lock"`, changed `ServiceLockPaths`, and added `createTestLock()` which is used to create a real lock id (one held in ZK) which is needed for some tests. - This fixes an unexpected failure that could have occurred for `ExternalCompaction_1_IT`. Was using a dummy lock for the store before and the fate data was being stored in the same locations that the Manager uses for it's fate data. The DeadReservationCleaner running in Manager would have cleaned up reservations created using this store if it ran when reservations were present. Now the test creates a real ZK lock so the DeadReservationCleaner won't clean these up unexpectedly. - Stores now support a null lock id for the situation where they will be used as read-only stores. A store with a null lock id will fail on write ops. Changed all existing uses of stores to only have a lock id if writes will occur (previously, all instances of the stores had a lock id). - Removed unused or unneccesary constructors for AbstractFateStore, MetaFateStore, UserFateStore - Ensured all tests changed, all FATE tests, and sunny day tests still pass closes apache#4904

server/base/src/main/java/org/apache/accumulo/server/util/Admin.java

test/src/main/java/org/apache/accumulo/test/fate/TestLock.java

keith-turner

These changes look good, made some minor comments but did not see problems.

core/src/main/java/org/apache/accumulo/core/Constants.java

core/src/main/java/org/apache/accumulo/core/fate/AbstractFateStore.java

core/src/main/java/org/apache/accumulo/core/fate/AdminUtil.java

server/base/src/main/java/org/apache/accumulo/server/util/Admin.java

test/src/main/java/org/apache/accumulo/test/fate/FateOpsCommandsIT.java

test/src/main/java/org/apache/accumulo/test/fate/MultipleStoresIT.java

kevinrr888 · 2024-11-21T16:56:29Z

One thing I noticed when working on this and running the tests again was that in a couple of the new tests added here (testFate<Delete/Fail>CommandTimeout), the test would fail/timeout on fate.shutdown(). There appears to still be one active transaction worker running (Fate <META/USER> is waiting for worker threads to terminate is logged until timeout). This only happens sometimes and I lowered the shutdown timeout to avoid a test failure in this case (this just avoids a test failure and allows the shutdownNow to be called on the pool, but doesn't fix the issue). It seems strange to me that a worker would still be running... there is only one transaction which is even attempted to be reserved (the one we seed in the test) and that is always followed by an unreserve (verified with logs). @keith-turner maybe you have some ideas?

keith-turner · 2024-12-05T22:18:41Z

It seems strange to me that a worker would still be running... there is only one transaction which is even attempted to be reserved (the one we seed in the test) and that is always followed by an unreserve (verified with logs). @keith-turner
maybe you have some ideas?

I tried running these test and did not see a problem. I saw an unrelated problem and made a comment about that. If I could see the problem I think I would try to jstack the process and see what the thread was doing.

kevinrr888 · 2024-12-10T21:33:00Z

I tried running these test and did not see a problem. I saw an unrelated problem and made a comment about that. If I could see the problem I think I would try to jstack the process and see what the thread was doing.

@keith-turner - I don't see the problem/comment you are referring to. I did notice the exit status when Admin loses the lock caused some test failures (noticed when resolving conflicts), so I fixed that in 8cc7b39

I'm going to try to figure out the worker issue with your jstack suggestion.

kevinrr888 · 2024-12-11T18:17:41Z

So, I was testing this yesterday, and was able to reproduce the same issue running UserFateOpsCommandsIT.testFateFailCommandTimeout(). I got a jstack trace of the thread:

"accumulo.pool.manager.fate-Worker-3" #64 daemon prio=5 os_prio=0 cpu=130772.51ms elapsed=138.20s tid=0x000075971000c9b0 nid=0x610aa runnable  [0x0000759863ffd000]
   java.lang.Thread.State: RUNNABLE
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@17.0.12/LinkedTransferQueue.java:652)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@17.0.12/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.poll(java.base@17.0.12/LinkedTransferQueue.java:1294)
	at org.apache.accumulo.core.fate.Fate$TransactionRunner.reserveFateTx(Fate.java:134)
	at org.apache.accumulo.core.fate.Fate$TransactionRunner.run(Fate.java:154)
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.12/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.12/ThreadPoolExecutor.java:635)
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
	at java.lang.Thread.run(java.base@17.0.12/Thread.java:840)

And this is seen again with this repeatedly printed (statement printed from Fate.shutdown()):

Fate USER is waiting for worker threads to terminate

I tried to reproduce again today with some debugging statements to better figure out why the worker would still be running, but after many, many attempts, I did not see the same failure again. Without being able to consistently reproduce the bug and without understanding based on the code how this could be occurring, I'm out of ideas and things to do to figure this out.

I'm not sure if this is a bug with my testing logic for the tests where I have seen this bug (testFateFailCommandTimeout, testFateDeleteCommandTimeout) (in which case, the bug doesn't matter), or if this is a bug with the preexisting Fate.java code (and is unrelated to these tests and PR and just happened to show up here).

Anyways, I'm out of things to try. Whenever you get the chance @keith-turner maybe you could look at the test logic, Fate.java code, and above stack trace and see if something sticks out to you. Or if you think this potential bug with Fate.java is not really something to worry about, we can just move on.

kevinrr888 · 2024-12-11T18:42:58Z

One thing I am considering is maybe this order of shutdown in Fate.shutdown is the source of the problem:

if (keepRunning.compareAndSet(true, false)) {
      fatePoolWatcher.shutdown();
      transactionExecutor.shutdown();
      workFinder.interrupt();

work finder is interrupted after the workers are shutdown... I think it makes more sense to interrupt the work finder then shutdown the worker pool

kevinrr888 · 2025-01-10T17:08:57Z

I just saw this same bug again when running some tests on FATE changes unrelated to this PR, so this isn't tied to this PR and is preexisting.

I can resolve the merge conflicts and these changes can be merged in.

Maybe an issue should be opened regarding this after this is merged, if this bug is actually worth it. I don't remember the specifics, but I think it's just that a worker thread (TransactionRunner) gets stuck somehow on when we try to shutdown() fate. It doesn't make sense how this can occur if no transactions are running, but since shutdownNow() will eventually be called on the pool, it probably doesn't matter.

keith-turner · 2025-01-10T18:24:16Z

Maybe an issue should be opened regarding this after this is merged, if this bug is actually worth it. I don't remember the specifics, but I think it's just that a worker thread (TransactionRunner) gets stuck somehow on when we try to shutdown() fate. It doesn't make sense how this can occur if no transactions are running, but since shutdownNow() will eventually be called on the pool, it probably doesn't matter.

Wondering if this will impact shutting down accumulo in any way. For cases when there is an observed bug where the cause can not be tracked down I have added logging to give clues about it the next time it happens. Is there any logging that can be added for this case that would be helpful?

kevinrr888 · 2025-01-10T19:52:51Z

Is there any logging that can be added for this case that would be helpful?

Other than the info I've provided, I don't think there is anything else to add. The logs I looked through when I've seen this occur were not able to help me diagnose the problem, only showed me that there was a problem. I've tried to add my own logs to figure this out, but without being able to consistently produce the error, this wasn't much help either.

kevinrr888 · 2025-01-16T16:18:39Z

I resolved all of the merge conflicts, ensured sunny day, all fate tests, and all tests changed still pass. I also checked over all of my changes again to ensure nothing was accidentally added, deleted, or changed that shouldn't have been in the merge.

Once the build completes, I will merge this into main.

kevinrr888 added 2 commits October 31, 2024 16:00

Merge branch 'main' into 4.0-feature-4904

6957092

kevinrr888 self-assigned this Oct 31, 2024

kevinrr888 added this to the 4.0.0 milestone Oct 31, 2024

kevinrr888 commented Oct 31, 2024

View reviewed changes

server/base/src/main/java/org/apache/accumulo/server/util/Admin.java Outdated Show resolved Hide resolved

kevinrr888 commented Oct 31, 2024

View reviewed changes

test/src/main/java/org/apache/accumulo/test/fate/TestLock.java Outdated Show resolved Hide resolved

kevinrr888 commented Oct 31, 2024

View reviewed changes

test/src/main/java/org/apache/accumulo/test/fate/TestLock.java Show resolved Hide resolved

kevinrr888 requested a review from keith-turner October 31, 2024 20:16

keith-turner approved these changes Nov 10, 2024

View reviewed changes

kevinrr888 mentioned this pull request Nov 19, 2024

Trivial: Added missing info to admin fate delete help msg #5080

Merged

Addresses review

8e59023

Merge branch 'main' into 4.0-feature-4904

dc707e7

kevinrr888 added 2 commits December 10, 2024 15:36

Merge branch 'main' into 4.0-feature-4904

f3fd473

fix Admin lostLock

8cc7b39

kevinrr888 added 3 commits January 15, 2025 16:40

Merge branch 'main' into 4.0-feature-4904

a357127

formatting

f9908ed

added 'private' to a couple fields missing it

8155118

kevinrr888 merged commit cb1158e into apache:main Jan 16, 2025
8 checks passed

kevinrr888 deleted the 4.0-feature-4904 branch January 16, 2025 16:28

admin fate improvements, LockIDs use for fate stores improved/fixed #5028

admin fate improvements, LockIDs use for fate stores improved/fixed #5028

Uh oh!

Conversation

kevinrr888 commented Oct 31, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keith-turner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinrr888 commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keith-turner commented Dec 5, 2024

Uh oh!

kevinrr888 commented Dec 10, 2024

Uh oh!

kevinrr888 commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinrr888 commented Dec 11, 2024

Uh oh!

kevinrr888 commented Jan 10, 2025

Uh oh!

keith-turner commented Jan 10, 2025

Uh oh!

kevinrr888 commented Jan 10, 2025

Uh oh!

kevinrr888 commented Jan 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`admin fate` improvements, `LockID`s use for fate stores improved/fixed #5028

`admin fate` improvements, `LockID`s use for fate stores improved/fixed #5028

kevinrr888 commented Nov 21, 2024 •

edited

Loading

kevinrr888 commented Dec 11, 2024 •

edited

Loading