Better tracking host maintanence and handling of migration jobs#3425
Better tracking host maintanence and handling of migration jobs#3425andrijapanicsb merged 19 commits intoapache:masterfrom
Conversation
| @Override | ||
| public ConfigKey<?>[] getConfigKeys() { | ||
| return new ConfigKey<?>[] {HostMaintenanceRetries}; | ||
| return new ConfigKey<?>[] {KvmSshToAgentEnabled}; |
There was a problem hiding this comment.
No need to add this config key as it is already present on DB
There was a problem hiding this comment.
For the sake of completeness (readability) should we not export it anyway? I mean if there are no repurcussions to it we should we not keep it as is?
There was a problem hiding this comment.
Is this exported by some other manager/class? Then there is no need to re-export it.
| @Override | ||
| public ConfigKey<?>[] getConfigKeys() { | ||
| return new ConfigKey<?>[] {HostMaintenanceRetries}; | ||
| return new ConfigKey<?>[] {KvmSshToAgentEnabled}; |
There was a problem hiding this comment.
For the sake of completeness (readability) should we not export it anyway? I mean if there are no repurcussions to it we should we not keep it as is?
|
Can you fix the conflicts @anuragaw and rebase against latest master? |
| final boolean hasVmsInFailureStates = CollectionUtils.isNotEmpty(errorVms); | ||
| errorVms.addAll(failedMigrations); | ||
|
|
||
| if (!hasPendingMigrationWorks && (hasRunningVms || (!hasRunningVms && !hasMigratingVms && hasVmsInFailureStates))) { |
There was a problem hiding this comment.
Maybe this could be a separate method with a javadoc to explain it? To improve readability
There was a problem hiding this comment.
Moving to another method.
| return setHostIntoErrorInMaintenance(host, errorVms); | ||
| } | ||
|
|
||
| if ((hasVmsInFailureStates || hasFailedMigrations) && (hasPendingMigrationWorks || hasMigratingVms || CollectionUtils.isNotEmpty(_vmDao.findByHostInStates(hostId, State.Stopping)))) { |
There was a problem hiding this comment.
Refactoring to different methods seems like an overhead. I've refactored a little with some javadocs to better readability.
c02ee4d to
5a15b8d
Compare
|
@blueorangutan package |
|
@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos6 ✔centos7 ✔debian. JID-73 |
|
@blueorangutan test |
|
@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
| ErrorInMaintenance, | ||
| Maintenance, | ||
| Error, | ||
| PrepareForMaintenanceErrorsPresent; |
There was a problem hiding this comment.
Can you rename PrepareForMaintenanceErrorsPresent to ErrorInPrepareForMaintenance to be uniform with other names such as ErrorInMaintenance.
There was a problem hiding this comment.
PrepareForMaintenanceErrorsPresent is a transitional state that can go back to PrepareForMaintenance if Admin fixes errors.
This ultimately ends in ErrorInMaintenance or Maintenance state which are final states to be reached. Let me update the PR description @rhtyd
There was a problem hiding this comment.
(minor nit)
I understand, if the intent of the new state to say errors happened while doing ... then I'm simply asking the name of the new state to be uniform with other state names. For example, Maintenance has ErrorInMaintenance; so PrepareForMaintenance can have a similar ErrorInPrepareForMaintenance state.
There was a problem hiding this comment.
Okay. I get your point. Fixing in the next commit.
|
|
||
| protected static final Logger s_logger = Logger.getLogger(HighAvailabilityManagerImpl.class); | ||
| private ConfigKey<Integer> MaxRetries = new ConfigKey<>("Advanced", Integer.class, | ||
| "max.retries","5", |
There was a problem hiding this comment.
max.retries is too generic, can you rename this to vm.ha.migration.max.retries or something suitable?
There was a problem hiding this comment.
This is an old config file in the code but was not exposed to be configured. Renaming makes sense. Doing it.
There was a problem hiding this comment.
If the global setting already exists, then don't rename them (for backward compatibility reasons) or re-define them. If they are new settings rename them.
There was a problem hiding this comment.
These existed before. Leaving them as is.
There was a problem hiding this comment.
I checked 4.13, the global setting max.retries does not exist in it however in code it is indeed referenced. Okay to leave as is, in that case.
There was a problem hiding this comment.
Yes. This was a bug because max.retries wasn't accessible via the API. We have fixed that here.
There was a problem hiding this comment.
@anuragaw @rhtyd
sorry I have some late comments as this PR has been merged.
- max.retries is not used in the past so I suggest to rename it before 4.14 release.
- this config has scope=Cluster but cluster-wide configuration is never used in code.
| } | ||
|
|
||
| @Test | ||
| public void testCheckAndMaintainEnterMaintenanceMode() throws NoTransitionException { |
|
@nvazquez @borisstoyanov have you reviewed and tested it? |
|
Trillian test result (tid-88) |
|
Refactoring and addressing reviews today. Apologies but I got occupied on other PRs. |
|
Moved this to 4.13.1.0 unless we can manage to get it reviewed/tested cc @borisstoyanov @PaulAngus |
bb20254 to
d660930
Compare
|
@blueorangutan package |
|
@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos6 ✖centos7 ✖debian. JID-149 |
borisstoyanov
left a comment
There was a problem hiding this comment.
@anuragaw there seems to be some issues building this one
[INFO] Apache CloudStack Server .......................... FAILURE [1:22.280s]
[ERROR] Failures:
[ERROR] ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations:203
Wanted but not invoked:
resourceManagerImpl.setHostIntoErrorInMaintenance(
host,
[vm1, vm2]
);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:203)
However, there were other interactions with this mock:
resourceManagerImpl.checkAndMaintain(1);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:201)
resourceManagerImpl.attemptMaintain(host);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:201)
resourceManagerImpl.setHostIntoMaintenance(
host
);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:201)
resourceManagerImpl.resourceStateTransitTo(
host,
InternalEnterMaintenance,
2886795268
);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:201)
[ERROR] ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceMigratingVms:195
[ERROR] ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceRunningVms:187
[INFO]
[ERROR] Tests run: 791, Failures: 3, Errors: 0, Skipped: 5
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
|
@borisstoyanov - looking at it Bobby, looks like I may have missed pushing some changes to the PR before transitioning to some urgent work. |
|
Is this still in progress or ready for review/testing - @anuragaw cc @borisstoyanov @andrijapanicsb ? |
|
This has undergone some testing as far as I know from @borisstoyanov and should be ready for merging once either @andrijapanicsb or @PaulAngus give up a thumbs up. I'll remove the WIP after that so that it can be merged. |
borisstoyanov
left a comment
There was a problem hiding this comment.
LGTM based on code review and test results
test_results.xlsx
|
ping @anuragaw seems some conflicts - can you please handle it |
1c90d88 to
a43482a
Compare
|
Rebased against master. |
|
@blueorangutan package |
|
@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos6 ✔centos7 ✔debian. JID-483 |
|
@blueorangutan test matrix |
|
@anuragaw a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
a43482a to
1c90d88
Compare
|
@andrijapanicsb - I've reverted to the commit that existed in the morning as per Github history per our offline conversation. @rhtyd , @DaanHoogland , @shwstppr , @nvazquez - can someone of you also take a quick look and LGTM? It has only one LGTM from @borisstoyanov |
|
@blueorangutan package |
|
@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos6 ✔centos7 ✔debian. JID-488 |
|
@blueorangutan test matrix |
|
@andrijapanicsb a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
|
Let's wait for the tests to confirm no regression since PR was rebased and commit/revert commit in the meantime. Otherwise, I give it LGTM |
|
Trillian test result (tid-643)
|
|
KVM and XenServer results are expected with more delay as the job had failed initially. |
|
Trillian test result (tid-645)
|
|
Trillian test result (tid-644)
|
|
Test failure in XenServer isnt related. @andrijapanicsb , @DaanHoogland - ready to merge? |
|
Looks good - all tests fine and enough LGTM. |


We want to update how host enters maintenance and it's states change. There have been instances when host was stuck in PrepareForMaintenance state indefinitely because all states weren't accounted for.
Also instead of moving direct to ErrorInMaintenance state on first fail, we want to wait for all legitimate operations to complete before entering ErrorInMaintenance state. If there are errors during in PrepareForMaintenance state, and errors are encountered, we enter the host into a transitory ErrorInPrepareForMaintenance state. This allows for better clarity if the operation had failed.
There are checks added to starting prepareHostForMaintenance. We should fail the API without any migration attempts if -
Below is the flow of new host FSM

Description
Types of changes
Screenshots (if appropriate):
How Has This Been Tested?