Skip to content

Better tracking host maintanence and handling of migration jobs#3425

Merged
andrijapanicsb merged 19 commits intoapache:masterfrom
shapeblue:host_maintainence
Dec 19, 2019
Merged

Better tracking host maintanence and handling of migration jobs#3425
andrijapanicsb merged 19 commits intoapache:masterfrom
shapeblue:host_maintainence

Conversation

@anuragaw
Copy link
Copy Markdown
Contributor

@anuragaw anuragaw commented Jun 26, 2019

We want to update how host enters maintenance and it's states change. There have been instances when host was stuck in PrepareForMaintenance state indefinitely because all states weren't accounted for.

Also instead of moving direct to ErrorInMaintenance state on first fail, we want to wait for all legitimate operations to complete before entering ErrorInMaintenance state. If there are errors during in PrepareForMaintenance state, and errors are encountered, we enter the host into a transitory ErrorInPrepareForMaintenance state. This allows for better clarity if the operation had failed.

There are checks added to starting prepareHostForMaintenance. We should fail the API without any migration attempts if -

  1. there are VMs in Starting, Error, Shutdown state on host or
  2. if there are incoming migrations to the host.

Below is the flow of new host FSM
hostFSM

Description

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

  • Unit Tests
  • Marvin Tests
  • Manually

@Override
public ConfigKey<?>[] getConfigKeys() {
return new ConfigKey<?>[] {HostMaintenanceRetries};
return new ConfigKey<?>[] {KvmSshToAgentEnabled};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add this config key as it is already present on DB

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of completeness (readability) should we not export it anyway? I mean if there are no repurcussions to it we should we not keep it as is?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this exported by some other manager/class? Then there is no need to re-export it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread server/src/main/java/com/cloud/resource/ResourceManagerImpl.java Outdated
@Override
public ConfigKey<?>[] getConfigKeys() {
return new ConfigKey<?>[] {HostMaintenanceRetries};
return new ConfigKey<?>[] {KvmSshToAgentEnabled};
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of completeness (readability) should we not export it anyway? I mean if there are no repurcussions to it we should we not keep it as is?

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jun 28, 2019

Can you fix the conflicts @anuragaw and rebase against latest master?

final boolean hasVmsInFailureStates = CollectionUtils.isNotEmpty(errorVms);
errorVms.addAll(failedMigrations);

if (!hasPendingMigrationWorks && (hasRunningVms || (!hasRunningVms && !hasMigratingVms && hasVmsInFailureStates))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be a separate method with a javadoc to explain it? To improve readability

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to another method.

return setHostIntoErrorInMaintenance(host, errorVms);
}

if ((hasVmsInFailureStates || hasFailedMigrations) && (hasPendingMigrationWorks || hasMigratingVms || CollectionUtils.isNotEmpty(_vmDao.findByHostInStates(hostId, State.Stopping)))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring to different methods seems like an overhead. I've refactored a little with some javadocs to better readability.

@anuragaw anuragaw force-pushed the host_maintainence branch from c02ee4d to 5a15b8d Compare July 1, 2019 09:02
@borisstoyanov
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-73

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jul 3, 2019

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

ErrorInMaintenance,
Maintenance,
Error,
PrepareForMaintenanceErrorsPresent;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename PrepareForMaintenanceErrorsPresent to ErrorInPrepareForMaintenance to be uniform with other names such as ErrorInMaintenance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrepareForMaintenanceErrorsPresent is a transitional state that can go back to PrepareForMaintenance if Admin fixes errors.

This ultimately ends in ErrorInMaintenance or Maintenance state which are final states to be reached. Let me update the PR description @rhtyd

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor nit)

I understand, if the intent of the new state to say errors happened while doing ... then I'm simply asking the name of the new state to be uniform with other state names. For example, Maintenance has ErrorInMaintenance; so PrepareForMaintenance can have a similar ErrorInPrepareForMaintenance state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I get your point. Fixing in the next commit.


protected static final Logger s_logger = Logger.getLogger(HighAvailabilityManagerImpl.class);
private ConfigKey<Integer> MaxRetries = new ConfigKey<>("Advanced", Integer.class,
"max.retries","5",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max.retries is too generic, can you rename this to vm.ha.migration.max.retries or something suitable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old config file in the code but was not exposed to be configured. Renaming makes sense. Doing it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the global setting already exists, then don't rename them (for backward compatibility reasons) or re-define them. If they are new settings rename them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These existed before. Leaving them as is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked 4.13, the global setting max.retries does not exist in it however in code it is indeed referenced. Okay to leave as is, in that case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This was a bug because max.retries wasn't accessible via the API. We have fixed that here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anuragaw @rhtyd
sorry I have some late comments as this PR has been merged.

  1. max.retries is not used in the past so I suggest to rename it before 4.14 release.
  2. this config has scope=Cluster but cluster-wide configuration is never used in code.

}

@Test
public void testCheckAndMaintainEnterMaintenanceMode() throws NoTransitionException {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anuragaw Can you add unit tests for the new cases, or instead of removing old ones fix/refactor them?

Copy link
Copy Markdown
Contributor Author

@anuragaw anuragaw Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it @rhtyd , along with @nvazquez refactoring suggestions.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jul 3, 2019

@nvazquez @borisstoyanov have you reviewed and tested it?

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-88)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 28692 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3425-t88-kvm-centos7.zip
Smoke tests completed. 69 look OK, 2 have error(s)
Only failed tests results shown below:

@anuragaw
Copy link
Copy Markdown
Contributor Author

anuragaw commented Jul 8, 2019

Refactoring and addressing reviews today. Apologies but I got occupied on other PRs.

@yadvr yadvr modified the milestones: 4.13.0.0, 4.13.1.0 Jul 11, 2019
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jul 11, 2019

Moved this to 4.13.1.0 unless we can manage to get it reviewed/tested cc @borisstoyanov @PaulAngus

@anuragaw anuragaw force-pushed the host_maintainence branch 2 times, most recently from bb20254 to d660930 Compare July 12, 2019 07:34
@borisstoyanov
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos6 ✖centos7 ✖debian. JID-149

Copy link
Copy Markdown
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anuragaw there seems to be some issues building this one
[INFO] Apache CloudStack Server .......................... FAILURE [1:22.280s]

[ERROR] Failures:
[ERROR] ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations:203
Wanted but not invoked:
resourceManagerImpl.setHostIntoErrorInMaintenance(
host,
[vm1, vm2]
);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:203)

However, there were other interactions with this mock:
resourceManagerImpl.checkAndMaintain(1);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:201)

resourceManagerImpl.attemptMaintain(host);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:201)

resourceManagerImpl.setHostIntoMaintenance(
host
);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:201)

resourceManagerImpl.resourceStateTransitTo(
host,
InternalEnterMaintenance,
2886795268
);
-> at com.cloud.resource.ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceFailedMigrations(ResourceManagerImplTest.java:201)

[ERROR] ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceMigratingVms:195
[ERROR] ResourceManagerImplTest.testCheckAndMaintainErrorInMaintenanceRunningVms:187
[INFO]
[ERROR] Tests run: 791, Failures: 3, Errors: 0, Skipped: 5
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:

@anuragaw
Copy link
Copy Markdown
Contributor Author

@borisstoyanov - looking at it Bobby, looks like I may have missed pushing some changes to the PR before transitioning to some urgent work.

@yadvr yadvr added this to the 4.14.0.0 milestone Dec 7, 2019
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Dec 7, 2019

Is this still in progress or ready for review/testing - @anuragaw cc @borisstoyanov @andrijapanicsb ?

@anuragaw
Copy link
Copy Markdown
Contributor Author

anuragaw commented Dec 8, 2019

This has undergone some testing as far as I know from @borisstoyanov and should be ready for merging once either @andrijapanicsb or @PaulAngus give up a thumbs up. I'll remove the WIP after that so that it can be merged.

Copy link
Copy Markdown
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on code review and test results
test_results.xlsx

@andrijapanicsb
Copy link
Copy Markdown
Contributor

ping @anuragaw seems some conflicts - can you please handle it

@anuragaw
Copy link
Copy Markdown
Contributor Author

Rebased against master.
ping for final review @rhtyd , @nvazquez , @DaanHoogland , @borisstoyanov , @andrijapanicsb , @shwstppr

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-483

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan test matrix

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@anuragaw
Copy link
Copy Markdown
Contributor Author

anuragaw commented Dec 18, 2019

@andrijapanicsb - I've reverted to the commit that existed in the morning as per Github history per our offline conversation.
image
image

@rhtyd , @DaanHoogland , @shwstppr , @nvazquez - can someone of you also take a quick look and LGTM? It has only one LGTM from @borisstoyanov

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-488

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link
Copy Markdown

@andrijapanicsb a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@andrijapanicsb
Copy link
Copy Markdown
Contributor

Let's wait for the tests to confirm no regression since PR was rebased and commit/revert commit in the meantime. Otherwise, I give it LGTM

Copy link
Copy Markdown
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes LGTM

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-643)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31523 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3425-t643-vmware-65u2.zip
Intermittent failure detected: /marvin/tests/smoke/test_ssvm.py
Smoke tests completed. 77 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@anuragaw
Copy link
Copy Markdown
Contributor Author

KVM and XenServer results are expected with more delay as the job had failed initially.

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-645)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 29508 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3425-t645-xenserver-71.zip
Intermittent failure detected: /marvin/tests/smoke/test_scale_vm.py
Smoke tests completed. 76 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_scale_vm Failure 13.35 test_scale_vm.py

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-644)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30407 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3425-t644-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_ssvm.py
Smoke tests completed. 77 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@anuragaw
Copy link
Copy Markdown
Contributor Author

Test failure in XenServer isnt related. @andrijapanicsb , @DaanHoogland - ready to merge?

@anuragaw anuragaw changed the title [WIP DO NOT MERGE] Better tracking host maintanence success and failure Better tracking host maintanence and handling of migration jobs Dec 19, 2019
@andrijapanicsb
Copy link
Copy Markdown
Contributor

Looks good - all tests fine and enough LGTM.

@andrijapanicsb andrijapanicsb merged commit 4b43c26 into apache:master Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants