Skip to content

Fix failure on agent reconnection#8089

Merged
DaanHoogland merged 6 commits intoapache:4.18from
shapeblue:fix-failure-on-agent-reconnection
Oct 26, 2023
Merged

Fix failure on agent reconnection#8089
DaanHoogland merged 6 commits intoapache:4.18from
shapeblue:fix-failure-on-agent-reconnection

Conversation

@vishesh92
Copy link
Copy Markdown
Member

Description

Depending on the agents' configuration, restarting a management server (preferred MS for the agent) will make the agent connect to another management server (non preferred MS). When the preferred MS comes back up, agent will try to disconnect with non-preferred MS and connect with the preferred MS. A race condition can happen during this process in which disconnection from non-preferred MS completes after the connection with preferred MS. This leads to agent to go into an Alert state. During this time, agent is still sending Ping to the preferred MS.

This PR solves this issue by:

  • Taking a lock in database which ensures that for a host, only connection or disconnection can happen across different MS.
  • While processing the Ping command if the Host is not in Up state, we request the agent to send a startup command again to the connection. If the startup is successful, the agent will come back in Up state.

To reproduce the issue,

  1. apply the patch below.
diff --git a/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
index b74c11cf..aec8a8dd 100644
--- a/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
+++ b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
@@ -836,6 +836,9 @@ public class AgentManagerImpl extends ManagerBase implements AgentManager, Handl
 
         removeAgent(attache, nextStatus);
         // update the DB
+        try {
+            Thread.sleep(5000L);
+        } catch (Exception e){}
         if (host != null && transitState) {
             disconnectAgent(host, event, _nodeId);
         }
  1. Setup an environment with two management servers and below global configuration
agent.lb.enabled = true
indirect.agent.lb.algorithm = roundrobin
indirect.agent.lb.check.interval = 30
  1. Restart the preferred management server.
  2. Check status of host in database or in the UI

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

  1. Tried by applying the above patch.
  2. Updated the state of a host to Alert state in database. After getting a ping, it gets a startup command after which it turns back to Up state.

How did you try to break this feature and the system with this change?

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 12, 2023

Codecov Report

Merging #8089 (0169fc9) into 4.18 (3e7f21a) will increase coverage by 0.00%.
Report is 14 commits behind head on 4.18.
The diff coverage is 0.00%.

@@            Coverage Diff            @@
##               4.18    #8089   +/-   ##
=========================================
  Coverage     13.06%   13.07%           
- Complexity     9109     9111    +2     
=========================================
  Files          2720     2720           
  Lines        257526   257566   +40     
  Branches      40150    40154    +4     
=========================================
+ Hits          33655    33666   +11     
- Misses       219644   219671   +27     
- Partials       4227     4229    +2     
Files Coverage Δ
...java/com/cloud/agent/manager/AgentManagerImpl.java 4.82% <0.00%> (-0.12%) ⬇️

... and 8 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@vishesh92
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@vishesh92 a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7351

@yadvr yadvr added this to the 4.18.2.0 milestone Oct 13, 2023
Copy link
Copy Markdown
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - didn't test it though

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Oct 13, 2023

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Oct 13, 2023

@blueorangutan test alma8 kvm-alma8

@blueorangutan
Copy link
Copy Markdown

@rohityadavcloud a [SF] Trillian-Jenkins test job (alma8 mgmt + kvm-alma8) has been kicked to run smoke tests

Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
Comment thread test/integration/smoke/test_host_ping.py Outdated
Comment thread test/integration/smoke/test_host_ping.py Outdated
@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-7962)
Environment: kvm-alma8 (x2), Advanced Networking with Mgmt server a8
Total time taken: 42908 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8089-t7962-kvm-alma8.zip
Smoke tests completed. 107 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_deploy_vm_wrong_checksum Error 43.84 test_templates.py
test_09_list_templates_download_details Failure 0.06 test_templates.py
test_01_migrate_VM_and_root_volume Error 83.08 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 50.71 test_vm_life_cycle.py
test_08_migrate_vm Error 47.03 test_vm_life_cycle.py

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-7961)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 48364 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8089-t7961-kvm-centos7.zip
Smoke tests completed. 105 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_deploy_vm_wrong_checksum Error 40.57 test_templates.py
test_09_list_templates_download_details Failure 0.05 test_templates.py
test_08_upgrade_kubernetes_ha_cluster Failure 3643.84 test_kubernetes_clusters.py
test_09_delete_kubernetes_ha_cluster Failure 0.06 test_kubernetes_clusters.py
ContextSuite context=TestKubernetesCluster>:teardown Error 324.61 test_kubernetes_clusters.py
test_08_migrate_vm Error 43.77 test_vm_life_cycle.py
test_hostha_enable_ha_when_host_in_maintenance Error 303.79 test_hostha_kvm.py
test_hostha_kvm_host_degraded Error 9.15 test_hostha_kvm.py

Copy link
Copy Markdown
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@vishesh92 vishesh92 force-pushed the fix-failure-on-agent-reconnection branch from 04ef4c2 to 17d6385 Compare October 18, 2023 06:58
@vishesh92 vishesh92 force-pushed the fix-failure-on-agent-reconnection branch from 17d6385 to 39b9cfe Compare October 18, 2023 07:16
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Oct 18, 2023

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7411

// update the DB
if (host != null && transitState) {
disconnectAgent(host, event, _nodeId);
// update the DB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// update the DB

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was part of the initial code and we are updating the state on disconnecting the agent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean this comment should stay there @vishesh92 ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, yes. The comment should be more detailed instead.

Copy link
Copy Markdown
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the use of host.getUuid() instead of host.getId() in my suggestions!?

Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
Comment thread engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java Outdated
@vishesh92 vishesh92 force-pushed the fix-failure-on-agent-reconnection branch from 7bab9b3 to c8dd68f Compare October 23, 2023 07:24
@vishesh92
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7468

@vishesh92
Copy link
Copy Markdown
Member Author

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-8071)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38840 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8089-t8071-kvm-centos7.zip
Smoke tests completed. 107 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_migrate_vm Error 45.83 test_vm_life_cycle.py
test_hostha_enable_ha_when_host_in_maintenance Error 305.88 test_hostha_kvm.py

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7509

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland
Copy link
Copy Markdown
Contributor

tried to break this by setting a breakpoint on the host check in the disconnect handler on the secondary MS/host and could see the host getting back up and connected to its original MS/host.

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-8096)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41889 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8089-t8096-kvm-centos7.zip
Smoke tests completed. 109 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland DaanHoogland merged commit 3b11663 into apache:4.18 Oct 26, 2023
@DaanHoogland DaanHoogland deleted the fix-failure-on-agent-reconnection branch October 26, 2023 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants