Fix failure on agent reconnection#8089
Conversation
Codecov Report
@@ Coverage Diff @@
## 4.18 #8089 +/- ##
=========================================
Coverage 13.06% 13.07%
- Complexity 9109 9111 +2
=========================================
Files 2720 2720
Lines 257526 257566 +40
Branches 40150 40154 +4
=========================================
+ Hits 33655 33666 +11
- Misses 219644 219671 +27
- Partials 4227 4229 +2
... and 8 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
@blueorangutan package |
|
@vishesh92 a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7351 |
|
@blueorangutan test |
|
@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
@blueorangutan test alma8 kvm-alma8 |
|
@rohityadavcloud a [SF] Trillian-Jenkins test job (alma8 mgmt + kvm-alma8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-7962)
|
|
[SF] Trillian test result (tid-7961)
|
04ef4c2 to
17d6385
Compare
17d6385 to
39b9cfe
Compare
|
@blueorangutan package |
|
@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7411 |
| // update the DB | ||
| if (host != null && transitState) { | ||
| disconnectAgent(host, event, _nodeId); | ||
| // update the DB |
There was a problem hiding this comment.
| // update the DB |
There was a problem hiding this comment.
It was part of the initial code and we are updating the state on disconnecting the agent.
There was a problem hiding this comment.
do you mean this comment should stay there @vishesh92 ?
There was a problem hiding this comment.
IMO, yes. The comment should be more detailed instead.
DaanHoogland
left a comment
There was a problem hiding this comment.
Note the use of host.getUuid() instead of host.getId() in my suggestions!?
7bab9b3 to
c8dd68f
Compare
|
@blueorangutan package |
|
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7468 |
|
@blueorangutan test |
|
@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
@blueorangutan test |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8071)
|
|
@blueorangutan package |
|
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7509 |
|
@blueorangutan test |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
tried to break this by setting a breakpoint on the host check in the disconnect handler on the secondary MS/host and could see the host getting back up and connected to its original MS/host. |
|
[SF] Trillian test result (tid-8096)
|
Description
Depending on the agents' configuration, restarting a management server (preferred MS for the agent) will make the agent connect to another management server (non preferred MS). When the preferred MS comes back up, agent will try to disconnect with non-preferred MS and connect with the preferred MS. A race condition can happen during this process in which disconnection from non-preferred MS completes after the connection with preferred MS. This leads to agent to go into an
Alertstate. During this time, agent is still sending Ping to the preferred MS.This PR solves this issue by:
Pingcommand if the Host is not inUpstate, we request the agent to send a startup command again to the connection. If the startup is successful, the agent will come back in Up state.To reproduce the issue,
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Alertstate in database. After getting a ping, it gets a startup command after which it turns back toUpstate.How did you try to break this feature and the system with this change?