Skip to content

CA-404658: Split heartbeat thread#17

Merged
GeraldEV merged 3 commits intoxenserver:masterfrom
BengangY:private/bengangy/CA-404658
Mar 5, 2025
Merged

CA-404658: Split heartbeat thread#17
GeraldEV merged 3 commits intoxenserver:masterfrom
BengangY:private/bengangy/CA-404658

Conversation

@BengangY
Copy link
Contributor

  1. Split heartbeat thread
  2. Print thread ID at startup

@BengangY BengangY marked this pull request as ready for review February 14, 2025 10:02
@minglumlu
Copy link
Contributor

Can you please share the test results w/ and w/o the splitting change?

@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch from addc01e to be67a8c Compare February 18, 2025 13:57

// start heartbeat sending thread
ret = pthread_create(&hb_send_thread, xhad_pthread_attr, hb_send, NULL);
if (ret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do anything here to cleanup the receiving thread if we fail to start the sending thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. Currently, if heartbeat threads fails to create, the return status of hb_initialize is non-0, then hb_cleanup_objects is called, but it doesn't do anything (code in it will not be compiled).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct in regards to a failure during the setup of these threads - but we're missing correct clean up of HB threads after they have been correctly setup
Obviously in the case of a HA failure which causes a fence; this isn't an issue, but if HA reaches a natural termination (e.g. disabling HA) then it may leave hanging threads which could cause issues
For example the changes in #26 will require threads to be cleaned up before clearing up the logging lock: #26 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the receive thread would be cleaned up if creating the send thread fails but cleanup in general here seems lacking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to take on the task of cleaning up threads in another PR so leave this to me

@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch from be67a8c to 4f0163e Compare February 20, 2025 02:07
@BengangY
Copy link
Contributor Author

Can you please share the test results w/ and w/o the splitting change?

I created a crontab to run "cat /proc/net/udp | grep '02B6'" on each host every 5 minute to record UDP 694 packet drop. The test results are below:

  1. With splitting heartbeat thread:
    HA was running for about 5 days (from Feb 14 7:50 to Feb 19 5:35), there is no any packet drop.
  2. Without splitting heartbeat thread:
  3. HA was running for less than 1 day (from Feb 19 5:40 to Feb 20 2:10), I have observed packet drops learly on each host. The packet drops began at Feb 20 0:25.
    20250215-udp-drop-no-split-heartbeat-thread.txt

@edwintorok
Copy link
Contributor

I've recently added a CI to this repo, if you rebase your branch on top of latest master then we should start to see workflow runs here.

Copy link
Contributor

@alexbrett alexbrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run some testing with this change, and in a situation where xha was failing to keep up with receiving heartbeats this does seem to resolve it.

There is a separate effort underway to identify why the receiving is getting 'bogged down', as it isn't a huge load so it should cope even with sending heartbeats occasionally as well, but this change definitely appears to be an improvement so I think is worth taking regardless of the outcome of that investigation.

@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch from 4f0163e to a4bf3d5 Compare February 26, 2025 15:00
@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch 3 times, most recently from e57c590 to 1770cff Compare February 27, 2025 10:27
@BengangY
Copy link
Contributor Author

I've recently added a CI to this repo, if you rebase your branch on top of latest master then we should start to see workflow runs here.

I have rebased on master and resolved the CI. Now it has passed all the checks.

// Refresh watchdog counter to Wh
if (hbvar.watchdog != INVALID_WATCHDOG_HANDLE_VALUE)
{
watchdog_set(hbvar.watchdog, _Wh);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the send and receive threads split, I would have expected that each thread gets its own watchdog handle. Otherwise it is possible that the send thread gets stuck but the watchdog doesn't notice because the receive thread continues refreshing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the send thread gets stuck, other hosts will not receive the heartbeats from this host which will finally trigger a self-fence.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I was think it would be better if it could be detected more directly on this host but I guess this works too.

static MTC_BOND_STATUS bond_status = BOND_STATUS_NOERR;
PCOM_DATA_BM pbm;

log_message(MTC_LOG_INFO, "BM: thread ID: %ld.\n", syscall(SYS_gettid));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nicer to create a gettid() wrapper function than using a raw syscall() everywhere.


// start heartbeat sending thread
ret = pthread_create(&hb_send_thread, xhad_pthread_attr, hb_send, NULL);
if (ret)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the receive thread would be cleaned up if creating the send thread fails but cleanup in general here seems lacking.

BengangY added 2 commits March 5, 2025 13:16
Split heartbeat thread into sending
heartbeat thread and receiving heartbeat thread.

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
Print all threads' ID in the xha.log at startup.

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch from 1770cff to 74c021e Compare March 5, 2025 14:17
hbvar.terminate = TRUE;
hb_spin_unlock();
// wait for receive thread termination
if ((ret = pthread_join(hb_receive_thread, NULL)))

Check warning

Code scanning / CodeChecker

Although the value stored to 'ret' is used in the enclosing expression, the value is never actually read from 'ret' Warning

Although the value stored to 'ret' is used in the enclosing expression, the value is never actually read from 'ret'
if (hb_send_thread)
{
// wait for send thread termination
if ((ret = pthread_join(hb_send_thread, NULL)))

Check warning

Code scanning / CodeChecker

Although the value stored to 'ret' is used in the enclosing expression, the value is never actually read from 'ret' Warning

Although the value stored to 'ret' is used in the enclosing expression, the value is never actually read from 'ret'

MTC_STATIC void *
hb_send(
void *ignore)

Check warning

Code scanning / CodeChecker

unused parameter 'ignore' Warning

unused parameter 'ignore'
@GeraldEV
Copy link
Collaborator

GeraldEV commented Mar 5, 2025

I've resolved the conflicts on this branch - just a whitespace change near a lock that was replaced with rwlocking

@GeraldEV GeraldEV merged commit f39a424 into xenserver:master Mar 5, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants