CA-404658: Split heartbeat thread#17
Conversation
BengangY
commented
Feb 14, 2025
- Split heartbeat thread
- Print thread ID at startup
|
Can you please share the test results w/ and w/o the splitting change? |
addc01e to
be67a8c
Compare
|
|
||
| // start heartbeat sending thread | ||
| ret = pthread_create(&hb_send_thread, xhad_pthread_attr, hb_send, NULL); | ||
| if (ret) |
There was a problem hiding this comment.
Do we need to do anything here to cleanup the receiving thread if we fail to start the sending thread?
There was a problem hiding this comment.
No need. Currently, if heartbeat threads fails to create, the return status of hb_initialize is non-0, then hb_cleanup_objects is called, but it doesn't do anything (code in it will not be compiled).
There was a problem hiding this comment.
This is correct in regards to a failure during the setup of these threads - but we're missing correct clean up of HB threads after they have been correctly setup
Obviously in the case of a HA failure which causes a fence; this isn't an issue, but if HA reaches a natural termination (e.g. disabling HA) then it may leave hanging threads which could cause issues
For example the changes in #26 will require threads to be cleaned up before clearing up the logging lock: #26 (comment)
There was a problem hiding this comment.
Ideally the receive thread would be cleaned up if creating the send thread fails but cleanup in general here seems lacking.
There was a problem hiding this comment.
I'm going to take on the task of cleaning up threads in another PR so leave this to me
be67a8c to
4f0163e
Compare
I created a crontab to run "cat /proc/net/udp | grep '02B6'" on each host every 5 minute to record UDP 694 packet drop. The test results are below:
|
|
I've recently added a CI to this repo, if you rebase your branch on top of latest master then we should start to see workflow runs here. |
alexbrett
left a comment
There was a problem hiding this comment.
I've run some testing with this change, and in a situation where xha was failing to keep up with receiving heartbeats this does seem to resolve it.
There is a separate effort underway to identify why the receiving is getting 'bogged down', as it isn't a huge load so it should cope even with sending heartbeats occasionally as well, but this change definitely appears to be an improvement so I think is worth taking regardless of the outcome of that investigation.
4f0163e to
a4bf3d5
Compare
e57c590 to
1770cff
Compare
I have rebased on master and resolved the CI. Now it has passed all the checks. |
| // Refresh watchdog counter to Wh | ||
| if (hbvar.watchdog != INVALID_WATCHDOG_HANDLE_VALUE) | ||
| { | ||
| watchdog_set(hbvar.watchdog, _Wh); |
There was a problem hiding this comment.
With the send and receive threads split, I would have expected that each thread gets its own watchdog handle. Otherwise it is possible that the send thread gets stuck but the watchdog doesn't notice because the receive thread continues refreshing it.
There was a problem hiding this comment.
If the send thread gets stuck, other hosts will not receive the heartbeats from this host which will finally trigger a self-fence.
There was a problem hiding this comment.
Indeed. I was think it would be better if it could be detected more directly on this host but I guess this works too.
daemon/bond_mon.c
Outdated
| static MTC_BOND_STATUS bond_status = BOND_STATUS_NOERR; | ||
| PCOM_DATA_BM pbm; | ||
|
|
||
| log_message(MTC_LOG_INFO, "BM: thread ID: %ld.\n", syscall(SYS_gettid)); |
There was a problem hiding this comment.
It might be nicer to create a gettid() wrapper function than using a raw syscall() everywhere.
|
|
||
| // start heartbeat sending thread | ||
| ret = pthread_create(&hb_send_thread, xhad_pthread_attr, hb_send, NULL); | ||
| if (ret) |
There was a problem hiding this comment.
Ideally the receive thread would be cleaned up if creating the send thread fails but cleanup in general here seems lacking.
Split heartbeat thread into sending heartbeat thread and receiving heartbeat thread. Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
Print all threads' ID in the xha.log at startup. Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
1770cff to
74c021e
Compare
| hbvar.terminate = TRUE; | ||
| hb_spin_unlock(); | ||
| // wait for receive thread termination | ||
| if ((ret = pthread_join(hb_receive_thread, NULL))) |
Check warning
Code scanning / CodeChecker
Although the value stored to 'ret' is used in the enclosing expression, the value is never actually read from 'ret' Warning
| if (hb_send_thread) | ||
| { | ||
| // wait for send thread termination | ||
| if ((ret = pthread_join(hb_send_thread, NULL))) |
Check warning
Code scanning / CodeChecker
Although the value stored to 'ret' is used in the enclosing expression, the value is never actually read from 'ret' Warning
|
|
||
| MTC_STATIC void * | ||
| hb_send( | ||
| void *ignore) |
Check warning
Code scanning / CodeChecker
unused parameter 'ignore' Warning
|
I've resolved the conflicts on this branch - just a whitespace change near a lock that was replaced with rwlocking |