Skip to content

Conversation

@guzu
Copy link

@guzu guzu commented Nov 10, 2025

For tests that fails rarely or only on CI we need a way to investigate the failure.
This PR is perform a log collection (only using xen-bugtool for the moment)

xen-bugtool reports can be very big (~100MB or more) so we might change that
later, or filter report.
Timestamps for each tests are recorded and the start and stop of tests, so that corresponding
logs can be extracted for analysis (see extract_logs.py). Timestamps are recorded in XCP-ng to avoid difference in
clocks.

No change is required to enable log collection. But console capture is only enabled
on demand, using @pytest.mark.capture_console.

Log collection do not happen at each test, but at the end of the test session.
For Windows (mainly), a screenshot can be done for each failing test; this uses the VNC console
through XO-lite.

@guzu guzu requested review from dinhngtu and ydirson November 10, 2025 14:28
Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>
@guzu guzu force-pushed the eva/log-test-failure branch from 2d37684 to 28d7eed Compare November 10, 2025 14:32
Copy link
Member

@dinhngtu dinhngtu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed up to "lib: use option to put ssh in background". Could you squash the console capture commits, as it's a bit difficult to review otherwise?

Comment on lines 903 to 921
for fixturename in getattr(request, 'fixturenames', []):
try:
fixture_value = request.getfixturevalue(fixturename)
# Check if fixture is a Host
if isinstance(fixture_value, Host):
test_hosts.add(fixture_value)
# Check if fixture is a VM
elif isinstance(fixture_value, VM):
test_vms.add(fixture_value)
test_hosts.add(fixture_value.host)
# Check if fixture is a list
elif isinstance(fixture_value, list):
for item in fixture_value:
if isinstance(item, Host):
test_hosts.add(item)
elif isinstance(item, VM):
test_vms.add(item)
test_hosts.add(item.host)
# Check if fixture is a dict with Host or VM values
elif isinstance(fixture_value, dict):
for value in fixture_value.values():
if isinstance(value, Host):
test_hosts.add(value)
elif isinstance(value, VM):
test_vms.add(value)
test_hosts.add(value.host)
# Check if fixture has a 'host' attribute (like SR, Pool objects)
elif hasattr(fixture_value, 'host') and isinstance(fixture_value.host, Host):
test_hosts.add(fixture_value.host)
# Check if fixture is a Pool
elif hasattr(fixture_value, 'hosts') and isinstance(fixture_value.hosts, list):
test_hosts.update(h for h in fixture_value.hosts if isinstance(h, Host))
except Exception:
# Some fixtures may not be available yet or may fail to load
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems a bit "magical" and brittle to me. Perhaps there's a clearer way to manage the list of hosts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to find something easier to understand.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a bit magical, it is the solution with the minimal friction with existing code.
Other solutions require adding code to tests, or will make maintenance harder, or even add coupling between Host and VM class and the .
Maybe @ydirson might come up with a better solution ?

In the meantime I tried to make it a bit more readable and do introspection only in error case.

@guzu
Copy link
Author

guzu commented Nov 12, 2025

I've reviewed up to "lib: use option to put ssh in background". Could you squash the console capture commits, as it's a bit difficult to review otherwise?

Sure, I could do that.
About the script for console capture and VNC client part, it was "vibe coded". I didn't find a proper library to handle just the VNC decoding.

@dinhngtu
Copy link
Member

Sure, I could do that. About the script for console capture and VNC client part, it was "vibe coded". I didn't find a proper library to handle just the VNC decoding.

I found https://github.com/sibson/vncdotool which looks quite feature-rich.

@guzu
Copy link
Author

guzu commented Nov 12, 2025

Sure, I could do that. About the script for console capture and VNC client part, it was "vibe coded". I didn't find a proper library to handle just the VNC decoding.

I found https://github.com/sibson/vncdotool which looks quite feature-rich.

I saw it, but it seems to handle the connection itself. In our case the VNC connection is handled by XAPI via websocket.

@guzu
Copy link
Author

guzu commented Nov 12, 2025

Sure, I could do that. About the script for console capture and VNC client part, it was "vibe coded". I didn't find a proper library to handle just the VNC decoding.

I found https://github.com/sibson/vncdotool which looks quite feature-rich.

I saw it, but it seems to handle the connection itself. In our case the VNC connection is handled by XAPI via websocket.

Maybe could just do a proxy that creates a websocket and expose a local port that could be used by an external tool or library. I could investigate that.

@guzu guzu force-pushed the eva/log-test-failure branch 2 times, most recently from 6da69d7 to 550cd61 Compare November 12, 2025 17:02
@guzu
Copy link
Author

guzu commented Nov 12, 2025

I change VM console capture and removed the VNC client implementation. Now the script mainly act as a proxy and uses a library for VNC capture.

@guzu guzu force-pushed the eva/log-test-failure branch from 550cd61 to 6216473 Compare November 13, 2025 17:14
@dinhngtu
Copy link
Member

"lib: check presence of xen-bugtool reports" has become empty. Could this commit be dropped, or did something get lost during the rebase?

@guzu
Copy link
Author

guzu commented Nov 17, 2025

"lib: check presence of xen-bugtool reports" has become empty. Could this commit be dropped, or did something get lost during the rebase?

Yes I remember seeing that. It seems that I forgot to remove the commit, but it wasn't lost it was merged into another. And it looks like it was merged into the wrong one, it should be in 10f4f1f and not 61d2e2f.

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>
@guzu guzu force-pushed the eva/log-test-failure branch from 6216473 to 1ae1d60 Compare November 19, 2025 18:49
@stormi stormi requested a review from a team November 20, 2025 16:11
@stormi
Copy link
Member

stormi commented Nov 20, 2025

Similarly to what I said regarding the follow-up commit, such big change needs to come with an explanation for the reviewers/maintainers regarding what the goal is, how it is meant to be used and when, how it is supposed to integrate in CI jobs, why this is the appropriate solution, etc.

Here I'm not sure that @xcp-ng/os-platform-release as maintainers have been provided with these answers.

@guzu
Copy link
Author

guzu commented Nov 24, 2025

Similarly to what I said regarding the follow-up commit, such big change needs to come with an explanation for the reviewers/maintainers regarding what the goal is, how it is meant to be used and when, how it is supposed to integrate in CI jobs, why this is the appropriate solution, etc.

Description was updated

Copy link
Member

@glehmann glehmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the types in the signature of the new methods in conftest.py?

conftest.py Outdated
test_hosts.add(fixture_value)
# Check if fixture is a list of hosts
elif isinstance(fixture_value, list) and all(isinstance(h, Host) for h in fixture_value):
test_hosts.update(fixture_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the element if it's a Host, to be consistent with what is done in Check if fixture is a Pool?

conftest.py Outdated
logging.debug(f"Record timestamp (end): {class_name}")
host.ssh(f"echo end {class_name} $(date '+%s') >> {HOST_TIMESTAMPS_FILE}")


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is mostly empty and doesn't match the description

conftest.py Outdated

# Helper function for immediate console capture
def capture_vm_console(vm: VM, log_dir: str) -> str:
import sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports should be at the top of the file, unless required by a circular import

parser = argparse.ArgumentParser(
description='VNC WebSocket Console Capture',
formatter_class=argparse.RawDescriptionHelpFormatter,
usage="""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't argparse able to generate that by itself?


else:
# Neither mode specified - show nice usage
print("VNC WebSocket Console Capture")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also probably be generated by argparse

@guzu guzu force-pushed the eva/log-test-failure branch from 1ae1d60 to 0257aef Compare November 25, 2025 14:13
guzu added 2 commits November 25, 2025 15:16
Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>
And not solely rely on subprocess and early quit.

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>
@guzu guzu force-pushed the eva/log-test-failure branch from 0257aef to 911bee5 Compare November 25, 2025 14:17
guzu added 4 commits November 27, 2025 09:48
Screenshot of the VM screen will be saved for failed tests,
if marked with pytest.mark.capture_console.
This is based on new script capture-console.py that connect to
XO-lite for capture.

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>
Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>
Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>
Extract from syslog type logs based on epoch timestamps.

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>
Test for log collection need to fail in order to trigger the log
collection; so pytest.mark.xfail can't be used.
But tests need to by referenced by a job, otherwise './jobs.py check'
will fail.

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>
@guzu guzu force-pushed the eva/log-test-failure branch from 911bee5 to 8ebf7af Compare November 27, 2025 08:48
Copy link
Contributor

@ydirson ydirson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review below is only for the host log collection feature.
The VM console capture and others should IMO live a separate PRs, there seems to be no reason one would block the other.
Same for the "smaller general improvements" like bumping pyright and ssh fix, they could have their own PR which could be merged faster.

Comment on lines 481 to +482
name = "pyright"
version = "1.1.402"
version = "1.1.407"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this kind of change we should remember to get the branch rebased before merging (to catch anything new in master that would not pass 1.1.407)

assert CACHE_IMPORTED_VM in [True, False]

# Session-level tracking for failed tests and their associated hosts
FAILED_TESTS_HOSTS = {} # {test_nodeid: set(Host)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment essentially looks like a good candidate for being a type hint?


# Session-level tracking for failed tests and their associated hosts
FAILED_TESTS_HOSTS = {} # {test_nodeid: set(Host)}
SESSION_HAS_FAILURES = False
Copy link
Contributor

@ydirson ydirson Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need SESSION_HAS_FAILURES? It looks like it just reflects that FAILED_TESTS_HOSTS has contents or not?

Comment on lines +110 to +121
parser.addoption(
"--collect-logs-on-failure",
action="store_true",
default=True,
help="Automatically collect xen-bugtool logs from hosts when tests fail (default: True)"
)
parser.addoption(
"--no-collect-logs-on-failure",
action="store_false",
dest="collect_logs_on_failure",
help="Disable automatic log collection on test failure"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be necessary to have a flag for expliciting the default behaviour?

Comment on lines +785 to +793
# Session-scoped fixture to create a shared log directory for all artifacts
@pytest.fixture(scope='session')
def session_log_dir():
"""
Create and return a session-wide log directory for storing all test artifacts.
The directory is created at session start and shared by all fixtures.
Directory naming includes BUILD_NUMBER if running in Jenkins CI environment.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc says this fixtures creates the dir, but it is actually created by a different one.

Comment on lines +809 to +811
# Record test timestamps to easily extract logs
@pytest.fixture(scope='class', autouse=True)
def bugreport_timestamp(request, host):
Copy link
Contributor

@ydirson ydirson Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Record test start and end timestamps"? Could make sense to be a docstring?

The "to easily extract logs" could have more details, as is it is no obvious how those timestamps help that.
That indeed seems to be related to "scripts: add log extraction script for investigation", woudn't it make sense to regroup it into that other commit?

Comment on lines +828 to +831
# Check all fixtures used by this test
# XXX: this is a bit hard to understand but this introspection of fixtures
# is the best way to gather VMs and Hosts. Other alternatives require,
# maintenance, boilerplate and are more prone to miss some cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's "hard to understand" maybe an overview of how it works would be useful (it's not that hard to understand actually 😉)
I'm a bit unsure this is a good approach: it still requires to enumerate the contexts in which a host can be found, so it seems possible that we miss some, or some other contexts get added in the future and we would need (or forget) to add them here.

Can't we instead get the Host ctor to cheaply register all hosts used (somewhere in the request), so we can just pick them when we see the test fail?

(sidenote: if we keep it, the function name is a bit too generic and describe how it does things, not what it does)

Comment on lines 783 to 785


# Session-scoped fixture to create a shared log directory for all artifacts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no single obvious entrypoint, it could be useful to have a little write-up first, on the overall architecture of the feature, and how the fixtures work together towards the goal. Right here before defining them could be a good-enough place?

Maybe session_log_dir could come last in definition order, as it looks more of an implementation detail, and collect_logs_on_session_failure could come first?

Comment on lines +53 to +54
# Path to timestamp file on remote hosts for log extraction
HOST_TIMESTAMPS_FILE = "/tmp/pytest-timestamps.log"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would likely want to use a per-session temporary file, or there might be issues when several sessions run simultaneously.

Comment on lines +819 to +824
@pytest.fixture(scope='session', autouse=True)
def check_bug_reports(host):
"""
That could be the sign of interrupted tests and some cleanup might be
required. We don't want to accumulate too much of these files.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring should start with a one-line explaining what the fixture does

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants