Collect logs in case of test failure #362

guzu · 2025-11-10T14:28:53Z

For tests that fails rarely or only on CI we need a way to investigate the failure.
This PR is perform a log collection (only using xen-bugtool for the moment)

xen-bugtool reports can be very big (~100MB or more) so we might change that
later, or filter report.
Timestamps for each tests are recorded and the start and stop of tests, so that corresponding
logs can be extracted for analysis (see extract_logs.py). Timestamps are recorded in XCP-ng to avoid difference in
clocks.

No change is required to enable log collection. But console capture is only enabled
on demand, using @pytest.mark.capture_console.

Log collection do not happen at each test, but at the end of the test session.
For Windows (mainly), a screenshot can be done for each failing test; this uses the VNC console
through XO-lite.

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

dinhngtu

I've reviewed up to "lib: use option to put ssh in background". Could you squash the console capture commits, as it's a bit difficult to review otherwise?

dinhngtu · 2025-11-12T08:39:43Z

conftest.py

+    for fixturename in getattr(request, 'fixturenames', []):
+        try:
+            fixture_value = request.getfixturevalue(fixturename)
+            # Check if fixture is a Host
+            if isinstance(fixture_value, Host):
+                test_hosts.add(fixture_value)
+            # Check if fixture is a VM
+            elif isinstance(fixture_value, VM):
+                test_vms.add(fixture_value)
+                test_hosts.add(fixture_value.host)
+            # Check if fixture is a list
+            elif isinstance(fixture_value, list):
+                for item in fixture_value:
+                    if isinstance(item, Host):
+                        test_hosts.add(item)
+                    elif isinstance(item, VM):
+                        test_vms.add(item)
+                        test_hosts.add(item.host)
+            # Check if fixture is a dict with Host or VM values
+            elif isinstance(fixture_value, dict):
+                for value in fixture_value.values():
+                    if isinstance(value, Host):
+                        test_hosts.add(value)
+                    elif isinstance(value, VM):
+                        test_vms.add(value)
+                        test_hosts.add(value.host)
+            # Check if fixture has a 'host' attribute (like SR, Pool objects)
+            elif hasattr(fixture_value, 'host') and isinstance(fixture_value.host, Host):
+                test_hosts.add(fixture_value.host)
+            # Check if fixture is a Pool
+            elif hasattr(fixture_value, 'hosts') and isinstance(fixture_value.hosts, list):
+                test_hosts.update(h for h in fixture_value.hosts if isinstance(h, Host))
+        except Exception:
+            # Some fixtures may not be available yet or may fail to load
+            pass


This check seems a bit "magical" and brittle to me. Perhaps there's a clearer way to manage the list of hosts?

I will try to find something easier to understand.

I agree that this is a bit magical, it is the solution with the minimal friction with existing code.
Other solutions require adding code to tests, or will make maintenance harder, or even add coupling between Host and VM class and the .
Maybe @ydirson might come up with a better solution ?

In the meantime I tried to make it a bit more readable and do introspection only in error case.

conftest.py

lib/host.py

guzu · 2025-11-12T09:53:01Z

I've reviewed up to "lib: use option to put ssh in background". Could you squash the console capture commits, as it's a bit difficult to review otherwise?

Sure, I could do that.
About the script for console capture and VNC client part, it was "vibe coded". I didn't find a proper library to handle just the VNC decoding.

dinhngtu · 2025-11-12T10:05:47Z

Sure, I could do that. About the script for console capture and VNC client part, it was "vibe coded". I didn't find a proper library to handle just the VNC decoding.

I found https://github.com/sibson/vncdotool which looks quite feature-rich.

guzu · 2025-11-12T10:10:06Z

Sure, I could do that. About the script for console capture and VNC client part, it was "vibe coded". I didn't find a proper library to handle just the VNC decoding.

I found https://github.com/sibson/vncdotool which looks quite feature-rich.

I saw it, but it seems to handle the connection itself. In our case the VNC connection is handled by XAPI via websocket.

guzu · 2025-11-12T10:13:10Z

Sure, I could do that. About the script for console capture and VNC client part, it was "vibe coded". I didn't find a proper library to handle just the VNC decoding.

I found https://github.com/sibson/vncdotool which looks quite feature-rich.

I saw it, but it seems to handle the connection itself. In our case the VNC connection is handled by XAPI via websocket.

Maybe could just do a proxy that creates a websocket and expose a local port that could be used by an external tool or library. I could investigate that.

guzu · 2025-11-12T17:05:27Z

I change VM console capture and removed the VNC client implementation. Now the script mainly act as a proxy and uses a library for VNC capture.

dinhngtu · 2025-11-17T14:10:33Z

"lib: check presence of xen-bugtool reports" has become empty. Could this commit be dropped, or did something get lost during the rebase?

guzu · 2025-11-17T14:23:58Z

"lib: check presence of xen-bugtool reports" has become empty. Could this commit be dropped, or did something get lost during the rebase?

Yes I remember seeing that. It seems that I forgot to remove the commit, but it wasn't lost it was merged into another. And it looks like it was merged into the wrong one, it should be in 10f4f1f and not 61d2e2f.

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

stormi · 2025-11-20T16:16:01Z

Similarly to what I said regarding the follow-up commit, such big change needs to come with an explanation for the reviewers/maintainers regarding what the goal is, how it is meant to be used and when, how it is supposed to integrate in CI jobs, why this is the appropriate solution, etc.

Here I'm not sure that @xcp-ng/os-platform-release as maintainers have been provided with these answers.

guzu · 2025-11-24T16:06:34Z

Similarly to what I said regarding the follow-up commit, such big change needs to come with an explanation for the reviewers/maintainers regarding what the goal is, how it is meant to be used and when, how it is supposed to integrate in CI jobs, why this is the appropriate solution, etc.

Description was updated

glehmann

Could you add the types in the signature of the new methods in conftest.py?

glehmann · 2025-11-25T09:30:28Z

conftest.py

+                test_hosts.add(fixture_value)
+            # Check if fixture is a list of hosts
+            elif isinstance(fixture_value, list) and all(isinstance(h, Host) for h in fixture_value):
+                test_hosts.update(fixture_value)


Maybe add the element if it's a Host, to be consistent with what is done in Check if fixture is a Pool?

glehmann · 2025-11-25T09:32:48Z

conftest.py

    logging.debug(f"Record timestamp (end): {class_name}")
    host.ssh(f"echo end   {class_name} $(date '+%s') >> {HOST_TIMESTAMPS_FILE}")

-


This commit is mostly empty and doesn't match the description

glehmann · 2025-11-25T09:34:52Z

conftest.py


+# Helper function for immediate console capture
+def capture_vm_console(vm: VM, log_dir: str) -> str:
+    import sys


imports should be at the top of the file, unless required by a circular import

glehmann · 2025-11-25T09:39:21Z

scripts/capture-console.py

+    parser = argparse.ArgumentParser(
+        description='VNC WebSocket Console Capture',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        usage="""


Isn't argparse able to generate that by itself?

glehmann · 2025-11-25T09:39:48Z

scripts/capture-console.py

+
+    else:
+        # Neither mode specified - show nice usage
+        print("VNC WebSocket Console Capture")


This should also probably be generated by argparse

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

And not solely rely on subprocess and early quit. Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

Screenshot of the VM screen will be saved for failed tests, if marked with pytest.mark.capture_console. This is based on new script capture-console.py that connect to XO-lite for capture. Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

Extract from syslog type logs based on epoch timestamps. Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

Test for log collection need to fail in order to trigger the log collection; so pytest.mark.xfail can't be used. But tests need to by referenced by a job, otherwise './jobs.py check' will fail. Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

ydirson

The review below is only for the host log collection feature.
The VM console capture and others should IMO live a separate PRs, there seems to be no reason one would block the other.
Same for the "smaller general improvements" like bumping pyright and ssh fix, they could have their own PR which could be merged faster.

ydirson · 2025-11-27T09:57:48Z

uv.lock

 name = "pyright"
-version = "1.1.402"
+version = "1.1.407"


For this kind of change we should remember to get the branch rebased before merging (to catch anything new in master that would not pass 1.1.407)

ydirson · 2025-11-27T10:02:24Z

conftest.py

 assert CACHE_IMPORTED_VM in [True, False]

+# Session-level tracking for failed tests and their associated hosts
+FAILED_TESTS_HOSTS = {}  # {test_nodeid: set(Host)}


This comment essentially looks like a good candidate for being a type hint?

ydirson · 2025-11-27T10:03:44Z

conftest.py


+# Session-level tracking for failed tests and their associated hosts
+FAILED_TESTS_HOSTS = {}  # {test_nodeid: set(Host)}
+SESSION_HAS_FAILURES = False


Do we really need SESSION_HAS_FAILURES? It looks like it just reflects that FAILED_TESTS_HOSTS has contents or not?

ydirson · 2025-11-27T10:05:52Z

conftest.py

+    parser.addoption(
+        "--collect-logs-on-failure",
+        action="store_true",
+        default=True,
+        help="Automatically collect xen-bugtool logs from hosts when tests fail (default: True)"
+    )
+    parser.addoption(
+        "--no-collect-logs-on-failure",
+        action="store_false",
+        dest="collect_logs_on_failure",
+        help="Disable automatic log collection on test failure"
+    )


It may not be necessary to have a flag for expliciting the default behaviour?

ydirson · 2025-11-27T10:09:48Z

conftest.py

+# Session-scoped fixture to create a shared log directory for all artifacts
+@pytest.fixture(scope='session')
+def session_log_dir():
+    """
+    Create and return a session-wide log directory for storing all test artifacts.
+    The directory is created at session start and shared by all fixtures.
+
+    Directory naming includes BUILD_NUMBER if running in Jenkins CI environment.
+    """


This doc says this fixtures creates the dir, but it is actually created by a different one.

ydirson · 2025-11-27T10:11:26Z

conftest.py

+# Record test timestamps to easily extract logs
+@pytest.fixture(scope='class', autouse=True)
+def bugreport_timestamp(request, host):


"Record test start and end timestamps"? Could make sense to be a docstring?

The "to easily extract logs" could have more details, as is it is no obvious how those timestamps help that.
That indeed seems to be related to "scripts: add log extraction script for investigation", woudn't it make sense to regroup it into that other commit?

ydirson · 2025-11-27T10:14:54Z

conftest.py

+    # Check all fixtures used by this test
+    # XXX: this is a bit hard to understand but this introspection of fixtures
+    #      is the best way to gather VMs and Hosts. Other alternatives require,
+    #      maintenance, boilerplate and are more prone to miss some cases.


If it's "hard to understand" maybe an overview of how it works would be useful (it's not that hard to understand actually 😉)
I'm a bit unsure this is a good approach: it still requires to enumerate the contexts in which a host can be found, so it seems possible that we miss some, or some other contexts get added in the future and we would need (or forget) to add them here.

Can't we instead get the Host ctor to cheaply register all hosts used (somewhere in the request), so we can just pick them when we see the test fail?

(sidenote: if we keep it, the function name is a bit too generic and describe how it does things, not what it does)

ydirson · 2025-11-27T10:30:35Z

conftest.py

+
+
+# Session-scoped fixture to create a shared log directory for all artifacts


Since there is no single obvious entrypoint, it could be useful to have a little write-up first, on the overall architecture of the feature, and how the fixtures work together towards the goal. Right here before defining them could be a good-enough place?

Maybe session_log_dir could come last in definition order, as it looks more of an implementation detail, and collect_logs_on_session_failure could come first?

ydirson · 2025-11-27T10:47:18Z

conftest.py

+# Path to timestamp file on remote hosts for log extraction
+HOST_TIMESTAMPS_FILE = "/tmp/pytest-timestamps.log"


We would likely want to use a per-session temporary file, or there might be issues when several sessions run simultaneously.

ydirson · 2025-11-27T10:49:17Z

conftest.py

+@pytest.fixture(scope='session', autouse=True)
+def check_bug_reports(host):
+    """
+    That could be the sign of interrupted tests and some cleanup might be
+    required. We don't want to accumulate too much of these files.
+    """


Docstring should start with a one-line explaining what the fixture does

guzu requested review from dinhngtu and ydirson November 10, 2025 14:28

Upgrade pyright to 1.1.407

dde7220

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

guzu force-pushed the eva/log-test-failure branch from 2d37684 to 28d7eed Compare November 10, 2025 14:32

dinhngtu requested changes Nov 12, 2025

View reviewed changes

guzu force-pushed the eva/log-test-failure branch 2 times, most recently from 6da69d7 to 550cd61 Compare November 12, 2025 17:02

guzu force-pushed the eva/log-test-failure branch from 550cd61 to 6216473 Compare November 13, 2025 17:14

Collect logs on test failure

5844e40

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

guzu force-pushed the eva/log-test-failure branch from 6216473 to 1ae1d60 Compare November 19, 2025 18:49

guzu mentioned this pull request Nov 20, 2025

Add support for VM serial/UART console logging #364

Merged

stormi requested a review from a team November 20, 2025 16:11

glehmann reviewed Nov 25, 2025

View reviewed changes

guzu force-pushed the eva/log-test-failure branch from 1ae1d60 to 0257aef Compare November 25, 2025 14:13

guzu added 2 commits November 25, 2025 15:16

lib: check presence of too much xen-bugtool reports

ec04550

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

lib: use option to put ssh in background

f178db0

And not solely rely on subprocess and early quit. Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

guzu force-pushed the eva/log-test-failure branch from 0257aef to 911bee5 Compare November 25, 2025 14:17

guzu added 4 commits November 27, 2025 09:48

Capture VM console for some tests

c19542b

Screenshot of the VM screen will be saved for failed tests, if marked with pytest.mark.capture_console. This is based on new script capture-console.py that connect to XO-lite for capture. Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

tests: ask for capture of windows guest tools on failure

c46efae

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

tests: test log collection on failure

5168437

Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

scripts: add log extraction script for investigation

7273e22

Extract from syslog type logs based on epoch timestamps. Signed-off-by: Emmanuel Varagnat <emmanuel.varagnat@vates.tech>

guzu force-pushed the eva/log-test-failure branch from 911bee5 to 8ebf7af Compare November 27, 2025 08:48

ydirson requested changes Nov 27, 2025

View reviewed changes

		logging.debug(f"Record timestamp (end): {class_name}")
		host.ssh(f"echo end {class_name} $(date '+%s') >> {HOST_TIMESTAMPS_FILE}")



		# Session-scoped fixture to create a shared log directory for all artifacts

		# Path to timestamp file on remote hosts for log extraction
		HOST_TIMESTAMPS_FILE = "/tmp/pytest-timestamps.log"

Collect logs in case of test failure #362

Are you sure you want to change the base?

Collect logs in case of test failure #362

Uh oh!

Conversation

guzu commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dinhngtu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guzu commented Nov 12, 2025

Uh oh!

dinhngtu commented Nov 12, 2025

Uh oh!

guzu commented Nov 12, 2025

Uh oh!

guzu commented Nov 12, 2025

Uh oh!

guzu commented Nov 12, 2025

Uh oh!

dinhngtu commented Nov 17, 2025

Uh oh!

guzu commented Nov 17, 2025

Uh oh!

stormi commented Nov 20, 2025

Uh oh!

guzu commented Nov 24, 2025

Uh oh!

glehmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydirson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydirson Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydirson Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

guzu commented Nov 10, 2025 •

edited

Loading

ydirson Nov 27, 2025 •

edited

Loading

ydirson Nov 27, 2025 •

edited

Loading