Skip to content

[DPE-8588] Automated tutorial testing#487

Open
izmalk wants to merge 61 commits intocanonical:mainfrom
izmalk:spread-test-3
Open

[DPE-8588] Automated tutorial testing#487
izmalk wants to merge 61 commits intocanonical:mainfrom
izmalk:spread-test-3

Conversation

@izmalk
Copy link
Copy Markdown
Contributor

@izmalk izmalk commented Apr 14, 2026

Description

Created an automated solution for the tutorial end-to-end tests using Spread + Multipass.

  1. The extract_commands.py script parses the existing tutorial content and extracts necessary commands in correct order.
  2. The tutorial files are modified with special comment-blocks, invisible to users, that adjust the flow for automated run when needed.
  3. Spread tasks are also generated from tutorial files.
  4. Spread.yaml to run it either locally (tox -e tutorial) or via CI/CD.

The tests are run in a Multipass VM.
Estimated time from start to finish ~1.5-2 hours.

The output finished like this:

2026-04-14 11:17:17 Allocation results of multipass:ubuntu-24.04-64: map[string]string{}
2026-04-14 11:17:17 Worker terminated.
2026-04-14 11:17:17 Successful tasks: 8
2026-04-14 11:17:17 Aborted tasks: 0
  tutorial: OK (5253.81=setup[0.03]+cmd[0.02,0.03,0.02,0.02,0.02,0.03,0.02,0.02,5253.59] seconds)
  congratulations :) (5253.86 seconds)

https://warthogs.atlassian.net/browse/DPE-8588

I've added TESTING.md with the description of the solution, agents.md to improve AI reliability and updated the README.

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

@izmalk izmalk self-assigned this Apr 14, 2026
@izmalk izmalk added documentation Improvements or additions to documentation enhancement New feature, UI change, or workload upgrade labels Apr 14, 2026
@izmalk izmalk mentioned this pull request Apr 14, 2026
2 tasks
@izmalk izmalk changed the title Automated tutorial testing [DPE-8588] Automated tutorial testing Apr 14, 2026
Comment thread tests/tutorial/helpers.sh
Comment on lines +35 to +103
wait_idle() {
local timeout=600
local interval=30
local allow_blocked=""

# Parse named options, consuming two tokens per flag (name + value).
while [[ $# -gt 0 ]]; do
case "$1" in
--timeout) timeout="$2"; shift 2 ;;
--interval) interval="$2"; shift 2 ;;
--allow-blocked) allow_blocked="$2"; shift 2 ;;
*) echo "wait_idle: unknown option: $1" >&2; return 1 ;;
esac
done

local elapsed=0
echo "Waiting for all Juju units to be active/idle (timeout=${timeout}s, poll=${interval}s)…"

while [[ "$elapsed" -lt "$timeout" ]]; do
local not_ready
# Run the poll pipeline with pipefail disabled so a non-zero exit from
# "juju status" (common while machines are still provisioning) does not
# abort a calling script that has set -euo pipefail active.
not_ready=$(
set +o pipefail
export ALLOW_BLOCKED="$allow_blocked"
juju status --format=json 2>/dev/null | python3 -c '
import json, sys, os
try:
data = json.load(sys.stdin)
allowed = set(os.environ.get("ALLOW_BLOCKED", "").split(",")) - {""}
not_ready = 0
total_units = 0
for app_name, app in data.get("applications", {}).items():
for unit in app.get("units", {}).values():
total_units += 1
ws = unit.get("workload-status", {}).get("current", "")
js = unit.get("juju-status", {}).get("current", "")
if ws == "active" and js == "idle":
continue
if ws == "blocked" and js == "idle" and app_name in allowed:
continue
not_ready += 1
if total_units == 0:
print("provisioning")
else:
print(not_ready)
except Exception:
print("provisioning")
'
) || not_ready="provisioning"

if [[ "$not_ready" == "0" ]]; then
echo "All units active/idle after ${elapsed}s."
juju status
return 0
elif [[ "$not_ready" == "provisioning" ]]; then
echo "[${elapsed}s elapsed] still provisioning – rechecking in ${interval}s…"
else
echo "[${elapsed}s elapsed] ${not_ready} unit(s) not yet active/idle – rechecking in ${interval}s…"
fi
sleep "$interval"
elapsed=$(( elapsed + interval ))
done

echo "Timed out after ${timeout}s. Final status:"
juju status
return 1
}
Copy link
Copy Markdown
Contributor Author

@izmalk izmalk Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the most important features: the ability to test the Juju model and detect its active/idle state. We rely on it frequently, so its reliability and configuration directly affect both the reliability of the tests and their overall duration.

I experimented with replacing it with juju wait-for, but it was far too unstable. It often failed to detect the active/idle state at all, timed out, and then, when I checked manually, the status was actually already active/idle. It simply did not detect it.

There is probably a simple reason for this: juju wait-for works differently from juju status (see source):

The wait-for command streams delta changes from the underlying database, unlike the status command which performs a full query of the database.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your observation needs more investigation (and maybe reporting upstream) since we rely on wait-for in almost all our integration tests.

Anyway, as I mentioned in the comments, we could achieve this much simpler in Python, either using jubilant.Juju.wait(), or if you don't trust that, using a simple retry & polling mechanism in Python:

for attempt in tenacity.Retrying(
    wait=tenacity.wait_fixed(10),
    stop=tenacity.stop_after_delay(600),
    reraise=True,
):
    with attempt:
        status = juju.status()
        assert all(
            app.app_status.current == "active" and app.juju_status.current == "idle"
            for app in status.apps.values()
        )

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Do we have examples of the instability? It's the same logic we're using in tests so I don't think it's unstable.

Copy link
Copy Markdown
Contributor Author

@izmalk izmalk Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the log output.

Sorry it's long; wait-for spams model "tutorial" found, waiting... quite often.
Search for:

ERROR timed out waiting for "tutorial" to reach goal state

Basically, the juju wait-for model runs until it reaches a timeout. I've tried making the maximum time rather big; it didn't help. And when I added a status check after the wait-for, it showed active/idle. It even worked slower, but that's most likely due to reaching timeouts (taking maximum time). So, after many attempts of trying to make it work, I dropped the idea and reverted to the well-working solution for now.

After discussing the jubilant approach with Iman, I feel committed to implementing alternative functions. To have all three options so we can test and compare them.

Copy link
Copy Markdown
Contributor

@imanenami imanenami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I really liked the idea of using inline HTML metadata for test automation. Good job on that 👍

Before doing a detailed review, I need some input for my own clarification so that I don't spam the review with generic comments:

  1. While I like the HTML inline comments idea, why are we inventing another mini-testing-language here? Since jubilant is the current charm eng. standard, I think we can safely use that with leaner syntax, and remove many of the "test compiling" logic in this PR.

  2. With the current implementation, I have a more general concern about raising the bar yet higher than it is now for future documentation contributions. Specially with the brand new test DSL (Domain Specific Language) & inline shell commands now required to pass the docs tests...

Comment thread tests/tutorial/helpers.sh
Copy link
Copy Markdown
Contributor

@imanenami imanenami Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why are we using a bash script here? we could write this in much more readable & maintainable form in python 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think that we should be encapsulating this logic here.

Copy link
Copy Markdown
Contributor Author

@izmalk izmalk Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarising the dicussion we had - yes, Python looks much nicer and easier to maintain. But our tutorial tells our users to run bash commands and I'd like to keep it as close as possible to the tutorial for now, as that is the objective.

And to recognize the valid concern about shell script implementation being rather ugly and hardly maintainable, I already implemented a jubilant alternative to the wait_idle() in helpers. I'm planning to implement the juju wait-for alternative too, as soon as I can make it actually work =D. You can see the jubilant version in this other branch. But it will be merged a bit later, as version, let's say 0.2, of the solution, where I will specifically test and experiment which implementation works better and why. For now, I'd like to settle with old-reliable shell script since it has shown best results so far.

Comment thread tox.ini
Comment on lines +89 to +101
[testenv:tutorial-extract]
description = Generate tutorial test scripts from Markdown sources
skip_install = True
commands =
python3 tests/tutorial/extract_commands.py docs/tutorial/environment.md tests/tutorial/01_environment.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/deploy.md tests/tutorial/02_deploy.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/integrate-with-client-applications.md tests/tutorial/03_client.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/manage-passwords.md tests/tutorial/04_passwords.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/enable-encryption.md tests/tutorial/05_encryption.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/use-kafka-connect.md tests/tutorial/06_kafka_connect.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/rebalance-partitions.md tests/tutorial/07_rebalance.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/cleanup.md tests/tutorial/08_cleanup.sh

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I'd highly suggest to keep the toxfile lean and move all this into a separate script.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably have a bash one-liner to list files and run them with xargs or something in sorted order

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 365ade9

Comment thread tox.ini
Comment on lines +107 to +117
commands =
python3 tests/tutorial/extract_commands.py docs/tutorial/environment.md tests/tutorial/01_environment.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/deploy.md tests/tutorial/02_deploy.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/integrate-with-client-applications.md tests/tutorial/03_client.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/manage-passwords.md tests/tutorial/04_passwords.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/enable-encryption.md tests/tutorial/05_encryption.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/use-kafka-connect.md tests/tutorial/06_kafka_connect.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/rebalance-partitions.md tests/tutorial/07_rebalance.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/cleanup.md tests/tutorial/08_cleanup.sh
spread -vv multipass:ubuntu-24.04-64:tests/tutorial/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 365ade9

Comment thread docs/agents.md
Comment on lines +48 to +50
Commands are extracted **only** from `` ```shell `` fenced blocks.
Use `` ```bash `` or `` ```text `` for output examples or commands that should
not be executed by the test harness.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only use text for command outputs and not recommend using bash. after all, bash is a form of shell.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure I understand. This tells the agent that shell blocks are extracted, while other block types are not. text and bash are examples: bash is for commands, and text is for output (or any other text that is not syntax higlighted).

Outside of the extractor script I built, bash and shell code blocks are functionally identical, as far as I know.

Comment thread docs/tutorial/rebalance-partitions.md Outdated
Comment on lines +123 to +144
<!-- test:run
# Cruise Control needs time to collect metrics from the cluster. Rather than
# a fixed sleep, retry the dryrun rebalance until CC reports ready (up to 40
# minutes, polling every 2 minutes).
elapsed=0
timeout=2400
interval=120
echo "Waiting for Cruise Control to be ready (timeout=${timeout}s, poll=${interval}s)…"
while [ "$elapsed" -lt "$timeout" ]; do
if juju run kraft/leader rebalance mode=add brokerid=103 --wait=2m 2>&1; then
echo "Cruise Control ready after ${elapsed}s."
break
fi
echo "[${elapsed}s elapsed] Cruise Control not ready – retrying in ${interval}s…"
sleep "$interval"
elapsed=$((elapsed + interval))
done
if [ "$elapsed" -ge "$timeout" ]; then
echo "ERROR: Cruise Control did not become ready within ${timeout}s"
exit 1
fi
-->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: This is a no-go, it's far too complex. We need a better way of abstracting this to make contributions more streamlined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a great discussion about this. I'll be implementing a helper.sh function to make invoking such command rather trivial. The most direct approach (to keep it close to the tutorial instructions) - is to call it every 20 minutes or so for ~3 times. The exact command will be here, the rest - automated in the helper.sh file.

Comment thread docs/tutorial/rebalance-partitions.md Outdated
Comment on lines +313 to +333
<!-- test:run
# CC may need time to recollect metrics after the topology change. Retry the
# dryrun until it succeeds (up to 20 minutes, polling every 2 minutes).
elapsed=0
timeout=1200
interval=120
echo "Waiting for Cruise Control to be ready for full rebalance (timeout=${timeout}s)…"
while [ "$elapsed" -lt "$timeout" ]; do
if juju run kraft/leader rebalance mode=full --wait=3m 2>&1; then
echo "Full rebalance dryrun succeeded after ${elapsed}s."
break
fi
echo "[${elapsed}s elapsed] Not ready – retrying in ${interval}s…"
sleep "$interval"
elapsed=$((elapsed + interval))
done
if [ "$elapsed" -ge "$timeout" ]; then
echo "ERROR: Cruise Control full rebalance dryrun did not succeed within ${timeout}s"
exit 1
fi
-->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**todo: Same as above.

Comment thread tests/tutorial/helpers.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think that we should be encapsulating this logic here.

Comment thread tests/tutorial/helpers.sh
Comment on lines +35 to +103
wait_idle() {
local timeout=600
local interval=30
local allow_blocked=""

# Parse named options, consuming two tokens per flag (name + value).
while [[ $# -gt 0 ]]; do
case "$1" in
--timeout) timeout="$2"; shift 2 ;;
--interval) interval="$2"; shift 2 ;;
--allow-blocked) allow_blocked="$2"; shift 2 ;;
*) echo "wait_idle: unknown option: $1" >&2; return 1 ;;
esac
done

local elapsed=0
echo "Waiting for all Juju units to be active/idle (timeout=${timeout}s, poll=${interval}s)…"

while [[ "$elapsed" -lt "$timeout" ]]; do
local not_ready
# Run the poll pipeline with pipefail disabled so a non-zero exit from
# "juju status" (common while machines are still provisioning) does not
# abort a calling script that has set -euo pipefail active.
not_ready=$(
set +o pipefail
export ALLOW_BLOCKED="$allow_blocked"
juju status --format=json 2>/dev/null | python3 -c '
import json, sys, os
try:
data = json.load(sys.stdin)
allowed = set(os.environ.get("ALLOW_BLOCKED", "").split(",")) - {""}
not_ready = 0
total_units = 0
for app_name, app in data.get("applications", {}).items():
for unit in app.get("units", {}).values():
total_units += 1
ws = unit.get("workload-status", {}).get("current", "")
js = unit.get("juju-status", {}).get("current", "")
if ws == "active" and js == "idle":
continue
if ws == "blocked" and js == "idle" and app_name in allowed:
continue
not_ready += 1
if total_units == 0:
print("provisioning")
else:
print(not_ready)
except Exception:
print("provisioning")
'
) || not_ready="provisioning"

if [[ "$not_ready" == "0" ]]; then
echo "All units active/idle after ${elapsed}s."
juju status
return 0
elif [[ "$not_ready" == "provisioning" ]]; then
echo "[${elapsed}s elapsed] still provisioning – rechecking in ${interval}s…"
else
echo "[${elapsed}s elapsed] ${not_ready} unit(s) not yet active/idle – rechecking in ${interval}s…"
fi
sleep "$interval"
elapsed=$(( elapsed + interval ))
done

echo "Timed out after ${timeout}s. Final status:"
juju status
return 1
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Do we have examples of the instability? It's the same logic we're using in tests so I don't think it's unstable.

Comment thread tox.ini
Comment on lines +89 to +101
[testenv:tutorial-extract]
description = Generate tutorial test scripts from Markdown sources
skip_install = True
commands =
python3 tests/tutorial/extract_commands.py docs/tutorial/environment.md tests/tutorial/01_environment.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/deploy.md tests/tutorial/02_deploy.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/integrate-with-client-applications.md tests/tutorial/03_client.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/manage-passwords.md tests/tutorial/04_passwords.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/enable-encryption.md tests/tutorial/05_encryption.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/use-kafka-connect.md tests/tutorial/06_kafka_connect.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/rebalance-partitions.md tests/tutorial/07_rebalance.sh
python3 tests/tutorial/extract_commands.py docs/tutorial/cleanup.md tests/tutorial/08_cleanup.sh

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably have a bash one-liner to list files and run them with xargs or something in sorted order

@imanenami
Copy link
Copy Markdown
Contributor

imanenami commented Apr 16, 2026

I had a productive discussion yesterday with @izmalk and it helped clarify some of the underlying decisions made in the PR. Posting my summary here for future reference:

  • Because tutorial tests sit at the topmost of the testing pyramid, it's justifiable that we use the same tools available to normal users for the tests. i.e. bash commands in this case. Although it'd be also desirable to add options for other flavors of testing. Now that I think, magics in IPython (jupyter) can be a good source of inspiration. For example if I put %%jubilant in a test annotation, it means I'm using jubilant, same for bash, python and other usable testing flavors.

  • Vladimir mentioned some complex bash commands in test blocks of this PR is for demonstrative purposes. That also alleviates my concern about raising the skills bar for documentation work (writing tutorials).

Copy link
Copy Markdown
Contributor

@imanenami imanenami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can break this PR into three pieces:

  1. The testing syntax (we can call it the Frontend): This involves the Annotation reference section in the TESTING.md file. This is the part that I care most about at current stage, and I think needs to be set right from ground up, to be reusable across projects. This involves the HTML inline comments & the proposed syntax.
  2. The test compiler (or we can call it the Backend): This involves the extract_commands.py, helpers.sh, and the make utility. I'm not going to be nitpicky at this stage, I believe we can improve this in the future once we agree on the testing syntax, and ideally imo, this should be in pure, type-hinted, quality-checked Python so that it can be maintainable by Python developers :)
  3. The implementation of the syntax + compiler for Kafka tutorials: This is already good in my opinion 👍

With that mental model, the only major change I want to see in this PR before approving is the one that Vladimir already mentioned, and I reiterated in my previous comment, about the possibility of adding different flavors of test assertions using a syntax, which can for example be inspired from IPython magics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature, UI change, or workload upgrade

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants