Skip to content

Conversation

@def-
Copy link
Contributor

@def- def- commented Nov 25, 2025

Fixes: https://github.com/MaterializeInc/database-issues/issues/9929

Test run: https://buildkite.com/materialize/nightly/builds/14267

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@def- def- force-pushed the pr-orchestratord-timing-0dt branch 3 times, most recently from f887e54 to b036c0d Compare November 25, 2025 18:39
@def-
Copy link
Contributor Author

def- commented Nov 25, 2025

@jubrad Does this make sense as the way to measure downtime? I'm counting the time to run kubectl port-forward, create a new SQL connection and run a query. It's all pretty fast since it's local anyway. https://github.com/MaterializeInc/materialize/pull/34286/files#diff-0ef29e4d4b1771d6d4b7d91d28b9e9cf9b2d0cb37fef898377dbc3792aff9ab9R1526

Edit: The initial connection took 2.2 s, after the 0dt upgrade 9.6 s. This is with no objects created at all.

@def- def- force-pushed the pr-orchestratord-timing-0dt branch from b036c0d to c2558f8 Compare November 25, 2025 18:58
@def- def- marked this pull request as ready for review November 25, 2025 18:58
@def- def- requested a review from a team as a code owner November 25, 2025 18:58
@def- def- marked this pull request as draft November 25, 2025 18:59
@def- def- force-pushed the pr-orchestratord-timing-0dt branch 2 times, most recently from febf00e to 86d9bce Compare November 27, 2025 10:08
@def- def- marked this pull request as ready for review November 27, 2025 10:20
@def- def- force-pushed the pr-orchestratord-timing-0dt branch from 86d9bce to 73b0c42 Compare November 27, 2025 10:24
@def-
Copy link
Contributor Author

def- commented Nov 27, 2025

All green in CI, ready for review!

@def- def- requested review from ggevay and teskje December 1, 2025 12:36
@def-
Copy link
Contributor Author

def- commented Dec 1, 2025

Ready for review!

Copy link
Contributor

@jubrad jubrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm debating whether we should add a standard set of items to the catalog to ensure that catalog migration doesn't explode the time in a more realistic scenario. - we can follow up with this later as well.

preexec_fn=os.setpgrp,
)
connect_port_forward = False
time.sleep(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we sleeping 1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't want to retry too quickly, but after I removed the print below, I could also remove the sleep again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"jsonpath={.items[0].metadata.name}",
]
).strip()
port_forward_process = subprocess.Popen(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The port forward connection can drop... it might be worth spinning up a task to maintain and reopen the portforward if it closes, or use a loop in your subprocess while true; do...<port forward>; done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the SQL connection fails, we reopen it.

if runtime > 2:
print(f"Downtime: {runtime}s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not always print?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So spammy. Only the ones > 2 s looked interesting, otherwise it's pretty much instant.

@jubrad
Copy link
Contributor

jubrad commented Dec 1, 2025

Does this make sense as the way to measure downtime? I'm counting the time to run kubectl port-forward, create a new SQL connection and run a query. It's all pretty fast since it's local anyway. https://github.com/MaterializeInc/materialize/pull/34286/files#diff-0ef29e4d4b1771d6d4b7d91d28b9e9cf9b2d0cb37fef898377dbc3792aff9ab9R1526

Yeah, I think this is a pretty reasonable approach.

@def-
Copy link
Contributor Author

def- commented Dec 1, 2025

I'm debating whether we should add a standard set of items to the catalog to ensure that catalog migration doesn't explode the time in a more realistic scenario. - we can follow up with this later as well.

Sure, we can do that, but then it starts overlapping more with what we already have in test/0dt/mzcompose.py's workflow_*rehydration.

@def- def- force-pushed the pr-orchestratord-timing-0dt branch from 73b0c42 to f6495ec Compare December 1, 2025 17:42
@def- def- requested a review from jubrad December 1, 2025 21:15
@def- def- merged commit 88c0cb6 into MaterializeInc:main Dec 2, 2025
133 checks passed
@def- def- deleted the pr-orchestratord-timing-0dt branch December 2, 2025 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants