Skip to content

Create public preview docs for new Kubernetes operator #19942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

jhlodin
Copy link
Contributor

@jhlodin jhlodin commented Jul 15, 2025

Publicly document the new CockroachDB operator (prev. "enterprise operator") as a preview.

  • Publish content from the limited preview, adding a preview notice
  • Add rollback instructions to migration guides
  • Add a page describing pod and flag overrides
  • Add introductory content for the CockroachDB operator
  • Formally rename the previous operator to "Public operator"
  • Add references to the CockroachDB operator from content for older Kubernetes deployment methods
  • Update deprecated localityLabels field to the new localityMappings field

TODO:

  • Dupe changes into v25.3 directory

Copy link

github-actions bot commented Jul 15, 2025

Files changed:

Copy link

netlify bot commented Jul 15, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit f38499e
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/6894f622386c6d0008ba98cd

Copy link

netlify bot commented Jul 15, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit f38499e
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/6894f622569e12000851b0a1

Copy link

netlify bot commented Jul 15, 2025

Netlify Preview

Name Link
🔨 Latest commit f38499e
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/6894f62233a9890008a7f18e
😎 Deploy Preview https://deploy-preview-19942--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a lot of work -- apologies for taking a while on the review.

I have a lot of suggested edits here, but am blocking for the following:

  • (Obviously, as you mentioned) Still needs the operator overview doc, and the Preview callout.
  • Markdown formatting doesn't match our docs set, as we discussed already.
  • Similarly, some style-related notes on capitalization, use of "we" in tutorials, etc.
  • Related: Code and output snippets should be presented cleanly and consistently with how we use them in the docs (see my notes).
  • In the guidance, some example yaml is missing where it would greatly help clarity. In particular, the instructions in deploy-cockroachdb-with-kubernetes-operator on configuring locality are confusing to me. I think these parts can be tightened up.
  • The two migration docs should more clearly emphasize that they aren't for prod deployments.
  • Ensure the "informal" product terminology for the new vs. old operator is consistent throughout.

Otherwise, I took the time to make suggested rewrites, but some of the copy feels more conversational than is helpful (especially in kubernetes-operator-performance). This isn't a blocking comment per se, and a conversational tone can be nice with such a complicated topic. But at times (in the performance doc, and with the locality stuff I mentioned above), I think it obfuscates the guidance itself. Let me know if you want to discuss. I also replaced passive with active language where I saw it.

@jhlodin jhlodin force-pushed the jl/doc-14248 branch 2 times, most recently from 6e2d630 to 5fec404 Compare August 4, 2025 20:24
@jhlodin jhlodin marked this pull request as ready for review August 4, 2025 20:24
@jhlodin jhlodin force-pushed the jl/doc-14248 branch 3 times, most recently from 938d9a8 to 152ac55 Compare August 5, 2025 02:49
Copy link
Contributor Author

@jhlodin jhlodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotating whose reviews are needed where

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PM question, do we need to start an equivalent release notes page for the CockroachDB operator now? Or should we re-evaluate that for GA once things stabilize?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pritesh-lahoti I think we were hoping to align our operator/helm releases to crdb releases and so could perhaps have them in the same release notes. Can you please confirm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had plans to align the release versions, but not necessarily the release timelines themselves.

For example, a new feature getting introduced within CRDB for a particular release version could also be exposed via Helm/Operator in the same release version [best-effort], so while the release versions would be the same, the timeline of these two events could differ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. Something to revisit when we make user-facing changes to the operator while it's in public preview.


The Operator updates all nodes and triggers a rolling restart of the pods with the new storage capacity.

To verify that the storage capacity has been updated, run `kubectl get pvc` to view the persistent volume claims (PVCs). It will take a few minutes before the PVCs are completely updated.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PM question - see above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs PM and eng review (this is the introduction page for the new operator):

  • For PM: Any feedback on this intro, would you like to introduce the new operator to readers any differently?
  • For eng: The "Kubernetes terminology" section below was taken from the existing Kubernetes intro. Are there any changes or new fields you would suggest adding to explain concepts to readers? Consider that this is a very high level introduction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. We could optionally also introduce service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBD on service, I'd want to learn more about the other use cases for that field.

@jhlodin jhlodin force-pushed the jl/doc-14248 branch 2 times, most recently from 38eb43e to d708151 Compare August 5, 2025 04:27
@jhlodin jhlodin dismissed taroface’s stale review August 5, 2025 15:01

Ryan is OOO - addressed all of his comments last week

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. We could optionally also introduce service.

Copy link
Contributor

@mwang1026 mwang1026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting review for now but I am not done going thru. Just have some meetings and wanted to get you the comments thus far so you can take action. Thanks!

The {{ site.data.products.cockroachdb-operator }} is a fully-featured [Kubernetes operator](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) that allows you to deploy and manage CockroachDB self-hosted clusters.

{{site.data.alerts.callout_info}}
The {{ site.data.products.cockroachdb-operator }} is in [Preview]({% link {{ page.version.version }}/cockroachdb-feature-availability.md %}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to also have the recommendation here re something like "It is the recommended choice for new Kubernetes deployments due to its better support for CockroachDB’s operational best practices."

People on Revenue might be sad but I think we'd all be sadder if they chose the old operator

Copy link
Contributor Author

@jhlodin jhlodin Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's tough to pitch in the docs as long as we have the "preview" tag. Either way though, I did my best attempt at that pitch in this boilerplate that I spread throughout the existing kubernetes docs:

Cockroach Labs recommends that new deployments of CockroachDB on Kubernetes use the CockroachDB operator. To migrate an existing deployment to use the CockroachDB operator, read the Helm and Public operator migration guides.

That boilerplate just isn't in this specific document.

The subject alternative names are based on a release called `my-release` in the `cockroach-ns` namespace. Make sure they match the services created with the release during Helm install.
{{site.data.alerts.end}}

If you wish to supply certificates with [cert-manager](https://cert-manager.io/), set `cockroachdb.tls.certManager.enabled` to `true`, and `cockroachdb.tls.certManager.issuer` to an IssuerRef (as they appear in certificate resources) pointing to a clusterIssuer or issuer that you have set up in the cluster. The following Kubernetes application describes an example issuer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should cert-manager be its own section? it does in the values.yaml so we probably want to outline those steps separately from external certs? I'm a bit out of my depth here tbh and could use eng help @pritesh-lahoti

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to move TLS config to a totally separate section or page, it's a lot of info to try and digest in the middle of a numbered list of instructions. Can tackle that later based on customer feedback if they find it confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense.

cockroach start --locality region=us-central1,zone=us-central1-c,dc=dc2
~~~

Optionally, review the `cockroachdb.crdbCluster.topologySpreadConstraints` configuration and set `topologyKey` to a locality variable that will have distinct values for each node. By default the lowest locality level is `zone`, so the following configuration sets that value as the `topologyKey`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does locality variable mean in this case? Is it the nodeLabel or the localityLabel? I think it's nodeLabel right? This is currently ambiguous

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we could explain here too why it's important to spread across zones for maximum ersilience from CockroachDB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified to "...set topologyKey to the nodeLabel value of a locality level that has..." which is a mouthful but technically accurate.

Thinking about if there's a good landing page to send readers to for locality best practice. Adding a link to https://www.cockroachlabs.com/docs/stable/topology-patterns.html but not sure if that's helpful.

Copy link
Contributor

@mwang1026 mwang1026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes requested. Thanks again for your help!

The Helm chart creates a public Service that exposes both SQL and gRPC connections over a single power. However, the operator uses a different port for gRPC communication. To ensure compatibility, update the public Service to reflect the correct gRPC port used by the operator.
Apply the updated Service manifest:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritesh-lahoti what happens in the meantime? if someone tries to e.g. call a CLI command that hits RPCs will they fail?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mwang1026 Yes, the gRPC will have downtime until the public service is applied with new config (only if somebody is using public service to call the grpc endpoint).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got me thinking and I have decided to do this exploration.

Originally Statefulsets runs gRPC and sql on 26257 port. Now, with the operator gRPC will be moved to port 26258. Yes, there will be downtime after all the nodes are migrated and until the service is patched.

After migrating half of replica to the cockroachdb operator, we can patch the service to use the new gRPC port. Now there are replicas to serve on the new port, the gRPC commands works perfectly fine. With this approach there will not be any downtime for gRPC also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will still depend on what pod the command is routed to though right? Regardless I think we should call this out as a temporary limitation. The cli commands I know off the top of my head that call RPCs are cockroach node drain and cockroach node decommission

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warning to the helm migration guide about there being a limited-scope downtime specifically for commands that hit RPCs - https://github.com/cockroachdb/docs/pull/19942/files#diff-b3d91dedb7197a1f4d3b45de3839ee0b45f8146a3837981d1e1c3700ec318cbfR28


For more context on how these rules work, see the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/). The [custom resource definition](https://github.com/cockroachdb/helm-charts/blob/master/cockroachdb-parent/charts/cockroachdb/values.yaml) details the fields supported by the operator.

### Add a pod affinity or anti-affinity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritesh-lahoti re: the node anti-affinity that we have hardcoded to allow only one pod per k8s node, is that the only anti-affinity that can't be altered via the helm chart? I'm having a hard time keeping up with these changes 😓 (good problem to have). WDYT about mentioning something about that here if it is the case? And then pointing to podtemplate as needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator hard-codes the pod anti-affinity (and not the node anti-affinity). Right, it makes sense to call it out here.
All of it can be overridden by the corresponding affinity policies specified in the podTemplate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


You should provision an appropriate amount of disk storage for your workload. For recommendations on this, see the [Production Checklist]({% link {{ page.version.version }}/recommended-production-settings.md %}#storage).

### Expand disk size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritesh-lahoti this only works with disks that can dynamically scale up right? so like EBS and such? I think it'd be good to make that clear here and what behavior would be if you didn't have EBS like disks (and that we don't recommend doing that and that they should horizontally scale...?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


You can use the operator to configure the CockroachDB logging system. This allows you to output logs to [configurable log sinks]({% link {{ page.version.version }}/configure-logs.md %}#configure-log-sinks) such as file or network logging destinations.

The logging configuration is defined in a [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) object, using a key named `logs.yaml`. For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is logs.yaml the same format as the standard logs.yaml file that cockroach grabs for logging configs? If so maybe we link to docs about the logs configuration via yaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that link in line 238 goes to https://www.cockroachlabs.com/docs/stable/configure-logs.html#configure-log-sinks

I'm on board with adding that link to the values.yaml as a comment.

@jhlodin jhlodin force-pushed the jl/doc-14248 branch 2 times, most recently from e0559fc to e05da7b Compare August 7, 2025 15:31
@jhlodin jhlodin requested a review from mwang1026 August 7, 2025 17:50
Copy link
Contributor

@florence-crl florence-crl left a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants