Skip to content

Conversation

bwplotka
Copy link
Member

@bwplotka bwplotka commented Sep 9, 2025

As discussed in various places (e.g. prometheus/prometheus#17036 (comment) and delta WG) we decided to create a formal proposal on how CT native Prometheus storage could look like and how to make it useful (unblock) delta temporality.

@bwplotka bwplotka changed the title proposal[PROM-60]: Prometheus CT Storage proposal: Prometheus CT Storage Sep 9, 2025
@bwplotka bwplotka force-pushed the ctstorage branch 4 times, most recently from 4f9eb07 to 03c37e3 Compare September 10, 2025 14:04
@bwplotka bwplotka changed the title proposal: Prometheus CT Storage proposal: Native TSDB Support for Cumulative CT (and Delta (ST) on the way) Sep 10, 2025
@bwplotka bwplotka changed the title proposal: Native TSDB Support for Cumulative CT (and Delta (ST) on the way) proposal: Native TSDB Support for Cumulative CT (and Delta ST on the way) Sep 10, 2025
@bwplotka bwplotka changed the title proposal: Native TSDB Support for Cumulative CT (and Delta ST on the way) proposal: TSDB Support for Cumulative CT (and Delta ST on the way) Sep 10, 2025
@bwplotka bwplotka force-pushed the ctstorage branch 4 times, most recently from 033d077 to 704dee5 Compare September 11, 2025 13:23
@bwplotka bwplotka marked this pull request as ready for review September 11, 2025 13:23
@bwplotka
Copy link
Member Author

bwplotka commented Sep 12, 2025

FYI: We met for 1h with the delta WG (@ArthurSens @carrieedwards @fionaliao @ywwg) for an initial discussion around this proposal decisions. Thanks for this productive time!

Here are some notes:

  • Bartek introducing proposal details.
  • Fiona adding more context on reset hints proposal: TSDB Support for Cumulative CT (and Delta ST on the way) #60 (comment)
  • Fiona sharing suggestions for delta being a "mini-cumulative" story.
  • General alignment on technical decisions.
  • Fiona suggested we double check if start/create : end time is inclusive (left TODO, Otel is inclusive on end time only).
  • Fiona asked around gauge vs counter CT storages differences, especially given gauges in some systems can have CT.
    • Bartek (now): There are none for now, mentioned that in general decisions.
  • On TSDB read (programmatic) interfaces:
    • Owen: Should we discuss here failover algorithms, what can you do? Don't take too much invariants/assumptions, leave room for flexibility e.g lack of CTs
    • Fiona: Is it worth adding any assumptions around CT semantics (e.g. on append)
    • Owen: We should document our assumption, and evolve with reality
    • Arthur: Fiona has a lot of details proposal.
    • Bartek: So far we didn't put ANY requirements on CT on write
    • Bartek (now): I added related section # Proposed CT semantics and validation -- @ywwg could you help me explore how those restriction could look like? And what if we do SHOULD or MUST on those?
  • Artur noticed delta feature is a "SHOULD" goal for CT proposal, he aligned expectations around Grafana interest to deliver delta support.
    • Bartek: Ack. Happy to move to MUST if it helps. I left should to be open minded for extreme cases when solution to cumulative CT and delta ST are better to be entirely different, it would silly to push in single inefficient direction in this proposal. I don't see this being a case now though.
  • Outlining pros & cons for CT -> ST renaming alternative:
    • Fiona/Ar/B: Just stick with one naming
    • Bartek/Fiona: No strong opinion at this point
    • Arthur: I'd vote for CT
    • Owen: There are more future users than previous users, I'd vote for changing to ST.
    • Bartek (now): I added one more argument to keep CT -- CT or ST naming is equally correct/incorrect in this context - Prometheus is cumulative-first system so choosing CT might be fair.

Also updated proposal today with some learnings. Finally proposed a single feature flag for this work (ct-storage).

Still lots of TODOs and anyone is welcome to help!

Signed-off-by: bwplotka <bwplotka@gmail.com>
Copy link
Member

@ywwg ywwg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this proposal! I added a bunch of comments, some of which are answered by the paragraph right after the comment 😅 . I think my main concern is nailing down the Goals section. This is not at all to question whether we should do the work, just that I think our statement of intent needs to be unequivocable.


> TL;DR: We propose to extend Prometheus TSDB storage sample definition to include an extra int64 that will represent the cumulative created timestamp (CT) and, for the future delta temporality ([PROM-48](https://github.com/prometheus/proposals/pull/48)), a delta start timestamp (ST).
> We propose introducing persisting CT logic behind a single flag `ct-storage`.
> Once implemented, wee propose to deprecate the `created-timestamps-zero-injection` experimental feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> Once implemented, wee propose to deprecate the `created-timestamps-zero-injection` experimental feature.
> Once implemented, we propose to deprecate the `created-timestamps-zero-injection` experimental feature.


See the details, motivations and discussions about the delta temporality in [PROM-48](https://github.com/prometheus/proposals/pull/48).

The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the first time Start Timestamp is mentioned, and for people reading linearly it will kind of pop up out of nowhere. Let's move the next section, or at least parts of it, up higher and get the terminology questions out of the way as soon as possible. And then later on we can clarify the subtleties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the way to do that is to acknowledge Otel explicitly, and briefly summarize how they define a delta sample


The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive).

In other words, instant query for `increase(<cumulative counter>[5m])` produces a single delta sample for a `[t-5m, t]` period (V: `increase(<counter>[5m])`, CT/ST: `now()-5m`, T: `now()`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it's just me, but I am confused by the idea of producing a delta sample. To me, delta samples are things that are written to storage and not produced by promQL statements. For me, PromQL queries only ever produce instant/range scalar/vector results. And an increase() function would operate on one or more delta samples in the desired range, producing an "instant scalar" result, not a "delta sample."

* Average: in the order of ~weeks/months for stable workloads, ~days/weeks for more dynamic environments (Kubernetes).
* Best case: it never changes (infinite count) e.g days_since_X_total.
* Worse case: it changes for every sample.
* For the delta we expect CT to change for every sample.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the definitions I can find in the otel docs say that start time is "strongly encouraged" but not required. (ref: https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L163-L186)

So I think we need to loosen this statement just a bit to consider the cases where CT does not change for every sample. perhaps: "For the delta we expect CT to change for every sample but can make a best-effort attempt to adapt to situations where the start time is missing."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason why start timestamp wasn't made a hard requirement was for prometheus compatibility. All SDKs are required to set the start timestamp. Deltas always set the start timestamp as well to my knowledge.

Expectation that delta start time changes each sample is documented here: https://github.com/open-telemetry/opentelemetry-specification/blob/1e04e1be0e17cae6d01c862049bbeb298e0ffa06/specification/metrics/data-model.md?plain=1#L421

When the aggregation temporality is "delta", we expect to have
no overlap in time windows for metric streams.

Comment on lines +110 to +115
* Otel) More detailed, but still descriptive SHOULD rules only: In OpenTelemetry data model ([temporality section](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#sums:~:text=name%2Dvalue%20pairs.-,A%20time%20window%20(of%20(start%2C%20end%5D)%20time%20for%20which%20the,The%20time%20interval%20is%20inclusive%20of%20the%20end%20time.,-Times%20are%20specified)) CT is generally optional, but strongly recommended. Rules are also soft and we assume they are "SHOULD" because there is section of [handling overlaps](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#overlap). However, it provides more examples, which allow us to capture some specifics:
* Time intervals are half open `(CT, T]`.
* Generally, CT SHOULD:
* `CT[i] < T[i]` (equal means unknown)
* For a cumulative, CT SHOULD:
* Be the same across all samples for the same count:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yes this addresses my point above


Notably, given persistence of this feature, similar to example storage, if users enabled and then disabled this feature, users will might be able to access their CTs through all already persistent pieces e.g. WAL).

This feature could be considered to be switched to opt-out, only after it's finished (this proposal is fully implemented) stable, provably adopted and when the previous LTS Prometheus version is compatible with this feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This feature could be considered to be switched to opt-out, only after it's finished (this proposal is fully implemented) stable, provably adopted and when the previous LTS Prometheus version is compatible with this feature.
This feature could be considered to be switched to opt-out only after it's stable (i.e., this proposal is fully implemented), provably adopted and when the previous LTS Prometheus version is compatible with this feature.


TODO: Just a draft, to be discussed.
TODO: There are questions around:
* Should we do inclusive vs exclusive intervals?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do worry about exclusive/inclusive. Will this need to be a flag / config option? Google has a strict opinion but it sounds like other systems do not.

TODO: Just a draft, to be discussed.
TODO: There are questions around:
* Should we do inclusive vs exclusive intervals?
* Given optionality of this feature, can we even reject sample on TSDB Append if CT is invalid? (MUST or SHOULD on interface?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might depend on how it's invalid? Would it be possible to accept invalid samples and solve the problem at query time? I have a hunch that people would prefer to have dirty data than have their data rejected. (I am unsure what the general Prometheus philosophy is on this question). As an example, there used to be a hard rule against OOO, but now there are a number of adaptations to support it.


CT/ST notion was popularized by OpenTelemetry and early experience exposed a big challenge: CT/ST data is extremely unclean given early adoption, mixed instrumentation support and multiple (all imperfect) ways of ["auto-generation"](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstarttimeprocessor) (`subtract_initial_point` might the most universally "correct", but it added only recently). This means that handling (or reducing ) CT errors is an important detail for consumers of this data.

TODO: Just a draft, to be discussed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also have to consider changes to the exposition format? I am thinking of the extra cost of doubling the size of the exposition just to include a second timestamp, vs adding a new term to the exposition line in a way that old systems won't be able to read.

* Changing names often creates confusion
* Having a slight changed name already makes it clear that we talk about Prometheus semantics of the same thing

**Rejected** due to not enough arguments for renaming (feel free to challenge this!).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I thought we had more interest in renaming than this. Maybe an informal poll on slack is sufficient to get an accurate temperature-read?

* This unlocks the most amount of benefits (e.g. also delta) for the same amount of work, it makes code simpler.
* We don't know if we need special cumulative best case optimization (yet); also it would be also for some "best" cases. Once we know we can always add those optimizations.

2. Similarly, we propose to not have special CT storage cases per metric types. TSDB storage is not metric type aware, plus some systems allow optional CTs on gauges (e.g. OpenTelemetry). We propose keeping that storage flexibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the start time on gauges is very confusing, but preserving it is probably the right thing to do.

@dashpole
Copy link
Contributor

This makes a lot of sense to me. I think performance/benchmarks are probably the biggest potential blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants