proposal: TSDB Support for Cumulative CT (and Delta ST on the way) #60

bwplotka · 2025-09-09T11:26:26Z

As discussed in various places (e.g. prometheus/prometheus#17036 (comment) and delta WG) we decided to create a formal proposal on how CT native Prometheus storage could look like and how to make it useful (unblock) delta temporality.

proposals/0060-ct-storage.md

bwplotka · 2025-09-12T04:34:25Z

FYI: We met for 1h with the delta WG (@ArthurSens @carrieedwards @fionaliao @ywwg) for an initial discussion around this proposal decisions. Thanks for this productive time!

Here are some notes:

Bartek introducing proposal details.
Fiona adding more context on reset hints proposal: TSDB Support for Cumulative CT (and Delta ST on the way) #60 (comment)
Fiona sharing suggestions for delta being a "mini-cumulative" story.
General alignment on technical decisions.
Fiona suggested we double check if start/create : end time is inclusive (left TODO, Otel is inclusive on end time only).
Fiona asked around gauge vs counter CT storages differences, especially given gauges in some systems can have CT.
- Bartek (now): There are none for now, mentioned that in general decisions.
On TSDB read (programmatic) interfaces:
- Owen: Should we discuss here failover algorithms, what can you do? Don't take too much invariants/assumptions, leave room for flexibility e.g lack of CTs
- Fiona: Is it worth adding any assumptions around CT semantics (e.g. on append)
- Owen: We should document our assumption, and evolve with reality
- Arthur: Fiona has a lot of details proposal.
- Bartek: So far we didn't put ANY requirements on CT on write
- Bartek (now): I added related section # Proposed CT semantics and validation -- @ywwg could you help me explore how those restriction could look like? And what if we do SHOULD or MUST on those?
Artur noticed delta feature is a "SHOULD" goal for CT proposal, he aligned expectations around Grafana interest to deliver delta support.
- Bartek: Ack. Happy to move to MUST if it helps. I left should to be open minded for extreme cases when solution to cumulative CT and delta ST are better to be entirely different, it would silly to push in single inefficient direction in this proposal. I don't see this being a case now though.
Outlining pros & cons for CT -> ST renaming alternative:
- Fiona/Ar/B: Just stick with one naming
- Bartek/Fiona: No strong opinion at this point
- Arthur: I'd vote for CT
- Owen: There are more future users than previous users, I'd vote for changing to ST.
- Bartek (now): I added one more argument to keep CT -- CT or ST naming is equally correct/incorrect in this context - Prometheus is cumulative-first system so choosing CT might be fair.

Also updated proposal today with some learnings. Finally proposed a single feature flag for this work (ct-storage).

Still lots of TODOs and anyone is welcome to help!

Signed-off-by: bwplotka <bwplotka@gmail.com>

ywwg

Thank you for this proposal! I added a bunch of comments, some of which are answered by the paragraph right after the comment 😅 . I think my main concern is nailing down the Goals section. This is not at all to question whether we should do the work, just that I think our statement of intent needs to be unequivocable.

ywwg · 2025-09-12T15:11:03Z

proposals/0060-ct-storage.md

+
+> TL;DR: We propose to extend Prometheus TSDB storage sample definition to include an extra int64 that will represent the cumulative created timestamp (CT) and, for the future delta temporality ([PROM-48](https://github.com/prometheus/proposals/pull/48)), a delta start timestamp (ST).
+> We propose introducing persisting CT logic behind a single flag `ct-storage`.
+> Once implemented, wee propose to deprecate the `created-timestamps-zero-injection` experimental feature.


Suggested change

> Once implemented, wee propose to deprecate the `created-timestamps-zero-injection` experimental feature.

> Once implemented, we propose to deprecate the `created-timestamps-zero-injection` experimental feature.

ywwg · 2025-09-12T15:16:24Z

proposals/0060-ct-storage.md

+
+See the details, motivations and discussions about the delta temporality in [PROM-48](https://github.com/prometheus/proposals/pull/48).
+
+The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive).


this is the first time Start Timestamp is mentioned, and for people reading linearly it will kind of pop up out of nowhere. Let's move the next section, or at least parts of it, up higher and get the terminology questions out of the way as soon as possible. And then later on we can clarify the subtleties.

Maybe the way to do that is to acknowledge Otel explicitly, and briefly summarize how they define a delta sample

ywwg · 2025-09-12T15:19:52Z

proposals/0060-ct-storage.md

+
+The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive).
+
+In other words, instant query for `increase(<cumulative counter>[5m])` produces a single delta sample for a `[t-5m, t]` period (V: `increase(<counter>[5m])`, CT/ST: `now()-5m`, T: `now()`).


I am not sure if it's just me, but I am confused by the idea of producing a delta sample. To me, delta samples are things that are written to storage and not produced by promQL statements. For me, PromQL queries only ever produce instant/range scalar/vector results. And an increase() function would operate on one or more delta samples in the desired range, producing an "instant scalar" result, not a "delta sample."

ywwg · 2025-09-12T15:29:10Z

proposals/0060-ct-storage.md

+  * Average: in the order of ~weeks/months for stable workloads, ~days/weeks for more dynamic environments (Kubernetes).
+  * Best case: it never changes (infinite count) e.g days_since_X_total.
+  * Worse case: it changes for every sample.
+* For the delta we expect CT to change for every sample.


All the definitions I can find in the otel docs say that start time is "strongly encouraged" but not required. (ref: https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L163-L186)

So I think we need to loosen this statement just a bit to consider the cases where CT does not change for every sample. perhaps: "For the delta we expect CT to change for every sample but can make a best-effort attempt to adapt to situations where the start time is missing."

The only reason why start timestamp wasn't made a hard requirement was for prometheus compatibility. All SDKs are required to set the start timestamp. Deltas always set the start timestamp as well to my knowledge.

Expectation that delta start time changes each sample is documented here: https://github.com/open-telemetry/opentelemetry-specification/blob/1e04e1be0e17cae6d01c862049bbeb298e0ffa06/specification/metrics/data-model.md?plain=1#L421

When the aggregation temporality is "delta", we expect to have
no overlap in time windows for metric streams.

ywwg · 2025-09-12T15:31:09Z

proposals/0060-ct-storage.md

+* Otel) More detailed, but still descriptive SHOULD rules only: In OpenTelemetry data model ([temporality section](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#sums:~:text=name%2Dvalue%20pairs.-,A%20time%20window%20(of%20(start%2C%20end%5D)%20time%20for%20which%20the,The%20time%20interval%20is%20inclusive%20of%20the%20end%20time.,-Times%20are%20specified)) CT is generally optional, but strongly recommended. Rules are also soft and we assume they are "SHOULD" because there is section of [handling overlaps](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#overlap). However, it provides more examples, which allow us to capture some specifics:
+  * Time intervals are half open `(CT, T]`.
+  * Generally, CT SHOULD:
+    * `CT[i] < T[i]` (equal means unknown)
+    * For a cumulative, CT SHOULD:
+      * Be the same across all samples for the same count:


ok yes this addresses my point above

ywwg · 2025-09-12T15:48:23Z

proposals/0060-ct-storage.md

+
+Notably, given persistence of this feature, similar to example storage, if users enabled and then disabled this feature, users will might be able to access their CTs through all already persistent pieces e.g. WAL).
+
+This feature could be considered to be switched to opt-out, only after it's finished (this proposal is fully implemented) stable, provably adopted and when the previous LTS Prometheus version is compatible with this feature.


Suggested change

This feature could be considered to be switched to opt-out, only after it's finished (this proposal is fully implemented) stable, provably adopted and when the previous LTS Prometheus version is compatible with this feature.

This feature could be considered to be switched to opt-out only after it's stable (i.e., this proposal is fully implemented), provably adopted and when the previous LTS Prometheus version is compatible with this feature.

ywwg · 2025-09-12T15:49:28Z

proposals/0060-ct-storage.md

+
+TODO: Just a draft, to be discussed.
+TODO: There are questions around:
+* Should we do inclusive vs exclusive intervals?


I do worry about exclusive/inclusive. Will this need to be a flag / config option? Google has a strict opinion but it sounds like other systems do not.

ywwg · 2025-09-12T15:50:42Z

proposals/0060-ct-storage.md

+TODO: Just a draft, to be discussed.
+TODO: There are questions around:
+* Should we do inclusive vs exclusive intervals?
+* Given optionality of this feature, can we even reject sample on TSDB Append if CT is invalid? (MUST or SHOULD on interface?)


it might depend on how it's invalid? Would it be possible to accept invalid samples and solve the problem at query time? I have a hunch that people would prefer to have dirty data than have their data rejected. (I am unsure what the general Prometheus philosophy is on this question). As an example, there used to be a hard rule against OOO, but now there are a number of adaptations to support it.

ywwg · 2025-09-12T15:53:22Z

proposals/0060-ct-storage.md

+
+CT/ST notion was popularized by OpenTelemetry and early experience exposed a big challenge: CT/ST data is extremely unclean given early adoption, mixed instrumentation support and multiple (all imperfect) ways of ["auto-generation"](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstarttimeprocessor) (`subtract_initial_point` might the most universally "correct", but it added only recently). This means that handling (or reducing ) CT errors is an important detail for consumers of this data.
+
+TODO: Just a draft, to be discussed.


do we also have to consider changes to the exposition format? I am thinking of the extra cost of doubling the size of the exposition just to include a second timestamp, vs adding a new term to the exposition line in a way that old systems won't be able to read.

ywwg · 2025-09-12T15:58:02Z

proposals/0060-ct-storage.md

+* Changing names often creates confusion
+* Having a slight changed name already makes it clear that we talk about Prometheus semantics of the same thing
+
+**Rejected** due to not enough arguments for renaming (feel free to challenge this!).


oh, I thought we had more interest in renaming than this. Maybe an informal poll on slack is sufficient to get an accurate temperature-read?

dashpole · 2025-09-12T19:05:33Z

proposals/0060-ct-storage.md

+* This unlocks the most amount of benefits (e.g. also delta) for the same amount of work, it makes code simpler.
+* We don't know if we need special cumulative best case optimization (yet); also it would be also for some "best" cases. Once we know we can always add those optimizations.
+
+2. Similarly, we propose to not have special CT storage cases per metric types. TSDB storage is not metric type aware, plus some systems allow optional CTs on gauges (e.g. OpenTelemetry). We propose keeping that storage flexibility.


Yeah, the start time on gauges is very confusing, but preserving it is probably the right thing to do.

dashpole · 2025-09-12T19:16:27Z

This makes a lot of sense to me. I think performance/benchmarks are probably the biggest potential blocker.

bwplotka changed the title ~~proposal[PROM-60]: Prometheus CT Storage~~ proposal: Prometheus CT Storage Sep 9, 2025

bwplotka added the proposal label Sep 9, 2025

bwplotka force-pushed the ctstorage branch 4 times, most recently from 4f9eb07 to 03c37e3 Compare September 10, 2025 14:04

bwplotka changed the title ~~proposal: Prometheus CT Storage~~ proposal: Native TSDB Support for Cumulative CT (and Delta (ST) on the way) Sep 10, 2025

bwplotka changed the title ~~proposal: Native TSDB Support for Cumulative CT (and Delta (ST) on the way)~~ proposal: Native TSDB Support for Cumulative CT (and Delta ST on the way) Sep 10, 2025

bwplotka mentioned this pull request Sep 10, 2025

prw: Remote Write 2.0 CT per Sample/Histogram prometheus/prometheus#17036

Draft

bwplotka force-pushed the ctstorage branch from 03c37e3 to 7df3d64 Compare September 10, 2025 14:11

bwplotka changed the title ~~proposal: Native TSDB Support for Cumulative CT (and Delta ST on the way)~~ proposal: TSDB Support for Cumulative CT (and Delta ST on the way) Sep 10, 2025

bwplotka force-pushed the ctstorage branch 4 times, most recently from 033d077 to 704dee5 Compare September 11, 2025 13:23

bwplotka marked this pull request as ready for review September 11, 2025 13:23

fionaliao reviewed Sep 11, 2025

View reviewed changes

proposals/0060-ct-storage.md Outdated Show resolved Hide resolved

bwplotka force-pushed the ctstorage branch from 704dee5 to 9dee382 Compare September 12, 2025 04:34

proposal[PROM-60]: Prometheus CT Storage

7a00542

Signed-off-by: bwplotka <bwplotka@gmail.com>

bwplotka force-pushed the ctstorage branch from 9dee382 to 7a00542 Compare September 12, 2025 07:54

bwplotka requested a review from dashpole September 12, 2025 07:55

ywwg reviewed Sep 12, 2025

View reviewed changes

fionaliao mentioned this pull request Sep 12, 2025

Proposal: OTEL delta temporality support #48

Open

dashpole reviewed Sep 12, 2025

View reviewed changes

	> Once implemented, wee propose to deprecate the `created-timestamps-zero-injection` experimental feature.
	> Once implemented, we propose to deprecate the `created-timestamps-zero-injection` experimental feature.


		See the details, motivations and discussions about the delta temporality in [PROM-48](https://github.com/prometheus/proposals/pull/48).

		The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive).


		The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive).

		In other words, instant query for `increase(<cumulative counter>[5m])` produces a single delta sample for a `[t-5m, t]` period (V: `increase(<counter>[5m])`, CT/ST: `now()-5m`, T: `now()`).


		Notably, given persistence of this feature, similar to example storage, if users enabled and then disabled this feature, users will might be able to access their CTs through all already persistent pieces e.g. WAL).

		This feature could be considered to be switched to opt-out, only after it's finished (this proposal is fully implemented) stable, provably adopted and when the previous LTS Prometheus version is compatible with this feature.


		CT/ST notion was popularized by OpenTelemetry and early experience exposed a big challenge: CT/ST data is extremely unclean given early adoption, mixed instrumentation support and multiple (all imperfect) ways of ["auto-generation"](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstarttimeprocessor) (`subtract_initial_point` might the most universally "correct", but it added only recently). This means that handling (or reducing ) CT errors is an important detail for consumers of this data.

		TODO: Just a draft, to be discussed.

proposal: TSDB Support for Cumulative CT (and Delta ST on the way) #60

Are you sure you want to change the base?

proposal: TSDB Support for Cumulative CT (and Delta ST on the way) #60

Uh oh!

Conversation

bwplotka commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bwplotka commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywwg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dashpole commented Sep 12, 2025

Uh oh!

Uh oh!

bwplotka commented Sep 9, 2025 •

edited

Loading

bwplotka commented Sep 12, 2025 •

edited

Loading