-
Notifications
You must be signed in to change notification settings - Fork 19
proposal: TSDB Support for Cumulative CT (and Delta ST on the way) #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4f9eb07
to
03c37e3
Compare
033d077
to
704dee5
Compare
FYI: We met for 1h with the delta WG (@ArthurSens @carrieedwards @fionaliao @ywwg) for an initial discussion around this proposal decisions. Thanks for this productive time! Here are some notes:
Also updated proposal today with some learnings. Finally proposed a single feature flag for this work ( Still lots of TODOs and anyone is welcome to help! |
Signed-off-by: bwplotka <bwplotka@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this proposal! I added a bunch of comments, some of which are answered by the paragraph right after the comment 😅 . I think my main concern is nailing down the Goals section. This is not at all to question whether we should do the work, just that I think our statement of intent needs to be unequivocable.
|
||
> TL;DR: We propose to extend Prometheus TSDB storage sample definition to include an extra int64 that will represent the cumulative created timestamp (CT) and, for the future delta temporality ([PROM-48](https://github.com/prometheus/proposals/pull/48)), a delta start timestamp (ST). | ||
> We propose introducing persisting CT logic behind a single flag `ct-storage`. | ||
> Once implemented, wee propose to deprecate the `created-timestamps-zero-injection` experimental feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> Once implemented, wee propose to deprecate the `created-timestamps-zero-injection` experimental feature. | |
> Once implemented, we propose to deprecate the `created-timestamps-zero-injection` experimental feature. |
|
||
See the details, motivations and discussions about the delta temporality in [PROM-48](https://github.com/prometheus/proposals/pull/48). | ||
|
||
The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the first time Start Timestamp is mentioned, and for people reading linearly it will kind of pop up out of nowhere. Let's move the next section, or at least parts of it, up higher and get the terminology questions out of the way as soon as possible. And then later on we can clarify the subtleties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the way to do that is to acknowledge Otel explicitly, and briefly summarize how they define a delta sample
|
||
The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive). | ||
|
||
In other words, instant query for `increase(<cumulative counter>[5m])` produces a single delta sample for a `[t-5m, t]` period (V: `increase(<counter>[5m])`, CT/ST: `now()-5m`, T: `now()`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it's just me, but I am confused by the idea of producing a delta sample. To me, delta samples are things that are written to storage and not produced by promQL statements. For me, PromQL queries only ever produce instant/range scalar/vector results. And an increase() function would operate on one or more delta samples in the desired range, producing an "instant scalar" result, not a "delta sample."
* Average: in the order of ~weeks/months for stable workloads, ~days/weeks for more dynamic environments (Kubernetes). | ||
* Best case: it never changes (infinite count) e.g days_since_X_total. | ||
* Worse case: it changes for every sample. | ||
* For the delta we expect CT to change for every sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the definitions I can find in the otel docs say that start time is "strongly encouraged" but not required. (ref: https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L163-L186)
So I think we need to loosen this statement just a bit to consider the cases where CT does not change for every sample. perhaps: "For the delta we expect CT to change for every sample but can make a best-effort attempt to adapt to situations where the start time is missing."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason why start timestamp wasn't made a hard requirement was for prometheus compatibility. All SDKs are required to set the start timestamp. Deltas always set the start timestamp as well to my knowledge.
Expectation that delta start time changes each sample is documented here: https://github.com/open-telemetry/opentelemetry-specification/blob/1e04e1be0e17cae6d01c862049bbeb298e0ffa06/specification/metrics/data-model.md?plain=1#L421
When the aggregation temporality is "delta", we expect to have
no overlap in time windows for metric streams.
* Otel) More detailed, but still descriptive SHOULD rules only: In OpenTelemetry data model ([temporality section](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#sums:~:text=name%2Dvalue%20pairs.-,A%20time%20window%20(of%20(start%2C%20end%5D)%20time%20for%20which%20the,The%20time%20interval%20is%20inclusive%20of%20the%20end%20time.,-Times%20are%20specified)) CT is generally optional, but strongly recommended. Rules are also soft and we assume they are "SHOULD" because there is section of [handling overlaps](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#overlap). However, it provides more examples, which allow us to capture some specifics: | ||
* Time intervals are half open `(CT, T]`. | ||
* Generally, CT SHOULD: | ||
* `CT[i] < T[i]` (equal means unknown) | ||
* For a cumulative, CT SHOULD: | ||
* Be the same across all samples for the same count: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yes this addresses my point above
|
||
Notably, given persistence of this feature, similar to example storage, if users enabled and then disabled this feature, users will might be able to access their CTs through all already persistent pieces e.g. WAL). | ||
|
||
This feature could be considered to be switched to opt-out, only after it's finished (this proposal is fully implemented) stable, provably adopted and when the previous LTS Prometheus version is compatible with this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature could be considered to be switched to opt-out, only after it's finished (this proposal is fully implemented) stable, provably adopted and when the previous LTS Prometheus version is compatible with this feature. | |
This feature could be considered to be switched to opt-out only after it's stable (i.e., this proposal is fully implemented), provably adopted and when the previous LTS Prometheus version is compatible with this feature. |
|
||
TODO: Just a draft, to be discussed. | ||
TODO: There are questions around: | ||
* Should we do inclusive vs exclusive intervals? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do worry about exclusive/inclusive. Will this need to be a flag / config option? Google has a strict opinion but it sounds like other systems do not.
TODO: Just a draft, to be discussed. | ||
TODO: There are questions around: | ||
* Should we do inclusive vs exclusive intervals? | ||
* Given optionality of this feature, can we even reject sample on TSDB Append if CT is invalid? (MUST or SHOULD on interface?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might depend on how it's invalid? Would it be possible to accept invalid samples and solve the problem at query time? I have a hunch that people would prefer to have dirty data than have their data rejected. (I am unsure what the general Prometheus philosophy is on this question). As an example, there used to be a hard rule against OOO, but now there are a number of adaptations to support it.
|
||
CT/ST notion was popularized by OpenTelemetry and early experience exposed a big challenge: CT/ST data is extremely unclean given early adoption, mixed instrumentation support and multiple (all imperfect) ways of ["auto-generation"](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstarttimeprocessor) (`subtract_initial_point` might the most universally "correct", but it added only recently). This means that handling (or reducing ) CT errors is an important detail for consumers of this data. | ||
|
||
TODO: Just a draft, to be discussed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also have to consider changes to the exposition format? I am thinking of the extra cost of doubling the size of the exposition just to include a second timestamp, vs adding a new term to the exposition line in a way that old systems won't be able to read.
* Changing names often creates confusion | ||
* Having a slight changed name already makes it clear that we talk about Prometheus semantics of the same thing | ||
|
||
**Rejected** due to not enough arguments for renaming (feel free to challenge this!). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I thought we had more interest in renaming than this. Maybe an informal poll on slack is sufficient to get an accurate temperature-read?
* This unlocks the most amount of benefits (e.g. also delta) for the same amount of work, it makes code simpler. | ||
* We don't know if we need special cumulative best case optimization (yet); also it would be also for some "best" cases. Once we know we can always add those optimizations. | ||
|
||
2. Similarly, we propose to not have special CT storage cases per metric types. TSDB storage is not metric type aware, plus some systems allow optional CTs on gauges (e.g. OpenTelemetry). We propose keeping that storage flexibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the start time on gauges is very confusing, but preserving it is probably the right thing to do.
This makes a lot of sense to me. I think performance/benchmarks are probably the biggest potential blocker. |
As discussed in various places (e.g. prometheus/prometheus#17036 (comment) and delta WG) we decided to create a formal proposal on how CT native Prometheus storage could look like and how to make it useful (unblock) delta temporality.