Editorial review: Document PerformanceNavigationTiming.confidence#43528
Editorial review: Document PerformanceNavigationTiming.confidence#43528chrisdavidmills wants to merge 5 commits intomdn:mainfrom
Conversation
|
@chrisdavidmills This says "technical review". Is it ready for me to look at? |
@hamishwillee. Not yet; I requested a tech review from the browser engineers yesterday. Once it is ready, I'll flip it to "Editorial review". |
mmocny
left a comment
There was a problem hiding this comment.
The docs look great, thanks for doing it!
I'm not the primary engineering contact for this, so hopefully Mike Jackson at Msft will have a change to take a look.
One detail that is missing from the docs: how should you use this value and interpret the data on the server? This feels like the most important part of the API, and also the harder to understand for developers.
Mike has done some presentations on this, and I see that he added a NOTE to the very bottom of this spection of the spec: https://www.w3.org/TR/navigation-timing-2/#sec-PerformanceNavigationTiming
This section is intended to help RUM providers and developers interpret confidence
...that section might be worth including in docs here?
Cheers.
This makes sense. For the moment, I've gone for including all the text in this section in the Anyway, I'll include that in my next commit. |
|
Cool, thanks, @mwjacksonmsft. I'll move this to the editorial review stage. @hamishwillee, ready for you to have a look, if you've still got time early next week. |
|
|
||
| ## Interpreting confidence data | ||
|
|
||
| Since the {{domxref("PerformanceTimingConfidence.randomizedTriggerRate", "randomizedTriggerRate")}} can vary across records, per-record weighting is needed to recover unbiased aggregates. The procedure below illustrates how weighting based on {{domxref("PerformanceTimingConfidence.value", "value")}} can be applied before computing summary statistics. |
There was a problem hiding this comment.
There is always an argument around what you should expect a developer to reasonably interpret and how much hand holding you should do. To my mind though, basic questions like "why do I need to do this have not been answered".
Let's start backwards. Why are summary statistics needed and how do I use those?
What is an unbiased aggregate? Why do I care to recover one? Blah blah.
To put it another way, I can see myself following these instructions and generating the data, but then not knowing what to do with it.
There was a problem hiding this comment.
@mwjacksonmsft can you provide a short paragraph that answers these questions, which I can edit into some finished prose? I don't know the answers to these questions.
There was a problem hiding this comment.
I didn't want to write new prose for @mwjacksonmsft here, but went looking for existing text on the subject.
The spec doesn't really elaborate, but the original design doc does.
Stealing from there:
Summary
Web applications may suffer from bimodal distribution in page load performance, due to factors outside of the web application’s control. For example:
- When a user agent first launches (a "cold start" scenario), it must perform many expensive initialization tasks that compete for resources on the system.
- Browser extensions can affect the performance of a website. For instance, some extensions run additional code on every page you visit, which can increase CPU usage and result in slower response times.
- When a machine is busy performing intensive tasks, it can lead to slower loading of web pages.
In these scenarios, content the web app attempts to load will be in competition with other work happening on the system. This makes it difficult to detect if performance issues exist within web applications themselves, or because of external factors.
Teams we have worked with have been surprised at the difference between real-world dashboard metrics and what they observe in page profiling tools. Without more information, it is challenging for developers to understand if (and when) their applications may be misbehaving or are simply being loaded in a contended period.
A new ‘confidence’ field on the PerformanceNavigationTiming object will enable developers to discern if the navigation timings are representative for their web application.
There was a problem hiding this comment.
Also, re-reading this patch, the description in the main navigation timing doc section Performance timing confidence seems to answer these questions already, so may just need linking to?
If the question here is even more broad, such as: why do developers measure performance in the field? Then I would also point to existing docs on the subject rather than answer them here.
The point of this feature (confidence) is to help segment field data into two distinct groups, with the observation that the high-confidence results are more stable over time, but the relative distribution between the groups can change.
There was a problem hiding this comment.
OK, thanks. I've done some research of my own, and added a bit of information on why unbiased aggregates are needed.
It seems to me that summary statistics just refers to the statistics you produce from the raw data, which you actually give people to read. I therefore don't think this needs a huge amount of explanation, but I have added a few more words to indicate that they are statistics based on the confidence data.
@hamishwillee, let me know what you think.
There was a problem hiding this comment.
The comment @mmocny left captures the why.
Teams we have worked with have been surprised at the difference between real-world dashboard metrics and what they observe in page profiling tools. Without more information, it is challenging for developers to understand if (and when) their applications may be misbehaving or are simply being loaded in a contended period.
Developers can debias the data, and then focus on measuring and improving perf for things under their control.
There was a problem hiding this comment.
Thanks, @mwjacksonmsft. I've added a couple more sentences to capture some of these thoughts.
|
|
||
| {{APIRef("Performance API")}}{{SeeCompatTable}} | ||
|
|
||
| The **`randomizedTriggerRate`** read-only property of the {{domxref("PerformanceTimingConfidence")}} interface is a number representing a percentage value that indicates how often noise is applied when exposing the {{domxref("PerformanceTimingConfidence.value")}}. |
There was a problem hiding this comment.
This is very complete and accurate, but it is hard to parse. Possibly it is not necessary to capture this all in one sentence here, since you should to that in the value.
| The **`randomizedTriggerRate`** read-only property of the {{domxref("PerformanceTimingConfidence")}} interface is a number representing a percentage value that indicates how often noise is applied when exposing the {{domxref("PerformanceTimingConfidence.value")}}. | |
| The **`randomizedTriggerRate`** read-only property of the {{domxref("PerformanceTimingConfidence")}} indicates how often noise is applied when exposing the {{domxref("PerformanceTimingConfidence.value")}}. |
There was a problem hiding this comment.
Either way
- why do we add noise? We should say, and also state what a high rate actually means vs a low rate.
- So 100% (1) would mean every
PerformanceTimingConfidencehas noise applied to the value. - If noise is applied does that flip the
value.
Just trying to get a feel for what a developer might do or not do with this knowledge.
There was a problem hiding this comment.
@mwjacksonmsft can you provide answers to these questions?
There was a problem hiding this comment.
@hamishwillee I've done a bit of research here too, and added more details about the randomized trigger rate and noise. Let me know if that answers your questions.
hamishwillee
left a comment
There was a problem hiding this comment.
Looks pretty good. I have questions.
It might be nice to mention this in https://developer.mozilla.org/en-US/docs/Web/API/Performance_API/Navigation_timing
Good point; I've added a section to cover it. |
hamishwillee
left a comment
There was a problem hiding this comment.
Those changes look excellent. Can you ping me again direclty when you've got the remaining answers and integrated them?
Yes, will do—cheers mate. |
|
Note, I haven't been pinged back on this one AFAIK. |
|
@mmocny @mwjacksonmsft, there are a couple of outstanding questions that came up in the editorial review that are blocking publication of this documentation. Can you look at them and help me with some answers? I've closed all the resolved comments, so they should be easy to find. Thanks! |
|
I spot only a single unresolved comment in the patch at this point-- but github UI claims there are two unresolved comments. (Perhaps one is stale from a line in the patch that has been removed, not sure). Let me know if I haven't found all the questions that need answering. |
42351f9 to
9a130c8
Compare
| ### Interpreting confidence data | ||
|
|
||
| Since the {{domxref("PerformanceTimingConfidence.randomizedTriggerRate", "randomizedTriggerRate")}} can vary across records, per-record weighting is needed to recover unbiased aggregates, to improve consistency of data, cut down on compound errors, and generally produce more realistic and reliable results. The procedures below illustrate how weighting based on {{domxref("PerformanceTimingConfidence.value", "value")}} can be applied before computing summary statistics based on the confidence data. | ||
|
|
||
| Once you have debiased the data and computed realistic summary statistics, you can focus on measuring and improving performance for issues under your control. |
There was a problem hiding this comment.
@chrisdavidmills Same problem as I highlighted before - I didn't understand the "point" from this text and what you would do when you have the debiased data.
I asked Claude if this paragraph was just marketing and apparently it isn't :-). Apparently the term "recover unbiased aggregates" means something :-)
The point that is a bit buried is that value is not deterministic. The browser uses randomization to assign "high" or "low". When p = 0.1 (say), it means 10% of the time the value you see was randomly assigned regardless of actual conditions.
So you can't just filter out "low" records and average the "high" ones to work out your real performance — you'd be throwing away records that were actually fine but happened to get a random "low", and keeping records that were actually bad but got a random "high".
The debiasing math corrects for the random noise so that your aggregate statistics (mean, p75, etc.) are statistically valid. This is what the paragraph above did not make clear to me. Perhaps I am dim.
Claude says that what I'd do with the data if collecting navigation timing data (e.g. for a real-user monitoring dashboard):
- Collect records via
PerformanceObserveras normal. - For each record, also grab
entry.confidence.valueandentry.confidence.randomizedTriggerRate. - When computing your p75 LCP or mean load time, apply the weighting formulas instead of a plain average — this gives you separate, corrected metrics for "typical" loads vs. "degraded" loads.
- Use the
"high"confidence mean/percentile as your "real" performance baseline, and use the"low"one to understand how bad things get in cold-start scenarios.
This last bit is what I meant by "what do you do with the data" - use it as a new baseline.
Does my problem now make sense?
Description
Chrome 145 adds support for the
PerformanceNavigationTiming.confidenceproperty, and the associatedPerformanceTimingConfidenceinterface. See https://chromestatus.com/feature/5186950448283648.This PR adds documentation for both features mentioned above.
Motivation
Additional details
Related issues and pull requests