Replies: 2 comments
-
THis is a good analysis. Let me dive into it a bit |
Beta Was this translation helpful? Give feedback.
0 replies
-
Thanks for compiling this. Looking at the next steps, are we intending to analyze the impact of warmup iterations or high iteration counts? It would be very interesting to see the data on runtimes per iteration over say 1000 iterations run back to back. Does it stabilize? Or randomly spike? Can we rely on median latency being reliable over a large number of iterations? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Benchmark Infra Stability Assessment with private AWS devices
TL;DR
Here are the highlights and lowlights of the what plaint data shows.
I am sharing my conclusions and recommendations based on the data at the end.
Understanding Stability Metrics
To properly assess the stability of ML model inference latency, I use several key statistical metrics:
And a composite stability score (0-100 scale) is calculated using weighted CV, Max/Min ratio, and P99/P50 ratio.
Intra-primary (private) Dataset Stability Comparison
Overall Summary:
Device-based Comparison:
The analysis of latency stability across private AWS devices reveals certain patterns in performance consistency. Here are the insights and recommendations:
Inter-dataset (private & public) Stability Comparison
NOTE: It's sad that the Dashboard does not provide all data needed for inter-dataset comparison. The reference (public) table is created based on whatever it's available.
Primary (Private) Datasets Summary:
Reference (Public) Datasets Summary:
Private vs Public Comparison:
Overall Insights and Recommendations
Detailed Stability Analysis on Individual Dataset - Primary (Private)
The full list of individual dataset analysis can be downloaded here. In this section I will highlight detailed statistical metrics for only a few selected datasets.
1. Latency Stability Analysis: llama3_spinq+s22_android13 (Primary)
2. Latency Stability Analysis: mv3_qnn+s22_android13 (Primary)
3. Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary)
4. Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary)
Conclusion and Recommendations for Next Steps
Android Benchmarking
The analysis shows that private AWS devices provide significantly better stability for Android benchmarking. The data supports a specific configuration strategy:
As next steps, we should:
iOS Benchmarking
Both private and public iOS devices show poor stability across all models (CV values of 21-37%), indicating fundamental flaws in our iOS benchmarking methodology/app. Until methodology improvements have been validated with new benchmark data, we're at the risk of having a meaningful conclusion of moving iOS benchmarking forward by end of June.
DevX Improvements
Our current benchmarking infrastructure has critical gaps that limit our ability to understand and address stability issues. These limitations are particularly problematic when trying to diagnose the root causes of performance variations we've observed across devices.
Current Gaps
Addressing these gaps is urgent to establish a reliable benchmarking infrastructure. Without these improvements, we risk making timely decisions and basing conclusions on misleading or incomplete data.
References
Here I attached the source of data and my script if anyone want to repeat the work. Please also use it as a reference when filling the infra gaps above.
The script used for analysis
Data source:
Datasets from Primary/Private AWS devices:
Benchmark Dataset with Private AWS Devices.xlsx
Datasets from Reference/Public AWS devices:
Benchmark Dataset with Public AWS Devices.xlsx
Each tab represent one dataset collected with one model+config+device combination. The source of the data are copied from the ExecuTorch benchmark dashboard
Beta Was this translation helpful? Give feedback.
All reactions