Benchmark Infra Stability Assessment with private AWS devices #10983

guangy10 · 2025-05-19T22:00:21Z

guangy10
May 19, 2025
Collaborator

Benchmark Infra Stability Assessment with private AWS devices

TL;DR
Analysis reveals that private AWS device can provide acceptable stability across all tested platforms (Android, iOS), delegates (QNN, XNNPACK, CoreML, MPS) and models (Llama3.2-1b and MobileNetV3), demonstrating that our private AWS infrastructure can deliver consistent benchmarking results.

Understanding Stability Metrics

To properly assess the stability of ML model inference latency, I use several key statistical metrics:

Coefficient of Variation (CV) / RSD
Percentile-Based Metrics: P50/P90/P95/P99
Interquartile Range (IQR)
Max-Min Range Ratio
Intra-Jitter Coefficient

And a composite stability score (0-100 scale) is calculated using weighted CV, Max/Min ratio, and P99/P50 ratio.

Intra-primary (private) Dataset Stability Comparison

I will begin the analysis by examining the key metrics for the primary (private) dataset. This section focuses on assessing the inherent stability of our benchmarking environment before making any comparisons to public infrastructure. By analyzing key statistical metrics mentioned above across different model and device combinations, we can establish a baseline understanding of performance consistency and stability.

Overall Stability Summary:

Sheet	Model	Device	Mean Latency (ms)	CV (%)	Stability Score	Stability Rating	Max/Min Ratio	P99/P50 Ratio
mv3_coreml+iphone15_ios18	mv3_coreml	iphone15_ios18	1.00	0.00	100.00	Excellent	1.00	1.00
mv3_coreml+iphone15max_ios17	mv3_coreml	iphone15max_ios17	1.00	0.00	100.00	Excellent	1.00	1.00
mv3_qnn+s22ultra_android14	mv3_qnn	s22ultra_android14	1.01	0.91	93.81	Excellent	1.09	1.02
llama3_spinq+s22_android13	llama3_spinq	s22_android13	21771.59	2.36	84.70	Good	1.15	1.07
llama3_qlora+s22_android13	llama3_qlora	s22_android13	22502.10	2.64	83.37	Good	1.14	1.07
mv3_qnn+s22_android13	mv3_qnn	s22_android13	1.01	2.34	82.41	Good	1.19	1.14
llama3_qlora+iphone15max_ios17	llama3_qlora	iphone15max_ios17	12972.80	3.73	75.15	Moderate	1.19	1.15
llama3_spinq+iphone15max_ios17	llama3_spinq	iphone15max_ios17	12195.41	3.78	72.90	Moderate	1.33	1.16
mv3_mps+iphone15_ios18	mv3_mps	iphone15_ios18	4.01	3.99	66.53	Moderate	1.67	1.21
llama3_qlora+s22ultra_android14	llama3_qlora	s22ultra_android14	25022.84	6.18	62.54	Moderate	1.27	1.13
llama3_spinq+s22ultra_android14	llama3_spinq	s22ultra_android14	24761.78	6.27	60.28	Moderate	1.36	1.15
mv3_xnnq8+pixel3_rooted_android	mv3_xnnq8	pixel3_rooted_android	5.93	7.68	46.93	Poor	1.70	1.24
mv3_xnnq8+iphone15_ios18	mv3_xnnq8	iphone15_ios18	48.23	12.84	24.53	Poor	2.30	1.37
mv3_xnnq8+s22_android13	mv3_xnnq8	s22_android13	2.73	23.03	14.94	Poor	2.44	1.68
mv3_mps+iphone15max_ios17	mv3_mps	iphone15max_ios17	1.25	35.07	12.50	Poor	2.00	2.00
mv3_xnnq8+iphone15max_ios17	mv3_xnnq8	iphone15max_ios17	13.98	24.60	10.82	Poor	3.29	1.57
mv3_xnnq8+s22ultra_android14	mv3_xnnq8	s22ultra_android14	2.91	39.08	0.00	Poor	5.61	2.33

Device-based Comparison:

Device Base	Stability Score (mean)	Stability Score (min)	Stability Score (max)	CV (%) (mean)	CV (%) (min)	CV (%) (max)
s22	66.36	14.94	84.70	7.59	2.34	23.03
iphone15max	54.27	10.82	100.00	13.44	0.00	35.07
s22ultra	54.16	0.00	93.81	13.11	0.91	39.08
pixel3	46.93	46.93	46.93	7.68	7.68	7.68
iphone15	40.10	2.81	100.00	13.95	0.00	27.53

My insights and recommendations

The analysis of latency stability across private AWS devices reveals certain patterns in performance consistency:

s22_android13 and iphone15max_ios17 provides the most stable environment for various model execution on Android and iOS respectively.
Non-CPU delegates (e.g. mv3_qnn, mv3_coreml, mv3_mps) shows the most consistent performance across devices.
mv3_xnnq8 shows more variability and may need further optimization. The rooted Android device still provides the best stability.

Intra-private analysis reveals that private iPhone and S22 can provide acceptable stability across all tested delegates (QNN, CoreML, MPS, XNNPACK) and models (Llama3.2-1b and MobileNetV3), demonstrating that our private AWS infrastructure can deliver consistent benchmarking results.

Android devices on private AWS infrastructure demonstrate excellent to good stability (scores 60.28-93.81/100) with CV values of 0.91-7.68%
iOS devices on private AWS infrastructure demonstrate perfect to near poor stability (scores 24.53-93.81/100) with CV values of 0-12.84%

Inter-dataset (private & public) Stability Comparison

To assess whether private AWS devices provide better stability than their public counterparts, here I conducted a detailed comparison between matching datasets from both environments. This section presents an apple-to-apple comparison of benchmark stability for identical model-device combinations, allowing us to directly evaluate the benefits of moving to use private infrastructure.

1. llama3_spinq+s22_android13 (Private) vs llama3_spinq+s22_android13 (Public)

Model: llama3_spinq
Private Device: s22_android13
Public Device: s22_android13

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	21771.59 ms	22774.60 ms	-1003.01 ms	-4.4%
Median Latency (ms)	21668.24 ms	22491.89 ms	-823.65 ms	-3.7%
Standard Deviation (ms)	514.89 ms	1947.04 ms	-1432.15 ms	-73.6%
CV (%)	2.36%	8.55%	-6.18%	-72.3%
IQR (ms)	602.75 ms	3455.61 ms	-2852.87 ms	-82.6%
P99 (ms)	23104.76 ms	26148.53 ms	-3043.77 ms	-11.6%
Max/Min Ratio	1.1452	1.3483	-0.2031	-15.1%
P99/P50 Ratio	1.0663	1.1626	-0.0963	-8.3%
Stability Score	84.7/100	48.8/100	35.9	73.4%
Stability Rating	Good	Poor	N/A	N/A

Interpretation:

Private environment shows better stability with a 73.4% higher stability score. (Private: 84.7/100 vs Public: 48.8/100)
Private environment has 72.3% lower coefficient of variation, indicating more consistent performance.
Private environment has 4.4% lower mean latency, indicating better performance.

2. mv3_qnn+s22_android13 (Private) with mv3_qnn+s22_android13 (Public)

Model: mv3_qnn
Private Device: s22_android13
Public Device: s22_android13

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	1.01 ms	1.44 ms	-0.44 ms	-30.3%
Median Latency (ms)	1.00 ms	1.00 ms	0.00 ms	0.0%
Standard Deviation (ms)	0.02 ms	0.83 ms	-0.80 ms	-97.2%
CV (%)	2.34%	57.29%	-54.95%	-95.9%
IQR (ms)	0.01 ms	0.06 ms	-0.05 ms	-83.3%
P99 (ms)	1.14 ms	3.95 ms	-2.81 ms	-71.1%
Max/Min Ratio	1.1919	4.5354	-3.3434	-73.7%
P99/P50 Ratio	1.1404	3.9482	-2.8078	-71.1%
Stability Score	82.4/100	0.0/100	82.4	Infinity
Stability Rating	Good	Poor	N/A	N/A

Interpretation:

Private environment shows better stability (Private: 82.4/100 vs Public: 0.0/100)
Private environment has 95.9% lower coefficient of variation, indicating more consistent performance.
Private environment has 30.3% lower mean latency, indicating better performance.

3. mv3_xnnq8+s22_android13 (Private) vs. mv3_xnnq8+s22_android13 (Public)

Model: mv3_xnnq8
Private Device: s22_android13
Public Device: s22_android13

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	2.73 ms	1.92 ms	0.81 ms	42.1%
Median Latency (ms)	2.65 ms	1.06 ms	1.59 ms	150.0%
Standard Deviation (ms)	0.63 ms	1.06 ms	-0.43 ms	-40.6%
CV (%)	23.03%	55.09%	-32.06%	-58.2%
IQR (ms)	0.95 ms	1.63 ms	-0.68 ms	-41.9%
P99 (ms)	4.46 ms	4.63 ms	-0.18 ms	-3.8%
Max/Min Ratio	2.4427	6.1313	-3.6886	-60.2%
P99/P50 Ratio	1.6812	4.3683	-2.6871	-61.5%
Stability Score	14.9/100	0.0/100	14.9	Infinity
Stability Rating	Poor	Poor	N/A	N/A

Interpretation:

Private environment shows better stability. (Private: 14.9/100 vs Public: 0.0/100)
Private environment has 58.2% lower coefficient of variation, indicating more consistent performance.
Public environment has 42.1% lower mean latency, indicating better performance.

4. llama3_qlora+iphone15max_ios17 (Private) with llama3_qlora+iphone15max_ios17 (Public)

Model: llama3_qlora
Private Device: iphone15max_ios17
Public Device: iphone15max_ios17

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	12972.80 ms	14133.01 ms	-1160.22 ms	-8.2%
Median Latency (ms)	12774.50 ms	13132.50 ms	-358.00 ms	-2.7%
Standard Deviation (ms)	483.26 ms	3019.85 ms	-2536.58 ms	-84.0%
CV (%)	3.73%	21.37%	-17.64%	-82.6%
IQR (ms)	624.00 ms	527.50 ms	96.50 ms	18.3%
P99 (ms)	14730.49 ms	25167.92 ms	-10437.43 ms	-41.5%
Max/Min Ratio	1.1916	2.3216	-1.1300	-48.7%
P99/P50 Ratio	1.1531	1.9165	-0.7633	-39.8%
Stability Score	75.2/100	10.6/100	64.6	611.1%
Stability Rating	Moderate	Poor	N/A	N/A

Interpretation:

Private environment shows better stability with a 611.1% higher stability score. (Private: 75.2/100 vs Public: 10.6/100)
Private environment has 82.6% lower coefficient of variation, indicating more consistent performance.
Private environment has 8.2% lower mean latency, indicating better performance.

5. mv3_xnnq8+iphone15max_ios17 (Private) with mv3_xnnq8+iphone15max_ios17 (Public)

Model: mv3_xnnq8
Private Device: iphone15max_ios17
Public Device: iphone15max_ios17

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	13.98 ms	13.97 ms	0.01 ms	0.1%
Median Latency (ms)	14.00 ms	13.00 ms	1.00 ms	7.7%
Standard Deviation (ms)	3.44 ms	4.74 ms	-1.30 ms	-27.4%
CV (%)	24.60%	33.93%	-9.33%	-27.5%
IQR (ms)	4.00 ms	7.00 ms	-3.00 ms	-42.9%
P99 (ms)	21.94 ms	25.40 ms	-3.46 ms	-13.6%
Max/Min Ratio	3.2857	4.1429	-0.8571	-20.7%
P99/P50 Ratio	1.5671	1.9538	-0.3867	-19.8%
Stability Score	10.8/100	1.2/100	9.7	837.9%
Stability Rating	Poor	Poor	N/A	N/A

Interpretation:
Though both are not ideal, but private environment shows better stability with a 837.9% higher stability score (Private: 10.8/100 vs Public: 1.2/100), and 27.5% lower coefficient of variation, indicating more consistent performance over public devices.

6. mv3_coreml+iphone15max_ios17 (Private) with mv3_coreml+iphone15max_ios17 (Public)

Model: mv3_coreml
Private Device: iphone15max_ios17
Public Device: iphone15max_ios17

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	1.00 ms	1.00 ms	0.00 ms	0.0%
Median Latency (ms)	1.00 ms	1.00 ms	0.00 ms	0.0%
Standard Deviation (ms)	0.00 ms	0.00 ms	0.00 ms	Infinity%
CV (%)	0.00%	0.00%	0.00%	Infinity%
IQR (ms)	0.00 ms	0.00 ms	0.00 ms	Infinity%
P99 (ms)	1.00 ms	1.00 ms	0.00 ms	0.0%
Max/Min Ratio	1.0000	1.0000	0.0000	0.0%
P99/P50 Ratio	1.0000	1.0000	0.0000	0.0%
Stability Score	100.0/100	100.0/100	0.0	0.0%
Stability Rating	Excellent	Excellent	N/A	N/A

Interpretation:
Both environments show perfect and identical stability scores.

7. mv3_mps+iphone15max_ios17 (Private) with mv3_mps+iphone15max_ios17 (Public)

Model: mv3_mps
Private Device: iphone15max_ios17
Public Device: iphone15max_ios17

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	1.25 ms	1.03 ms	0.23 ms	22.1%
Median Latency (ms)	1.00 ms	1.00 ms	0.00 ms	0.0%
Standard Deviation (ms)	0.44 ms	0.17 ms	0.27 ms	166.0%
CV (%)	35.07%	16.10%	18.97%	117.8%
IQR (ms)	0.50 ms	0.00 ms	0.50 ms	Infinity%
P99 (ms)	2.00 ms	2.00 ms	0.00 ms	0.0%
Max/Min Ratio	2.0000	2.0000	0.0000	0.0%
P99/P50 Ratio	2.0000	2.0000	0.0000	0.0%
Stability Score	12.5/100	12.5/100	0.0	0.0%
Stability Rating	Poor	Poor	N/A	N/A

Interpretation:

Both environments exhibit consistent stability metrics with matching performance profiles.
The high Max/Min and P99/P50 ratios observed are primarily due to measurement resolution limitations. Given that baseline latencies are at the 1ms threshold—which matches our measurement granularity—small absolute differences translate to seemingly significant ratio variations that don't reflect meaningful performance disparities.

Overall Private vs Public Comparison:

Dataset	Private Device	Public Device	Private Score	Public Score	Score Diff	Private CV (%)	Public CV (%)	CV Diff (%)
mv3_qnn on s22	s22 (_android13)	s22 (_android13)	82.41	0.00	82.41	2.34	57.29	-54.95
llama3_spinq on iphone15max	iphone15max (_ios17)	iphone15max (_ios17)	72.90	2.65	70.25	3.78	21.76	-17.97
llama3_qlora on iphone15max	iphone15max (_ios17)	iphone15max (_ios17)	75.15	10.57	64.58	3.73	21.37	-17.64
llama3_qlora on s22	s22 (_android13)	s22 (_android13)	83.37	46.07	37.31	2.64	8.72	-6.08
llama3_spinq on s22	s22 (_android13)	s22 (_android13)	84.70	48.84	35.87	2.36	8.55	-6.18
mv3_mps on iphone15	iphone15 (_ios18)	iphone15 (_ios18)	66.53	37.50	29.03	3.99	17.76	-13.77
mv3_xnnq8 on iphone15	iphone15 (_ios18)	iphone15 (_ios18)	24.53	0.00	24.53	12.84	41.06	-28.22
llama3_spinq on s22ultra	s22ultra (_android14)	s22ultra (_android12)	60.28	37.66	22.62	6.27	10.96	-4.69
mv3_xnnq8 on s22	s22 (_android13)	s22 (_android13)	14.94	0.00	14.94	23.03	55.09	-32.06
mv3_xnnq8 on iphone15max	iphone15max (_ios17)	iphone15max (_ios17)	10.82	1.15	9.67	24.60	33.93	-9.33
mv3_qnn on s22ultra	s22ultra (_android14)	s22ultra (_android12)	93.81	90.39	3.42	0.91	1.35	-0.44
mv3_coreml on iphone15max	iphone15max (_ios17)	iphone15max (_ios17)	100.00	100.00	0.00	0.00	0.00	0.00
mv3_mps on iphone15max	iphone15max (_ios17)	iphone15max (_ios17)	12.50	12.50	0.00	35.07	16.10	18.97
mv3_coreml on iphone15	iphone15 (_ios18)	iphone15 (_ios18)	100.00	100.00	0.00	0.00	0.00	0.00
mv3_xnnq8 on s22ultra	s22ultra (_android14)	s22ultra (_android12)	0.00	15.48	-15.48	39.08	22.35	16.73

Summary:

Private devices consistently outperform public devices on both platforms, with Android showing slightly larger performance gains and more dramatic stability improvements.

Android: Private devices are 2.9x more stable (avg. 6.7% CV vs 19.2% CV)
iOS: Private devices are 1.7x more stable (avg. 11.7% CV vs 20.2% CV)

Detailed Stability Analysis on Individual Dataset - Primary (Private)

The full list of individual dataset analysis can be downloaded here. In this section I will highlight detailed statistical metrics for only a few selected datasets.

1. Latency Stability Analysis: llama3_spinq+s22_android13 (Primary)

================================================================================
Model: llama3_spinqant
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 21771.59 ms
  - Median latency (P50): 21668.24 ms
  - Mean trimmed latency: 21662.53 ms
  - Median trimmed latency: 21559.89 ms

Dispersion Metrics:
  - Standard deviation: 514.89 ms
  - Coefficient of variation (CV): 2.36%
  - Interquartile range (IQR): 602.75 ms
  - Trimmed standard deviation: 515.03 ms
  - Trimmed coefficient of variation: 2.38%

Percentile Metrics:
  - P50 (median): 21668.24 ms
  - P90: 22438.74 ms
  - P95: 22542.42 ms
  - P99: 23104.76 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1452
  - P99/P50 ratio: 1.0663
  - Mean rolling std (window=5): 449.10 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.50%
  - Max trimming effect ratio: 0.89%

Throughput Metrics:
  - Mean TPS: 33.76
  - TPS coefficient of variation: 4.70%

Stability Assessment:
  - Overall stability score: 84.7/100
  - Overall stability rating: Good

================================================================================

2. Latency Stability Analysis: mv3_qnn+s22_android13 (Primary)

================================================================================
Model: mv3_qnn
Device: s22_android13

Dataset Overview:
  - Number of samples: 100
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00

Central Tendency Metrics:
  - Mean latency: 1.01 ms
  - Median latency (P50): 1.00 ms
  - Mean trimmed latency: 1.00 ms
  - Median trimmed latency: 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.02 ms
  - Coefficient of variation (CV): 2.34%
  - Interquartile range (IQR): 0.01 ms
  - Trimmed standard deviation: 0.02 ms
  - Trimmed coefficient of variation: 2.27%

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.01 ms
  - P95: 1.01 ms
  - P99: 1.14 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1919
  - P99/P50 ratio: 1.1404
  - Mean rolling std (window=5): 0.01 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.19%
  - Max trimming effect ratio: 1.00%

Stability Assessment:
  - Overall stability score: 82.4/100
  - Overall stability rating: Good

================================================================================

3. Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary)

================================================================================
Model: mv3_xnnq8
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.73 ms
  - Median latency (P50): 2.65 ms
  - Mean trimmed latency: 2.22 ms
  - Median trimmed latency: 2.10 ms

Dispersion Metrics:
  - Standard deviation: 0.63 ms
  - Coefficient of variation (CV): 23.03%
  - Interquartile range (IQR): 0.95 ms
  - Trimmed standard deviation: 0.36 ms
  - Trimmed coefficient of variation: 15.98%

Percentile Metrics:
  - P50 (median): 2.65 ms
  - P90: 3.59 ms
  - P95: 3.74 ms
  - P99: 4.46 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.4427
  - P99/P50 ratio: 1.6812
  - Mean rolling std (window=5): 0.60 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 16.52%
  - Max trimming effect ratio: 36.96%

Stability Assessment:
  - Overall stability score: 14.9/100
  - Overall stability rating: Poor

================================================================================

4. Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary)

================================================================================
Model: llama3_spinquant
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 54
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 12195.41 ms
  - Median latency (P50): 12104.50 ms

Dispersion Metrics:
  - Standard deviation: 461.27 ms
  - Coefficient of variation (CV): 3.78%
  - Interquartile range (IQR): 154.25 ms

Percentile Metrics:
  - P50 (median): 12104.50 ms
  - P90: 12567.20 ms
  - P95: 12760.05 ms
  - P99: 14052.31 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.3331
  - P99/P50 ratio: 1.1609
  - Mean rolling std (window=5): 365.79 ms

Throughput Metrics:
  - Mean TPS: 13.89
  - TPS coefficient of variation: 16.58%

Stability Assessment:
  - Overall stability score: 72.9/100
  - Overall stability rating: Moderate

================================================================================

5. Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Primary)

================================================================================
Model: mv3_xnnq8
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 54
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 13.98 ms
  - Median latency (P50): 14.00 ms

Dispersion Metrics:
  - Standard deviation: 3.44 ms
  - Coefficient of variation (CV): 24.60%
  - Interquartile range (IQR): 4.00 ms

Percentile Metrics:
  - P50 (median): 14.00 ms
  - P90: 18.00 ms
  - P95: 20.00 ms
  - P99: 21.94 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 3.2857
  - P99/P50 ratio: 1.5671
  - Mean rolling std (window=5): 3.40 ms

Stability Assessment:
  - Overall stability score: 10.8/100
  - Overall stability rating: Poor

================================================================================

6. Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Primary)

================================================================================
Model: mv3_mps
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 51
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 1.25 ms
  - Median latency (P50): 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.44 ms
  - Coefficient of variation (CV): 35.07%
  - Interquartile range (IQR): 0.50 ms

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 2.00 ms
  - P95: 2.00 ms
  - P99: 2.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.0000
  - P99/P50 ratio: 2.0000
  - Mean rolling std (window=5): 0.39 ms

Stability Assessment:
  - Overall stability score: 12.5/100
  - Overall stability rating: Poor
================================================================================

7. Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Primary)

================================================================================
Model: mv3_coreml
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 50
  - Date range: 2025-04-30 05:23:09+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 1.00 ms
  - Median latency (P50): 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.00 ms
  - Coefficient of variation (CV): 0.00%
  - Interquartile range (IQR): 0.00 ms

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.00 ms
  - P95: 1.00 ms
  - P99: 1.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.0000
  - P99/P50 ratio: 1.0000
  - Mean rolling std (window=5): 0.00 ms

Stability Assessment:
  - Overall stability score: 100.0/100
  - Overall stability rating: Excellent

================================================================================

Summary of Conclusions and Next Steps

ExecuTorch Benchmarking

The analysis shows that private AWS devices provide significantly better stability for both Android and iOS benchmarking, with Android showing slightly larger performance gains and more dramatic stability improvements.

As next steps, I would suggest:

Use iPhone 15 Pro Max and S22 Android as the primary devices for benchmarking.
Expand the size of the private device pool and start migrating more models+configs to it.
Though it's not a blocker, we should investigate how to further optimize the stability of tiny CPU models (e.g. MobileNetV3 XNNPACK), for both Android and iOS.

DevX Improvements

Our current benchmarking infrastructure has critical gaps that limit our ability to understand and address stability issues. These limitations are particularly problematic when trying to diagnose the root causes of performance variations we've observed across devices.

Current Gaps

Lack of Device Telemetry: We currently have no mechanism to collect essential device state information during benchmarking runs, including:
- CPU frequency and thermal state data
- Memory usage patterns and garbage collection events
- Background process activity that may interfere with benchmarks
- Battery state and power mode settings that can affect performance
Limited Analysis Capabilities and Manual Process: Our infrastructure lacks built-in stability assessment tools, forcing a time-consuming manual process:
- No automated calculation of key statistical metrics (CV, IQR, P99/P50 Ratio), and no standardized criteria for interpreting results
- Manual data extraction from the benchmark dashboard
- No way to correlate performance variations with individual device and its state

Addressing these gaps is urgent to establish a reliable benchmarking infrastructure. Without these improvements, we risk making timely decisions and basing conclusions on misleading or incomplete data.

References

Here I attached the source of data and my script if anyone want to repeat the work. Please also use it as a reference when filling the infra gaps above.

The script used for analysis

PR: Script for benchmark stability assessment #10982

Data source:

Datasets from Primary/Private AWS devices:
Benchmark Dataset with Private AWS Devices.xlsx
Datasets from Reference/Public AWS devices:
Benchmark Dataset with Public AWS Devices.xlsx

Each tab represent one dataset collected with one model+config+device combination. The source of the data are copied from the ExecuTorch benchmark dashboard

kimishpatel · 2025-05-20T14:30:57Z

kimishpatel
May 20, 2025
Collaborator

THis is a good analysis. Let me dive into it a bit

0 replies

GregoryComer · 2025-05-20T21:55:51Z

GregoryComer
May 20, 2025
Collaborator

Thanks for compiling this. Looking at the next steps, are we intending to analyze the impact of warmup iterations or high iteration counts? It would be very interesting to see the data on runtimes per iteration over say 1000 iterations run back to back. Does it stabilize? Or randomly spike? Can we rely on median latency being reliable over a large number of iterations?

2 replies

guangy10 May 22, 2025
Collaborator Author

warmup is supported in android benchmark. @kirklandsign can clarify how many warmup iters and how many inference runs after. There may be room to tweak for better stability.

kirklandsign May 22, 2025
Collaborator

executorch/extension/benchmark/android/benchmark/app/src/main/java/org/pytorch/minibench/BenchmarkActivity.java

Lines 59 to 60 in c256723

    
           int numIter = intent.getIntExtra("num_iter", 50); 
        
           int numWarmupIter = intent.getIntExtra("num_warm_up_iter", 10);

10 warmup, 50 runs

digantdesai · 2025-05-21T15:01:39Z

digantdesai
May 21, 2025
Collaborator

Amazing how we can do such analysis in OSS. ❤️
Kudos @guangy10 and team for all the efforts, and what it has enabled thus far.

Couple of random thoughts while reading the post,

Using CV (RSD) is good. How many runs did we do for each? 10?
mv3_xnnq8 is noisy on a CPU because it is tiny, easy way to "fix" would be (1) to run on a single, pinned core or (2) chain 10 inferences back to back to make the model "larger".
What is next on ios unstability mitigation side?
Did we figure out why public devices (non qnn) are noisy yet? Since private looks (esp Android) good, this is not a real blocker.
What is next in terms of new delegates? Vulkan? @SS-JIA or CoreML? @metascroy

4 replies

guangy10 May 22, 2025
Collaborator Author

Using CV (RSD) is good. How many runs did we do for each? 10?

There are over 80+ datapoints in each dataset (primary/private devices). The data I used are from late April to May 15. Since we are running those jobs continuously, we should already have over 100 datapoints for each.

mv3_xnnq8 is noisy on a CPU because it is tiny, easy way to "fix" would be (1) to run on a single, pinned core or (2) chain 10 inferences back to back to make the model "larger".

Thanks @digantdesai, that's good suggestions. cc: @kirklandsign can we pin the benchmark app to certain core while running? If it requires root permisison, we have a rooted pixel in the pool.

What is next on ios unstability mitigation side?

The analysis shows the unstability is device pool independent, so I would probably start by looking into the app, many things @kirklandsign set for Android benchmark app are probably not supported in the iOS benchmark app, .e.g. warmup iters, cooldown period by forcing sleep, trimmed outliers, etc. @shoumikhin can you take this?

Did we figure out why public devices (non qnn) are noisy yet? Since private looks (esp Android) good, this is not a real blocker.

As mentioned in the Next Steps, we don't have device telemetry yet though I've been asking this for a while. @kirklandsign an
d @huydhn will prioritize adding support for this. Actually, the QNN case is good justification why we needs this. The Coefficient of variation (CV) is 2.34%, and P99/P50 ratio is 1.1404. Both are very good. However, we can see occasional outlier causing sharp spike in the graph view here. Without the device telemetry, we have no way to understand why it happened.

digantdesai May 22, 2025
Collaborator

There are over 80+ datapoints in each dataset (primary/private devices). The data I used are from late April to May 15. Since we are running those jobs continuously, we should already have over 100 datapoints for each.

Not sure how you are isolating the code changes throughout this, and also any other changes happening in the device i.e. android updates by aws (just making things up).

I was reading this as CV for a given ET sha.

many things @kirklandsign set for Android benchmark app are probably not supported in the iOS benchmark app, .e.g. warmup iters, cooldown period by forcing sleep, trimmed outlier

Makes sense. Thanks. iOS, I bet, is harder to "rein in".

However, we can see occasional outlier causing sharp spike in the graph view here. Without the device telemetry, we have no way to understand why it happened

Exactly agree with QNN observation. It feels like if finding the main culprit for public device noise, we might allow us to use "any random device" more confidently using this framework.

kirklandsign May 22, 2025
Collaborator

can we pin the benchmark app to certain core while running

Should it be "always" or "certain time"?

kirklandsign May 22, 2025
Collaborator

telemetry yet though I've been asking this for a while

Now we have CPU frequency duration and change table in artifacts

guangy10 · 2025-05-22T00:52:00Z

guangy10
May 22, 2025
Collaborator Author

FYI, with more data from public Android devices are found, I just updated the post to incorporate the private vs. public comparison. The metrics from the new data strengthens the conclusions, indicating using private AWS can provide decent stability for Android benchmarking. cc: @cbilgin @kimishpatel @digantdesai

2 replies

digantdesai May 22, 2025
Collaborator

I think using private now is a no brainer. We should dig a bit more into public devices, perhaps share this with AWS folks if they have some learning?

guangy10 May 22, 2025
Collaborator Author

@digantdesai Yeah, there has been a lots of discussion (in the email threads) with AWS team regarding the stability of using public devices. @huydhn may add more details if you need. But IIRC one issue with public device is that devices in the same pool are not guaranteed to have exact spec/config, that feature/bug makes it more suitable for something like function/correctness validation, but hard to enforce inter-device performance consistence. Where with private devices, we have more control and we can make all devices in the same pool having exact same spec/config. The firmware/OS update for private devices should be controllable by us.

IIRC private device is charged at flat rate per device per month, while public one is pay-as-you-go. For continuously benchmarking at large scale, go with private is more cost-effective, which is another benefits of going with this direction.

kimishpatel · 2025-05-22T22:03:14Z

kimishpatel
May 22, 2025
Collaborator

@guangy10 overall conclusion make sense. I would like to offer my views on the way forward

On android devices, for the devices listed, we should start using that to catch any performance regressions. I would suggest running a trial on this via removing say custom sdpa optimization and observing that the job shows significant regression.
On UX side the one I would consider best use of time would be ability to just get benchmarking numbers for arbitrary model.
I am not sure if CPU frequencey, background process or battery life are immediately important to focus on
Depending on team's bandwidth we can figure out how to enable getting flamegraph and op level data out of android runs. I would suggest filing an issue with enough details of what needs to happen so that if we dont have bandwidth someone else can pick it up.
On iOS side I would suggest reproing the behavior on private devices. That do we observe such variation on devices that we own or we have in our benchmarking facility. At the moment it feels inconclusive.

0 replies

huydhn · 2025-05-27T01:14:52Z

huydhn
May 27, 2025
Maintainer

If there is no objection, I will go ahead and order more S22 private devices. From our records, in the last 7 days https://github.com/pytorch/executorch/actions/workflows/android-perf.yml runs a total of 2578 minutes and it's configured to run every 8 hours or 3 times per day. So, each run takes around 2578 / (7 * 3) or 2 hours.

On the other hand, https://github.com/pytorch/executorch/actions/workflows/android-perf-private-device-experiment.yml with 2 S22 devices runs a total of 2353 minutes every 4 hours or 6 times per day. So, it is roughly 2353 / (7 * 6) or 1 hour.

My ballpark estimate is ordering 2 more S22 devices would be sufficient (x2), but I think I will request 4 more to have a buffer for PR runs and broken devices like #11083 where we might need to remove a device from the pool.

3 replies

guangy10 May 27, 2025
Collaborator Author

@huydhn Thanks for following up on the action points from meeting last week.
For Android,

I think 4 more S22 (with identical hardware+os spec) would be great, as we may scale fast once stability is no longer an concern. This pool will be the mainstream for Android benchmarking
Before we can further stabilize the tiny cpu model, I think we will continue running them on rooted Android. So we may need additional rooted pixel 3 (I'm totally fine using this old pixel)
I also want to open the room for device upgrade, so if possible I'd like to add 1 S24 (most powerful variant, maybe S24 ultra?) with android 14 or newer, this is hosted in a separate pool for experimental only. Once the stability is confirmed, we can move to making it the new mainstream. (This work can be saved for later if your bandwidth is limited)

For iOS,

I think we can keep one pool "iPhone 15 Pro Max + iOS 17" as the mainstream for iOS benchmarking. We are benchmarking 3 delegates, e.g. xnnpack, coreml, mps.
Similar to Android setup, we should also setup an experimental pool for future device upgrade. I think we may want to try with iPhone 16 Pro Max + iOS 18? (This work can be saved for later if your bandwidth is limited)

huydhn May 27, 2025
Maintainer

Sounds good! Let me follow up with AWS on the Android requirements, which is clear now. For iOS, as I have already asking them for one more device of the same spec. We can wait for that while try to figure out the difference between private and public ones here.

digantdesai May 27, 2025
Collaborator

most powerful variant, maybe S24 ultra?

Yes please.

guangy10 · 2025-05-27T20:28:06Z

guangy10
May 27, 2025
Collaborator Author

FYI, after carefully reviewed the iOS benchmark results and also including more data for iOS, the latest analysis shows converged conclusions for both iOS and Android. The post has been updated to reflect the new findings. All data sources and script are updated and linked.

0 replies

Benchmark Infra Stability Assessment with private AWS devices #10983

Uh oh!

Uh oh!

guangy10 May 19, 2025 Collaborator

Benchmark Infra Stability Assessment with private AWS devices

Understanding Stability Metrics

Intra-primary (private) Dataset Stability Comparison

Overall Stability Summary:

Device-based Comparison:

My insights and recommendations

Inter-dataset (private & public) Stability Comparison

1. llama3_spinq+s22_android13 (Private) vs llama3_spinq+s22_android13 (Public)

2. mv3_qnn+s22_android13 (Private) with mv3_qnn+s22_android13 (Public)

3. mv3_xnnq8+s22_android13 (Private) vs. mv3_xnnq8+s22_android13 (Public)

4. llama3_qlora+iphone15max_ios17 (Private) with llama3_qlora+iphone15max_ios17 (Public)

5. mv3_xnnq8+iphone15max_ios17 (Private) with mv3_xnnq8+iphone15max_ios17 (Public)

6. mv3_coreml+iphone15max_ios17 (Private) with mv3_coreml+iphone15max_ios17 (Public)

7. mv3_mps+iphone15max_ios17 (Private) with mv3_mps+iphone15max_ios17 (Public)

Overall Private vs Public Comparison:

Summary:

Detailed Stability Analysis on Individual Dataset - Primary (Private)

Summary of Conclusions and Next Steps

ExecuTorch Benchmarking

DevX Improvements

Current Gaps

References

The script used for analysis

Data source:

Replies: 7 comments · 11 replies

Uh oh!

kimishpatel May 20, 2025 Collaborator

Uh oh!

GregoryComer May 20, 2025 Collaborator

Uh oh!

guangy10 May 22, 2025 Collaborator Author

Uh oh!

kirklandsign May 22, 2025 Collaborator

Uh oh!

digantdesai May 21, 2025 Collaborator

Uh oh!

guangy10 May 22, 2025 Collaborator Author

Uh oh!

digantdesai May 22, 2025 Collaborator

Uh oh!

kirklandsign May 22, 2025 Collaborator

Uh oh!

kirklandsign May 22, 2025 Collaborator

Uh oh!

guangy10 May 22, 2025 Collaborator Author

Uh oh!

digantdesai May 22, 2025 Collaborator

Uh oh!

guangy10 May 22, 2025 Collaborator Author

Uh oh!

kimishpatel May 22, 2025 Collaborator

Uh oh!

huydhn May 27, 2025 Maintainer

Uh oh!

guangy10 May 27, 2025 Collaborator Author

Uh oh!

huydhn May 27, 2025 Maintainer

Uh oh!

Uh oh!

digantdesai May 27, 2025 Collaborator

Uh oh!

guangy10 May 27, 2025 Collaborator Author

guangy10
May 19, 2025
Collaborator

Replies: 7 comments 11 replies

kimishpatel
May 20, 2025
Collaborator

GregoryComer
May 20, 2025
Collaborator

guangy10 May 22, 2025
Collaborator Author

kirklandsign May 22, 2025
Collaborator

digantdesai
May 21, 2025
Collaborator

guangy10 May 22, 2025
Collaborator Author

digantdesai May 22, 2025
Collaborator

kirklandsign May 22, 2025
Collaborator

kirklandsign May 22, 2025
Collaborator

guangy10
May 22, 2025
Collaborator Author

digantdesai May 22, 2025
Collaborator

guangy10 May 22, 2025
Collaborator Author

kimishpatel
May 22, 2025
Collaborator

huydhn
May 27, 2025
Maintainer

guangy10 May 27, 2025
Collaborator Author

huydhn May 27, 2025
Maintainer

digantdesai May 27, 2025
Collaborator

guangy10
May 27, 2025
Collaborator Author