Problem
APE-RV's LLM integration (aperv:sata_mop_llm) has a 37.3% no_match rate — 3,554 of 9,525 LLM calls fail to map coordinates to a ModelAction. Each no_match wastes 1-3s of LLM overhead without benefit, and the fallback algorithmic action may be suboptimal.
Impact: exp3 showed aperv:sata_mop_llm (27.60% method coverage) did NOT outperform the non-LLM baseline aperv:sata_mop_v1 (28.35%, p=0.014). Reducing no_match from 37% to <20% could unlock the LLM's potential.
Root causes identified (from architectural analysis in docs/20260318_aperv_coordenadas_gh46.md):
- Timing gap: LLM sees a fresh screenshot but matching uses stale bounds from an earlier UIAutomator dump
- Over-abstraction: GUITreeBuilder filters widgets that exist in the accessibility tree
- Prompt format: Current prompt may not be optimal for 4B model (Qwen3-VL)
- Matching algorithm: Fixed tolerances may miss edge cases
Approach
Create a new aperv-llm-validation module that replicates the APE-RV LLM pipeline offline (ImageProcessor, ApePromptBuilder, ToolCallParser, CoordinateNormalizer, mapToModelAction) against 468 existing screenshots with UIAutomator ground truth. This enables:
- Phase B (Prompt Optimization): Test 5-6 prompt variants with
reasoning parameter to understand LLM intent. Identify the optimal prompt for the 4B model.
- Phase A (Replay Forensic): Classify each exp3 no_match by root cause using trace replay.
- Phase A' (Ground Truth): Re-run subset with enriched logging + artifact preservation.
- Phase C (Matching Improvements): Improve the matching algorithm based on data from A/A'/B.
Success Criteria
| Metric |
Baseline (exp3) |
Target |
| no_match rate |
37.3% |
<20% |
| APKs with 100% no_match |
8 |
0 |
| match rate |
62.1% |
>80% |
References
- Plan:
docs/20260318_aperv_coordenadas_gh46.md
- Calibration dependency:
docs/20260318_rvape_calibracao.md (MICRO phase blocked on this)
- Prior visual grounding work:
docs/vision/ (rvsec-vision-llm benchmark results)
- Exp3 results:
data/results/exp3_*/
Problem
APE-RV's LLM integration (
aperv:sata_mop_llm) has a 37.3% no_match rate — 3,554 of 9,525 LLM calls fail to map coordinates to a ModelAction. Each no_match wastes 1-3s of LLM overhead without benefit, and the fallback algorithmic action may be suboptimal.Impact: exp3 showed
aperv:sata_mop_llm(27.60% method coverage) did NOT outperform the non-LLM baselineaperv:sata_mop_v1(28.35%, p=0.014). Reducing no_match from 37% to <20% could unlock the LLM's potential.Root causes identified (from architectural analysis in
docs/20260318_aperv_coordenadas_gh46.md):Approach
Create a new
aperv-llm-validationmodule that replicates the APE-RV LLM pipeline offline (ImageProcessor, ApePromptBuilder, ToolCallParser, CoordinateNormalizer, mapToModelAction) against 468 existing screenshots with UIAutomator ground truth. This enables:reasoningparameter to understand LLM intent. Identify the optimal prompt for the 4B model.Success Criteria
References
docs/20260318_aperv_coordenadas_gh46.mddocs/20260318_rvape_calibracao.md(MICRO phase blocked on this)docs/vision/(rvsec-vision-llm benchmark results)data/results/exp3_*/