CompliBench is a benchmark for evaluating large language models on real-world Android privacy compliance. We build three regulation-aligned datasets (LGPD, PDPA, PIPEDA) from actual Android repositories in repos/ β including AIRAVAT, Android_Spy_App, L3MON, pounce-keys, PrivacyBreacher, Rafel_Rat, and rdroid β and annotate violations at article level. The benchmark assesses models with two complementary tasks and aggregates results with principled, reproducible scoring.
Datasets and tasks
- Datasets (3 jurisdictions): LGPD (Brazil), PDPA (Singapore), PIPEDA (Canada), curated from real Android projects with regulation-article annotations.
- Task 1 β Multi-granularity detection: Retrieval-style evaluation at three code scales β file, module, and line β to identify violated articles in repositories.
- Task 2 β Multi-label classification: Snippet-level prediction of violated articles with multi-label ground truth.
CompliBench answers whether LLMs can detect privacy compliance violations across jurisdictions and code scales by combining:
- Raw metrics appropriate to each task (e.g., Acc@k/MRR/MAP/nDCG for Task 1; F1/Jaccard/NCE/1βHamming for Task 2).
- Aggregation within scale/regulation (SGS): Generalized mean (p = β1) with consistency penalty via coefficient of variation.
- Aggregation across metrics (RCS): TOPSIS with Mahalanobis distance (with ridge regularization) to account for metric correlations.
- Aggregation across regulations (CRGS): Geometric mean with cross-regulation variance penalty.
- Overall score (OCS): Final single-number score using harmonic-mean cross-task coupling with a balance penalty and cross-regulation stability (variance) penalty.
The benchmark utilizes real-world Android applications from the repos/ directory, representing diverse categories of privacy-sensitive software:
- AIRAVAT: Android surveillance application with web panel
- Android_Spy_App: Comprehensive monitoring application
- L3MON: Remote access tool with extensive Android capabilities
- pounce-keys: Android keylogger with stealth features
- PrivacyBreacher: Privacy testing and assessment application
- Rafel_Rat: Remote administration tool with data collection features
- rdroid: Android remote control system
Android applications are ideal for privacy compliance evaluation because they:
- Handle sensitive user data (contacts, location, messages, etc.)
- Require explicit permission declarations
- Implement complex data processing workflows
- Must comply with multiple international privacy regulations
- Represent real-world software engineering practices
# Run complete evaluation across all regulations, tasks, and models
python evaluate.py --regulation all --task all --model all
# Evaluate specific regulation and task
python evaluate.py --regulation LGPD --task 1 --model claude-3-5-sonnet-20241022
# Advanced aggregation methods with harmonic mean OCS (default)
python evaluate.py --outdir results_advanced
# Customize OCS parameters
python evaluate.py --ocs_beta 2.0 --ocs_delta 2.0 --outdir results_custom
# Generate radar chart visualizations
python create_combined_radar_charts.py --outdir evaluation_resultsEvaluation Results:
cross_reg_task1.txt- Task 1 cross-regulation comparison tablecross_reg_task2.txt- Task 2 cross-regulation comparison tableoverall_scores.txt- Overall Composite Score (OCS) rankingsraw_metrics_task1_table.txt- Detailed Task 1 raw metrics with SGS summaryraw_metrics_task2_table.txt- Detailed Task 2 raw metrics with NCE
Visualization Outputs:
radars/Task1_file_level_combined.png- Task 1 file-level radar chartsradars/Task1_module_level_combined.png- Task 1 module-level radar chartsradars/Task1_line_level_combined.png- Task 1 line-level radar chartsradars/Task2_combined.png- Task 2 radar charts
After successful evaluation, you should see:
evaluation_results/
βββ cross_reg_task1.txt # Cross-regulation Task 1 table
βββ cross_reg_task2.txt # Cross-regulation Task 2 table
βββ overall_scores.txt # Final OCS rankings table
βββ raw_metrics_task1_table.txt # All Task 1 metrics + SGS summary
βββ raw_metrics_task2_table.txt # All Task 2 metrics (including NCE)
βββ radars/ # Radar chart visualizations
β βββ Task1_file_level_combined.png
β βββ Task1_module_level_combined.png
β βββ Task1_line_level_combined.png
β βββ Task2_combined.png
βββ {LGPD,PDPA,PIPEDA}/
βββ Task{1,2}/
βββ {model}/metrics.json # Individual detailed results
βββ summary.txt # Per-regulation summaries
All table files include evaluation settings for reproducibility.
Objective: Detect privacy regulation violations in Android codebases at three granularity levels:
- File-level: Identify violated articles for entire Android source files (Activities, Services, etc.)
- Module-level: Detect violations within specific Android modules/classes (e.g., permission handlers, data collectors)
- Line-level: Pinpoint violations at specific line ranges in Android code
Data Format: Each sample contains Android code annotations at multiple levels with violated_articles lists referencing specific regulation articles.
Key Challenges:
- Hierarchical violation detection across different Android code granularities
- Understanding Android-specific privacy patterns (permissions, data access, etc.)
- Ranking quality of predicted violation articles
- Cross-scale consistency in detection performance
Objective: Classify privacy violations in isolated Android code snippets using multi-label prediction.
Data Format: Android code snippets (methods, classes, permission requests) with associated violated_articles representing multiple possible privacy violations.
Key Challenges:
- Multi-label classification with imbalanced article distributions in Android contexts
- Understanding Android framework-specific privacy implications
- Ranking quality when multiple violations may apply to the same code pattern
- Coverage depth for comprehensive violation detection in mobile app contexts
Accuracy@k: Fraction of relevant articles found in top-k predictions
Acc@k = |{relevant articles} β© {top-k predictions}| / k
Range: [0,1], higher is better
R-Precision: Precision at rank R, where R = number of relevant articles
R-Precision = |{relevant articles} β© {top-R predictions}| / R
Range: [0,1], higher is better
MRR (Mean Reciprocal Rank): Average of 1/rank for first correct prediction
MRR = 1 / rank_of_first_relevant_item
Range: [0,1], higher is better
MAP (Mean Average Precision): Average precision across all relevant articles
MAP = (1/|relevant|) Γ Ξ£(precision@rank_i) for each relevant item i
Range: [0,1], higher is better
nDCG@5: Normalized Discounted Cumulative Gain at rank 5
DCG@5 = Ξ£(1/logβ(rank+1)) for relevant items in top-5
nDCG@5 = DCG@5 / IDCG@5
Range: [0,1], higher is better
SGS (Scale-wise Generalized Score): Cross-scale consistency with penalty
For each metric m:
values = [file_level_m, module_level_m, line_level_m]
gm = generalized_mean(values, p=-1) # p=-1 for harmonic mean
cvΒ² = (std(values) / mean(values))Β² # coefficient of variation squared
SGS_m = gm Γ exp(-Ξ± Γ cvΒ²) # Ξ±=1.0 consistency penalty
Range: [0,1], higher is better
T1-RCS (Task 1 Regulation Composite Score): TOPSIS with Mahalanobis distance
Input: [sgs_acc@1, sgs_acc@5, sgs_r_precision, sgs_mrr, sgs_map, sgs_ndcg@5]
Z = metrics_matrix (models Γ 6_metrics)
C = covariance(Z) + Ξ»ΓI # Ξ»=0.1 ridge regularization
ideal = [1,1,1,1,1,1], worst = [0,0,0,0,0,0]
d_plus = mahalanobis_distance(Z[i], ideal)
d_minus = mahalanobis_distance(Z[i], worst)
T1-RCS = d_minus / (d_plus + d_minus)
Range: [0,1], higher is better
T1-CRGS (Cross-Regulation Generalized Score): Geometric mean with variance penalty
rcs_values = [T1-RCS_LGPD, T1-RCS_PDPA, T1-RCS_PIPEDA]
gm = geometric_mean(rcs_values)
var_penalty = exp(-Ξ² Γ variance(rcs_values)) # Ξ²=2.0
T1-CRGS = gm Γ var_penalty
Range: [0,1], higher is better
Micro-F1: F1 score averaged across all label instances
Micro-F1 = 2 Γ (micro_precision Γ micro_recall) / (micro_precision + micro_recall)
micro_precision = TP_total / (TP_total + FP_total)
micro_recall = TP_total / (TP_total + FN_total)
Range: [0,1], higher is better
Macro-F1: F1 score averaged across labels (unweighted)
Macro-F1 = (1/|labels|) Γ Ξ£ F1_score(label_i)
Range: [0,1], higher is better
Weighted-F1: F1 score weighted by label support
Weighted-F1 = Ξ£ (support_i Γ F1_score(label_i)) / total_support
Range: [0,1], higher is better
Jaccard (samples): Average Jaccard similarity across samples
For each sample: J = |predicted β© true| / |predicted βͺ true|
Jaccard = average(J_across_samples)
Range: [0,1], higher is better
Normalized Coverage Error (NCE): Depth required to cover all true labels
For each sample: CE = max_rank_of_true_labels - 1
NCE = CE / (total_labels - 1)
Range: [0,1], lower is better (inverted to inv_nce = 1-NCE for RCS)
Hamming Loss: Fraction of incorrect label predictions
Hamming = (1/samples) Γ Ξ£ |predicted_i β true_i| / |all_labels|
Range: [0,1], lower is better (inverted to inv_hamming = 1-Hamming for RCS)
T2-RCS (Task 2 Regulation Composite Score): TOPSIS with Mahalanobis distance
Input: [micro_f1, macro_f1, weighted_f1, jaccard_samples, inv_nce, inv_hamming]
where inv_nce = 1-NCE, inv_hamming = 1-Hamming (converted to "higher is better")
Z = metrics_matrix (models Γ 6_metrics)
C = covariance(Z) + Ξ»ΓI # Ξ»=0.1 ridge regularization
ideal = [1,1,1,1,1,1], worst = [0,0,0,0,0,0]
d_plus = mahalanobis_distance(Z[i], ideal)
d_minus = mahalanobis_distance(Z[i], worst)
T2-RCS = d_minus / (d_plus + d_minus)
Range: [0,1], higher is better
T2-CRGS (Cross-Regulation Generalized Score): Same as T1-CRGS but for Task 2
rcs_values = [T2-RCS_LGPD, T2-RCS_PDPA, T2-RCS_PIPEDA]
gm = geometric_mean(rcs_values)
var_penalty = exp(-Ξ² Γ variance(rcs_values)) # Ξ²=2.0
T2-CRGS = gm Γ var_penalty
Range: [0,1], higher is better
OCS (Overall Composite Score): Cross-task composite score with multiple modes
OCS Basic: Simple linear combination
OCS = Ξ» Γ T1-CRGS + (1-Ξ») Γ T2-CRGS # Ξ»=0.5 default
Range: [0,1], higher is better
OCS: Harmonic mean cross-task coupling
For each regulation r:
hm_r = 2ΓT1-RCS_rΓT2-RCS_r / (T1-RCS_r + T2-RCS_r) # harmonic mean
S_r = hm_r Γ exp(-Ξ² Γ |T1-RCS_r - T2-RCS_r|) # balance penalty
OCS = geometric_mean(S_r) Γ exp(-Ξ΄ Γ variance(S_r))
Parameters: Ξ²=2.0, Ξ΄=2.0
Range: [0,1], higher is better
# Generalized mean with consistency penalty
def sgs(values, p=-1.0, alpha=1.0):
gm = generalized_mean(values, p=p) # p=-1 for harmonic mean
cv = std(values) / mean(values) # coefficient of variation
return gm * exp(-alpha * cv**2) # consistency penalty# TOPSIS with Mahalanobis distance (no normalization - metrics already in [0,1])
def rcs(metrics_matrix):
Z = metrics_matrix # already in [0,1] range
C = covariance_matrix(Z) + Ξ»*I # Ξ»=0.1 ridge regularization
ideal = [1, 1, ..., 1] # best possible
worst = [0, 0, ..., 0] # worst possible
return topsis_scores(Z, C, ideal, worst)# Geometric mean with variance penalty
def crgs(rcs_values, beta=2.0):
gm = geometric_mean(rcs_values)
var = variance(rcs_values)
return gm * exp(-beta * var) # variance penalty# Harmonic mean coupling with balance penalty
def ocs(regulation_scores, beta=2.0, delta=2.0):
coupling_scores = []
for reg in regulations:
r1, r2 = regulation_scores[reg]["T1-RCS"], regulation_scores[reg]["T2-RCS"]
hm = (2 * r1 * r2) / (r1 + r2) # harmonic mean
balance_penalty = exp(-beta * abs(r1 - r2))
S_r = hm * balance_penalty
coupling_scores.append(S_r)
# Cross-regulation aggregation
gm = geometric_mean(coupling_scores)
var_penalty = exp(-delta * variance(coupling_scores))
return gm * var_penaltyThe framework now uses advanced aggregation methods by default:
- SGS: Replaces simple harmonic mean with generalized mean + consistency penalty
- RCS: Replaces linear weighting with TOPSIS multi-criteria decision analysis
- CRGS: Replaces arithmetic mean with geometric mean + variance penalty
- OCS: Multiple computation modes for cross-task evaluation
The --norm parameter is now for display purposes only and doesn't affect RCS computation:
--norm none(default): Absolute performance scale - no normalization for display--norm minmax: Relative ranking - min-max normalization for display only--norm robust: Percentile-based - P10-P90 range for display only
The framework computes the Overall Composite Score using harmonic mean cross-task coupling with balance penalty and cross-regulation stability.
Generate combined radar charts for academic papers and presentations:
# Generate all combined radar charts
python create_combined_radar_charts.py --outdir evaluation_results
# Charts are saved to evaluation_results/radars/Output Files:
Task1_file_level_combined.png- Task 1 file-level radar charts (3 models Γ 2 rows)Task1_module_level_combined.png- Task 1 module-level radar chartsTask1_line_level_combined.png- Task 1 line-level radar chartsTask2_combined.png- Task 2 radar charts (3 models Γ 2 rows)
Chart Features:
- Academic-quality: Times New Roman font, high DPI (600), professional styling
- Multi-regulation visualization: Each radar shows LGPD, PDPA, and PIPEDA performance
- Fixed metrics: 6 standardized axes per radar chart
- Task 1: Acc@1, Acc@5, MRR, R-Precision, MAP, nDCG@5
- Task 2: Weighted-F1, Micro-F1, Macro-F1, 1βHamming, Jaccard, NCE
- Paper-ready: Optimized for inclusion in academic papers with proper spacing and sizing Parameters:
--ocs_beta 2.0: Balance penalty strength (penalizes task imbalance)--ocs_delta 2.0: Cross-regulation variance penalty
--relax_keys: Enable relaxed key matching for Task 1- When strict matching fails, fallback to file-level aggregation
- Useful for debugging format mismatches between predictions and gold data
- Provides diagnostic info:
strict_matched_itemsvsrelaxed_matched_items
--jaccard_empty_same_one: Treat empty-empty samples as perfect match (1.0) in Jaccard computation--task2_coverage_mode {scores,pred_on_top}: Choose ranking construction method for Coverage Errorscores: Use per-label confidence scores (if available in predictions) - recommended for better discriminative powerpred_on_top: Place predicted labels first, others after (stable fallback mode)
High SGS values indicate both good performance and consistency across granularities. Low SGS values suggest either poor performance or high cross-scale inconsistency (high coefficient of variation).
Example diagnostics:
DIAG file_level_violations: matched 45/90 (strict:15, relaxed:30)
- Gold items: 90 total annotations
- Matched: 45 had predictions
- Strict: 15 exact key matches
- Relaxed: 30 additional matches via file-level fallback
Coverage Error measures ranking depth - how far down you need to go to find all true labels.
- Lower values are better
- NCE normalizes to [0,1] range for cross-dataset comparison
Jaccard vs F1:
- Jaccard focuses on set overlap
- F1 accounts for precision/recall balance
- Both complement each other for multi-label evaluation
OCS Rankings provide overall model comparison:
- Higher OCS = better overall compliance detection capability
- Cross-task coupling: Harmonic mean ensures both Task 1 and Task 2 perform well
- Balance penalty: exp(-Ξ²Γ|T1-T2|) reduces scores for large performance gaps between tasks
- Cross-regulation stability: Geometric mean with variance penalty ensures consistent performance across LGPD/PDPA/PIPEDA
- Interpretable components: Each regulation's coupling score S_r shows task balance quality
CompliBench/
βββ evaluate.py # Main evaluation framework
βββ README.md # This documentation
βββ LGPD/ # Brazilian LGPD regulation datasets
β βββ LGPD_task1_dataset.json # Task 1 gold standard (Android code violations)
β βββ LGPD_task2_dataset.json # Task 2 gold standard (Android snippets)
β βββ Compliance_Task1/ # Task 1 model predictions
β βββ Compliance_Task2/ # Task 2 model predictions
β βββ create_task1_dataset.py # Dataset generation script
β βββ create_task2_dataset.py # Dataset generation script
βββ PDPA/ # Singapore PDPA regulation datasets
βββ PIPEDA/ # Canada PIPEDA regulation datasets
βββ repos/ # Real-world Android app source code
β βββ AIRAVAT/ # Android surveillance app
β βββ Android_Spy_App/ # Android monitoring application
β βββ L3MON/ # Remote access tool for Android
β βββ pounce-keys/ # Android keylogger
β βββ PrivacyBreacher/ # Privacy testing application
β βββ Rafel_Rat/ # Remote administration tool
β βββ rdroid/ # Android remote control system
βββ evaluation_results/ # Output directory
βββ cross_reg_task1.txt # Task 1 cross-regulation comparison table
βββ cross_reg_task2.txt # Task 2 cross-regulation comparison table
βββ overall_scores.txt # Overall composite scores table
βββ raw_metrics_task1_table.txt # Detailed Task 1 raw metrics with SGS
βββ raw_metrics_task2_table.txt # Detailed Task 2 raw metrics
βββ {REGULATION}/
βββ Task{1,2}/
βββ {model}/
β βββ metrics.json # Individual model results
βββ summary.txt # Regulation-task summary
To ensure proper evaluation, model predictions must follow the exact JSON schema below:
[
{
"repo_url": "https://github.com/example/android-app",
"Commit_ID": "abc123def456",
"file_level_violations": [
{
"file_path": "app/src/main/java/MainActivity.java",
"violated_articles": [1, 3, 5]
}
],
"module_level_violations": [
{
"file_path": "app/src/main/java/MainActivity.java",
"module_name": "MainActivity",
"violated_articles": [1, 3]
}
],
"line_level_violations": [
{
"file_path": "app/src/main/java/MainActivity.java",
"line_spans": "45-47",
"violated_articles": [1]
}
]
}
]Basic Format:
[
{
"repo_url": "https://github.com/example/android-app",
"Commit_ID": "abc123def456",
"code_snippet_path": "snippets/location_access.java",
"violated_articles": [2, 4, 7]
}
]Enhanced Format with Scores (Recommended for better Coverage Error):
[
{
"repo_url": "https://github.com/example/android-app",
"Commit_ID": "abc123def456",
"code_snippet_path": "snippets/location_access.java",
"violated_articles": [2, 4, 7],
"article_scores": {
"1": 0.1, "2": 0.9, "3": 0.2, "4": 0.8,
"5": 0.3, "6": 0.1, "7": 0.7, "8": 0.2
}
}
]repo_url: String - Repository URL identifierCommit_ID: String - Git commit hash or identifierviolated_articles: Array[Integer] - List of violated article numbers
file_path: String - Relative path to source filemodule_name: String - Class/module name (module_level only)line_spans: String - Line range in format "start-end" or single number (line_level only)
code_snippet_path: String - Path to code snippet filearticle_scores: Object (Optional) - Per-article confidence scores for better Coverage Error computation
- Article numbering: Use regulation-specific article numbers (e.g., LGPD Article 1, 2, 3...)
- Zero handling: Use
0to indicate "no violation" or as padding - Order preservation: Maintain prediction order in
violated_articlesfor ranking metrics - Deduplication: Avoid duplicate article numbers within the same violation instance
- File paths: Use forward slashes
/consistently across platforms
Before running evaluation, verify your predictions:
- JSON is valid and parseable
- All required fields are present
-
violated_articlescontains only valid article numbers or 0 - File paths match the expected Android project structure
- No duplicate articles within the same violation instance
- Prediction order reflects model confidence (higher confidence first)
- For Task 2: Include
article_scoresif using--task2_coverage_mode scores
-
Key Construction: Build unique identifiers for each violation instance
- File:
(repo_url, commit_id, file_path) - Module:
(repo_url, commit_id, file_path, module_name) - Line:
(repo_url, commit_id, file_path, line_spans)
- File:
-
Order Preservation: Maintain prediction ranking for MRR/MAP/nDCG computation
- Use stable deduplication to handle repeated predictions
- Preserve first occurrence position for ranking metrics
-
Relaxed Matching: Optional fallback to file-level aggregation when strict keys don't match
- Multi-label Matrix: Convert to binary matrix
Y_true[samples, labels] - Coverage Error: Compute ranking depth needed to cover all true labels
- Ranking Construction:
- scores mode: Use per-label confidence scores
- pred_on_top mode: Place predictions first, maintaining order
- SGS: Generalized mean (p=-1) with consistency penalty exp(-Ξ±Β·CVΒ²) across granularities
- RCS: TOPSIS with Mahalanobis distance considering metric correlations
- CRGS: Geometric mean with variance penalty exp(-Ξ²Β·Var) across regulations
- OCS: Cross-task coupling with cross-regulation stability (multiple modes available)
The framework supports pluggable metrics via the registry system:
# Register a new Task 2 metric
register_task2_metric("custom_metric", orientation="max")
# Metrics are automatically included in TOPSIS computation- Create regulation directory:
NEW_REG/ - Add to
REGULATIONS = ["LGPD", "PDPA", "PIPEDA", "NEW_REG"] - Provide gold datasets:
NEW_REG_task1_dataset.json,NEW_REG_task2_dataset.json - Place model predictions in
NEW_REG/Compliance_Task{1,2}/
Low Task 1 scores with many "matched 0/N" diagnostics:
- Check key alignment between predictions and gold data
- Try
--relax_keysto use file-level fallback - Verify prediction format matches expected schema
Task 2 Coverage Error seems too high:
- Check if predictions maintain meaningful ranking
- Consider
--task2_coverage_mode scoresif confidence scores available - Verify label encoding consistency
RCS values interpretation:
- RCS uses TOPSIS scores in [0,1] range automatically
- Higher values indicate better distance to ideal solution
- Values consider metric correlations via Mahalanobis distance
- Use
--regulation SPECIFICand--task SPECIFICfor focused evaluation - Individual model evaluation:
--model specific_model_name - Parallel processing: Run different regulations separately and combine results
This project is licensed under the MIT License - see the LICENSE file for details.
- CompliBench framework: MIT License
- Evaluation datasets: CC BY 4.0 License
- Android app source code: Original licenses apply (see individual repos)
- Third-party dependencies: See requirements.txt and individual package licenses
For detailed technical specifications and implementation details, see the inline documentation in evaluate.py.