Skip to content

Haoyi-Zhang/CompliBench

Repository files navigation

CompliBench: Can Large Language Models Detect Real-World Android Software Compliance Violations?

CompliBench is a benchmark for evaluating large language models on real-world Android privacy compliance. We build three regulation-aligned datasets (LGPD, PDPA, PIPEDA) from actual Android repositories in repos/ β€” including AIRAVAT, Android_Spy_App, L3MON, pounce-keys, PrivacyBreacher, Rafel_Rat, and rdroid β€” and annotate violations at article level. The benchmark assesses models with two complementary tasks and aggregates results with principled, reproducible scoring.

Overview

Datasets and tasks

  • Datasets (3 jurisdictions): LGPD (Brazil), PDPA (Singapore), PIPEDA (Canada), curated from real Android projects with regulation-article annotations.
  • Task 1 – Multi-granularity detection: Retrieval-style evaluation at three code scales β€” file, module, and line β€” to identify violated articles in repositories.
  • Task 2 – Multi-label classification: Snippet-level prediction of violated articles with multi-label ground truth.

What we measure

CompliBench answers whether LLMs can detect privacy compliance violations across jurisdictions and code scales by combining:

  • Raw metrics appropriate to each task (e.g., Acc@k/MRR/MAP/nDCG for Task 1; F1/Jaccard/NCE/1βˆ’Hamming for Task 2).
  • Aggregation within scale/regulation (SGS): Generalized mean (p = βˆ’1) with consistency penalty via coefficient of variation.
  • Aggregation across metrics (RCS): TOPSIS with Mahalanobis distance (with ridge regularization) to account for metric correlations.
  • Aggregation across regulations (CRGS): Geometric mean with cross-regulation variance penalty.
  • Overall score (OCS): Final single-number score using harmonic-mean cross-task coupling with a balance penalty and cross-regulation stability (variance) penalty.

Android App Dataset

The benchmark utilizes real-world Android applications from the repos/ directory, representing diverse categories of privacy-sensitive software:

Included Applications

  • AIRAVAT: Android surveillance application with web panel
  • Android_Spy_App: Comprehensive monitoring application
  • L3MON: Remote access tool with extensive Android capabilities
  • pounce-keys: Android keylogger with stealth features
  • PrivacyBreacher: Privacy testing and assessment application
  • Rafel_Rat: Remote administration tool with data collection features
  • rdroid: Android remote control system

Why Android Apps?

Android applications are ideal for privacy compliance evaluation because they:

  • Handle sensitive user data (contacts, location, messages, etc.)
  • Require explicit permission declarations
  • Implement complex data processing workflows
  • Must comply with multiple international privacy regulations
  • Represent real-world software engineering practices

Quick Start

Basic Usage

# Run complete evaluation across all regulations, tasks, and models
python evaluate.py --regulation all --task all --model all

# Evaluate specific regulation and task
python evaluate.py --regulation LGPD --task 1 --model claude-3-5-sonnet-20241022

# Advanced aggregation methods with harmonic mean OCS (default)
python evaluate.py --outdir results_advanced

# Customize OCS parameters
python evaluate.py --ocs_beta 2.0 --ocs_delta 2.0 --outdir results_custom

# Generate radar chart visualizations
python create_combined_radar_charts.py --outdir evaluation_results

Key Output Files

Evaluation Results:

  • cross_reg_task1.txt - Task 1 cross-regulation comparison table
  • cross_reg_task2.txt - Task 2 cross-regulation comparison table
  • overall_scores.txt - Overall Composite Score (OCS) rankings
  • raw_metrics_task1_table.txt - Detailed Task 1 raw metrics with SGS summary
  • raw_metrics_task2_table.txt - Detailed Task 2 raw metrics with NCE

Visualization Outputs:

  • radars/Task1_file_level_combined.png - Task 1 file-level radar charts
  • radars/Task1_module_level_combined.png - Task 1 module-level radar charts
  • radars/Task1_line_level_combined.png - Task 1 line-level radar charts
  • radars/Task2_combined.png - Task 2 radar charts

Expected Output Structure

After successful evaluation, you should see:

evaluation_results/
β”œβ”€β”€ cross_reg_task1.txt          # Cross-regulation Task 1 table
β”œβ”€β”€ cross_reg_task2.txt          # Cross-regulation Task 2 table  
β”œβ”€β”€ overall_scores.txt           # Final OCS rankings table
β”œβ”€β”€ raw_metrics_task1_table.txt  # All Task 1 metrics + SGS summary
β”œβ”€β”€ raw_metrics_task2_table.txt  # All Task 2 metrics (including NCE)
β”œβ”€β”€ radars/                      # Radar chart visualizations
β”‚   β”œβ”€β”€ Task1_file_level_combined.png
β”‚   β”œβ”€β”€ Task1_module_level_combined.png
β”‚   β”œβ”€β”€ Task1_line_level_combined.png
β”‚   └── Task2_combined.png
└── {LGPD,PDPA,PIPEDA}/
    └── Task{1,2}/
        β”œβ”€β”€ {model}/metrics.json # Individual detailed results
        └── summary.txt          # Per-regulation summaries

All table files include evaluation settings for reproducibility.

Task Definitions

Task 1: Multi-Granularity Android Violation Detection

Objective: Detect privacy regulation violations in Android codebases at three granularity levels:

  • File-level: Identify violated articles for entire Android source files (Activities, Services, etc.)
  • Module-level: Detect violations within specific Android modules/classes (e.g., permission handlers, data collectors)
  • Line-level: Pinpoint violations at specific line ranges in Android code

Data Format: Each sample contains Android code annotations at multiple levels with violated_articles lists referencing specific regulation articles.

Key Challenges:

  • Hierarchical violation detection across different Android code granularities
  • Understanding Android-specific privacy patterns (permissions, data access, etc.)
  • Ranking quality of predicted violation articles
  • Cross-scale consistency in detection performance

Task 2: Android Code Snippet Multi-Label Classification

Objective: Classify privacy violations in isolated Android code snippets using multi-label prediction.

Data Format: Android code snippets (methods, classes, permission requests) with associated violated_articles representing multiple possible privacy violations.

Key Challenges:

  • Multi-label classification with imbalanced article distributions in Android contexts
  • Understanding Android framework-specific privacy implications
  • Ranking quality when multiple violations may apply to the same code pattern
  • Coverage depth for comprehensive violation detection in mobile app contexts

Advanced Evaluation Metrics

Task 1 Metrics

Core Ranking Metrics (per granularity level)

Accuracy@k: Fraction of relevant articles found in top-k predictions

Acc@k = |{relevant articles} ∩ {top-k predictions}| / k
Range: [0,1], higher is better

R-Precision: Precision at rank R, where R = number of relevant articles

R-Precision = |{relevant articles} ∩ {top-R predictions}| / R
Range: [0,1], higher is better

MRR (Mean Reciprocal Rank): Average of 1/rank for first correct prediction

MRR = 1 / rank_of_first_relevant_item
Range: [0,1], higher is better

MAP (Mean Average Precision): Average precision across all relevant articles

MAP = (1/|relevant|) Γ— Ξ£(precision@rank_i) for each relevant item i
Range: [0,1], higher is better

nDCG@5: Normalized Discounted Cumulative Gain at rank 5

DCG@5 = Ξ£(1/logβ‚‚(rank+1)) for relevant items in top-5
nDCG@5 = DCG@5 / IDCG@5
Range: [0,1], higher is better

Advanced Composite Metrics

SGS (Scale-wise Generalized Score): Cross-scale consistency with penalty

For each metric m:
values = [file_level_m, module_level_m, line_level_m]
gm = generalized_mean(values, p=-1)  # p=-1 for harmonic mean
cvΒ² = (std(values) / mean(values))Β²  # coefficient of variation squared
SGS_m = gm Γ— exp(-Ξ± Γ— cvΒ²)           # Ξ±=1.0 consistency penalty
Range: [0,1], higher is better

T1-RCS (Task 1 Regulation Composite Score): TOPSIS with Mahalanobis distance

Input: [sgs_acc@1, sgs_acc@5, sgs_r_precision, sgs_mrr, sgs_map, sgs_ndcg@5]
Z = metrics_matrix (models Γ— 6_metrics)
C = covariance(Z) + λ×I  # Ξ»=0.1 ridge regularization
ideal = [1,1,1,1,1,1], worst = [0,0,0,0,0,0]
d_plus = mahalanobis_distance(Z[i], ideal)
d_minus = mahalanobis_distance(Z[i], worst)
T1-RCS = d_minus / (d_plus + d_minus)
Range: [0,1], higher is better

T1-CRGS (Cross-Regulation Generalized Score): Geometric mean with variance penalty

rcs_values = [T1-RCS_LGPD, T1-RCS_PDPA, T1-RCS_PIPEDA]
gm = geometric_mean(rcs_values)
var_penalty = exp(-Ξ² Γ— variance(rcs_values))  # Ξ²=2.0
T1-CRGS = gm Γ— var_penalty
Range: [0,1], higher is better

Task 2 Metrics

Core Classification Metrics

Micro-F1: F1 score averaged across all label instances

Micro-F1 = 2 Γ— (micro_precision Γ— micro_recall) / (micro_precision + micro_recall)
micro_precision = TP_total / (TP_total + FP_total)
micro_recall = TP_total / (TP_total + FN_total)
Range: [0,1], higher is better

Macro-F1: F1 score averaged across labels (unweighted)

Macro-F1 = (1/|labels|) Γ— Ξ£ F1_score(label_i)
Range: [0,1], higher is better

Weighted-F1: F1 score weighted by label support

Weighted-F1 = Ξ£ (support_i Γ— F1_score(label_i)) / total_support
Range: [0,1], higher is better

Jaccard (samples): Average Jaccard similarity across samples

For each sample: J = |predicted ∩ true| / |predicted βˆͺ true|
Jaccard = average(J_across_samples)
Range: [0,1], higher is better

Normalized Coverage Error (NCE): Depth required to cover all true labels

For each sample: CE = max_rank_of_true_labels - 1
NCE = CE / (total_labels - 1)
Range: [0,1], lower is better (inverted to inv_nce = 1-NCE for RCS)

Hamming Loss: Fraction of incorrect label predictions

Hamming = (1/samples) Γ— Ξ£ |predicted_i βŠ• true_i| / |all_labels|
Range: [0,1], lower is better (inverted to inv_hamming = 1-Hamming for RCS)

Advanced Composite Metrics

T2-RCS (Task 2 Regulation Composite Score): TOPSIS with Mahalanobis distance

Input: [micro_f1, macro_f1, weighted_f1, jaccard_samples, inv_nce, inv_hamming]
where inv_nce = 1-NCE, inv_hamming = 1-Hamming (converted to "higher is better")
Z = metrics_matrix (models Γ— 6_metrics)
C = covariance(Z) + λ×I  # Ξ»=0.1 ridge regularization
ideal = [1,1,1,1,1,1], worst = [0,0,0,0,0,0]
d_plus = mahalanobis_distance(Z[i], ideal)
d_minus = mahalanobis_distance(Z[i], worst)
T2-RCS = d_minus / (d_plus + d_minus)
Range: [0,1], higher is better

T2-CRGS (Cross-Regulation Generalized Score): Same as T1-CRGS but for Task 2

rcs_values = [T2-RCS_LGPD, T2-RCS_PDPA, T2-RCS_PIPEDA]
gm = geometric_mean(rcs_values)
var_penalty = exp(-Ξ² Γ— variance(rcs_values))  # Ξ²=2.0
T2-CRGS = gm Γ— var_penalty
Range: [0,1], higher is better

Overall Metrics

OCS (Overall Composite Score): Cross-task composite score with multiple modes

OCS Basic: Simple linear combination

OCS = Ξ» Γ— T1-CRGS + (1-Ξ») Γ— T2-CRGS  # Ξ»=0.5 default
Range: [0,1], higher is better

OCS: Harmonic mean cross-task coupling

For each regulation r: 
  hm_r = 2Γ—T1-RCS_rΓ—T2-RCS_r / (T1-RCS_r + T2-RCS_r)  # harmonic mean
  S_r = hm_r Γ— exp(-Ξ² Γ— |T1-RCS_r - T2-RCS_r|)         # balance penalty
OCS = geometric_mean(S_r) Γ— exp(-Ξ΄ Γ— variance(S_r))
Parameters: Ξ²=2.0, Ξ΄=2.0
Range: [0,1], higher is better

Advanced Aggregation Details

SGS Implementation

# Generalized mean with consistency penalty
def sgs(values, p=-1.0, alpha=1.0):
    gm = generalized_mean(values, p=p)  # p=-1 for harmonic mean
    cv = std(values) / mean(values)     # coefficient of variation
    return gm * exp(-alpha * cv**2)     # consistency penalty

RCS Implementation

# TOPSIS with Mahalanobis distance (no normalization - metrics already in [0,1])
def rcs(metrics_matrix):
    Z = metrics_matrix  # already in [0,1] range
    C = covariance_matrix(Z) + Ξ»*I  # Ξ»=0.1 ridge regularization
    ideal = [1, 1, ..., 1]          # best possible
    worst = [0, 0, ..., 0]          # worst possible
    return topsis_scores(Z, C, ideal, worst)

CRGS Implementation

# Geometric mean with variance penalty
def crgs(rcs_values, beta=2.0):
    gm = geometric_mean(rcs_values)
    var = variance(rcs_values)
    return gm * exp(-beta * var)  # variance penalty

OCS Implementation

# Harmonic mean coupling with balance penalty
def ocs(regulation_scores, beta=2.0, delta=2.0):
    coupling_scores = []
    for reg in regulations:
        r1, r2 = regulation_scores[reg]["T1-RCS"], regulation_scores[reg]["T2-RCS"]
        hm = (2 * r1 * r2) / (r1 + r2)         # harmonic mean
        balance_penalty = exp(-beta * abs(r1 - r2))
        S_r = hm * balance_penalty
        coupling_scores.append(S_r)
    
    # Cross-regulation aggregation
    gm = geometric_mean(coupling_scores)
    var_penalty = exp(-delta * variance(coupling_scores))
    return gm * var_penalty

Advanced Features

Aggregation Methods

The framework now uses advanced aggregation methods by default:

  • SGS: Replaces simple harmonic mean with generalized mean + consistency penalty
  • RCS: Replaces linear weighting with TOPSIS multi-criteria decision analysis
  • CRGS: Replaces arithmetic mean with geometric mean + variance penalty
  • OCS: Multiple computation modes for cross-task evaluation

Legacy Normalization (Display Only)

The --norm parameter is now for display purposes only and doesn't affect RCS computation:

  • --norm none (default): Absolute performance scale - no normalization for display
  • --norm minmax: Relative ranking - min-max normalization for display only
  • --norm robust: Percentile-based - P10-P90 range for display only

OCS (Overall Composite Score)

The framework computes the Overall Composite Score using harmonic mean cross-task coupling with balance penalty and cross-regulation stability.

Radar Chart Visualization

Generate combined radar charts for academic papers and presentations:

# Generate all combined radar charts
python create_combined_radar_charts.py --outdir evaluation_results

# Charts are saved to evaluation_results/radars/

Output Files:

  • Task1_file_level_combined.png - Task 1 file-level radar charts (3 models Γ— 2 rows)
  • Task1_module_level_combined.png - Task 1 module-level radar charts
  • Task1_line_level_combined.png - Task 1 line-level radar charts
  • Task2_combined.png - Task 2 radar charts (3 models Γ— 2 rows)

Chart Features:

  • Academic-quality: Times New Roman font, high DPI (600), professional styling
  • Multi-regulation visualization: Each radar shows LGPD, PDPA, and PIPEDA performance
  • Fixed metrics: 6 standardized axes per radar chart
    • Task 1: Acc@1, Acc@5, MRR, R-Precision, MAP, nDCG@5
    • Task 2: Weighted-F1, Micro-F1, Macro-F1, 1βˆ’Hamming, Jaccard, NCE
  • Paper-ready: Optimized for inclusion in academic papers with proper spacing and sizing Parameters:
  • --ocs_beta 2.0: Balance penalty strength (penalizes task imbalance)
  • --ocs_delta 2.0: Cross-regulation variance penalty

Key Alignment Options

  • --relax_keys: Enable relaxed key matching for Task 1
    • When strict matching fails, fallback to file-level aggregation
    • Useful for debugging format mismatches between predictions and gold data
    • Provides diagnostic info: strict_matched_items vs relaxed_matched_items

Task 2 Specialized Options

  • --jaccard_empty_same_one: Treat empty-empty samples as perfect match (1.0) in Jaccard computation
  • --task2_coverage_mode {scores,pred_on_top}: Choose ranking construction method for Coverage Error
    • scores: Use per-label confidence scores (if available in predictions) - recommended for better discriminative power
    • pred_on_top: Place predicted labels first, others after (stable fallback mode)

Understanding the Results

Interpreting Task 1 Results

High SGS values indicate both good performance and consistency across granularities. Low SGS values suggest either poor performance or high cross-scale inconsistency (high coefficient of variation).

Example diagnostics:

DIAG file_level_violations: matched 45/90 (strict:15, relaxed:30)
  • Gold items: 90 total annotations
  • Matched: 45 had predictions
  • Strict: 15 exact key matches
  • Relaxed: 30 additional matches via file-level fallback

Interpreting Task 2 Results

Coverage Error measures ranking depth - how far down you need to go to find all true labels.

  • Lower values are better
  • NCE normalizes to [0,1] range for cross-dataset comparison

Jaccard vs F1:

  • Jaccard focuses on set overlap
  • F1 accounts for precision/recall balance
  • Both complement each other for multi-label evaluation

Model Ranking Interpretation

OCS Rankings provide overall model comparison:

  • Higher OCS = better overall compliance detection capability
  • Cross-task coupling: Harmonic mean ensures both Task 1 and Task 2 perform well
  • Balance penalty: exp(-Ξ²Γ—|T1-T2|) reduces scores for large performance gaps between tasks
  • Cross-regulation stability: Geometric mean with variance penalty ensures consistent performance across LGPD/PDPA/PIPEDA
  • Interpretable components: Each regulation's coupling score S_r shows task balance quality

File Structure

CompliBench/
β”œβ”€β”€ evaluate.py                    # Main evaluation framework
β”œβ”€β”€ README.md                      # This documentation
β”œβ”€β”€ LGPD/                          # Brazilian LGPD regulation datasets
β”‚   β”œβ”€β”€ LGPD_task1_dataset.json   # Task 1 gold standard (Android code violations)
β”‚   β”œβ”€β”€ LGPD_task2_dataset.json   # Task 2 gold standard (Android snippets)
β”‚   β”œβ”€β”€ Compliance_Task1/          # Task 1 model predictions
β”‚   β”œβ”€β”€ Compliance_Task2/          # Task 2 model predictions
β”‚   β”œβ”€β”€ create_task1_dataset.py   # Dataset generation script
β”‚   └── create_task2_dataset.py   # Dataset generation script
β”œβ”€β”€ PDPA/                          # Singapore PDPA regulation datasets
β”œβ”€β”€ PIPEDA/                        # Canada PIPEDA regulation datasets
β”œβ”€β”€ repos/                         # Real-world Android app source code
β”‚   β”œβ”€β”€ AIRAVAT/                  # Android surveillance app
β”‚   β”œβ”€β”€ Android_Spy_App/          # Android monitoring application
β”‚   β”œβ”€β”€ L3MON/                    # Remote access tool for Android
β”‚   β”œβ”€β”€ pounce-keys/              # Android keylogger
β”‚   β”œβ”€β”€ PrivacyBreacher/          # Privacy testing application
β”‚   β”œβ”€β”€ Rafel_Rat/                # Remote administration tool
β”‚   └── rdroid/                   # Android remote control system
└── evaluation_results/            # Output directory
    β”œβ”€β”€ cross_reg_task1.txt       # Task 1 cross-regulation comparison table
    β”œβ”€β”€ cross_reg_task2.txt       # Task 2 cross-regulation comparison table
    β”œβ”€β”€ overall_scores.txt        # Overall composite scores table
    β”œβ”€β”€ raw_metrics_task1_table.txt # Detailed Task 1 raw metrics with SGS
    β”œβ”€β”€ raw_metrics_task2_table.txt # Detailed Task 2 raw metrics
    └── {REGULATION}/
        └── Task{1,2}/
            β”œβ”€β”€ {model}/
            β”‚   └── metrics.json      # Individual model results
            └── summary.txt           # Regulation-task summary

Prediction File Format

To ensure proper evaluation, model predictions must follow the exact JSON schema below:

Task 1 Prediction Schema

[
  {
    "repo_url": "https://github.com/example/android-app",
    "Commit_ID": "abc123def456",
    "file_level_violations": [
      {
        "file_path": "app/src/main/java/MainActivity.java",
        "violated_articles": [1, 3, 5]
      }
    ],
    "module_level_violations": [
      {
        "file_path": "app/src/main/java/MainActivity.java", 
        "module_name": "MainActivity",
        "violated_articles": [1, 3]
      }
    ],
    "line_level_violations": [
      {
        "file_path": "app/src/main/java/MainActivity.java",
        "line_spans": "45-47",
        "violated_articles": [1]
      }
    ]
  }
]

Task 2 Prediction Schema

Basic Format:

[
  {
    "repo_url": "https://github.com/example/android-app",
    "Commit_ID": "abc123def456", 
    "code_snippet_path": "snippets/location_access.java",
    "violated_articles": [2, 4, 7]
  }
]

Enhanced Format with Scores (Recommended for better Coverage Error):

[
  {
    "repo_url": "https://github.com/example/android-app",
    "Commit_ID": "abc123def456", 
    "code_snippet_path": "snippets/location_access.java",
    "violated_articles": [2, 4, 7],
    "article_scores": {
      "1": 0.1, "2": 0.9, "3": 0.2, "4": 0.8,
      "5": 0.3, "6": 0.1, "7": 0.7, "8": 0.2
    }
  }
]

Schema Requirements

Required Fields (All Tasks)

  • repo_url: String - Repository URL identifier
  • Commit_ID: String - Git commit hash or identifier
  • violated_articles: Array[Integer] - List of violated article numbers

Task 1 Specific Fields

  • file_path: String - Relative path to source file
  • module_name: String - Class/module name (module_level only)
  • line_spans: String - Line range in format "start-end" or single number (line_level only)

Task 2 Specific Fields

  • code_snippet_path: String - Path to code snippet file
  • article_scores: Object (Optional) - Per-article confidence scores for better Coverage Error computation

Important Notes

  • Article numbering: Use regulation-specific article numbers (e.g., LGPD Article 1, 2, 3...)
  • Zero handling: Use 0 to indicate "no violation" or as padding
  • Order preservation: Maintain prediction order in violated_articles for ranking metrics
  • Deduplication: Avoid duplicate article numbers within the same violation instance
  • File paths: Use forward slashes / consistently across platforms

Validation Checklist

Before running evaluation, verify your predictions:

  • JSON is valid and parseable
  • All required fields are present
  • violated_articles contains only valid article numbers or 0
  • File paths match the expected Android project structure
  • No duplicate articles within the same violation instance
  • Prediction order reflects model confidence (higher confidence first)
  • For Task 2: Include article_scores if using --task2_coverage_mode scores

Technical Implementation

Task 1 Algorithm

  1. Key Construction: Build unique identifiers for each violation instance

    • File: (repo_url, commit_id, file_path)
    • Module: (repo_url, commit_id, file_path, module_name)
    • Line: (repo_url, commit_id, file_path, line_spans)
  2. Order Preservation: Maintain prediction ranking for MRR/MAP/nDCG computation

    • Use stable deduplication to handle repeated predictions
    • Preserve first occurrence position for ranking metrics
  3. Relaxed Matching: Optional fallback to file-level aggregation when strict keys don't match

Task 2 Algorithm

  1. Multi-label Matrix: Convert to binary matrix Y_true[samples, labels]
  2. Coverage Error: Compute ranking depth needed to cover all true labels
  3. Ranking Construction:
    • scores mode: Use per-label confidence scores
    • pred_on_top mode: Place predictions first, maintaining order

Advanced Composite Score Calculation

  1. SGS: Generalized mean (p=-1) with consistency penalty exp(-Ξ±Β·CVΒ²) across granularities
  2. RCS: TOPSIS with Mahalanobis distance considering metric correlations
  3. CRGS: Geometric mean with variance penalty exp(-Ξ²Β·Var) across regulations
  4. OCS: Cross-task coupling with cross-regulation stability (multiple modes available)

Extensibility

Adding New Metrics

The framework supports pluggable metrics via the registry system:

# Register a new Task 2 metric
register_task2_metric("custom_metric", orientation="max")

# Metrics are automatically included in TOPSIS computation

Adding New Regulations

  1. Create regulation directory: NEW_REG/
  2. Add to REGULATIONS = ["LGPD", "PDPA", "PIPEDA", "NEW_REG"]
  3. Provide gold datasets: NEW_REG_task1_dataset.json, NEW_REG_task2_dataset.json
  4. Place model predictions in NEW_REG/Compliance_Task{1,2}/

Troubleshooting

Common Issues

Low Task 1 scores with many "matched 0/N" diagnostics:

  • Check key alignment between predictions and gold data
  • Try --relax_keys to use file-level fallback
  • Verify prediction format matches expected schema

Task 2 Coverage Error seems too high:

  • Check if predictions maintain meaningful ranking
  • Consider --task2_coverage_mode scores if confidence scores available
  • Verify label encoding consistency

RCS values interpretation:

  • RCS uses TOPSIS scores in [0,1] range automatically
  • Higher values indicate better distance to ideal solution
  • Values consider metric correlations via Mahalanobis distance

Performance Optimization

  • Use --regulation SPECIFIC and --task SPECIFIC for focused evaluation
  • Individual model evaluation: --model specific_model_name
  • Parallel processing: Run different regulations separately and combine results

License

This project is licensed under the MIT License - see the LICENSE file for details.

Dataset and Code Attribution

  • CompliBench framework: MIT License
  • Evaluation datasets: CC BY 4.0 License
  • Android app source code: Original licenses apply (see individual repos)
  • Third-party dependencies: See requirements.txt and individual package licenses

For detailed technical specifications and implementation details, see the inline documentation in evaluate.py.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors