Skip to content

[FEATURE]: Data Contamination Detection via Configurable Benchmark Filtering #50

@DhruvK278

Description

@DhruvK278

Feature and its Use Cases

What is the feature?

Introduce an optional preprocessing module that detects potential data contamination (test leakage) by scanning the dataset for n-gram overlaps with evaluation benchmarks (e.g., MMLU, HumanEval, GSM8K).

Instead of hardcoding a fixed set of benchmarks, the benchmark selection could be configurable via:

  • a CLI flag (e.g., --benchmarks)
  • or a configuration file (e.g., benchmarks.yaml)

The pipeline could dynamically load the selected benchmark datasets (via the Hugging Face datasets library or local files), extract their text, and construct a Bloom filter from their n-grams.

During dataset processing, text chunks would be checked against this Bloom filter to detect potential overlaps with benchmark data.


How would users benefit from it?

While the project currently provides cryptographic verification of dataset artifacts, it does not yet provide a mechanism to verify that the dataset is free from benchmark contamination.

This feature would allow researchers to verify that their training dataset does not overlap with specific evaluation benchmarks, improving the transparency, reliability, and credibility of model evaluation results.


What scenarios would this feature address?

  • Detecting and filtering benchmark datasets to prevent test leakage.
  • Allowing organizations to verify datasets against internal evaluation sets or proprietary factual checks.
  • Recording which benchmark datasets were checked in the final dataset_manifest.json.

Example manifest entry:

{
  "contamination_checks_passed": ["gsm8k", "mmlu"]
}

Additional Context

Possible Implementation Approach

1. Configuration Input

Add support for a --benchmarks CLI argument or a benchmarks.yaml configuration file.

If no configuration is provided, a default benchmark suite (e.g., MMLU, GSM8K) could be used for convenience.


2. Filter Construction

Load the specified benchmark datasets using the Hugging Face datasets library or local files.

Extract text from benchmark questions/prompts, split them into n-grams (e.g., 10–13 tokens), and insert them into a Bloom filter.

The Bloom filter can be serialized and cached locally (e.g., filter.bin) to avoid recomputation.


3. Pipeline Integration

During dataset processing, text chunks are checked against the Bloom filter.

If a potential match is detected:

  • perform an exact string comparison to avoid Bloom filter false positives
  • optionally discard or redact the contaminated chunk

4. Manifest Update

Extend the dataset_manifest.json generation logic to include metadata describing:

  • which benchmark datasets were used
  • dataset versions
  • contamination check results

This ensures transparent and reproducible dataset verification.

Code of Conduct

  • I have joined the Discord server and will post updates there
  • I have searched existing issues to avoid duplicates

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions