-
-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Feature and its Use Cases
What is the feature?
Introduce an optional preprocessing module that detects potential data contamination (test leakage) by scanning the dataset for n-gram overlaps with evaluation benchmarks (e.g., MMLU, HumanEval, GSM8K).
Instead of hardcoding a fixed set of benchmarks, the benchmark selection could be configurable via:
- a CLI flag (e.g.,
--benchmarks) - or a configuration file (e.g.,
benchmarks.yaml)
The pipeline could dynamically load the selected benchmark datasets (via the Hugging Face datasets library or local files), extract their text, and construct a Bloom filter from their n-grams.
During dataset processing, text chunks would be checked against this Bloom filter to detect potential overlaps with benchmark data.
How would users benefit from it?
While the project currently provides cryptographic verification of dataset artifacts, it does not yet provide a mechanism to verify that the dataset is free from benchmark contamination.
This feature would allow researchers to verify that their training dataset does not overlap with specific evaluation benchmarks, improving the transparency, reliability, and credibility of model evaluation results.
What scenarios would this feature address?
- Detecting and filtering benchmark datasets to prevent test leakage.
- Allowing organizations to verify datasets against internal evaluation sets or proprietary factual checks.
- Recording which benchmark datasets were checked in the final
dataset_manifest.json.
Example manifest entry:
{
"contamination_checks_passed": ["gsm8k", "mmlu"]
}Additional Context
Possible Implementation Approach
1. Configuration Input
Add support for a --benchmarks CLI argument or a benchmarks.yaml configuration file.
If no configuration is provided, a default benchmark suite (e.g., MMLU, GSM8K) could be used for convenience.
2. Filter Construction
Load the specified benchmark datasets using the Hugging Face datasets library or local files.
Extract text from benchmark questions/prompts, split them into n-grams (e.g., 10–13 tokens), and insert them into a Bloom filter.
The Bloom filter can be serialized and cached locally (e.g., filter.bin) to avoid recomputation.
3. Pipeline Integration
During dataset processing, text chunks are checked against the Bloom filter.
If a potential match is detected:
- perform an exact string comparison to avoid Bloom filter false positives
- optionally discard or redact the contaminated chunk
4. Manifest Update
Extend the dataset_manifest.json generation logic to include metadata describing:
- which benchmark datasets were used
- dataset versions
- contamination check results
This ensures transparent and reproducible dataset verification.
Code of Conduct
- I have joined the Discord server and will post updates there
- I have searched existing issues to avoid duplicates