[FEATURE]: Data Contamination Detection via Configurable Benchmark Filtering

### Feature and its Use Cases

### What is the feature?

Introduce an optional preprocessing module that detects potential **data contamination (test leakage)** by scanning the dataset for n-gram overlaps with evaluation benchmarks (e.g., MMLU, HumanEval, GSM8K).

Instead of hardcoding a fixed set of benchmarks, the benchmark selection could be **configurable** via:

- a CLI flag (e.g., `--benchmarks`)
- or a configuration file (e.g., `benchmarks.yaml`)

The pipeline could dynamically load the selected benchmark datasets (via the Hugging Face `datasets` library or local files), extract their text, and construct a **Bloom filter** from their n-grams.

During dataset processing, text chunks would be checked against this Bloom filter to detect potential overlaps with benchmark data.

---

### How would users benefit from it?

While the project currently provides cryptographic verification of dataset artifacts, it does not yet provide a mechanism to verify that the dataset is free from **benchmark contamination**.

This feature would allow researchers to verify that their training dataset does not overlap with specific evaluation benchmarks, improving the **transparency, reliability, and credibility** of model evaluation results.

---

### What scenarios would this feature address?

- Detecting and filtering benchmark datasets to prevent **test leakage**.
- Allowing organizations to verify datasets against **internal evaluation sets or proprietary factual checks**.
- Recording which benchmark datasets were checked in the final `dataset_manifest.json`.

Example manifest entry:

```json
{
  "contamination_checks_passed": ["gsm8k", "mmlu"]
}
```
## Additional Context

### Possible Implementation Approach

#### 1. Configuration Input

Add support for a `--benchmarks` CLI argument or a `benchmarks.yaml` configuration file.

If no configuration is provided, a default benchmark suite (e.g., MMLU, GSM8K) could be used for convenience.

---

#### 2. Filter Construction

Load the specified benchmark datasets using the Hugging Face `datasets` library or local files.

Extract text from benchmark questions/prompts, split them into n-grams (e.g., 10–13 tokens), and insert them into a Bloom filter.

The Bloom filter can be serialized and cached locally (e.g., `filter.bin`) to avoid recomputation.

---

#### 3. Pipeline Integration

During dataset processing, text chunks are checked against the Bloom filter.

If a potential match is detected:

- perform an exact string comparison to avoid Bloom filter false positives
- optionally discard or redact the contaminated chunk

---

#### 4. Manifest Update

Extend the `dataset_manifest.json` generation logic to include metadata describing:

- which benchmark datasets were used
- dataset versions
- contamination check results

This ensures **transparent and reproducible dataset verification**.

### Code of Conduct
- [x] I have joined the Discord server and will post updates there
- [x] I have searched existing issues to avoid duplicates


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE]: Data Contamination Detection via Configurable Benchmark Filtering #50

Feature and its Use Cases

What is the feature?

How would users benefit from it?

What scenarios would this feature address?

Additional Context

Possible Implementation Approach

1. Configuration Input

2. Filter Construction

3. Pipeline Integration

4. Manifest Update

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[FEATURE]: Data Contamination Detection via Configurable Benchmark Filtering #50

Description

Feature and its Use Cases

What is the feature?

How would users benefit from it?

What scenarios would this feature address?

Additional Context

Possible Implementation Approach

1. Configuration Input

2. Filter Construction

3. Pipeline Integration

4. Manifest Update

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions