Data Selection Testbed

The success of supervised deep learning depends on the quantity and quality of labeled training data. However, as training datasets grow larger, they inevitably contain noisy or even harmful instances, increasing computational costs and hindering model performance. Existing data selection strategies typically focus on improving training effectiveness by selecting high-quality data or increasing efficiency by reducing training time without sacrificing accuracy.

Although numerous data selection methods have been proposed, ranging from efficiency-focused to effectiveness-focused strategies and from model-agnostic to model-aware approaches, their performance varies significantly across datasets and tasks, with no universally superior method identified to date. This work presents the first comprehensive survey and empirical evaluation of representative data selection methods, spanning traditional deep learning and supervised fine-tuning (SFT) for large language models (LLMs). We introduce a unified open-source testbed that implements 15 methods and systematically evaluates them under various datasets, model architectures, etc. Furthermore, we extend the evaluation to SFT for LLMs, benchmarking 10 popular methods on 4 downstream tasks. Our results provide actionable insights into the strengths, limitations, and practical trade-offs of different strategies, offering valuable guidance to researchers and practitioners in selecting the appropriate data selection methods for their specific scenarios

🎯 Overview

This testbed provides a comprehensive framework for evaluating data selection strategies across diverse datasets and model architectures. It consists of three main modules:

🔧 Configuration Loader: Allows users to configure datasets, models, parameters, data selection strategies, and experimental controls (e.g., logging).

🎯 Data Selector: Offers 15 data selection strategies for efficient and effective training. It runs a data selection strategy specified in the user configuration to select a subset of data and passes it to the Model Evaluator. For iterative strategies, the Data Selector and Model Evaluator interact to continuously improve the model performance.

📊 Model Evaluator: Trains the target model using the selected data subset and evaluates its performance on the test set, measuring both accuracy and latency.

🏗️ Architecture

🏛️ The testbed is designed with a modular architecture that enables seamless switching between different methods and makes it easy to integrate new strategies for benchmarking or research purposes. The modular design facilitates easy integration of new data selection methods, datasets, and model architectures.

🔌 Modular Design: Easy integration of new methods and datasets 🔄 Seamless Switching: Quick comparison between different strategies
📈 Scalable Framework: Support for various model architectures and datasets

🚀 Installation

📋 Prerequisites

🐍 Python 3.7+ - Core programming language 🔥 PyTorch 1.0+ - Deep learning framework ⚡ CUDA (optional) - GPU acceleration support

⚙️ Setup

# 📥 Clone the repository
git clone <repository-url>
cd data_selection_lib

# 📦 Install dependencies
pip install -r requirements.txt

⚡ Quick Start

🚀 Get started in minutes with our comprehensive data selection framework!

import torch
from torch.utils.data import DataLoader
import nets
from mydatasets import adult
from selection.kmeans import KMeansSelection
from model_utils import train_model, evaluate_model

# ⚙️ Basic configuration
data_path = "mydatasets"
method = 'kmeans'
task_type = 'classification'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 📊 Load dataset
channel, im_size, num_classes, class_names, mean, std, dst_train, dst_val, dst_test = adult(data_path, balance_target=True)

# 🤖 Initialize model
base_model = nets.__dict__['MLP'](
    channel=channel,
    num_classes=num_classes,
    im_size=im_size,
    pretrained=False
)

# 🎯 Data selection
selector = KMeansSelection()
core_loader = selector.select(
    model=base_model,
    selection_fraction=0.1,
    k=num_classes,
    train_set=dst_train,
    task_type=task_type,
    device=device
)

# 🏋️ Train and evaluate
trained_model = train_model(
    model=base_model,
    train_loader=core_loader,
    val_loader=DataLoader(dst_val, batch_size=128, shuffle=False),
    device=device,
    task_type=task_type,
    epochs=100,
    patience=10
)

# 📈 Evaluate model performance
evaluate_model(
    model=trained_model,
    test_loader=DataLoader(dst_test, batch_size=128, shuffle=False),
    device=device,
    task_type=task_type
)

🧠 Deep Learning Data Selection

🎯 Our testbed implements 15 representative data selection strategies covering both efficient and effective training approaches:

💡 Model-Agnostic Methods: Work with any model architecture 🧠 Model-Aware Methods: Leverage model-specific information for better selection

Implemented Methods

Method	Category	Description	Implementation
K-Means	Model-agnostic	Selection using K-means algorithm	✅ selection/kmeans.py
Herding	Model-agnostic	Selection using herding algorithm	✅ selection/herding.py
K-Center Greedy	Model-agnostic	Greedy k-center clustering for data selection	✅ selection/kcentergreedy.py
Random	Model-agnostic	Random sampling baseline	✅ selection/random.py
Full	Model-agnostic	Full dataset training baseline	✅ selection/full.py
CRAIG	Model-aware	Gradient-based core-set selection	✅ selection/craig.py
GradMatch	Model-aware	Gradient matching for data selection	✅ selection/gradmatch.py
GLISTER	Model-aware	Glister selection	✅ selection/glister.py
CREST	Model-aware	Core-set selection with gradient-based scoring	✅ selection/crest.py
TracIn	Model-aware	Training data influence estimation	✅ selection/TracIn.py
LiSSA	Model-aware	Linear time second-order algorithm	✅ selection/LiSSA.py
Arnoldi	Model-aware	Arnoldi iteration for influence computation	✅ selection/Arnoldi.py
TMCS	Model-aware	Monte Carlo Shapley Value estimation	✅ selection/TMCS.py
KNN-Shapley	Model-aware	K-nearest neighbors Shapley value estimation	✅ selection/KNN_shapley.py
CS-Shapley	Model-aware	Class-wise Shapley value computation	✅ selection/cs_shapley.py
G-Shapley	Model-aware	G-Shapley value estimation	✅ selection/G_shapley.py
DVRL	Model-aware	Data valuation using reinforcement learning	✅ selection/dvrl.py
DBAL	Model-aware	Deep bayesian active learning	✅ selection/dbal.py
DFAL	Model-aware	Deep fool active learning	✅ selection/dfal.py
Boundary Aware	Model-aware	Boundary-aware data selection	✅ selection/boundary_aware.py
CG-Influence	Model-aware	Conjugate gradient influence estimation	✅ selection/cg_influence.py
SVP Max Entropy	Model-aware	maximum entropy with proxy model	✅ selection/svp_max_entropy.py

🤖 Supported Models

🖼️ The testbed supports various model architectures:

🔧 CNNs: ResNet, VGG, AlexNet, InceptionV3, MobileNetV3, WideResNet, ViT 🧠 MLPs: Standard MLP, TabNet, TabTransformer
🔄 RNNs: LSTM ⚡ Transformers: BERT, T5 🎯 Specialized: DSN, EfficientNet

📊 Supported Datasets

🎯 In-Distribution (ID) Datasets

🖼️ CIFAR-100 - Large-scale image classification (100 classes) 🖼️ TinyImageNet - Image classification with rich visual diversity (200 classes) 📊 Adult - Census income prediction (tabular data) 📝 IMDB-Large - Large-scale text classification (>20M records) 📊 Covtype - Large-scale multiclass classification (tabular data) 🚴 Bike - Regression benchmark (tabular data) 📝 IMDB - Binary text classification 📰 News - Multiclass text classification

🔄 Out-of-Distribution (OOD) Datasets

🔄 MNIST→MNIST-M - Domain adaptation (synthetic backgrounds) 🔄 MNIST-M→MNIST - Reverse domain adaptation 👥 HR - Employee attrition prediction (cross-departmental) 🏠 House - House price prediction (cross-city)

🤖 LLMs SFT Data Selection

This section is dedicated to methods for selecting high-quality data for supervised fine-tuning (SFT) of Large Language Models (LLMs). Each method reflects a different perspective on what constitutes "valuable" data.

🔍 Methods Comparison Table

Method	Family	Goal	Implementation Path / Link
DSIR	● Geometry	Task‑specific	`dsir`
BM25	● Geometry	Task‑specific	`sft/data-select/bm25.py`
RDS+	■ Gradient	Task‑specific	`sft/data-select/rds.py`
LESS	◆ Influence	Task‑specific	`less`
SHED	▼ Shapley	Task‑specific	`shed`
Superfilter	➕ Uncertainty	General instruction	`superfilter`
PPL Score	➕ Uncertainty	General instruction	`sft/data-select/ppl_score.py`
NLL Score	➕ Uncertainty	General instruction	`sft/data-select/nll_score.py`
SelectIT	➕ Uncertainty	General instruction	`selectit`
TAGCOS	■ Gradient	General instruction	`tagcos`

🧰 Unified CLI Parameters

We recommend using the following standardized CLI arguments for all methods:

--input_file: Path to the input dataset (e.g., merged .jsonl)
--model_path: Pretrained model path (e.g., LLaMA, Mistral, etc.)
--output_file: Path to the output filtered dataset
--select_ratio: Ratio of data to select (for scoring-based methods)
--batch_size: Batch size for inference-based scoring
--query_file: (For retrieval-based) Query set to match from
--top_n: (For retrieval-based) Top-k examples to retrieve

🚀 Example Usage

Below are usage examples per method, using the standardized argument names.

🔸 PPL Score

torchrun --nproc_per_node=4 sft/data-select/ppl_score.py \
    --model_path /path/to/mistral-7b \
    --input_file /path/to/merged.jsonl \
    --output_file ppl_scores.jsonl \
    --batch_size 8

🔸 NLL Score

torchrun --nproc_per_node=8 sft/data-select/nll_score.py \
    --model_path /path/to/llama2-7b \
    --input_file /path/to/merged.jsonl \
    --output_file nll_scores.jsonl \
    --batch_size 16

🔸 BM25

python sft/data-select/bm25.py \
    --input_file /path/to/documents.jsonl \
    --query_file /path/to/queries.jsonl \
    --output_file bm25_top100.jsonl \
    --top_n 100

🔸 RDS+

python sft/data-select/rds.py \
    --input_file /path/to/documents.jsonl \
    --query_file /path/to/queries.jsonl \
    --output_file rds_top100.jsonl \
    --top_n 100 \
    --model_path bert-base-uncased

📌 For third-party methods (DSIR, LESS, SHED, Superfilter, SelectIT, TAGCOS), please refer to their official repositories for setup and usage.

📊 Datasets

Dataset Statistics

Dataset	Type	Modality	Classes	Dataset Size	Distribution
CIFAR-100	Classification	Image	100	60,000	ID
TinyImageNet	Classification	Image	200	110,000	ID
Adult	Classification	Tabular	2	48,842	ID
IMDB-Large	Classification	Text	2	22,500,000+	ID
Covtype	Classification	Tabular	7	581,012	ID
Bike	Regression	Tabular	-	17,358	ID
IMDB	Classification	Text	2	50,000	ID
News	Classification	Text	4	142,170	ID
MNIST→MNIST-M	Classification	Image	10	70,000	OOD
MNIST-M→MNIST	Classification	Image	10	70,000	OOD
HR	Classification	Tabular	2	12,500	OOD
House	Regression	Tabular	-	18,750	OOD

Data Distribution

In-Distribution (ID): Training and test sets share the same data distribution
Out-of-Distribution (OOD): Training and test sets come from different distributions
Train/Val/Test Split: 80%/10%/10% for ID datasets
OOD Split: Following Data Shapley methodology

📈 Evaluation Metrics

🎯 Effectiveness Metrics

📊 Classification Tasks: F1-score

F1-score = 2 × (Precision × Recall) / (Precision + Recall) 📈 Regression Tasks: Mean Squared Error (MSE)
MSE = (1/n) × Σ(y - y_predict)²

⚡ Efficiency Metrics

⏱️ Data Selection Time: Time required for subset selection 🏋️ Model Training Time: Total training time with selected subset

🔧 Configuration

⚙️ The testbed uses YAML configuration files for easy customization:

📝 Flexible Configuration: Easy parameter tuning 🎛️ Modular Settings: Separate configs for different components 🔄 Batch Experimentation: Run multiple experiments with different parameters

📋 Configuration Structure

The testbed uses experiments.yaml for experiment configuration and experiment_runner.py for batch execution:

# experiments.yaml - Data Selection Experiment Configuration
general:
  # Dataset Configuration
  data_path: "mydatasets"
  dataset: "adult"
  balance_target: true
  
  # Model Configuration
  model_name: "MLP"
  task_type: "classification"
  device: "auto"  # auto, cpu, cuda
  
  # Training Parameters
  epochs: 20
  batch_size: 128
  patience: 8
  selection_lr: 0.001
  selection_momentum: 0.9
  selection_weight_decay: 0.0005

# Experiment List - Multiple Methods and Fractions Comparison
experiments:
  - name: "arnoldi_10percent"
    method: "arnoldi"
    selection_fraction: 0.1
    recursion_depth: 20
    damping: 0.01
    scale: 25.0
    num_test_samples: 1
    pretrain_epochs: 3

  - name: "dvrl_05percent"
    method: "dvrl"
    selection_fraction: 0.05
    num_epochs: 10
    learning_rate: 0.001

# Results Save Configuration
results:
  save_path: "results"
  log_path: "logs"
  models_path: "best_models"
  comparison_file: "comparison_results.csv"

📚 Usage Examples

🚀 Running a Single Experiment

python main.py --method kmeans --dataset adult --model MLP --selection_fraction 0.1

🔄 Running Multiple Methods

# 🎯 Run all methods on a dataset
python run_experiments.py --dataset adult --methods kmeans,herding,craig,glister

🧪 Batch Experimentation with Configuration

The testbed provides a powerful batch experimentation system using YAML configuration:

# 🚀 Run all experiments defined in experiments.yaml
python experiment_runner.py

# 📋 Use custom configuration file
python experiment_runner.py custom_experiments.yaml

Key Features:

📊 Batch Execution: Run multiple experiments with different parameters
📝 Detailed Logging: Each experiment gets its own log file
📈 Results Comparison: Automatic CSV export with performance metrics
🔄 Parameter Sweeping: Easy testing of different selection fractions and methods

🔌 Custom Dataset Integration

from mydatasets import YourDataset
from methods.your_method import YourSelectionMethod

# 📊 Load your dataset
dataset = YourDataset(data_path)

# 🎯 Use your selection method
selector = YourSelectionMethod()
selected_data = selector.select(dataset, fraction=0.1)

🏃‍♂️ Running Experiments

💻 Hardware Requirements

🖥️ GPU: 4× NVIDIA RTX 3090 (24GB each) ⚡ CPU: Dual Intel Xeon Gold 6148 (80 threads total) 💾 RAM: 1 TiB system memory

🧪 Experiment Management

The testbed provides comprehensive experiment management through:

📋 Configuration-Driven: Define experiments in YAML files 🔄 Batch Processing: Run multiple experiments automatically 📊 Result Tracking: Automatic logging and result comparison 🎯 Parameter Sweeping: Test different selection fractions and methods

📊 Result Analysis

After running experiments, results are automatically saved in multiple formats:

📈 CSV Comparison: Multi-index format for easy analysis
📝 Individual Logs: Detailed logs for each experiment
🤖 Model Checkpoints: Best models saved for each experiment
📊 Performance Metrics: Accuracy, F1-score, timing information

🎛️ Hyperparameter Optimization

🔍 We employ Optuna for hyperparameter optimization using grid search. See Appendix for detailed hyperparameter configurations.

⚙️ Automated Tuning: Optimize parameters automatically 📊 Grid Search: Systematic parameter exploration

🤝 Contributing

🌟 We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

💡 Open Source: Community-driven development 🔧 Easy Integration: Simple process for adding new methods

🔧 Adding New Methods

📁 1. Create a new file in the selection/ directory ⚙️ 2. Implement the selection interface 🔗 3. Add the method to import_methods.py 📝 4. Update the documentation

🧪 Adding New Experiments

📋 1. Update experiments.yaml with new experiment configurations 🔧 2. Add method mapping in experiment_runner.py if needed 📊 3. Test with single experiment before batch running 📈 4. Verify results in the generated CSV files

📊 Adding New Datasets

📁 1. Create a new file in the mydatasets/ directory ⚙️ 2. Implement the dataset interface 🔧 3. Add preprocessing and loading functions 📋 4. Update the dataset registry

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Note: This is the first open-source testbed for comprehensive data selection evaluation. We hope it will facilitate research and development in the field of efficient and effective data selection for machine learning.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
__pycache__		__pycache__
best_models		best_models
image		image
methods		methods
mydatasets		mydatasets
nets		nets
selection		selection
sft/data-select		sft/data-select
utils		utils
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
experiment_runner.py		experiment_runner.py
experiments.yaml		experiments.yaml
import_methods.py		import_methods.py
main.py		main.py
model_utils.py		model_utils.py
requirements.txt		requirements.txt
tech_report.pdf		tech_report.pdf

Folders and files

Latest commit

History

Repository files navigation

Data Selection Testbed

📋 Table of Contents

🎯 Overview

🏗️ Architecture

🚀 Installation

📋 Prerequisites

⚙️ Setup

⚡ Quick Start

🧠 Deep Learning Data Selection

Implemented Methods

🤖 Supported Models

📊 Supported Datasets

🎯 In-Distribution (ID) Datasets

🔄 Out-of-Distribution (OOD) Datasets

🤖 LLMs SFT Data Selection

🔍 Methods Comparison Table

🧰 Unified CLI Parameters

🚀 Example Usage

🔸 PPL Score

🔸 NLL Score

🔸 BM25

🔸 RDS+

📊 Datasets

Dataset Statistics

Data Distribution

📈 Evaluation Metrics

🎯 Effectiveness Metrics

⚡ Efficiency Metrics

🔧 Configuration

📋 Configuration Structure

📚 Usage Examples

🚀 Running a Single Experiment

🔄 Running Multiple Methods

🧪 Batch Experimentation with Configuration

🔌 Custom Dataset Integration

🏃‍♂️ Running Experiments

💻 Hardware Requirements

🧪 Experiment Management

📊 Result Analysis

🎛️ Hyperparameter Optimization

🤝 Contributing

🔧 Adding New Methods

🧪 Adding New Experiments

📊 Adding New Datasets

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages