Skip to content

qiyandeng/data_selection_eva

Repository files navigation

Data Selection Testbed

Python 3.7 PyTorch License

Data Selection surface

The success of supervised deep learning depends on the quantity and quality of labeled training data. However, as training datasets grow larger, they inevitably contain noisy or even harmful instances, increasing computational costs and hindering model performance. Existing data selection strategies typically focus on improving training effectiveness by selecting high-quality data or increasing efficiency by reducing training time without sacrificing accuracy.

Although numerous data selection methods have been proposed, ranging from efficiency-focused to effectiveness-focused strategies and from model-agnostic to model-aware approaches, their performance varies significantly across datasets and tasks, with no universally superior method identified to date. This work presents the first comprehensive survey and empirical evaluation of representative data selection methods, spanning traditional deep learning and supervised fine-tuning (SFT) for large language models (LLMs). We introduce a unified open-source testbed that implements 15 methods and systematically evaluates them under various datasets, model architectures, etc. Furthermore, we extend the evaluation to SFT for LLMs, benchmarking 10 popular methods on 4 downstream tasks. Our results provide actionable insights into the strengths, limitations, and practical trade-offs of different strategies, offering valuable guidance to researchers and practitioners in selecting the appropriate data selection methods for their specific scenarios

πŸ“‹ Table of Contents

🎯 Overview

This testbed provides a comprehensive framework for evaluating data selection strategies across diverse datasets and model architectures. It consists of three main modules:

πŸ”§ Configuration Loader: Allows users to configure datasets, models, parameters, data selection strategies, and experimental controls (e.g., logging).

🎯 Data Selector: Offers 15 data selection strategies for efficient and effective training. It runs a data selection strategy specified in the user configuration to select a subset of data and passes it to the Model Evaluator. For iterative strategies, the Data Selector and Model Evaluator interact to continuously improve the model performance.

πŸ“Š Model Evaluator: Trains the target model using the selected data subset and evaluates its performance on the test set, measuring both accuracy and latency.

Data Selection Workflow

πŸ—οΈ Architecture

πŸ›οΈ The testbed is designed with a modular architecture that enables seamless switching between different methods and makes it easy to integrate new strategies for benchmarking or research purposes. The modular design facilitates easy integration of new data selection methods, datasets, and model architectures.

πŸ”Œ Modular Design: Easy integration of new methods and datasets πŸ”„ Seamless Switching: Quick comparison between different strategies
πŸ“ˆ Scalable Framework: Support for various model architectures and datasets

πŸš€ Installation

πŸ“‹ Prerequisites

🐍 Python 3.7+ - Core programming language πŸ”₯ PyTorch 1.0+ - Deep learning framework ⚑ CUDA (optional) - GPU acceleration support

βš™οΈ Setup

# πŸ“₯ Clone the repository
git clone <repository-url>
cd data_selection_lib

# πŸ“¦ Install dependencies
pip install -r requirements.txt

⚑ Quick Start

πŸš€ Get started in minutes with our comprehensive data selection framework!

import torch
from torch.utils.data import DataLoader
import nets
from mydatasets import adult
from selection.kmeans import KMeansSelection
from model_utils import train_model, evaluate_model

# βš™οΈ Basic configuration
data_path = "mydatasets"
method = 'kmeans'
task_type = 'classification'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# πŸ“Š Load dataset
channel, im_size, num_classes, class_names, mean, std, dst_train, dst_val, dst_test = adult(data_path, balance_target=True)

# πŸ€– Initialize model
base_model = nets.__dict__['MLP'](
    channel=channel,
    num_classes=num_classes,
    im_size=im_size,
    pretrained=False
)

# 🎯 Data selection
selector = KMeansSelection()
core_loader = selector.select(
    model=base_model,
    selection_fraction=0.1,
    k=num_classes,
    train_set=dst_train,
    task_type=task_type,
    device=device
)

# πŸ‹οΈ Train and evaluate
trained_model = train_model(
    model=base_model,
    train_loader=core_loader,
    val_loader=DataLoader(dst_val, batch_size=128, shuffle=False),
    device=device,
    task_type=task_type,
    epochs=100,
    patience=10
)

# πŸ“ˆ Evaluate model performance
evaluate_model(
    model=trained_model,
    test_loader=DataLoader(dst_test, batch_size=128, shuffle=False),
    device=device,
    task_type=task_type
)

🧠 Deep Learning Data Selection

🎯 Our testbed implements 15 representative data selection strategies covering both efficient and effective training approaches:

πŸ’‘ Model-Agnostic Methods: Work with any model architecture 🧠 Model-Aware Methods: Leverage model-specific information for better selection

Implemented Methods

Method Category Description Implementation
K-Means Model-agnostic Selection using K-means algorithm βœ… selection/kmeans.py
Herding Model-agnostic Selection using herding algorithm βœ… selection/herding.py
K-Center Greedy Model-agnostic Greedy k-center clustering for data selection βœ… selection/kcentergreedy.py
Random Model-agnostic Random sampling baseline βœ… selection/random.py
Full Model-agnostic Full dataset training baseline βœ… selection/full.py
CRAIG Model-aware Gradient-based core-set selection βœ… selection/craig.py
GradMatch Model-aware Gradient matching for data selection βœ… selection/gradmatch.py
GLISTER Model-aware Glister selection βœ… selection/glister.py
CREST Model-aware Core-set selection with gradient-based scoring βœ… selection/crest.py
TracIn Model-aware Training data influence estimation βœ… selection/TracIn.py
LiSSA Model-aware Linear time second-order algorithm βœ… selection/LiSSA.py
Arnoldi Model-aware Arnoldi iteration for influence computation βœ… selection/Arnoldi.py
TMCS Model-aware Monte Carlo Shapley Value estimation βœ… selection/TMCS.py
KNN-Shapley Model-aware K-nearest neighbors Shapley value estimation βœ… selection/KNN_shapley.py
CS-Shapley Model-aware Class-wise Shapley value computation βœ… selection/cs_shapley.py
G-Shapley Model-aware G-Shapley value estimation βœ… selection/G_shapley.py
DVRL Model-aware Data valuation using reinforcement learning βœ… selection/dvrl.py
DBAL Model-aware Deep bayesian active learning βœ… selection/dbal.py
DFAL Model-aware Deep fool active learning βœ… selection/dfal.py
Boundary Aware Model-aware Boundary-aware data selection βœ… selection/boundary_aware.py
CG-Influence Model-aware Conjugate gradient influence estimation βœ… selection/cg_influence.py
SVP Max Entropy Model-aware maximum entropy with proxy model βœ… selection/svp_max_entropy.py

πŸ€– Supported Models

πŸ–ΌοΈ The testbed supports various model architectures:

πŸ”§ CNNs: ResNet, VGG, AlexNet, InceptionV3, MobileNetV3, WideResNet, ViT 🧠 MLPs: Standard MLP, TabNet, TabTransformer
πŸ”„ RNNs: LSTM ⚑ Transformers: BERT, T5 🎯 Specialized: DSN, EfficientNet

πŸ“Š Supported Datasets

🎯 In-Distribution (ID) Datasets

πŸ–ΌοΈ CIFAR-100 - Large-scale image classification (100 classes) πŸ–ΌοΈ TinyImageNet - Image classification with rich visual diversity (200 classes) πŸ“Š Adult - Census income prediction (tabular data) πŸ“ IMDB-Large - Large-scale text classification (>20M records) πŸ“Š Covtype - Large-scale multiclass classification (tabular data) 🚴 Bike - Regression benchmark (tabular data) πŸ“ IMDB - Binary text classification πŸ“° News - Multiclass text classification

πŸ”„ Out-of-Distribution (OOD) Datasets

πŸ”„ MNISTβ†’MNIST-M - Domain adaptation (synthetic backgrounds) πŸ”„ MNIST-Mβ†’MNIST - Reverse domain adaptation πŸ‘₯ HR - Employee attrition prediction (cross-departmental) 🏠 House - House price prediction (cross-city)

πŸ€– LLMs SFT Data Selection

This section is dedicated to methods for selecting high-quality data for supervised fine-tuning (SFT) of Large Language Models (LLMs). Each method reflects a different perspective on what constitutes "valuable" data.

πŸ” Methods Comparison Table

Method Family Goal Implementation Path / Link
DSIR ● Geometry Task‑specific dsir
BM25 ● Geometry Task‑specific sft/data-select/bm25.py
RDS+ β–  Gradient Task‑specific sft/data-select/rds.py
LESS β—† Influence Task‑specific less
SHED β–Ό Shapley Task‑specific shed
Superfilter βž• Uncertainty General instruction superfilter
PPL Score βž• Uncertainty General instruction sft/data-select/ppl_score.py
NLL Score βž• Uncertainty General instruction sft/data-select/nll_score.py
SelectIT βž• Uncertainty General instruction selectit
TAGCOS β–  Gradient General instruction tagcos

🧰 Unified CLI Parameters

We recommend using the following standardized CLI arguments for all methods:

  • --input_file: Path to the input dataset (e.g., merged .jsonl)
  • --model_path: Pretrained model path (e.g., LLaMA, Mistral, etc.)
  • --output_file: Path to the output filtered dataset
  • --select_ratio: Ratio of data to select (for scoring-based methods)
  • --batch_size: Batch size for inference-based scoring
  • --query_file: (For retrieval-based) Query set to match from
  • --top_n: (For retrieval-based) Top-k examples to retrieve

πŸš€ Example Usage

Below are usage examples per method, using the standardized argument names.

πŸ”Έ PPL Score

torchrun --nproc_per_node=4 sft/data-select/ppl_score.py \
    --model_path /path/to/mistral-7b \
    --input_file /path/to/merged.jsonl \
    --output_file ppl_scores.jsonl \
    --batch_size 8

πŸ”Έ NLL Score

torchrun --nproc_per_node=8 sft/data-select/nll_score.py \
    --model_path /path/to/llama2-7b \
    --input_file /path/to/merged.jsonl \
    --output_file nll_scores.jsonl \
    --batch_size 16

πŸ”Έ BM25

python sft/data-select/bm25.py \
    --input_file /path/to/documents.jsonl \
    --query_file /path/to/queries.jsonl \
    --output_file bm25_top100.jsonl \
    --top_n 100

πŸ”Έ RDS+

python sft/data-select/rds.py \
    --input_file /path/to/documents.jsonl \
    --query_file /path/to/queries.jsonl \
    --output_file rds_top100.jsonl \
    --top_n 100 \
    --model_path bert-base-uncased

πŸ“Œ For third-party methods (DSIR, LESS, SHED, Superfilter, SelectIT, TAGCOS), please refer to their official repositories for setup and usage.

πŸ“Š Datasets

Dataset Statistics

Dataset Type Modality Classes Dataset Size Distribution
CIFAR-100 Classification Image 100 60,000 ID
TinyImageNet Classification Image 200 110,000 ID
Adult Classification Tabular 2 48,842 ID
IMDB-Large Classification Text 2 22,500,000+ ID
Covtype Classification Tabular 7 581,012 ID
Bike Regression Tabular - 17,358 ID
IMDB Classification Text 2 50,000 ID
News Classification Text 4 142,170 ID
MNIST→MNIST-M Classification Image 10 70,000 OOD
MNIST-M→MNIST Classification Image 10 70,000 OOD
HR Classification Tabular 2 12,500 OOD
House Regression Tabular - 18,750 OOD

Data Distribution

  • In-Distribution (ID): Training and test sets share the same data distribution
  • Out-of-Distribution (OOD): Training and test sets come from different distributions
  • Train/Val/Test Split: 80%/10%/10% for ID datasets
  • OOD Split: Following Data Shapley methodology

πŸ“ˆ Evaluation Metrics

🎯 Effectiveness Metrics

πŸ“Š Classification Tasks: F1-score

  • F1-score = 2 Γ— (Precision Γ— Recall) / (Precision + Recall) πŸ“ˆ Regression Tasks: Mean Squared Error (MSE)
  • MSE = (1/n) Γ— Ξ£(y - y_predict)Β²

⚑ Efficiency Metrics

⏱️ Data Selection Time: Time required for subset selection πŸ‹οΈ Model Training Time: Total training time with selected subset

πŸ”§ Configuration

βš™οΈ The testbed uses YAML configuration files for easy customization:

πŸ“ Flexible Configuration: Easy parameter tuning πŸŽ›οΈ Modular Settings: Separate configs for different components πŸ”„ Batch Experimentation: Run multiple experiments with different parameters

πŸ“‹ Configuration Structure

The testbed uses experiments.yaml for experiment configuration and experiment_runner.py for batch execution:

# experiments.yaml - Data Selection Experiment Configuration
general:
  # Dataset Configuration
  data_path: "mydatasets"
  dataset: "adult"
  balance_target: true
  
  # Model Configuration
  model_name: "MLP"
  task_type: "classification"
  device: "auto"  # auto, cpu, cuda
  
  # Training Parameters
  epochs: 20
  batch_size: 128
  patience: 8
  selection_lr: 0.001
  selection_momentum: 0.9
  selection_weight_decay: 0.0005

# Experiment List - Multiple Methods and Fractions Comparison
experiments:
  - name: "arnoldi_10percent"
    method: "arnoldi"
    selection_fraction: 0.1
    recursion_depth: 20
    damping: 0.01
    scale: 25.0
    num_test_samples: 1
    pretrain_epochs: 3

  - name: "dvrl_05percent"
    method: "dvrl"
    selection_fraction: 0.05
    num_epochs: 10
    learning_rate: 0.001

# Results Save Configuration
results:
  save_path: "results"
  log_path: "logs"
  models_path: "best_models"
  comparison_file: "comparison_results.csv"

πŸ“š Usage Examples

πŸš€ Running a Single Experiment

python main.py --method kmeans --dataset adult --model MLP --selection_fraction 0.1

πŸ”„ Running Multiple Methods

# 🎯 Run all methods on a dataset
python run_experiments.py --dataset adult --methods kmeans,herding,craig,glister

πŸ§ͺ Batch Experimentation with Configuration

The testbed provides a powerful batch experimentation system using YAML configuration:

# πŸš€ Run all experiments defined in experiments.yaml
python experiment_runner.py

# πŸ“‹ Use custom configuration file
python experiment_runner.py custom_experiments.yaml

Key Features:

  • πŸ“Š Batch Execution: Run multiple experiments with different parameters
  • πŸ“ Detailed Logging: Each experiment gets its own log file
  • πŸ“ˆ Results Comparison: Automatic CSV export with performance metrics
  • πŸ”„ Parameter Sweeping: Easy testing of different selection fractions and methods

πŸ”Œ Custom Dataset Integration

from mydatasets import YourDataset
from methods.your_method import YourSelectionMethod

# πŸ“Š Load your dataset
dataset = YourDataset(data_path)

# 🎯 Use your selection method
selector = YourSelectionMethod()
selected_data = selector.select(dataset, fraction=0.1)

πŸƒβ€β™‚οΈ Running Experiments

πŸ’» Hardware Requirements

πŸ–₯️ GPU: 4Γ— NVIDIA RTX 3090 (24GB each) ⚑ CPU: Dual Intel Xeon Gold 6148 (80 threads total) πŸ’Ύ RAM: 1 TiB system memory

πŸ§ͺ Experiment Management

The testbed provides comprehensive experiment management through:

πŸ“‹ Configuration-Driven: Define experiments in YAML files πŸ”„ Batch Processing: Run multiple experiments automatically πŸ“Š Result Tracking: Automatic logging and result comparison 🎯 Parameter Sweeping: Test different selection fractions and methods

πŸ“Š Result Analysis

After running experiments, results are automatically saved in multiple formats:

  • πŸ“ˆ CSV Comparison: Multi-index format for easy analysis
  • πŸ“ Individual Logs: Detailed logs for each experiment
  • πŸ€– Model Checkpoints: Best models saved for each experiment
  • πŸ“Š Performance Metrics: Accuracy, F1-score, timing information

πŸŽ›οΈ Hyperparameter Optimization

πŸ” We employ Optuna for hyperparameter optimization using grid search. See Appendix for detailed hyperparameter configurations.

βš™οΈ Automated Tuning: Optimize parameters automatically πŸ“Š Grid Search: Systematic parameter exploration

🀝 Contributing

🌟 We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

πŸ’‘ Open Source: Community-driven development πŸ”§ Easy Integration: Simple process for adding new methods

πŸ”§ Adding New Methods

πŸ“ 1. Create a new file in the selection/ directory βš™οΈ 2. Implement the selection interface πŸ”— 3. Add the method to import_methods.py πŸ“ 4. Update the documentation

πŸ§ͺ Adding New Experiments

πŸ“‹ 1. Update experiments.yaml with new experiment configurations πŸ”§ 2. Add method mapping in experiment_runner.py if needed πŸ“Š 3. Test with single experiment before batch running πŸ“ˆ 4. Verify results in the generated CSV files

πŸ“Š Adding New Datasets

πŸ“ 1. Create a new file in the mydatasets/ directory βš™οΈ 2. Implement the dataset interface πŸ”§ 3. Add preprocessing and loading functions πŸ“‹ 4. Update the dataset registry

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Note: This is the first open-source testbed for comprehensive data selection evaluation. We hope it will facilitate research and development in the field of efficient and effective data selection for machine learning.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages