The success of supervised deep learning depends on the quantity and quality of labeled training data. However, as training datasets grow larger, they inevitably contain noisy or even harmful instances, increasing computational costs and hindering model performance. Existing data selection strategies typically focus on improving training effectiveness by selecting high-quality data or increasing efficiency by reducing training time without sacrificing accuracy.
Although numerous data selection methods have been proposed, ranging from efficiency-focused to effectiveness-focused strategies and from model-agnostic to model-aware approaches, their performance varies significantly across datasets and tasks, with no universally superior method identified to date. This work presents the first comprehensive survey and empirical evaluation of representative data selection methods, spanning traditional deep learning and supervised fine-tuning (SFT) for large language models (LLMs). We introduce a unified open-source testbed that implements 15 methods and systematically evaluates them under various datasets, model architectures, etc. Furthermore, we extend the evaluation to SFT for LLMs, benchmarking 10 popular methods on 4 downstream tasks. Our results provide actionable insights into the strengths, limitations, and practical trade-offs of different strategies, offering valuable guidance to researchers and practitioners in selecting the appropriate data selection methods for their specific scenarios
- Overview
- Architecture
- Installation
- Quick Start
- Deep Learning Data Selection
- LLMs SFT Data Selection
- Datasets
- Evaluation Metrics
- Configuration
- Usage Examples
- Running Experiments
- Contributing
- License
This testbed provides a comprehensive framework for evaluating data selection strategies across diverse datasets and model architectures. It consists of three main modules:
π§ Configuration Loader: Allows users to configure datasets, models, parameters, data selection strategies, and experimental controls (e.g., logging).
π― Data Selector: Offers 15 data selection strategies for efficient and effective training. It runs a data selection strategy specified in the user configuration to select a subset of data and passes it to the Model Evaluator. For iterative strategies, the Data Selector and Model Evaluator interact to continuously improve the model performance.
π Model Evaluator: Trains the target model using the selected data subset and evaluates its performance on the test set, measuring both accuracy and latency.
ποΈ The testbed is designed with a modular architecture that enables seamless switching between different methods and makes it easy to integrate new strategies for benchmarking or research purposes. The modular design facilitates easy integration of new data selection methods, datasets, and model architectures.
π Modular Design: Easy integration of new methods and datasets
π Seamless Switching: Quick comparison between different strategies
π Scalable Framework: Support for various model architectures and datasets
π Python 3.7+ - Core programming language π₯ PyTorch 1.0+ - Deep learning framework β‘ CUDA (optional) - GPU acceleration support
# π₯ Clone the repository
git clone <repository-url>
cd data_selection_lib
# π¦ Install dependencies
pip install -r requirements.txtπ Get started in minutes with our comprehensive data selection framework!
import torch
from torch.utils.data import DataLoader
import nets
from mydatasets import adult
from selection.kmeans import KMeansSelection
from model_utils import train_model, evaluate_model
# βοΈ Basic configuration
data_path = "mydatasets"
method = 'kmeans'
task_type = 'classification'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# π Load dataset
channel, im_size, num_classes, class_names, mean, std, dst_train, dst_val, dst_test = adult(data_path, balance_target=True)
# π€ Initialize model
base_model = nets.__dict__['MLP'](
channel=channel,
num_classes=num_classes,
im_size=im_size,
pretrained=False
)
# π― Data selection
selector = KMeansSelection()
core_loader = selector.select(
model=base_model,
selection_fraction=0.1,
k=num_classes,
train_set=dst_train,
task_type=task_type,
device=device
)
# ποΈ Train and evaluate
trained_model = train_model(
model=base_model,
train_loader=core_loader,
val_loader=DataLoader(dst_val, batch_size=128, shuffle=False),
device=device,
task_type=task_type,
epochs=100,
patience=10
)
# π Evaluate model performance
evaluate_model(
model=trained_model,
test_loader=DataLoader(dst_test, batch_size=128, shuffle=False),
device=device,
task_type=task_type
)π― Our testbed implements 15 representative data selection strategies covering both efficient and effective training approaches:
π‘ Model-Agnostic Methods: Work with any model architecture π§ Model-Aware Methods: Leverage model-specific information for better selection
| Method | Category | Description | Implementation |
|---|---|---|---|
| K-Means | Model-agnostic | Selection using K-means algorithm | β selection/kmeans.py |
| Herding | Model-agnostic | Selection using herding algorithm | β selection/herding.py |
| K-Center Greedy | Model-agnostic | Greedy k-center clustering for data selection | β selection/kcentergreedy.py |
| Random | Model-agnostic | Random sampling baseline | β selection/random.py |
| Full | Model-agnostic | Full dataset training baseline | β selection/full.py |
| CRAIG | Model-aware | Gradient-based core-set selection | β selection/craig.py |
| GradMatch | Model-aware | Gradient matching for data selection | β selection/gradmatch.py |
| GLISTER | Model-aware | Glister selection | β selection/glister.py |
| CREST | Model-aware | Core-set selection with gradient-based scoring | β selection/crest.py |
| TracIn | Model-aware | Training data influence estimation | β selection/TracIn.py |
| LiSSA | Model-aware | Linear time second-order algorithm | β selection/LiSSA.py |
| Arnoldi | Model-aware | Arnoldi iteration for influence computation | β selection/Arnoldi.py |
| TMCS | Model-aware | Monte Carlo Shapley Value estimation | β selection/TMCS.py |
| KNN-Shapley | Model-aware | K-nearest neighbors Shapley value estimation | β selection/KNN_shapley.py |
| CS-Shapley | Model-aware | Class-wise Shapley value computation | β selection/cs_shapley.py |
| G-Shapley | Model-aware | G-Shapley value estimation | β selection/G_shapley.py |
| DVRL | Model-aware | Data valuation using reinforcement learning | β selection/dvrl.py |
| DBAL | Model-aware | Deep bayesian active learning | β selection/dbal.py |
| DFAL | Model-aware | Deep fool active learning | β selection/dfal.py |
| Boundary Aware | Model-aware | Boundary-aware data selection | β selection/boundary_aware.py |
| CG-Influence | Model-aware | Conjugate gradient influence estimation | β selection/cg_influence.py |
| SVP Max Entropy | Model-aware | maximum entropy with proxy model | β selection/svp_max_entropy.py |
πΌοΈ The testbed supports various model architectures:
π§ CNNs: ResNet, VGG, AlexNet, InceptionV3, MobileNetV3, WideResNet, ViT
π§ MLPs: Standard MLP, TabNet, TabTransformer
π RNNs: LSTM
β‘ Transformers: BERT, T5
π― Specialized: DSN, EfficientNet
πΌοΈ CIFAR-100 - Large-scale image classification (100 classes) πΌοΈ TinyImageNet - Image classification with rich visual diversity (200 classes) π Adult - Census income prediction (tabular data) π IMDB-Large - Large-scale text classification (>20M records) π Covtype - Large-scale multiclass classification (tabular data) π΄ Bike - Regression benchmark (tabular data) π IMDB - Binary text classification π° News - Multiclass text classification
π MNISTβMNIST-M - Domain adaptation (synthetic backgrounds) π MNIST-MβMNIST - Reverse domain adaptation π₯ HR - Employee attrition prediction (cross-departmental) π House - House price prediction (cross-city)
This section is dedicated to methods for selecting high-quality data for supervised fine-tuning (SFT) of Large Language Models (LLMs). Each method reflects a different perspective on what constitutes "valuable" data.
| Method | Family | Goal | Implementation Path / Link |
|---|---|---|---|
| DSIR | β Geometry | Taskβspecific | dsir |
| BM25 | β Geometry | Taskβspecific | sft/data-select/bm25.py |
| RDS+ | β Gradient | Taskβspecific | sft/data-select/rds.py |
| LESS | β Influence | Taskβspecific | less |
| SHED | βΌ Shapley | Taskβspecific | shed |
| Superfilter | β Uncertainty | General instruction | superfilter |
| PPL Score | β Uncertainty | General instruction | sft/data-select/ppl_score.py |
| NLL Score | β Uncertainty | General instruction | sft/data-select/nll_score.py |
| SelectIT | β Uncertainty | General instruction | selectit |
| TAGCOS | β Gradient | General instruction | tagcos |
We recommend using the following standardized CLI arguments for all methods:
--input_file: Path to the input dataset (e.g., merged.jsonl)--model_path: Pretrained model path (e.g., LLaMA, Mistral, etc.)--output_file: Path to the output filtered dataset--select_ratio: Ratio of data to select (for scoring-based methods)--batch_size: Batch size for inference-based scoring--query_file: (For retrieval-based) Query set to match from--top_n: (For retrieval-based) Top-k examples to retrieve
Below are usage examples per method, using the standardized argument names.
torchrun --nproc_per_node=4 sft/data-select/ppl_score.py \
--model_path /path/to/mistral-7b \
--input_file /path/to/merged.jsonl \
--output_file ppl_scores.jsonl \
--batch_size 8torchrun --nproc_per_node=8 sft/data-select/nll_score.py \
--model_path /path/to/llama2-7b \
--input_file /path/to/merged.jsonl \
--output_file nll_scores.jsonl \
--batch_size 16python sft/data-select/bm25.py \
--input_file /path/to/documents.jsonl \
--query_file /path/to/queries.jsonl \
--output_file bm25_top100.jsonl \
--top_n 100python sft/data-select/rds.py \
--input_file /path/to/documents.jsonl \
--query_file /path/to/queries.jsonl \
--output_file rds_top100.jsonl \
--top_n 100 \
--model_path bert-base-uncasedπ For third-party methods (DSIR, LESS, SHED, Superfilter, SelectIT, TAGCOS), please refer to their official repositories for setup and usage.
| Dataset | Type | Modality | Classes | Dataset Size | Distribution |
|---|---|---|---|---|---|
| CIFAR-100 | Classification | Image | 100 | 60,000 | ID |
| TinyImageNet | Classification | Image | 200 | 110,000 | ID |
| Adult | Classification | Tabular | 2 | 48,842 | ID |
| IMDB-Large | Classification | Text | 2 | 22,500,000+ | ID |
| Covtype | Classification | Tabular | 7 | 581,012 | ID |
| Bike | Regression | Tabular | - | 17,358 | ID |
| IMDB | Classification | Text | 2 | 50,000 | ID |
| News | Classification | Text | 4 | 142,170 | ID |
| MNISTβMNIST-M | Classification | Image | 10 | 70,000 | OOD |
| MNIST-MβMNIST | Classification | Image | 10 | 70,000 | OOD |
| HR | Classification | Tabular | 2 | 12,500 | OOD |
| House | Regression | Tabular | - | 18,750 | OOD |
- In-Distribution (ID): Training and test sets share the same data distribution
- Out-of-Distribution (OOD): Training and test sets come from different distributions
- Train/Val/Test Split: 80%/10%/10% for ID datasets
- OOD Split: Following Data Shapley methodology
π Classification Tasks: F1-score
- F1-score = 2 Γ (Precision Γ Recall) / (Precision + Recall) π Regression Tasks: Mean Squared Error (MSE)
- MSE = (1/n) Γ Ξ£(y - y_predict)Β²
β±οΈ Data Selection Time: Time required for subset selection ποΈ Model Training Time: Total training time with selected subset
βοΈ The testbed uses YAML configuration files for easy customization:
π Flexible Configuration: Easy parameter tuning ποΈ Modular Settings: Separate configs for different components π Batch Experimentation: Run multiple experiments with different parameters
The testbed uses experiments.yaml for experiment configuration and experiment_runner.py for batch execution:
# experiments.yaml - Data Selection Experiment Configuration
general:
# Dataset Configuration
data_path: "mydatasets"
dataset: "adult"
balance_target: true
# Model Configuration
model_name: "MLP"
task_type: "classification"
device: "auto" # auto, cpu, cuda
# Training Parameters
epochs: 20
batch_size: 128
patience: 8
selection_lr: 0.001
selection_momentum: 0.9
selection_weight_decay: 0.0005
# Experiment List - Multiple Methods and Fractions Comparison
experiments:
- name: "arnoldi_10percent"
method: "arnoldi"
selection_fraction: 0.1
recursion_depth: 20
damping: 0.01
scale: 25.0
num_test_samples: 1
pretrain_epochs: 3
- name: "dvrl_05percent"
method: "dvrl"
selection_fraction: 0.05
num_epochs: 10
learning_rate: 0.001
# Results Save Configuration
results:
save_path: "results"
log_path: "logs"
models_path: "best_models"
comparison_file: "comparison_results.csv"python main.py --method kmeans --dataset adult --model MLP --selection_fraction 0.1# π― Run all methods on a dataset
python run_experiments.py --dataset adult --methods kmeans,herding,craig,glisterThe testbed provides a powerful batch experimentation system using YAML configuration:
# π Run all experiments defined in experiments.yaml
python experiment_runner.py
# π Use custom configuration file
python experiment_runner.py custom_experiments.yamlKey Features:
- π Batch Execution: Run multiple experiments with different parameters
- π Detailed Logging: Each experiment gets its own log file
- π Results Comparison: Automatic CSV export with performance metrics
- π Parameter Sweeping: Easy testing of different selection fractions and methods
from mydatasets import YourDataset
from methods.your_method import YourSelectionMethod
# π Load your dataset
dataset = YourDataset(data_path)
# π― Use your selection method
selector = YourSelectionMethod()
selected_data = selector.select(dataset, fraction=0.1)π₯οΈ GPU: 4Γ NVIDIA RTX 3090 (24GB each) β‘ CPU: Dual Intel Xeon Gold 6148 (80 threads total) πΎ RAM: 1 TiB system memory
The testbed provides comprehensive experiment management through:
π Configuration-Driven: Define experiments in YAML files π Batch Processing: Run multiple experiments automatically π Result Tracking: Automatic logging and result comparison π― Parameter Sweeping: Test different selection fractions and methods
After running experiments, results are automatically saved in multiple formats:
- π CSV Comparison: Multi-index format for easy analysis
- π Individual Logs: Detailed logs for each experiment
- π€ Model Checkpoints: Best models saved for each experiment
- π Performance Metrics: Accuracy, F1-score, timing information
π We employ Optuna for hyperparameter optimization using grid search. See Appendix for detailed hyperparameter configurations.
βοΈ Automated Tuning: Optimize parameters automatically π Grid Search: Systematic parameter exploration
π We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
π‘ Open Source: Community-driven development π§ Easy Integration: Simple process for adding new methods
π 1. Create a new file in the selection/ directory
βοΈ 2. Implement the selection interface
π 3. Add the method to import_methods.py
π 4. Update the documentation
π 1. Update experiments.yaml with new experiment configurations
π§ 2. Add method mapping in experiment_runner.py if needed
π 3. Test with single experiment before batch running
π 4. Verify results in the generated CSV files
π 1. Create a new file in the mydatasets/ directory
βοΈ 2. Implement the dataset interface
π§ 3. Add preprocessing and loading functions
π 4. Update the dataset registry
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This is the first open-source testbed for comprehensive data selection evaluation. We hope it will facilitate research and development in the field of efficient and effective data selection for machine learning.

