Sequential Data Selection with Data Values Project

This project implements various data valuation methods and evaluates their performance on different datasets.

Main Files

main.py: The entry point of the project. It parses command-line arguments, loads the configuration, creates the experiment, and runs the data valuation methods. It also saves the results to CSV files.
utils/data.py: Contains utility functions for data processing, model evaluation, and experiment setup. Key functions include:
- evaluate_model: Evaluates the performance of a scikit-learn model.
- set_seed: Sets the random seed for reproducibility.
- create_output_dir: Creates the output directory for saving results.
- process_data_for_sklearn: Prepares data for use with scikit-learn models.
- train_model_on_subset: Trains a scikit-learn model on a subset of data.
- remove_points_one_by_one: Evaluates model performance by removing data points one by one.
utils/plotting.py: Contains functions for plotting the aggregated results with confidence intervals across multiple random seeds.
evaluators/: A directory containing custom data valuation methods implemented as subclasses of DataEvaluator. These include:
- BipartiteMatchingEvaluator: A data valuation method based on category-aware greedy bipartite matching.

Usage

To run the project, use the main.py script with the desired command-line arguments. The script loads the configuration from config/base_config.yaml and config/cifar_config.yaml (for CIFAR-10 dataset), and saves the results in the results/ directory.

Configuration

The project uses YAML files for configuration. The config/base_config.yaml file contains the default settings, which can be overridden by dataset-specific configurations like config/cifar_config.yaml. The configuration files specify the dataset, model, and experiment settings.

Results

The results of the data valuation experiments are saved in the results/ directory, organized by dataset and random seed. The addition_experiment_results.csv file in each seed directory contains the performance metrics for each data valuation method when data points are added one by one.

The utils/plotting.py script can be used to plot the aggregated results with confidence intervals across multiple random seeds. The plots are saved in the results/{dataset}/aggregate/ directory.


This README provides an overview of the main Python files, their functionalities, and how to use the project. It also explains the configuration and results structure. Feel free to modify and expand the README based on your specific project requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
src		src
utility_appro		utility_appro
.DS_Store		.DS_Store
README.md		README.md
curvature_exp.py		curvature_exp.py
project_root_budget_1000_new.zip		project_root_budget_1000_new.zip
run_experiments_all.sh		run_experiments_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sequential Data Selection with Data Values Project

Main Files

Usage

Configuration

Results

About

Uh oh!

Releases

Packages

Languages

frankhlchi/SequentialDataVal

Folders and files

Latest commit

History

Repository files navigation

Sequential Data Selection with Data Values Project

Main Files

Usage

Configuration

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages