Skip to content

BaumSebastian/DaRUS-Dataset-Interaction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

103 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DaRUS Dataset Interaction

License: GPL v3 GitHub release Python 3.8+ GitHub issues Maintenance Code style: black

A Python package for easily downloading datasets from the DaRUS (DataRepository of the University of Stuttgart) platform. Currently the web interface of darus limits the size of downloads by 2 GB, which makes it hard to download big datasets like the FEM Dataset example below. This package enables interaction with the dataset by downloading the whole dataset (or specific files), handles authentication and directory management.

Table of Contents

Installation

This repository can be installed using pip or uv (recommended).

pip install git+https://github.com/BaumSebastian/DaRUS-Dataset-Interaction.git

CLI Usage

After installation, you can use the command line interface:

Basic Usage

Download all files from a dataset to ./data directory:

darus-download --url "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801"

Specify Download Path

Choose where to save the downloaded files:

darus-download --url "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801" --path "./downloads"

Note: Every file has a value directory in its metadata (see Add a File to Dataset). The programm will create and store the downloaded file in the specific directory according to path/directory.

Download Specific Files Only

Download only selected files instead of the entire dataset:

darus-download --url "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801" --files metadata.tab

Note: DaRUS converts tabular data like .csv files into .tab format when uploaded. This package downloads the original file format (like .csv) when available. As metadata.tab is the displayed file by darus, this file still needs to be added as --files and not metadata.csv.

Private Datasets with API Token

Access restricted datasets using your DaRUS API token:

darus-download --url "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801" --token "your-api-token"

Use Custom Config File

Store settings in a YAML file for repeated use:

darus-download --config config.yaml

Available Arguments:

  • --url, -u: Dataset URL
  • --path, -p: Download directory path [optional] (default: ./data)
  • --token, -t: API token for authentication [optional]
  • --files, -f: Specific files to download [optional] (space-separated)
  • --config, -c: Config file path [optional]
  • --help: Show help message

Python API Usage

Basic Usage

Download an entire dataset:

from darus import Dataset 

url = "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801"
path = "./data"

# Download the complete dataset
ds = Dataset(url)
ds.download(path)

Note: Every file has a value directory in its metadata (see Add a File to Dataset). Dataset creates and stores the downloaded file in the specific directory according to path/directory.

Download Specific Files

Download only selected files (["metadata.tab"]) from a dataset:

from darus import Dataset 

url = "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801"
path = "./data"

files = ["metadata.tab"] 

ds = Dataset(url)
ds.download(path, files=files)

Note: DaRUS converts tabular data like .csv files into .tab format when uploaded. This package downloads the original file format (like .csv) when available. As metadata.tab is the displayed file by darus, this file still needs to be added as --files and not metadata.csv.

Private Datasets

For datasets that require authentication use the api_token of your DaRUS account.

from darus import Dataset 

url = "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801"
path = "./data"

api_token = 'xxxx-xxxx-xxxx-xxxx'

ds = Dataset(url, api_token=api_token)
ds.download(path)

Post Processing

The method download of Dataset accepts two optional arguments.

  • post_process : Zip archieves are automatically extracted, after download completed. Default: True.
  • remove_after_pp: The Zip archieves are deleted after extration. Default: True.

Sample Output

Executing following script, results in the output below.

from darus import Dataset 

url = "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801"
path = "./data"

ds = Dataset(url)
ds.summary()
ds.download(path)

Note: The Dataset Summary and Files in Dataset is only printed, when ds.summary() is called.

The output looks like following:

Dataset Summary
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Property      ┃ Value                                                                             ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL           │ https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801 │
│ Persistent ID │ doi:10.18419/DARUS-4801                                                           │
│ Last Update   │ 2025-03-12 12:32:17                                                               │
│ License       │ CC BY 4.0                                                                         │
└───────────────┴───────────────────────────────────────────────────────────────────────────────────┘
Files in Dataset
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Size    ┃ Original Available ┃ Description                                                 ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 113525_116825.zip │ 59.2 GB │                    │ Contains all simulations with ID between 113525 and 116825. │
│ 116826_211007.zip │ 59.2 GB │                    │ Contains all simulations with ID between 116826 and 211007. │
│ 16039_19338.zip   │ 59.2 GB │                    │ Contains all simulations with ID between 16039 and 19338.   │
│ 19339_113524.zip  │ 59.1 GB │                    │ Contains all simulations with ID between 19339 and 113524.  │
│ 257076_260375.zip │ 59.7 GB │                    │ Contains all simulations with ID between 257076 and 260375. │
│ 260376_306443.zip │ 59.8 GB │                    │ Contains all simulations with ID between 260376 and 306443. │
│ 306444_309743.zip │ 59.7 GB │                    │ Contains all simulations with ID between 306444 and 309743. │
│ 309744_403925.zip │ 59.6 GB │                    │ Contains all simulations with ID between 309744 and 403925. │
│ 403926_406296.zip │ 42.6 GB │                    │ Contains all simulations with ID between 403926 and 406296. │
│ metadata.tab      │ 2.5 MB  │ ✓(metadata.csv)    │ Metadata of the simulations.                                │
└───────────────────┴─────────┴────────────────────┴─────────────────────────────────────────────────────────────┘
Downloading...
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Size    ┃ Directory  ┃ Download Original ┃ Description                                                 ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 113525_116825.zip │ 59.2 GB │ .\data\h5\ │                   │ Contains all simulations with ID between 113525 and 116825. │
│ 116826_211007.zip │ 59.2 GB │ .\data\h5\ │                   │ Contains all simulations with ID between 116826 and 211007. │
│ 16039_19338.zip   │ 59.2 GB │ .\data\h5\ │                   │ Contains all simulations with ID between 16039 and 19338.   │
│ 19339_113524.zip  │ 59.1 GB │ .\data\h5\ │                   │ Contains all simulations with ID between 19339 and 113524.  │
│ 257076_260375.zip │ 59.7 GB │ .\data\h5\ │                   │ Contains all simulations with ID between 257076 and 260375. │
│ 260376_306443.zip │ 59.8 GB │ .\data\h5\ │                   │ Contains all simulations with ID between 260376 and 306443. │
│ 306444_309743.zip │ 59.7 GB │ .\data\h5\ │                   │ Contains all simulations with ID between 306444 and 309743. │
│ 309744_403925.zip │ 59.6 GB │ .\data\h5\ │                   │ Contains all simulations with ID between 309744 and 403925. │
│ 403926_406296.zip │ 42.6 GB │ .\data\h5\ │                   │ Contains all simulations with ID between 403926 and 406296. │
│ metadata.tab      │ 2.5 MB  │ .\data\    │ ✓(metadata.csv)   │ Metadata of the simulations.                                │
└───────────────────┴─────────┴────────────┴───────────────────┴─────────────────────────────────────────────────────────────┘
Downloading 113525_116825.zip ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9% • 5.2/59.2 GB • 0:08:07 • 1:22:43 • 10.9 MB/s
....

Development

Project Structure

darus/
├── darus/              # Main package
│   ├── __init__.py     # Package initialization
│   ├── cli.py          # Command line interface
│   ├── Dataset.py      # Main Dataset class
│   ├── DatasetFile.py  # File download and processing
│   └── utils.py        # Utility functions and logging
├── tests/              # Test suite
│   ├── fixtures/       # Test data and fixtures
│   ├── test_dataset.py # Dataset class tests
│   └── test_dataset_file.py # DatasetFile tests
├── config.yaml         # Example configuration
└── setup.py           # Package configuration

Setup Development Environment

Clone the repository and install in development mode:

git clone https://github.com/BaumSebastian/DaRUS-Dataset-Interaction.git
cd DaRUS-Dataset-Interaction
pip install -e .[dev]  # Includes testing tools

Running from Source

Test the CLI directly from source:

python -m darus.cli --url "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801"

Running Tests

Running the tests locally.

# Run the full test suite:
pytest -v

# Run tests with coverage:
pytest --cov=darus --cov-report=html

# Run specific test file:
pytest tests/test_dataset.py -v

Code Quality

How to ensure a specific code quality.

# Format code with Black:
black darus/ tests/

# Type checking (if mypy is installed)
mypy darus/

# Linting
flake8 darus/ tests/

VSCode Debugging

For VSCode users, you can create .vscode/launch.json with debug configurations:

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Debug CLI - Demo Dataset",
            "type": "debugpy",
            "request": "launch",
            "module": "darus.cli",
            "args": [
                "--url", "https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/NIVKU0",
                "--path", "./debug_downloads"
            ],
            "console": "integratedTerminal",
            "cwd": "${workspaceFolder}"
        },
        {
            "name": "Debug Tests - All",
            "type": "debugpy",
            "request": "launch",
            "module": "pytest",
            "args": ["-v", "tests/"],
            "console": "integratedTerminal",
            "cwd": "${workspaceFolder}"
        }
    ]
}

Set breakpoints and press F5 to start debugging.

Contributing Guidelines

  1. Fork the repository and create a feature branch
  2. Write tests for new functionality
  3. Ensure tests pass: pytest -v
  4. Format code: black darus/ tests/
  5. Update documentation if needed
  6. Submit a pull request with a clear description

Additional Resources

About

Unofficial Python package for downloading datasets from DaRUS (University of Stuttgart DataRepository)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages