Skip to content

skitzke/csv-warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSV Warehouse

A simple ETL utility that extracts data from multiple CSV files, transforms it, and loads it into a consolidated SQLite database.

Features

  • Batch Import: Import multiple CSV files from a directory in one operation
  • Configurable Mapping: Define column mappings and data type conversions via YAML config
  • Data Cleaning: Automatic null handling and duplicate removal
  • Logging: Comprehensive logging of all ETL operations

Installation

pip install csv-warehouse

Or install from source:

git clone https://github.com/example/csv-warehouse.git
cd csv-warehouse
pip install -e .

Quick Start

Command Line Usage

# Basic usage - import all CSVs from a directory
csv-warehouse /path/to/csv/files output.db

# With configuration file
csv-warehouse /path/to/csv/files output.db --config config.yaml

# Verbose output
csv-warehouse /path/to/csv/files output.db -v

Python API

from csv_warehouse import Warehouse

# Initialize warehouse
warehouse = Warehouse('my_database.db')

# Import CSV files
warehouse.import_directory('/path/to/csv/files')

# Or import a single file
warehouse.import_file('/path/to/data.csv', table_name='my_table')

# Close the connection
warehouse.close()

Using Configuration

from csv_warehouse import Warehouse, Config

# Load configuration
config = Config.from_yaml('config.yaml')

# Initialize with config
warehouse = Warehouse('my_database.db', config=config)
warehouse.import_directory('/path/to/csv/files')

Configuration

Create a YAML configuration file to customize the ETL process:

# config.yaml
database:
  path: "warehouse.db"

import_settings:
  encoding: "utf-8"
  delimiter: ","
  skip_errors: false

column_mappings:
  # Map source columns to target columns
  "Original Name": "original_name"
  "Email Address": "email"
  "Phone Number": "phone"

data_types:
  # Specify column data types
  id: integer
  created_at: datetime
  price: float
  is_active: boolean

cleaning:
  remove_duplicates: true
  null_handling: "drop"  # or "fill" with fill_value
  fill_value: ""
  trim_whitespace: true

Data Type Conversions

Supported data types for automatic conversion:

Type Description Example
string Text data (default) "hello"
integer Whole numbers 42
float Decimal numbers 3.14
boolean True/False values true, false, 1, 0
datetime Date and time 2024-01-15 10:30:00
date Date only 2024-01-15

Null Handling Options

  • drop: Remove rows with null values
  • fill: Replace nulls with a specified value
  • keep: Keep null values as-is

API Reference

Warehouse

Main class for ETL operations.

Warehouse(db_path: str, config: Config = None)

Methods:

  • import_directory(path: str) -> dict: Import all CSV files from directory
  • import_file(path: str, table_name: str = None) -> int: Import single CSV file
  • close(): Close database connection

Config

Configuration management class.

Config.from_yaml(path: str) -> Config
Config.from_dict(data: dict) -> Config

Transformer

Data transformation utilities.

Transformer(config: Config = None)

Methods:

  • transform(df: DataFrame) -> DataFrame: Apply all transformations
  • clean_nulls(df: DataFrame) -> DataFrame: Handle null values
  • remove_duplicates(df: DataFrame) -> DataFrame: Remove duplicate rows
  • apply_mappings(df: DataFrame) -> DataFrame: Apply column mappings
  • convert_types(df: DataFrame) -> DataFrame: Convert data types

Examples

Importing Sales Data

from csv_warehouse import Warehouse, Config

config = Config.from_dict({
    'column_mappings': {
        'Sale Date': 'sale_date',
        'Product ID': 'product_id',
        'Quantity Sold': 'quantity',
        'Unit Price': 'unit_price'
    },
    'data_types': {
        'sale_date': 'datetime',
        'product_id': 'integer',
        'quantity': 'integer',
        'unit_price': 'float'
    },
    'cleaning': {
        'remove_duplicates': True,
        'null_handling': 'drop'
    }
})

warehouse = Warehouse('sales.db', config=config)
result = warehouse.import_directory('./sales_data/')

print(f"Imported {result['total_rows']} rows from {result['files_processed']} files")

Custom Logging

import logging
from csv_warehouse import Warehouse, setup_logger

# Configure logging
setup_logger(level=logging.DEBUG, log_file='etl.log')

warehouse = Warehouse('data.db')
warehouse.import_directory('./data/')

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

v0.1.0

  • Initial release
  • Batch CSV import
  • Configurable column mapping
  • Data type conversion
  • Null handling and duplicate removal
  • SQLite database support

About

A simple ETL utility that extracts data from multiple CSV files, transforms it, and loads it into a consolidated SQLite database

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages