A simple ETL utility that extracts data from multiple CSV files, transforms it, and loads it into a consolidated SQLite database.
- Batch Import: Import multiple CSV files from a directory in one operation
- Configurable Mapping: Define column mappings and data type conversions via YAML config
- Data Cleaning: Automatic null handling and duplicate removal
- Logging: Comprehensive logging of all ETL operations
pip install csv-warehouseOr install from source:
git clone https://github.com/example/csv-warehouse.git
cd csv-warehouse
pip install -e .# Basic usage - import all CSVs from a directory
csv-warehouse /path/to/csv/files output.db
# With configuration file
csv-warehouse /path/to/csv/files output.db --config config.yaml
# Verbose output
csv-warehouse /path/to/csv/files output.db -vfrom csv_warehouse import Warehouse
# Initialize warehouse
warehouse = Warehouse('my_database.db')
# Import CSV files
warehouse.import_directory('/path/to/csv/files')
# Or import a single file
warehouse.import_file('/path/to/data.csv', table_name='my_table')
# Close the connection
warehouse.close()from csv_warehouse import Warehouse, Config
# Load configuration
config = Config.from_yaml('config.yaml')
# Initialize with config
warehouse = Warehouse('my_database.db', config=config)
warehouse.import_directory('/path/to/csv/files')Create a YAML configuration file to customize the ETL process:
# config.yaml
database:
path: "warehouse.db"
import_settings:
encoding: "utf-8"
delimiter: ","
skip_errors: false
column_mappings:
# Map source columns to target columns
"Original Name": "original_name"
"Email Address": "email"
"Phone Number": "phone"
data_types:
# Specify column data types
id: integer
created_at: datetime
price: float
is_active: boolean
cleaning:
remove_duplicates: true
null_handling: "drop" # or "fill" with fill_value
fill_value: ""
trim_whitespace: trueSupported data types for automatic conversion:
| Type | Description | Example |
|---|---|---|
string |
Text data (default) | "hello" |
integer |
Whole numbers | 42 |
float |
Decimal numbers | 3.14 |
boolean |
True/False values | true, false, 1, 0 |
datetime |
Date and time | 2024-01-15 10:30:00 |
date |
Date only | 2024-01-15 |
drop: Remove rows with null valuesfill: Replace nulls with a specified valuekeep: Keep null values as-is
Main class for ETL operations.
Warehouse(db_path: str, config: Config = None)Methods:
import_directory(path: str) -> dict: Import all CSV files from directoryimport_file(path: str, table_name: str = None) -> int: Import single CSV fileclose(): Close database connection
Configuration management class.
Config.from_yaml(path: str) -> Config
Config.from_dict(data: dict) -> ConfigData transformation utilities.
Transformer(config: Config = None)Methods:
transform(df: DataFrame) -> DataFrame: Apply all transformationsclean_nulls(df: DataFrame) -> DataFrame: Handle null valuesremove_duplicates(df: DataFrame) -> DataFrame: Remove duplicate rowsapply_mappings(df: DataFrame) -> DataFrame: Apply column mappingsconvert_types(df: DataFrame) -> DataFrame: Convert data types
from csv_warehouse import Warehouse, Config
config = Config.from_dict({
'column_mappings': {
'Sale Date': 'sale_date',
'Product ID': 'product_id',
'Quantity Sold': 'quantity',
'Unit Price': 'unit_price'
},
'data_types': {
'sale_date': 'datetime',
'product_id': 'integer',
'quantity': 'integer',
'unit_price': 'float'
},
'cleaning': {
'remove_duplicates': True,
'null_handling': 'drop'
}
})
warehouse = Warehouse('sales.db', config=config)
result = warehouse.import_directory('./sales_data/')
print(f"Imported {result['total_rows']} rows from {result['files_processed']} files")import logging
from csv_warehouse import Warehouse, setup_logger
# Configure logging
setup_logger(level=logging.DEBUG, log_file='etl.log')
warehouse = Warehouse('data.db')
warehouse.import_directory('./data/')Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Initial release
- Batch CSV import
- Configurable column mapping
- Data type conversion
- Null handling and duplicate removal
- SQLite database support