A Python package for efficiently loading and querying data from ClickHouse databases with a clean, intuitive API.
-
Create a GitHub Personal Access Token (Classic)
- Go to GitHub Settings → Developer settings → Personal access tokens → Tokens (classic)
- Click "Generate new token (classic)"
- Give it a descriptive name (e.g., "DataLoader Package Access")
- Select the
reposcope (this grants full control of private repositories) - Click "Generate token"
-
Install the package
pip install git+https://<PAT>@github.com/SSMIF-Quant/dataloader.git@mainReplace <PAT> with your Personal Access Token.
Example:
pip install git+https://ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxx@github.com/SSMIF-Quant/dataloader.git@maindataloader/
├── __init__.py # Package initialization
├── _client.py # ClickHouse client wrapper
├── _manager.py # Connection manager
├── _env.py # Environment configuration
├── _pool.py # Connection pooling
└── loader.py # Main DataLoader class
DataLoader: Main interface for querying dataClient: Handles ClickHouse queries and connectionsManager: Manages database connections and poolingConnectionPool: Maintains reusable database connections
Fetch data from ClickHouse with flexible filtering and column selection.
Parameters:
source(str): Table or materialized view namecolumns_list(Optional[List[str]]): Explicit list of columns to selectcolumn_pattern(Optional[List[str]]): Pattern-based column selection (e.g.,["IS_*"]for all columns starting with "IS_")filters(Optional[Dict[str, Any]]): Key-value pairs for WHERE clause filteringlimit(Optional[int]): Maximum number of rows to returnoffset(Optional[int]): Number of rows to skip
Returns: pd.DataFrame with Date as index (sorted and deduplicated)
List all available tables in the database.
Returns: List[str] of table names
Get all columns for a specific table.
Parameters:
source(str): Table or view name
Returns: List[str] of column names
from dataloader import DataLoader
# Get all income statement columns for Apple
equities_data = DataLoader.get_data(
source="equities",
column_pattern=["IS_*"], # All columns starting with "IS_"
filters={"symbol": "AAPL"}
)
print(equities_data.head())from dataloader import DataLoader
# Get specific macro indicators
macro_data = DataLoader.get_data(
source="macro",
columns_list=["XLB", "THREEFYTP10", "DSWP10", "DGS10", "BAMLEMHBHYCRPIOAS", "T10Y2Y"]
)
print(macro_data.info())from dataloader import DataLoader
# Get data for multiple symbols with limit
multi_symbol_data = DataLoader.get_data(
source="equities",
columns_list=["PX_LAST", "CUR_MKT_CAP", "PX_VOLUME"],
filters={
"symbol": ["AAPL", "GOOGL", "MSFT"]
},
limit=1000
)
print(multi_symbol_data)from dataloader import DataLoader
# List all available tables
tables = DataLoader.show_tables()
print("Available tables:", tables)
# Get columns for a specific table
equity_columns = DataLoader.show_table_column("equities")
print("Equity columns:", equity_columns)
macro_columns = DataLoader.show_table_column("macro")
print("Macro columns:", macro_columns)from dataloader import DataLoader
# Mix explicit columns and patterns
combined_data = DataLoader.get_data(
source="equities",
columns_list=["PX_LAST"], # Explicit columns
column_pattern=["IS_*", "BS_*"], # Income Statement + Balance Sheet columns
filters={"symbol": "TSLA"}
)
print(combined_data.columns.tolist())from dataloader import DataLoader
# Get data in chunks
page_1 = DataLoader.get_data(
source="equities",
columns_list=["PX_LAST"],
filters={"symbol": "AAPL"},
limit=100,
offset=0
)
page_2 = DataLoader.get_data(
source="equities",
columns_list=["PX_LAST"],
filters={"symbol": "AAPL"},
limit=100,
offset=100
)
print(f"Page 1: {len(page_1)} rows")
print(f"Page 2: {len(page_2)} rows")✅ Dynamic Column Selection: Use patterns to select multiple columns at once
✅ Flexible Filtering: Support for single values and lists
✅ Automatic Date Handling: Returns DataFrame with Date index
✅ Deduplication: Automatically removes duplicate dates
✅ Connection Pooling: Efficient database connection management
All queries return a pandas.DataFrame with:
Datecolumn as the index (datetime type)- Sorted by date (ascending)
- Duplicates removed based on date
- All selected columns as specified
- The
datecolumn is automatically included in all queries - The
symbolcolumn is included when filtering by multiple symbols - The columns are renamed for clarity when filtering by a single symbol
- Dates are converted to pandas datetime objects
- Results are always sorted by date
- The default database is
ssmif_quant - Connection credentials are managed via environment variables
For issues or questions, please contact the SSMIF Quant team or open an issue in the repository.