A minimal template repository for working with Google Cloud Storage (GCS) and BigQuery, specifically designed to handle large files (5-6 GB each) split across multiple GCS buckets.
- Authentication: Service account JSON key file authentication
- Dataset Management: Create and manage BigQuery datasets
- Table Creation: Create external tables or load data from GCS files
- Large File Support: Handle hundreds of large files (5-6 GB each)
- Batch Processing: Efficient file discovery and table creation for large file lists
- Multiple Paths: Combine files from multiple GCS paths/buckets
macOS/Linux:
pip3 install -r requirements.txtWindows:
pip install -r requirements.txtNote: On macOS/Linux, you may need to use pip3 instead of pip depending on your Python installation. If you encounter permission errors, use pip3 install --user -r requirements.txt or install in a virtual environment.
Copy .env.example to .env and fill in your values:
macOS/Linux:
cp .env.example .envWindows PowerShell:
Copy-Item .env.example .envWindows Command Prompt:
copy .env.example .envRequired environment variables:
GOOGLE_APPLICATION_CREDENTIALS: Path to your service account JSON key file- macOS/Linux example:
/Users/username/.gcp/service-account-key.json - Windows example:
C:\Users\YourUsername\.gcp\service-account-key.json
- macOS/Linux example:
GCP_PROJECT_ID: Your Google Cloud Project IDGCS_BUCKET_NAME: Name of your GCS bucketBIGQUERY_DATASET: Name of your BigQuery dataset
Optional:
BIGQUERY_DATASET_LOCATION: Dataset location (default: US)GCS_PATH_PREFIX: GCS path prefix for file discovery
- Create a service account in Google Cloud Console
- Grant the following roles:
- Storage Object Viewer (for reading GCS objects)
- Storage Legacy Bucket Reader (for bucket metadata access)
- BigQuery Data Editor (for creating datasets and tables)
- BigQuery Job User (for running BigQuery jobs)
- Download the JSON key file
- Set
GOOGLE_APPLICATION_CREDENTIALSto the path of the key file
Setting the path:
macOS/Linux:
# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
# Option 2: Set as environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"Windows PowerShell:
# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json
# Option 2: Set as environment variable
$env:GOOGLE_APPLICATION_CREDENTIALS="C:\path\to\key.json"Windows Command Prompt:
# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json
# Option 2: Set as environment variable
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.jsonSee GCP Setup Guide for detailed instructions.
macOS/Linux:
# Create a dataset
python3 examples/create_dataset_example.py
# Create tables from GCS files
python3 examples/create_table_from_gcs.pyWindows:
# Create a dataset
python examples/create_dataset_example.py
# Create tables from GCS files
python examples/create_table_from_gcs.pyNote: On macOS/Linux, you may need to use python3 instead of python depending on your Python installation.
from config.gcp_config import GCPConfig
from utils.dataset_manager import DatasetManager
# Initialize
gcp_config = GCPConfig()
dataset_manager = DatasetManager(gcp_config)
# Create dataset
dataset_manager.create_dataset(
dataset_name="my_dataset",
location="US",
description="My dataset description",
if_exists="ignore" # or "error" or "overwrite"
)from config.gcp_config import GCPConfig
from utils.table_manager import TableManager
# Initialize
gcp_config = GCPConfig()
table_manager = TableManager(gcp_config)
# Discover files
files = table_manager.discover_gcs_files(
prefixes=["path/to/data/"],
file_extensions=['.parquet']
)
# Create external table
table_manager.create_table_from_gcs_files(
table_name="my_external_table",
gcs_paths=files,
source_format="PARQUET",
table_type="external"
)# Create loaded table (data is copied to BigQuery)
table_manager.create_table_from_gcs_files(
table_name="my_loaded_table",
gcs_paths=files,
source_format="PARQUET",
table_type="loaded",
write_disposition="WRITE_TRUNCATE"
)# Combine files from multiple GCS paths
table_manager.combine_files_from_multiple_paths(
table_name="combined_table",
gcs_paths=[
"path/to/data1/",
"path/to/data2/",
"another/path/"
],
file_extensions=['.parquet'],
source_format="PARQUET",
table_type="external"
)For very large file lists (hundreds of files), use batch methods:
# Discover files (handles large lists efficiently)
files = table_manager.discover_gcs_files(
prefixes=["path/to/data/"],
file_extensions=['.parquet'],
max_files=None # None for all files
)
# Create GCS URIs
gcs_uris = gcp_config.create_gcs_uris(files)
# Create external table (BigQuery handles large URI lists well)
table_manager.create_external_table_batch(
table_name="large_table",
gcs_uris=gcs_uris,
source_format="PARQUET"
)
# Or load in batches (for loaded tables)
table_manager.load_files_batch(
table_name="large_loaded_table",
gcs_uris=gcs_uris,
batch_size=50,
source_format="PARQUET"
)Main configuration class for GCP authentication and basic operations.
list_gcs_files(prefix, max_results, file_extensions): List files in GCS bucketdiscover_gcs_files_batch(prefixes, file_extensions, max_files_per_prefix): Discover files across multiple prefixescreate_gcs_uris(blob_paths): Convert blob paths to GCS URIscheck_bigquery_table_exists(table_name): Check if table existscreate_external_table_from_gcs(table_name, gcs_uris, schema, source_format): Create external tableload_files_to_bigquery(table_name, gcs_uris, source_format, write_disposition): Load files into BigQuery
Manages BigQuery dataset operations.
create_dataset(dataset_name, location, description, if_exists): Create a datasetdataset_exists(dataset_name): Check if dataset existsdelete_dataset(dataset_name, delete_contents): Delete a datasetlist_datasets(): List all datasets in projectget_dataset_info(dataset_name): Get dataset information
Manages BigQuery table creation from GCS files.
discover_gcs_files(prefixes, file_extensions, max_files): Discover files matching criteriacreate_table_from_gcs_files(table_name, gcs_paths, gcs_uris, source_format, table_type, schema, write_disposition, if_exists): Create table from GCS filescreate_external_table_batch(table_name, gcs_uris, batch_size, schema, source_format): Create external table from large file listload_files_batch(table_name, gcs_uris, batch_size, source_format, write_disposition): Load files in batchescombine_files_from_multiple_paths(table_name, gcs_paths, file_pattern, file_extensions, source_format, table_type): Combine files from multiple pathsdelete_table(table_name): Delete a table
- Pros: No data movement, fast setup, always reflects latest GCS data
- Cons: Query performance may be slower, requires GCS access permissions
- Use when: Data changes frequently, want to avoid data duplication, query performance is acceptable
- Pros: Better query performance, data is stored in BigQuery
- Cons: Data is copied (storage costs), setup takes longer for large files
- Use when: Query performance is critical, data is relatively static
This template is designed to handle large files (5-6 GB each) efficiently:
- File Discovery: Uses efficient GCS listing without downloading files
- Batch Processing: Supports batch operations for large file lists
- Progress Tracking: Provides progress updates for long-running operations
- Error Handling: Includes error recovery and retry logic
- Use external tables when possible (no data movement)
- For loaded tables, use batch loading for better control
- Monitor BigQuery job status for long-running operations
- Consider splitting very large datasets into multiple tables
This repository works on Windows, macOS, and Linux. The code uses Python's cross-platform libraries (os, pathlib) for file path handling, so no platform-specific code changes are needed.
- macOS/Linux: Use forward slashes (
/) in paths- Example:
/Users/username/.gcp/service-account-key.json
- Example:
- Windows: Use backslashes (
\) or forward slashes (/) in paths- Example:
C:\Users\username\.gcp\service-account-key.jsonorC:/Users/username/.gcp/service-account-key.json
- Example:
macOS:
# Using Homebrew (recommended)
brew install python3
# Or download from python.org
# Then install dependencies
pip3 install -r requirements.txtWindows:
# Download Python from python.org
# Then install dependencies
pip install -r requirements.txtLinux:
# Using package manager
sudo apt-get install python3 python3-pip # Debian/Ubuntu
# or
sudo yum install python3 python3-pip # CentOS/RHEL
# Then install dependencies
pip3 install -r requirements.txtError: "Could not automatically determine credentials"
Solution:
- Verify
GOOGLE_APPLICATION_CREDENTIALSis set correctly - Check that the JSON key file path is valid
- macOS/Linux: Ensure the path uses forward slashes and is absolute (starts with
/) - Windows: Ensure the path uses backslashes or forward slashes and includes the drive letter
- macOS/Linux: Ensure the path uses forward slashes and is absolute (starts with
- Ensure the service account has required permissions
- On macOS/Linux, verify file permissions allow reading:
chmod 600 /path/to/key.json
Error: "Permission denied" or "Access denied"
Solution:
- Verify service account has required roles:
- Storage Object Viewer
- Storage Legacy Bucket Reader
- BigQuery Data Editor
- BigQuery Job User
- Check bucket and dataset permissions
Error: "Dataset already exists"
Solution: Use if_exists="ignore" or if_exists="replace" when creating datasets
Error: "Table already exists"
Solution: Use if_exists="replace" or if_exists="ignore" when creating tables
Error: "Too many URIs" or timeout errors
Solution:
- Use
load_files_batch()for loaded tables - Consider splitting into multiple tables
- Use external tables which handle large URI lists better
bigquery-gcs-utils/
├── README.md # This file
├── .env.example # Environment variable template
├── requirements.txt # Python dependencies
├── config/
│ ├── __init__.py
│ └── gcp_config.py # GCP configuration class
├── utils/
│ ├── __init__.py
│ ├── dataset_manager.py # Dataset management utilities
│ └── table_manager.py # Table creation utilities
└── examples/
├── __init__.py
├── create_dataset_example.py
└── create_table_from_gcs.py
This is a template repository. Modify as needed for your use case.
This is a minimal template. Feel free to extend it with additional features as needed.