BigQuery GCS Utils

A minimal template repository for working with Google Cloud Storage (GCS) and BigQuery, specifically designed to handle large files (5-6 GB each) split across multiple GCS buckets.

Features

Authentication: Service account JSON key file authentication
Dataset Management: Create and manage BigQuery datasets
Table Creation: Create external tables or load data from GCS files
Large File Support: Handle hundreds of large files (5-6 GB each)
Batch Processing: Efficient file discovery and table creation for large file lists
Multiple Paths: Combine files from multiple GCS paths/buckets

Quick Start

1. Install Dependencies

macOS/Linux:

pip3 install -r requirements.txt

Windows:

pip install -r requirements.txt

Note: On macOS/Linux, you may need to use pip3 instead of pip depending on your Python installation. If you encounter permission errors, use pip3 install --user -r requirements.txt or install in a virtual environment.

2. Configure Environment Variables

Copy .env.example to .env and fill in your values:

macOS/Linux:

cp .env.example .env

Windows PowerShell:

Copy-Item .env.example .env

Windows Command Prompt:

copy .env.example .env

Required environment variables:

GOOGLE_APPLICATION_CREDENTIALS: Path to your service account JSON key file
- macOS/Linux example: /Users/username/.gcp/service-account-key.json
- Windows example: C:\Users\YourUsername\.gcp\service-account-key.json
GCP_PROJECT_ID: Your Google Cloud Project ID
GCS_BUCKET_NAME: Name of your GCS bucket
BIGQUERY_DATASET: Name of your BigQuery dataset

Optional:

BIGQUERY_DATASET_LOCATION: Dataset location (default: US)
GCS_PATH_PREFIX: GCS path prefix for file discovery

3. Set Up Service Account

Create a service account in Google Cloud Console
Grant the following roles:
- Storage Object Viewer (for reading GCS objects)
- Storage Legacy Bucket Reader (for bucket metadata access)
- BigQuery Data Editor (for creating datasets and tables)
- BigQuery Job User (for running BigQuery jobs)
Download the JSON key file
Set GOOGLE_APPLICATION_CREDENTIALS to the path of the key file

Setting the path:

macOS/Linux:

# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json

# Option 2: Set as environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"

Windows PowerShell:

# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json

# Option 2: Set as environment variable
$env:GOOGLE_APPLICATION_CREDENTIALS="C:\path\to\key.json"

Windows Command Prompt:

# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json

# Option 2: Set as environment variable
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json

See GCP Setup Guide for detailed instructions.

4. Run Examples

macOS/Linux:

# Create a dataset
python3 examples/create_dataset_example.py

# Create tables from GCS files
python3 examples/create_table_from_gcs.py

Windows:

# Create a dataset
python examples/create_dataset_example.py

# Create tables from GCS files
python examples/create_table_from_gcs.py

Note: On macOS/Linux, you may need to use python3 instead of python depending on your Python installation.

Usage

Creating a Dataset

from config.gcp_config import GCPConfig
from utils.dataset_manager import DatasetManager

# Initialize
gcp_config = GCPConfig()
dataset_manager = DatasetManager(gcp_config)

# Create dataset
dataset_manager.create_dataset(
    dataset_name="my_dataset",
    location="US",
    description="My dataset description",
    if_exists="ignore"  # or "error" or "overwrite"
)

Creating Tables from GCS Files

External Table (References GCS files, no data movement)

from config.gcp_config import GCPConfig
from utils.table_manager import TableManager

# Initialize
gcp_config = GCPConfig()
table_manager = TableManager(gcp_config)

# Discover files
files = table_manager.discover_gcs_files(
    prefixes=["path/to/data/"],
    file_extensions=['.parquet']
)

# Create external table
table_manager.create_table_from_gcs_files(
    table_name="my_external_table",
    gcs_paths=files,
    source_format="PARQUET",
    table_type="external"
)

Loaded Table (Copies data to BigQuery)

# Create loaded table (data is copied to BigQuery)
table_manager.create_table_from_gcs_files(
    table_name="my_loaded_table",
    gcs_paths=files,
    source_format="PARQUET",
    table_type="loaded",
    write_disposition="WRITE_TRUNCATE"
)

Combining Files from Multiple Paths

# Combine files from multiple GCS paths
table_manager.combine_files_from_multiple_paths(
    table_name="combined_table",
    gcs_paths=[
        "path/to/data1/",
        "path/to/data2/",
        "another/path/"
    ],
    file_extensions=['.parquet'],
    source_format="PARQUET",
    table_type="external"
)

Handling Large File Lists

For very large file lists (hundreds of files), use batch methods:

# Discover files (handles large lists efficiently)
files = table_manager.discover_gcs_files(
    prefixes=["path/to/data/"],
    file_extensions=['.parquet'],
    max_files=None  # None for all files
)

# Create GCS URIs
gcs_uris = gcp_config.create_gcs_uris(files)

# Create external table (BigQuery handles large URI lists well)
table_manager.create_external_table_batch(
    table_name="large_table",
    gcs_uris=gcs_uris,
    source_format="PARQUET"
)

# Or load in batches (for loaded tables)
table_manager.load_files_batch(
    table_name="large_loaded_table",
    gcs_uris=gcs_uris,
    batch_size=50,
    source_format="PARQUET"
)

API Reference

GCPConfig

Main configuration class for GCP authentication and basic operations.

Methods

list_gcs_files(prefix, max_results, file_extensions): List files in GCS bucket
discover_gcs_files_batch(prefixes, file_extensions, max_files_per_prefix): Discover files across multiple prefixes
create_gcs_uris(blob_paths): Convert blob paths to GCS URIs
check_bigquery_table_exists(table_name): Check if table exists
create_external_table_from_gcs(table_name, gcs_uris, schema, source_format): Create external table
load_files_to_bigquery(table_name, gcs_uris, source_format, write_disposition): Load files into BigQuery

DatasetManager

Manages BigQuery dataset operations.

Methods

create_dataset(dataset_name, location, description, if_exists): Create a dataset
dataset_exists(dataset_name): Check if dataset exists
delete_dataset(dataset_name, delete_contents): Delete a dataset
list_datasets(): List all datasets in project
get_dataset_info(dataset_name): Get dataset information

TableManager

Manages BigQuery table creation from GCS files.

Methods

discover_gcs_files(prefixes, file_extensions, max_files): Discover files matching criteria
create_table_from_gcs_files(table_name, gcs_paths, gcs_uris, source_format, table_type, schema, write_disposition, if_exists): Create table from GCS files
create_external_table_batch(table_name, gcs_uris, batch_size, schema, source_format): Create external table from large file list
load_files_batch(table_name, gcs_uris, batch_size, source_format, write_disposition): Load files in batches
combine_files_from_multiple_paths(table_name, gcs_paths, file_pattern, file_extensions, source_format, table_type): Combine files from multiple paths
delete_table(table_name): Delete a table

Table Types

External Tables

Pros: No data movement, fast setup, always reflects latest GCS data
Cons: Query performance may be slower, requires GCS access permissions
Use when: Data changes frequently, want to avoid data duplication, query performance is acceptable

Loaded Tables

Pros: Better query performance, data is stored in BigQuery
Cons: Data is copied (storage costs), setup takes longer for large files
Use when: Query performance is critical, data is relatively static

Large File Handling

This template is designed to handle large files (5-6 GB each) efficiently:

File Discovery: Uses efficient GCS listing without downloading files
Batch Processing: Supports batch operations for large file lists
Progress Tracking: Provides progress updates for long-running operations
Error Handling: Includes error recovery and retry logic

Best Practices for Large Files

Use external tables when possible (no data movement)
For loaded tables, use batch loading for better control
Monitor BigQuery job status for long-running operations
Consider splitting very large datasets into multiple tables

Platform-Specific Notes

This repository works on Windows, macOS, and Linux. The code uses Python's cross-platform libraries (os, pathlib) for file path handling, so no platform-specific code changes are needed.

Path Formatting

macOS/Linux: Use forward slashes (/) in paths
- Example: /Users/username/.gcp/service-account-key.json
Windows: Use backslashes (\) or forward slashes (/) in paths
- Example: C:\Users\username\.gcp\service-account-key.json or C:/Users/username/.gcp/service-account-key.json

Python Installation

macOS:

# Using Homebrew (recommended)
brew install python3

# Or download from python.org
# Then install dependencies
pip3 install -r requirements.txt

Windows:

# Download Python from python.org
# Then install dependencies
pip install -r requirements.txt

Linux:

# Using package manager
sudo apt-get install python3 python3-pip  # Debian/Ubuntu
# or
sudo yum install python3 python3-pip      # CentOS/RHEL

# Then install dependencies
pip3 install -r requirements.txt

Troubleshooting

Authentication Errors

Error: "Could not automatically determine credentials"

Solution:

Verify GOOGLE_APPLICATION_CREDENTIALS is set correctly
Check that the JSON key file path is valid
- macOS/Linux: Ensure the path uses forward slashes and is absolute (starts with /)
- Windows: Ensure the path uses backslashes or forward slashes and includes the drive letter
Ensure the service account has required permissions
On macOS/Linux, verify file permissions allow reading: chmod 600 /path/to/key.json

Permission Errors

Error: "Permission denied" or "Access denied"

Solution:

Verify service account has required roles:
- Storage Object Viewer
- Storage Legacy Bucket Reader
- BigQuery Data Editor
- BigQuery Job User
Check bucket and dataset permissions

Dataset Creation Errors

Error: "Dataset already exists"

Solution: Use if_exists="ignore" or if_exists="replace" when creating datasets

Table Creation Errors

Error: "Table already exists"

Solution: Use if_exists="replace" or if_exists="ignore" when creating tables

Large File List Errors

Error: "Too many URIs" or timeout errors

Solution:

Use load_files_batch() for loaded tables
Consider splitting into multiple tables
Use external tables which handle large URI lists better

Project Structure

bigquery-gcs-utils/
├── README.md                    # This file
├── .env.example                 # Environment variable template
├── requirements.txt             # Python dependencies
├── config/
│   ├── __init__.py
│   └── gcp_config.py           # GCP configuration class
├── utils/
│   ├── __init__.py
│   ├── dataset_manager.py      # Dataset management utilities
│   └── table_manager.py        # Table creation utilities
└── examples/
    ├── __init__.py
    ├── create_dataset_example.py
    └── create_table_from_gcs.py

License

This is a template repository. Modify as needed for your use case.

Contributing

This is a minimal template. Feel free to extend it with additional features as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
examples		examples
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

MIT-FutureTech/bigquery-gcs-utils

Folders and files

Latest commit

History

Repository files navigation

BigQuery GCS Utils

Features

Quick Start

1. Install Dependencies

2. Configure Environment Variables

3. Set Up Service Account

4. Run Examples

Usage

Creating a Dataset

Creating Tables from GCS Files

External Table (References GCS files, no data movement)

Loaded Table (Copies data to BigQuery)

Combining Files from Multiple Paths

Handling Large File Lists

API Reference

GCPConfig

Methods

DatasetManager

Methods

TableManager

Methods

Table Types

External Tables

Loaded Tables

Large File Handling

Best Practices for Large Files

Platform-Specific Notes

Path Formatting

Python Installation

Troubleshooting

Authentication Errors

Permission Errors

Dataset Creation Errors

Table Creation Errors

Large File List Errors

Project Structure

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages