Skip to content

MIT-FutureTech/bigquery-gcs-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BigQuery GCS Utils

A minimal template repository for working with Google Cloud Storage (GCS) and BigQuery, specifically designed to handle large files (5-6 GB each) split across multiple GCS buckets.

Features

  • Authentication: Service account JSON key file authentication
  • Dataset Management: Create and manage BigQuery datasets
  • Table Creation: Create external tables or load data from GCS files
  • Large File Support: Handle hundreds of large files (5-6 GB each)
  • Batch Processing: Efficient file discovery and table creation for large file lists
  • Multiple Paths: Combine files from multiple GCS paths/buckets

Quick Start

1. Install Dependencies

macOS/Linux:

pip3 install -r requirements.txt

Windows:

pip install -r requirements.txt

Note: On macOS/Linux, you may need to use pip3 instead of pip depending on your Python installation. If you encounter permission errors, use pip3 install --user -r requirements.txt or install in a virtual environment.

2. Configure Environment Variables

Copy .env.example to .env and fill in your values:

macOS/Linux:

cp .env.example .env

Windows PowerShell:

Copy-Item .env.example .env

Windows Command Prompt:

copy .env.example .env

Required environment variables:

  • GOOGLE_APPLICATION_CREDENTIALS: Path to your service account JSON key file
    • macOS/Linux example: /Users/username/.gcp/service-account-key.json
    • Windows example: C:\Users\YourUsername\.gcp\service-account-key.json
  • GCP_PROJECT_ID: Your Google Cloud Project ID
  • GCS_BUCKET_NAME: Name of your GCS bucket
  • BIGQUERY_DATASET: Name of your BigQuery dataset

Optional:

  • BIGQUERY_DATASET_LOCATION: Dataset location (default: US)
  • GCS_PATH_PREFIX: GCS path prefix for file discovery

3. Set Up Service Account

  1. Create a service account in Google Cloud Console
  2. Grant the following roles:
    • Storage Object Viewer (for reading GCS objects)
    • Storage Legacy Bucket Reader (for bucket metadata access)
    • BigQuery Data Editor (for creating datasets and tables)
    • BigQuery Job User (for running BigQuery jobs)
  3. Download the JSON key file
  4. Set GOOGLE_APPLICATION_CREDENTIALS to the path of the key file

Setting the path:

macOS/Linux:

# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json

# Option 2: Set as environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"

Windows PowerShell:

# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json

# Option 2: Set as environment variable
$env:GOOGLE_APPLICATION_CREDENTIALS="C:\path\to\key.json"

Windows Command Prompt:

# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json

# Option 2: Set as environment variable
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json

See GCP Setup Guide for detailed instructions.

4. Run Examples

macOS/Linux:

# Create a dataset
python3 examples/create_dataset_example.py

# Create tables from GCS files
python3 examples/create_table_from_gcs.py

Windows:

# Create a dataset
python examples/create_dataset_example.py

# Create tables from GCS files
python examples/create_table_from_gcs.py

Note: On macOS/Linux, you may need to use python3 instead of python depending on your Python installation.

Usage

Creating a Dataset

from config.gcp_config import GCPConfig
from utils.dataset_manager import DatasetManager

# Initialize
gcp_config = GCPConfig()
dataset_manager = DatasetManager(gcp_config)

# Create dataset
dataset_manager.create_dataset(
    dataset_name="my_dataset",
    location="US",
    description="My dataset description",
    if_exists="ignore"  # or "error" or "overwrite"
)

Creating Tables from GCS Files

External Table (References GCS files, no data movement)

from config.gcp_config import GCPConfig
from utils.table_manager import TableManager

# Initialize
gcp_config = GCPConfig()
table_manager = TableManager(gcp_config)

# Discover files
files = table_manager.discover_gcs_files(
    prefixes=["path/to/data/"],
    file_extensions=['.parquet']
)

# Create external table
table_manager.create_table_from_gcs_files(
    table_name="my_external_table",
    gcs_paths=files,
    source_format="PARQUET",
    table_type="external"
)

Loaded Table (Copies data to BigQuery)

# Create loaded table (data is copied to BigQuery)
table_manager.create_table_from_gcs_files(
    table_name="my_loaded_table",
    gcs_paths=files,
    source_format="PARQUET",
    table_type="loaded",
    write_disposition="WRITE_TRUNCATE"
)

Combining Files from Multiple Paths

# Combine files from multiple GCS paths
table_manager.combine_files_from_multiple_paths(
    table_name="combined_table",
    gcs_paths=[
        "path/to/data1/",
        "path/to/data2/",
        "another/path/"
    ],
    file_extensions=['.parquet'],
    source_format="PARQUET",
    table_type="external"
)

Handling Large File Lists

For very large file lists (hundreds of files), use batch methods:

# Discover files (handles large lists efficiently)
files = table_manager.discover_gcs_files(
    prefixes=["path/to/data/"],
    file_extensions=['.parquet'],
    max_files=None  # None for all files
)

# Create GCS URIs
gcs_uris = gcp_config.create_gcs_uris(files)

# Create external table (BigQuery handles large URI lists well)
table_manager.create_external_table_batch(
    table_name="large_table",
    gcs_uris=gcs_uris,
    source_format="PARQUET"
)

# Or load in batches (for loaded tables)
table_manager.load_files_batch(
    table_name="large_loaded_table",
    gcs_uris=gcs_uris,
    batch_size=50,
    source_format="PARQUET"
)

API Reference

GCPConfig

Main configuration class for GCP authentication and basic operations.

Methods

  • list_gcs_files(prefix, max_results, file_extensions): List files in GCS bucket
  • discover_gcs_files_batch(prefixes, file_extensions, max_files_per_prefix): Discover files across multiple prefixes
  • create_gcs_uris(blob_paths): Convert blob paths to GCS URIs
  • check_bigquery_table_exists(table_name): Check if table exists
  • create_external_table_from_gcs(table_name, gcs_uris, schema, source_format): Create external table
  • load_files_to_bigquery(table_name, gcs_uris, source_format, write_disposition): Load files into BigQuery

DatasetManager

Manages BigQuery dataset operations.

Methods

  • create_dataset(dataset_name, location, description, if_exists): Create a dataset
  • dataset_exists(dataset_name): Check if dataset exists
  • delete_dataset(dataset_name, delete_contents): Delete a dataset
  • list_datasets(): List all datasets in project
  • get_dataset_info(dataset_name): Get dataset information

TableManager

Manages BigQuery table creation from GCS files.

Methods

  • discover_gcs_files(prefixes, file_extensions, max_files): Discover files matching criteria
  • create_table_from_gcs_files(table_name, gcs_paths, gcs_uris, source_format, table_type, schema, write_disposition, if_exists): Create table from GCS files
  • create_external_table_batch(table_name, gcs_uris, batch_size, schema, source_format): Create external table from large file list
  • load_files_batch(table_name, gcs_uris, batch_size, source_format, write_disposition): Load files in batches
  • combine_files_from_multiple_paths(table_name, gcs_paths, file_pattern, file_extensions, source_format, table_type): Combine files from multiple paths
  • delete_table(table_name): Delete a table

Table Types

External Tables

  • Pros: No data movement, fast setup, always reflects latest GCS data
  • Cons: Query performance may be slower, requires GCS access permissions
  • Use when: Data changes frequently, want to avoid data duplication, query performance is acceptable

Loaded Tables

  • Pros: Better query performance, data is stored in BigQuery
  • Cons: Data is copied (storage costs), setup takes longer for large files
  • Use when: Query performance is critical, data is relatively static

Large File Handling

This template is designed to handle large files (5-6 GB each) efficiently:

  1. File Discovery: Uses efficient GCS listing without downloading files
  2. Batch Processing: Supports batch operations for large file lists
  3. Progress Tracking: Provides progress updates for long-running operations
  4. Error Handling: Includes error recovery and retry logic

Best Practices for Large Files

  • Use external tables when possible (no data movement)
  • For loaded tables, use batch loading for better control
  • Monitor BigQuery job status for long-running operations
  • Consider splitting very large datasets into multiple tables

Platform-Specific Notes

This repository works on Windows, macOS, and Linux. The code uses Python's cross-platform libraries (os, pathlib) for file path handling, so no platform-specific code changes are needed.

Path Formatting

  • macOS/Linux: Use forward slashes (/) in paths
    • Example: /Users/username/.gcp/service-account-key.json
  • Windows: Use backslashes (\) or forward slashes (/) in paths
    • Example: C:\Users\username\.gcp\service-account-key.json or C:/Users/username/.gcp/service-account-key.json

Python Installation

macOS:

# Using Homebrew (recommended)
brew install python3

# Or download from python.org
# Then install dependencies
pip3 install -r requirements.txt

Windows:

# Download Python from python.org
# Then install dependencies
pip install -r requirements.txt

Linux:

# Using package manager
sudo apt-get install python3 python3-pip  # Debian/Ubuntu
# or
sudo yum install python3 python3-pip      # CentOS/RHEL

# Then install dependencies
pip3 install -r requirements.txt

Troubleshooting

Authentication Errors

Error: "Could not automatically determine credentials"

Solution:

  • Verify GOOGLE_APPLICATION_CREDENTIALS is set correctly
  • Check that the JSON key file path is valid
    • macOS/Linux: Ensure the path uses forward slashes and is absolute (starts with /)
    • Windows: Ensure the path uses backslashes or forward slashes and includes the drive letter
  • Ensure the service account has required permissions
  • On macOS/Linux, verify file permissions allow reading: chmod 600 /path/to/key.json

Permission Errors

Error: "Permission denied" or "Access denied"

Solution:

  • Verify service account has required roles:
    • Storage Object Viewer
    • Storage Legacy Bucket Reader
    • BigQuery Data Editor
    • BigQuery Job User
  • Check bucket and dataset permissions

Dataset Creation Errors

Error: "Dataset already exists"

Solution: Use if_exists="ignore" or if_exists="replace" when creating datasets

Table Creation Errors

Error: "Table already exists"

Solution: Use if_exists="replace" or if_exists="ignore" when creating tables

Large File List Errors

Error: "Too many URIs" or timeout errors

Solution:

  • Use load_files_batch() for loaded tables
  • Consider splitting into multiple tables
  • Use external tables which handle large URI lists better

Project Structure

bigquery-gcs-utils/
├── README.md                    # This file
├── .env.example                 # Environment variable template
├── requirements.txt             # Python dependencies
├── config/
│   ├── __init__.py
│   └── gcp_config.py           # GCP configuration class
├── utils/
│   ├── __init__.py
│   ├── dataset_manager.py      # Dataset management utilities
│   └── table_manager.py        # Table creation utilities
└── examples/
    ├── __init__.py
    ├── create_dataset_example.py
    └── create_table_from_gcs.py

License

This is a template repository. Modify as needed for your use case.

Contributing

This is a minimal template. Feel free to extend it with additional features as needed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages