A web scraper for Bilka2Go (Danish supermarket chain) that extracts product information from various categories and stores the data in Google BigQuery.
- Multi-category scraping: Scrapes products from 23 different categories
- Structured data extraction: Extracts product details including name, price, producer, quantity, and labels
- BigQuery integration: Automatically stores scraped data in Google BigQuery
- Docker support: Containerized application for easy deployment
- CI/CD pipeline: Automated testing and deployment with GitHub Actions
- Caching: Built-in caching for improved performance
- Robust error handling: Comprehensive logging and error management
- Python 3.12+
- Google Cloud Platform account with BigQuery API enabled
- Docker (optional, for containerized deployment)
-
Clone the repository
git clone <repository-url> cd bilka2go-scraper
-
Install UV (recommended package manager)
pip install uv
-
Create and activate a virtual environment
uv venv
and
source .venv/bin/activate -
Install dependencies
uv sync
or
uv pip install -e . -
Install Playwright browsers
playwright install
-
Build the Docker image
docker build -t bilka2go-scraper . -
Run the container
docker run -v $(pwd)/key.json:/usr/local/appuser/key.json -e GOOGLE_APPLICATION_CREDENTIALS=/usr/local/appuser/key.json bilka2go-scraper
Create a .env file in the root directory with the following variables:
# Google Cloud Configuration
GOOGLE_CLOUD_PROJECT_ID=your-gcp-project-id
GOOGLE_CLOUD_BIGQUERY_DATASET=your-bq-dataset-nameAlternatively
-
Create a Google Cloud Project
- Go to Google Cloud Console
- Create a new project or select an existing one
-
Enable BigQuery API
- Navigate to APIs & Services > Library
- Search for "BigQuery API" and enable it
-
Create Service Account
- Go to IAM & Admin > Service Accounts
- Create a new service account with BigQuery Admin role
- Download the JSON key file and save it as
key.jsonin the project root
-
Set up BigQuery Dataset
- The scraper will automatically create the dataset and table if they don't exist
- Or manually create them in the BigQuery console
The scraper supports various command line arguments:
# Make sure your virtual environment is activated
source .venv/bin/activate # On Windows: .venv\Scripts\activate
python src/main.py [OPTIONS]--category: Specify a category to scrape (default: all)--headless: Run the browser in headless mode (default: True)--verbose: Enable verbose logging (default: False)--log-level: Set the logging level (default: INFO)
# Activate virtual environment first
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Scrape all categories (default)
python src/main.py
# Scrape a specific category
python src/main.py --category fruits-and-vegetables
# Run with verbose logging
python src/main.py --verbose
# Run with different log level
python src/main.py --log-level DEBUG
# Run in non-headless mode (with visible browser)
python src/main.py --headlessThe scraper supports the following categories:
| Danish Name | English Translation |
|---|---|
| frugt-og-groent | fruits-and-vegetables |
| koed-og-fisk | meat-and-fish |
| mejeri-og-koel | dairy-and-chilled |
| drikkevarer | beverages |
| broed-og-kager | bread-and-cakes |
| kolonial | groceries |
| mad-fra-hele-verden | world-food |
| slik-og-snacks | sweets-and-snacks |
| frost | frozen-food |
| kiosk | kiosk |
| dyremad | pet-food |
| husholdning | household |
| personlig-pleje | personal-care |
| baby-og-boern | baby-and-children |
| bolig-og-koekken | home-and-kitchen |
| fritid-og-sport | leisure-and-sport |
| toej-og-sko | clothing-and-shoes |
| elektronik | electronics |
| have | garden |
| leg | toys |
| byggemarked | hardware-store |
| biludstyr | car-accessories |
The scraper extracts the following information for each product:
{
"name": "Product Name",
"price": "Price in DKK",
"image_url": "Product image URL",
"product_url": "Product page URL",
"producer": "Brand/Producer",
"quantity": "Package size/quantity",
"price_per_unit": "Price per unit (kg, L, etc.)",
"label1": "Product label 1",
"label2": "Product label 2",
"label3": "Product label 3",
"category": "Product category",
"scraped_at": "Timestamp"
}src/
├── main.py # Main scraper logic
├── config/ # Configuration files (empty)
├── models/ # Data models (empty)
├── services/ # Business logic services (empty)
├── storage/ # Scraped data storage (JSON files)
│ ├── baby-and-children/
│ ├── beverages/
│ ├── bread-and-cakes/
│ └── ...
└── utils/
├── __init__.py
└── bigquery_connector.py # BigQuery integration
The project includes a GitHub Actions workflow that:
- Testing: Runs tests on Python 3.12
- Docker Build: Builds Docker image for pull requests
- Docker Push: Pushes to Google Artifact Registry on main branch
Configure the following secrets in your GitHub repository:
SERVICE_ACCOUNT: GCP service account emailPROJECT_ID: Google Cloud project IDSERVICE_ACCOUNT_KEY: Service account JSON keyGAR_REGION: Google Artifact Registry regionGAR_REPO: Google Artifact Registry repository name
Always activate your virtual environment before development:
source .venv/bin/activate # On Windows: .venv\Scripts\activateTo deactivate when done:
deactivate- crawl4ai: Web scraping framework with Playwright backend
- Google Cloud BigQuery: Data warehouse for storing scraped data
- loguru: Advanced logging
- python-dotenv: Environment variable management
- Add the Danish category name to
CATEGORIES_DKlist - Add the translation to
CATEGORIES_TRANSLATEDdictionary - Update the README with the new category
Modify the EXTRACTION_STRATEGY in main.py to add or change extracted fields:
EXTRACTION_STRATEGY = JsonCssExtractionStrategy(
schema={
"name": "product_list",
"baseSelector": "div.product-item",
"fields": [
{
"name": "new_field",
"selector": "css-selector",
"type": "text", # or "attribute"
},
# ... existing fields
],
}
)-
Playwright Browser Not Found
# Make sure virtual environment is activated source .venv/bin/activate # On Windows: .venv\Scripts\activate playwright install
-
BigQuery Authentication Error
- Ensure
GOOGLE_APPLICATION_CREDENTIALSpoints to valid service account key - Verify the service account has BigQuery Admin permissions
- Ensure
-
Memory Issues
- The scraper includes built-in delays and rate limiting
- Adjust timeouts in the crawler configuration if needed
Logs are written to the console only. To view logs in real-time or save them to a file, you can use:
# View logs in real-time and save to file
python src/main.py | tee scraper.log
# Save logs to file only
python src/main.py > scraper.log 2>&1
# Search for errors in saved logs
grep -i error scraper.logThis project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
For support, please open an issue in the GitHub repository or contact the maintainers.
Note: This scraper is for educational and research purposes. Please respect the website's robots.txt and terms of service when using this tool.