feat: implement modular email extractor by Neidel · Pull Request #1 · RemiBp/choice-scraping

Neidel · 2025-06-24T19:51:09Z

This commit introduces a comprehensive business email extraction system for the CHOICE platform, designed to support cold outreach campaigns by extracting valid business emails from producer websites.

New Features

Core Email Extractor (`email_extractor.py`)

CLI entrypoint with test and production modes
MongoDB integration for producer data input
CSV output with specified schema: place_name, website_url, email, category
Parallel processing with configurable thread pools
Built-in test mode processing first 10 MongoDB records with automatic CSV output

Modular Architecture (`email_extractor_modules/`)

`constants.py`

Centralized configuration constants and regex patterns
Email filtering patterns for business email validation
Default timeouts, thread counts, and processing parameters

`email_extraction.py`

Multi-method email extraction system:
- Primary: Selenium WebDriver with stealth anti-detection
- Fallback: Async HTTP extraction using aiohttp
- Alternative: extract-emails library integration
Intelligent subpage discovery (contact, about, footer pages)
Robust error handling and retry mechanisms

`filtering.py`

Business email validation and filtering
Excludes: noreply@, social platforms, generic domains (gmail, yahoo, etc.)
Supports all TLD patterns (.com, .net, .fr, etc.) via regex
Removes placeholder and invalid email addresses

`io_utils.py`

MongoDB data loading with field normalization
CSV output writing with proper encoding
Handles various input field name variations (website vs url vs site_web)

`pipeline.py`

Processing orchestration for different execution modes:
- Single entry testing with detailed logging
- Threaded batch processing for Selenium extraction
- Async batch processing for high-speed HTTP extraction
- Progress tracking and performance metrics

`env.py`

Environment setup and warning suppression
Selenium/Chrome driver noise reduction
Logging configuration for clean output

`integration.py`

Shared email extraction utilities for cross-script integration
Designed for use by other CHOICE platform scripts:
- wellness.py (beauty/wellness venues)
- billetreduc_shotgun_mistral.py (event venues)
- Future platform extensions

Technical Implementation

Email Extraction Methods

Selenium WebDriver: Primary method with JavaScript execution, handles SPAs
Async HTTP: Fast fallback using aiohttp for static content
extract-emails: Library-based extraction with additional coverage
Subpage Crawling: Automatic discovery of contact/about pages via link analysis

Filtering & Validation

Comprehensive business email filtering excluding non-business addresses
Support for all international TLDs beyond just .fr domain restriction
Deduplication and validation pipeline ensuring clean output

Performance & Scalability

Configurable parallel processing (default: 5 threads for Selenium)
Batch processing for large datasets with progress tracking
Async processing option for speed-critical operations
Built-in performance metrics and efficiency reporting

Integration Ready

Modular design following SOLID principles
DRY implementation preventing code duplication across platform scripts
Clean separation of concerns enabling selective feature usage
Comprehensive error handling and logging for production stability

Usage Examples

# Test mode (default) - processes first 10 MongoDB records
python email_extractor.py

# Production mode with full MongoDB processing
python email_extractor.py --production --output emails.csv

# Single site testing
python email_extractor.py --test "Restaurant Name" "restaurant-website.com"

# Async mode for faster processing
python email_extractor.py --production --output emails.csv --use-async

Output Schema

CSV format with fields: place_name, website_url, email, category

Dependencies

selenium, aiohttp, extract-emails, pymongo
Chrome/Chromium browser for WebDriver functionality
MongoDB connection for producer data input

This implementation establishes the foundation for systematic business email collection across the CHOICE platform ecosystem while maintaining code reusability and performance.

…multi-method extraction This commit introduces a comprehensive business email extraction system for the CHOICE platform, designed to support cold outreach campaigns by extracting valid business emails from producer websites. ## New Features ### Core Email Extractor (`email_extractor.py`) - CLI entrypoint with test and production modes - MongoDB integration for producer data input - CSV output with specified schema: place_name, website_url, email, category - Parallel processing with configurable thread pools - Built-in test mode processing first 10 MongoDB records with automatic CSV output ### Modular Architecture (`email_extractor_modules/`) #### `constants.py` - Centralized configuration constants and regex patterns - Email filtering patterns for business email validation - Default timeouts, thread counts, and processing parameters #### `email_extraction.py` - Multi-method email extraction system: - Primary: Selenium WebDriver with stealth anti-detection - Fallback: Async HTTP extraction using aiohttp - Alternative: extract-emails library integration - Intelligent subpage discovery (contact, about, footer pages) - Robust error handling and retry mechanisms #### `filtering.py` - Business email validation and filtering - Excludes: noreply@, social platforms, generic domains (gmail, yahoo, etc.) - Supports all TLD patterns (.com, .net, .fr, etc.) via regex - Removes placeholder and invalid email addresses #### `io_utils.py` - MongoDB data loading with field normalization - CSV output writing with proper encoding - Handles various input field name variations (website vs url vs site_web) #### `pipeline.py` - Processing orchestration for different execution modes: - Single entry testing with detailed logging - Threaded batch processing for Selenium extraction - Async batch processing for high-speed HTTP extraction - Progress tracking and performance metrics #### `env.py` - Environment setup and warning suppression - Selenium/Chrome driver noise reduction - Logging configuration for clean output #### `integration.py` - Shared email extraction utilities for cross-script integration - Designed for use by other CHOICE platform scripts: - `wellness.py` (beauty/wellness venues) - `billetreduc_shotgun_mistral.py` (event venues) - Future platform extensions ## Technical Implementation ### Email Extraction Methods - **Selenium WebDriver**: Primary method with JavaScript execution, handles SPAs - **Async HTTP**: Fast fallback using aiohttp for static content - **extract-emails**: Library-based extraction with additional coverage - **Subpage Crawling**: Automatic discovery of contact/about pages via link analysis ### Filtering & Validation - Comprehensive business email filtering excluding non-business addresses - Support for all international TLDs beyond just .fr domain restriction - Deduplication and validation pipeline ensuring clean output ### Performance & Scalability - Configurable parallel processing (default: 5 threads for Selenium) - Batch processing for large datasets with progress tracking - Async processing option for speed-critical operations - Built-in performance metrics and efficiency reporting ### Integration Ready - Modular design following SOLID principles - DRY implementation preventing code duplication across platform scripts - Clean separation of concerns enabling selective feature usage - Comprehensive error handling and logging for production stability ## Usage Examples ```bash # Test mode (default) - processes first 10 MongoDB records python email_extractor.py # Production mode with full MongoDB processing python email_extractor.py --production --output emails.csv # Single site testing python email_extractor.py --test "Restaurant Name" "restaurant-website.com" # Async mode for faster processing python email_extractor.py --production --output emails.csv --use-async ``` ## Output Schema CSV format with fields: place_name, website_url, email, category ## Dependencies - selenium, aiohttp, extract-emails, pymongo - Chrome/Chromium browser for WebDriver functionality - MongoDB connection for producer data input This implementation establishes the foundation for systematic business email collection across the CHOICE platform ecosystem while maintaining code reusability and performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: implement modular email extractor#1

feat: implement modular email extractor#1
Neidel wants to merge 1 commit intoRemiBp:masterfrom
Neidel:feature/email-extractor

Neidel commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Neidel commented Jun 24, 2025

New Features

Core Email Extractor (email_extractor.py)

Modular Architecture (email_extractor_modules/)

constants.py

email_extraction.py

filtering.py

io_utils.py

pipeline.py

env.py

integration.py

Technical Implementation

Email Extraction Methods

Filtering & Validation

Performance & Scalability

Integration Ready

Usage Examples

Output Schema

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Core Email Extractor (`email_extractor.py`)

Modular Architecture (`email_extractor_modules/`)

`constants.py`

`email_extraction.py`

`filtering.py`

`io_utils.py`

`pipeline.py`

`env.py`

`integration.py`