Skip to content

Comments

feat: implement modular email extractor#1

Open
Neidel wants to merge 1 commit intoRemiBp:masterfrom
Neidel:feature/email-extractor
Open

feat: implement modular email extractor#1
Neidel wants to merge 1 commit intoRemiBp:masterfrom
Neidel:feature/email-extractor

Conversation

@Neidel
Copy link
Collaborator

@Neidel Neidel commented Jun 24, 2025

This commit introduces a comprehensive business email extraction system for the CHOICE platform, designed to support cold outreach campaigns by extracting valid business emails from producer websites.

New Features

Core Email Extractor (email_extractor.py)

  • CLI entrypoint with test and production modes
  • MongoDB integration for producer data input
  • CSV output with specified schema: place_name, website_url, email, category
  • Parallel processing with configurable thread pools
  • Built-in test mode processing first 10 MongoDB records with automatic CSV output

Modular Architecture (email_extractor_modules/)

constants.py

  • Centralized configuration constants and regex patterns
  • Email filtering patterns for business email validation
  • Default timeouts, thread counts, and processing parameters

email_extraction.py

  • Multi-method email extraction system:
    • Primary: Selenium WebDriver with stealth anti-detection
    • Fallback: Async HTTP extraction using aiohttp
    • Alternative: extract-emails library integration
  • Intelligent subpage discovery (contact, about, footer pages)
  • Robust error handling and retry mechanisms

filtering.py

  • Business email validation and filtering
  • Excludes: noreply@, social platforms, generic domains (gmail, yahoo, etc.)
  • Supports all TLD patterns (.com, .net, .fr, etc.) via regex
  • Removes placeholder and invalid email addresses

io_utils.py

  • MongoDB data loading with field normalization
  • CSV output writing with proper encoding
  • Handles various input field name variations (website vs url vs site_web)

pipeline.py

  • Processing orchestration for different execution modes:
    • Single entry testing with detailed logging
    • Threaded batch processing for Selenium extraction
    • Async batch processing for high-speed HTTP extraction
    • Progress tracking and performance metrics

env.py

  • Environment setup and warning suppression
  • Selenium/Chrome driver noise reduction
  • Logging configuration for clean output

integration.py

  • Shared email extraction utilities for cross-script integration
  • Designed for use by other CHOICE platform scripts:
    • wellness.py (beauty/wellness venues)
    • billetreduc_shotgun_mistral.py (event venues)
    • Future platform extensions

Technical Implementation

Email Extraction Methods

  • Selenium WebDriver: Primary method with JavaScript execution, handles SPAs
  • Async HTTP: Fast fallback using aiohttp for static content
  • extract-emails: Library-based extraction with additional coverage
  • Subpage Crawling: Automatic discovery of contact/about pages via link analysis

Filtering & Validation

  • Comprehensive business email filtering excluding non-business addresses
  • Support for all international TLDs beyond just .fr domain restriction
  • Deduplication and validation pipeline ensuring clean output

Performance & Scalability

  • Configurable parallel processing (default: 5 threads for Selenium)
  • Batch processing for large datasets with progress tracking
  • Async processing option for speed-critical operations
  • Built-in performance metrics and efficiency reporting

Integration Ready

  • Modular design following SOLID principles
  • DRY implementation preventing code duplication across platform scripts
  • Clean separation of concerns enabling selective feature usage
  • Comprehensive error handling and logging for production stability

Usage Examples

# Test mode (default) - processes first 10 MongoDB records
python email_extractor.py

# Production mode with full MongoDB processing
python email_extractor.py --production --output emails.csv

# Single site testing
python email_extractor.py --test "Restaurant Name" "restaurant-website.com"

# Async mode for faster processing
python email_extractor.py --production --output emails.csv --use-async

Output Schema

CSV format with fields: place_name, website_url, email, category

Dependencies

  • selenium, aiohttp, extract-emails, pymongo
  • Chrome/Chromium browser for WebDriver functionality
  • MongoDB connection for producer data input

This implementation establishes the foundation for systematic business email collection across the CHOICE platform ecosystem while maintaining code reusability and performance.

…multi-method extraction

This commit introduces a comprehensive business email extraction system for the CHOICE platform,
designed to support cold outreach campaigns by extracting valid business emails from producer websites.

## New Features

### Core Email Extractor (`email_extractor.py`)
- CLI entrypoint with test and production modes
- MongoDB integration for producer data input
- CSV output with specified schema: place_name, website_url, email, category
- Parallel processing with configurable thread pools
- Built-in test mode processing first 10 MongoDB records with automatic CSV output

### Modular Architecture (`email_extractor_modules/`)

#### `constants.py`
- Centralized configuration constants and regex patterns
- Email filtering patterns for business email validation
- Default timeouts, thread counts, and processing parameters

#### `email_extraction.py`
- Multi-method email extraction system:
  - Primary: Selenium WebDriver with stealth anti-detection
  - Fallback: Async HTTP extraction using aiohttp
  - Alternative: extract-emails library integration
- Intelligent subpage discovery (contact, about, footer pages)
- Robust error handling and retry mechanisms

#### `filtering.py`
- Business email validation and filtering
- Excludes: noreply@, social platforms, generic domains (gmail, yahoo, etc.)
- Supports all TLD patterns (.com, .net, .fr, etc.) via regex
- Removes placeholder and invalid email addresses

#### `io_utils.py`
- MongoDB data loading with field normalization
- CSV output writing with proper encoding
- Handles various input field name variations (website vs url vs site_web)

#### `pipeline.py`
- Processing orchestration for different execution modes:
  - Single entry testing with detailed logging
  - Threaded batch processing for Selenium extraction
  - Async batch processing for high-speed HTTP extraction
  - Progress tracking and performance metrics

#### `env.py`
- Environment setup and warning suppression
- Selenium/Chrome driver noise reduction
- Logging configuration for clean output

#### `integration.py`
- Shared email extraction utilities for cross-script integration
- Designed for use by other CHOICE platform scripts:
  - `wellness.py` (beauty/wellness venues)
  - `billetreduc_shotgun_mistral.py` (event venues)
  - Future platform extensions

## Technical Implementation

### Email Extraction Methods
- **Selenium WebDriver**: Primary method with JavaScript execution, handles SPAs
- **Async HTTP**: Fast fallback using aiohttp for static content
- **extract-emails**: Library-based extraction with additional coverage
- **Subpage Crawling**: Automatic discovery of contact/about pages via link analysis

### Filtering & Validation
- Comprehensive business email filtering excluding non-business addresses
- Support for all international TLDs beyond just .fr domain restriction
- Deduplication and validation pipeline ensuring clean output

### Performance & Scalability
- Configurable parallel processing (default: 5 threads for Selenium)
- Batch processing for large datasets with progress tracking
- Async processing option for speed-critical operations
- Built-in performance metrics and efficiency reporting

### Integration Ready
- Modular design following SOLID principles
- DRY implementation preventing code duplication across platform scripts
- Clean separation of concerns enabling selective feature usage
- Comprehensive error handling and logging for production stability

## Usage Examples
```bash
# Test mode (default) - processes first 10 MongoDB records
python email_extractor.py

# Production mode with full MongoDB processing
python email_extractor.py --production --output emails.csv

# Single site testing
python email_extractor.py --test "Restaurant Name" "restaurant-website.com"

# Async mode for faster processing
python email_extractor.py --production --output emails.csv --use-async
```

## Output Schema
CSV format with fields: place_name, website_url, email, category

## Dependencies
- selenium, aiohttp, extract-emails, pymongo
- Chrome/Chromium browser for WebDriver functionality
- MongoDB connection for producer data input

This implementation establishes the foundation for systematic business email collection
across the CHOICE platform ecosystem while maintaining code reusability and performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant