Open
Conversation
…multi-method extraction This commit introduces a comprehensive business email extraction system for the CHOICE platform, designed to support cold outreach campaigns by extracting valid business emails from producer websites. ## New Features ### Core Email Extractor (`email_extractor.py`) - CLI entrypoint with test and production modes - MongoDB integration for producer data input - CSV output with specified schema: place_name, website_url, email, category - Parallel processing with configurable thread pools - Built-in test mode processing first 10 MongoDB records with automatic CSV output ### Modular Architecture (`email_extractor_modules/`) #### `constants.py` - Centralized configuration constants and regex patterns - Email filtering patterns for business email validation - Default timeouts, thread counts, and processing parameters #### `email_extraction.py` - Multi-method email extraction system: - Primary: Selenium WebDriver with stealth anti-detection - Fallback: Async HTTP extraction using aiohttp - Alternative: extract-emails library integration - Intelligent subpage discovery (contact, about, footer pages) - Robust error handling and retry mechanisms #### `filtering.py` - Business email validation and filtering - Excludes: noreply@, social platforms, generic domains (gmail, yahoo, etc.) - Supports all TLD patterns (.com, .net, .fr, etc.) via regex - Removes placeholder and invalid email addresses #### `io_utils.py` - MongoDB data loading with field normalization - CSV output writing with proper encoding - Handles various input field name variations (website vs url vs site_web) #### `pipeline.py` - Processing orchestration for different execution modes: - Single entry testing with detailed logging - Threaded batch processing for Selenium extraction - Async batch processing for high-speed HTTP extraction - Progress tracking and performance metrics #### `env.py` - Environment setup and warning suppression - Selenium/Chrome driver noise reduction - Logging configuration for clean output #### `integration.py` - Shared email extraction utilities for cross-script integration - Designed for use by other CHOICE platform scripts: - `wellness.py` (beauty/wellness venues) - `billetreduc_shotgun_mistral.py` (event venues) - Future platform extensions ## Technical Implementation ### Email Extraction Methods - **Selenium WebDriver**: Primary method with JavaScript execution, handles SPAs - **Async HTTP**: Fast fallback using aiohttp for static content - **extract-emails**: Library-based extraction with additional coverage - **Subpage Crawling**: Automatic discovery of contact/about pages via link analysis ### Filtering & Validation - Comprehensive business email filtering excluding non-business addresses - Support for all international TLDs beyond just .fr domain restriction - Deduplication and validation pipeline ensuring clean output ### Performance & Scalability - Configurable parallel processing (default: 5 threads for Selenium) - Batch processing for large datasets with progress tracking - Async processing option for speed-critical operations - Built-in performance metrics and efficiency reporting ### Integration Ready - Modular design following SOLID principles - DRY implementation preventing code duplication across platform scripts - Clean separation of concerns enabling selective feature usage - Comprehensive error handling and logging for production stability ## Usage Examples ```bash # Test mode (default) - processes first 10 MongoDB records python email_extractor.py # Production mode with full MongoDB processing python email_extractor.py --production --output emails.csv # Single site testing python email_extractor.py --test "Restaurant Name" "restaurant-website.com" # Async mode for faster processing python email_extractor.py --production --output emails.csv --use-async ``` ## Output Schema CSV format with fields: place_name, website_url, email, category ## Dependencies - selenium, aiohttp, extract-emails, pymongo - Chrome/Chromium browser for WebDriver functionality - MongoDB connection for producer data input This implementation establishes the foundation for systematic business email collection across the CHOICE platform ecosystem while maintaining code reusability and performance.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit introduces a comprehensive business email extraction system for the CHOICE platform, designed to support cold outreach campaigns by extracting valid business emails from producer websites.
New Features
Core Email Extractor (
email_extractor.py)Modular Architecture (
email_extractor_modules/)constants.pyemail_extraction.pyfiltering.pyio_utils.pypipeline.pyenv.pyintegration.pywellness.py(beauty/wellness venues)billetreduc_shotgun_mistral.py(event venues)Technical Implementation
Email Extraction Methods
Filtering & Validation
Performance & Scalability
Integration Ready
Usage Examples
Output Schema
CSV format with fields: place_name, website_url, email, category
Dependencies
This implementation establishes the foundation for systematic business email collection across the CHOICE platform ecosystem while maintaining code reusability and performance.