Skip to content

adam7rans/s33k3r

Repository files navigation

s33k3r - AI-Powered Email Finder

Version: 0.3.0 (Phase 4 - Advanced Extraction & Validation) Status: In Development Stack: TypeScript, Node.js, Playwright, Anthropic Claude, OpenAI GPT-4V


📋 Overview

s33k3r is an AI-powered email finder that uses autonomous agents to discover contact information from websites. The system employs a multi-agent architecture with browser automation to intelligently navigate, extract, and validate email addresses.

Current Implementation: Phase 4

This branch implements Phase 4: Advanced Extraction & Validation which includes:

Phase 4 Features (NEW!):

  • Vision-based Extraction - GPT-4V screenshot analysis for emails in images
  • LLM-Enhanced Validation - Intelligent name/role extraction from context
  • Contact Form Detection - Vision-based form identification
  • Enhanced Quality Scoring - Improved confidence calculations
  • Graceful Fallbacks - Works with or without OpenAI API key

Phase 3 Features:

  • Orchestrator Agent - High-level strategy planning and agent coordination
  • Agent Coordinator - Unified search workflow with progress tracking
  • Link Scoring Algorithm - Intelligent link prioritization
  • DOM Distillation - Efficient page content reduction (10-100x token savings)
  • Strategy Selection - Automatic strategy selection (Common Page, Deep Crawl, LinkedIn)
  • Complete Workflow - End-to-end automated email discovery

Phase 2 Foundation:

  • Base Agent System - Foundation for all AI agents with LLM integration
  • Navigation Agent - Scouts pages to identify high-value targets
  • Extraction Agent - Extracts emails using multiple methods (regex, mailto, JS, LLM)
  • Validation Agent - Validates and scores email quality
  • Browser Automation - Playwright-based stealth browsing
  • Type Safety - Full TypeScript with Zod schemas

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                    Orchestrator Agent                    │
│  • High-level task planning                             │
│  • Strategy selection                                    │
│  • Result aggregation                                    │
└────────────────┬────────────────────────────────────────┘
                 │
                 ├──────────────┬──────────────┬─────────────
                 │              │              │
        ┌────────▼──────┐  ┌───▼──────┐  ┌───▼──────────┐
        │ Navigation     │  │ Extraction│  │ Validation   │
        │ Agent          │  │ Agent     │  │ Agent        │
        │                │  │           │  │              │
        │ • Browse pages │  │ • Parse   │  │ • Verify     │
        │ • Follow links │  │   content │  │   emails     │
        │ • Handle auth  │  │ • Pattern │  │ • Dedupe     │
        └────────┬───────┘  │   match   │  │ • Score      │
                 │          └─────┬─────┘  └──────┬───────┘
                 │                │                │
        ┌────────▼────────────────▼────────────────▼───────┐
        │           Browser Control Layer                   │
        │              (Playwright + Stealth)               │
        └───────────────────────────────────────────────────┘

🚀 Quick Start

Prerequisites

Installation

# Install Bun (if not already installed)
curl -fsSL https://bun.sh/install | bash

# Clone the repository
git clone https://github.com/adam7rans/s33k3r.git
cd s33k3r

# Install dependencies
bun install

# Install Playwright browsers
cd backend && bunx playwright install chromium && cd ..

Configuration

# Copy environment templates
cp backend/.env.example backend/.env
cp frontend/.env.local.example frontend/.env.local

# Edit backend/.env and add your API keys
GEMINI_API_KEY=your_gemini_api_key_here
CONVEX_URL=your_convex_deployment_url_here

# Edit frontend/.env.local
NEXT_PUBLIC_CONVEX_URL=your_convex_deployment_url_here

Initialize Convex

# Run Convex dev server to sync schema and generate types
bunx convex dev

This will authenticate you with Convex, create your deployment, and push the database schema.

Running the Email Finder

# Run with default URL (example.com)
bun dev

# Or specify a target website
cd backend && bun dev https://yourwebsite.com

# Build and run in production mode
bun build
bun start https://yourwebsite.com

This will:

  1. Initialize the multi-agent system
  2. Analyze the target website structure
  3. Select the optimal search strategy (Common Page, Deep Crawl, etc.)
  4. Navigate to high-value pages (contact, about, team)
  5. Extract emails using multiple methods
  6. Validate and score all discovered emails
  7. Display detailed results with progress tracking

🧪 Testing

# Run all tests
pnpm test

# Run tests with coverage
pnpm test:coverage

# Run tests in watch mode
pnpm test -- --watch

Test Coverage

Current tests cover:

  • ✅ BaseAgent initialization and conversation history
  • ✅ NavigationAgent high-value page detection
  • ✅ ExtractionAgent email finding methods
  • ✅ ValidationAgent scoring and validation
  • ✅ BrowserManager stealth mode
  • ✅ PageNavigator retry logic

📁 Project Structure

s33k3r/
├── backend/
│   ├── src/
│   │   ├── agents/              # AI Agent implementations
│   │   │   ├── BaseAgent.ts           # Abstract base class
│   │   │   ├── OrchestratorAgent.ts   # Strategy planning ⭐ NEW
│   │   │   ├── AgentCoordinator.ts    # Workflow coordination ⭐ NEW
│   │   │   ├── NavigationAgent.ts     # Page navigation
│   │   │   ├── ExtractionAgent.ts     # Email extraction
│   │   │   └── ValidationAgent.ts     # Email validation
│   │   ├── browser/             # Browser automation
│   │   │   ├── BrowserManager.ts      # Playwright browser pool
│   │   │   ├── PageNavigator.ts       # Navigation utilities
│   │   │   ├── AccessibilityExtractor.ts  # A11y tree parsing
│   │   │   └── ScreenshotManager.ts   # Screenshot capture
│   │   ├── utils/               # Utilities
│   │   │   ├── logger.ts              # Logging system
│   │   │   ├── linkScorer.ts          # Link scoring ⭐ NEW
│   │   │   └── domDistillation.ts     # DOM reduction ⭐ NEW
│   │   ├── types/               # TypeScript types & Zod schemas
│   │   └── index.ts             # Main entry point
│   ├── tests/                   # Unit and integration tests
│   ├── package.json
│   ├── tsconfig.json
│   └── vitest.config.ts
├── docs/                        # Documentation
│   ├── EMAIL_FINDER_AGENT_SPEC.md
│   ├── AGENT_AND_USER_FLOWS.md
│   └── MULTIPHASE_IMPLEMENTATION_PLAN.md (on other branch)
└── README.md                    # This file

🤖 Agent Details

OrchestratorAgent ⭐ NEW

Purpose: High-level strategic planning and agent coordination

Capabilities:

  • Analyzes target website structure
  • Selects optimal search strategy based on site characteristics
  • Decomposes strategy into prioritized tasks
  • Coordinates Navigation, Extraction, and Validation agents
  • Provides real-time progress updates

Strategies:

  • Common Page - Fast search of standard pages (contact, about, team)
  • Deep Crawl - Systematic exploration of all pages
  • LinkedIn - Extract LinkedIn profiles and construct emails
  • API Discovery - Look for public APIs exposing contact info

Example Strategy Output:

{
  "strategy": "common_page",
  "tasks": [
    {"type": "navigate", "url": "/contact", "priority": 10},
    {"type": "navigate", "url": "/about", "priority": 9},
    {"type": "navigate", "url": "/team", "priority": 8}
  ],
  "estimatedPages": 5,
  "reasoning": "Site has clear navigation with contact page"
}

AgentCoordinator ⭐ NEW

Purpose: Unified search workflow management

Features:

  • Initializes and manages all agents
  • Executes complete search workflow
  • Progress tracking with callbacks
  • Error handling and recovery
  • Resource cleanup

Workflow:

  1. Initialize browser and agents
  2. Navigate to target URL
  3. Plan search strategy
  4. Execute strategy with progress updates
  5. Validate and deduplicate emails
  6. Return structured results

BaseAgent

Abstract base class that all agents inherit from. Provides:

  • LLM conversation management (Anthropic Claude)
  • Message history tracking
  • Think/respond cycle
  • Configurable model parameters

NavigationAgent

Purpose: Scout pages and identify high-value targets

Capabilities:

  • Analyzes accessibility tree (lightweight observation)
  • Identifies pages likely to contain emails (contact, about, team)
  • Uses ReAct prompting for reasoning
  • Decides which pages to hand off to Extraction Agent

System Prompt:

You are a Navigation Agent specialized in scouting web pages to find
contact information. Use the accessibility tree to understand page
structure and identify high-value pages. Use ReAct format:
THOUGHT: [Your reasoning]
ACTION: [navigate|click|extract|finish]

ExtractionAgent

Purpose: Deep email extraction from high-value pages

Methods:

  1. Regex Scanning - Pattern matching on page text
  2. mailto: Links - Clickable email links
  3. JavaScript Variables - Emails stored in JS code
  4. LLM Extraction - Claude-based intelligent parsing

Output: Email candidates with context

ValidationAgent

Purpose: Validate and score email quality

Validation Checks:

  • ✅ Format validation (RFC 5322)
  • ✅ Domain match detection (+40 bonus if matches)
  • ✅ MX record lookup (+20 if exists)
  • ✅ Generic role detection (-10 for info@, contact@)
  • ✅ Suspicious pattern detection (-30 for noreply@)
  • ✅ Context presence (+10 if has name/role)

Critical Design:

  • Domain mismatch is NOT penalized
  • Personal emails (@gmail.com) are valid findings
  • Only penalize truly suspicious patterns

Scoring Example:

john.doe@example.com (from example.com)
→ +40 (domain match) +10 (context) +20 (MX) = 70 points ✓

jane@gmail.com (from example.com)
→ +0 (no penalty!) +10 (context) +20 (MX) = 30 points ✓

info@example.com (from example.com)
→ +40 (domain match) -10 (generic) +20 (MX) = 50 points

🔧 Development

Code Quality

# Lint code
pnpm lint

# Format code
pnpm format

# Type check
pnpm build

Adding a New Agent

  1. Create a new file in src/agents/
  2. Extend BaseAgent
  3. Implement getSystemPrompt()
  4. Add agent-specific methods
  5. Write tests in tests/agents/

Example:

import { BaseAgent, AgentConfig } from './BaseAgent.js';

export class MyAgent extends BaseAgent {
  constructor(config: Partial<AgentConfig> = {}) {
    super({
      name: 'MyAgent',
      model: 'claude-3-5-sonnet-20241022',
      temperature: 0.5,
      maxTokens: 2000,
      ...config,
    });
  }

  getSystemPrompt(): string {
    return 'You are MyAgent. Your role is...';
  }

  async performTask(): Promise<void> {
    const response = await this.think('Task prompt here');
    // Process response...
  }
}

🔧 Phase 3 Components

Link Scoring Algorithm

Intelligently scores links based on their likelihood to contain emails:

  • High-Value Keywords: contact (+10), about (+9), team (+9), people (+8)
  • Avoid Keywords: login (-20), cart (-15), privacy (-10)
  • Same Domain Bonus: +20 points
  • Path Depth Penalty: -2 per level
  • Root Level Bonus: +5 points
  • Email Text Indicators: +5 if text mentions email/contact

Example:

import { linkScorer } from './utils/linkScorer';

const scored = linkScorer.scoreLink(
  { href: '/contact', text: 'Contact Us' },
  'https://example.com'
);
// Result: score ~35 (contact +10, same domain +20, root +5)

DOM Distillation

Reduces full page DOM to email-relevant content only:

  • 10-100x token reduction
  • Extracts high-value sections (footer, header, contact areas)
  • Identifies relevant links automatically
  • Detects contact forms and social links
  • Compact summaries for LLM processing

Before: 50KB full HTML → After: 2KB distilled content


📊 Phase 3 Completion Status

  • Orchestrator Agent with strategy planning
  • Agent Coordinator service
  • Link scoring algorithm
  • DOM distillation utilities
  • Strategy selection logic
  • Complete search workflow
  • Progress tracking system
  • Integration tests
  • Updated documentation

Next: Phase 4 - Advanced Extraction & Validation


🔗 Related Documentation


🤝 Contributing

This is currently a development branch implementing Phase 2 of the multiphase plan. See the implementation plan for the full roadmap.


📝 License

MIT


🎯 Roadmap

  • Phase 0: Project setup
  • Phase 1: Core browser automation
  • Phase 2: Basic agent framework
  • Phase 3: Multi-agent orchestration
  • Phase 4: Advanced extraction & validation ← YOU ARE HERE
  • Phase 5: Frontend & UI
  • Phase 6: Lead enrichment
  • Phase 7: Production deployment

Built with ❤️ using TypeScript, Playwright, and Claude AI

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages