s33k3r - AI-Powered Email Finder

Version: 0.3.0 (Phase 4 - Advanced Extraction & Validation) Status: In Development Stack: TypeScript, Node.js, Playwright, Anthropic Claude, OpenAI GPT-4V

📋 Overview

s33k3r is an AI-powered email finder that uses autonomous agents to discover contact information from websites. The system employs a multi-agent architecture with browser automation to intelligently navigate, extract, and validate email addresses.

Current Implementation: Phase 4

This branch implements Phase 4: Advanced Extraction & Validation which includes:

Phase 4 Features (NEW!):

✅ Vision-based Extraction - GPT-4V screenshot analysis for emails in images
✅ LLM-Enhanced Validation - Intelligent name/role extraction from context
✅ Contact Form Detection - Vision-based form identification
✅ Enhanced Quality Scoring - Improved confidence calculations
✅ Graceful Fallbacks - Works with or without OpenAI API key

Phase 3 Features:

✅ Orchestrator Agent - High-level strategy planning and agent coordination
✅ Agent Coordinator - Unified search workflow with progress tracking
✅ Link Scoring Algorithm - Intelligent link prioritization
✅ DOM Distillation - Efficient page content reduction (10-100x token savings)
✅ Strategy Selection - Automatic strategy selection (Common Page, Deep Crawl, LinkedIn)
✅ Complete Workflow - End-to-end automated email discovery

Phase 2 Foundation:

✅ Base Agent System - Foundation for all AI agents with LLM integration
✅ Navigation Agent - Scouts pages to identify high-value targets
✅ Extraction Agent - Extracts emails using multiple methods (regex, mailto, JS, LLM)
✅ Validation Agent - Validates and scores email quality
✅ Browser Automation - Playwright-based stealth browsing
✅ Type Safety - Full TypeScript with Zod schemas

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                    Orchestrator Agent                    │
│  • High-level task planning                             │
│  • Strategy selection                                    │
│  • Result aggregation                                    │
└────────────────┬────────────────────────────────────────┘
                 │
                 ├──────────────┬──────────────┬─────────────
                 │              │              │
        ┌────────▼──────┐  ┌───▼──────┐  ┌───▼──────────┐
        │ Navigation     │  │ Extraction│  │ Validation   │
        │ Agent          │  │ Agent     │  │ Agent        │
        │                │  │           │  │              │
        │ • Browse pages │  │ • Parse   │  │ • Verify     │
        │ • Follow links │  │   content │  │   emails     │
        │ • Handle auth  │  │ • Pattern │  │ • Dedupe     │
        └────────┬───────┘  │   match   │  │ • Score      │
                 │          └─────┬─────┘  └──────┬───────┘
                 │                │                │
        ┌────────▼────────────────▼────────────────▼───────┐
        │           Browser Control Layer                   │
        │              (Playwright + Stealth)               │
        └───────────────────────────────────────────────────┘

🚀 Quick Start

Prerequisites

Bun 1.0+ (https://bun.sh)
Gemini API Key (Google Gemini 2.5)
Convex Account (https://convex.dev)

Installation

# Install Bun (if not already installed)
curl -fsSL https://bun.sh/install | bash

# Clone the repository
git clone https://github.com/adam7rans/s33k3r.git
cd s33k3r

# Install dependencies
bun install

# Install Playwright browsers
cd backend && bunx playwright install chromium && cd ..

Configuration

# Copy environment templates
cp backend/.env.example backend/.env
cp frontend/.env.local.example frontend/.env.local

# Edit backend/.env and add your API keys
GEMINI_API_KEY=your_gemini_api_key_here
CONVEX_URL=your_convex_deployment_url_here

# Edit frontend/.env.local
NEXT_PUBLIC_CONVEX_URL=your_convex_deployment_url_here

Initialize Convex

# Run Convex dev server to sync schema and generate types
bunx convex dev

This will authenticate you with Convex, create your deployment, and push the database schema.

Running the Email Finder

# Run with default URL (example.com)
bun dev

# Or specify a target website
cd backend && bun dev https://yourwebsite.com

# Build and run in production mode
bun build
bun start https://yourwebsite.com

This will:

Initialize the multi-agent system
Analyze the target website structure
Select the optimal search strategy (Common Page, Deep Crawl, etc.)
Navigate to high-value pages (contact, about, team)
Extract emails using multiple methods
Validate and score all discovered emails
Display detailed results with progress tracking

🧪 Testing

# Run all tests
pnpm test

# Run tests with coverage
pnpm test:coverage

# Run tests in watch mode
pnpm test -- --watch

Test Coverage

Current tests cover:

✅ BaseAgent initialization and conversation history
✅ NavigationAgent high-value page detection
✅ ExtractionAgent email finding methods
✅ ValidationAgent scoring and validation
✅ BrowserManager stealth mode
✅ PageNavigator retry logic

📁 Project Structure

s33k3r/
├── backend/
│   ├── src/
│   │   ├── agents/              # AI Agent implementations
│   │   │   ├── BaseAgent.ts           # Abstract base class
│   │   │   ├── OrchestratorAgent.ts   # Strategy planning ⭐ NEW
│   │   │   ├── AgentCoordinator.ts    # Workflow coordination ⭐ NEW
│   │   │   ├── NavigationAgent.ts     # Page navigation
│   │   │   ├── ExtractionAgent.ts     # Email extraction
│   │   │   └── ValidationAgent.ts     # Email validation
│   │   ├── browser/             # Browser automation
│   │   │   ├── BrowserManager.ts      # Playwright browser pool
│   │   │   ├── PageNavigator.ts       # Navigation utilities
│   │   │   ├── AccessibilityExtractor.ts  # A11y tree parsing
│   │   │   └── ScreenshotManager.ts   # Screenshot capture
│   │   ├── utils/               # Utilities
│   │   │   ├── logger.ts              # Logging system
│   │   │   ├── linkScorer.ts          # Link scoring ⭐ NEW
│   │   │   └── domDistillation.ts     # DOM reduction ⭐ NEW
│   │   ├── types/               # TypeScript types & Zod schemas
│   │   └── index.ts             # Main entry point
│   ├── tests/                   # Unit and integration tests
│   ├── package.json
│   ├── tsconfig.json
│   └── vitest.config.ts
├── docs/                        # Documentation
│   ├── EMAIL_FINDER_AGENT_SPEC.md
│   ├── AGENT_AND_USER_FLOWS.md
│   └── MULTIPHASE_IMPLEMENTATION_PLAN.md (on other branch)
└── README.md                    # This file

🤖 Agent Details

OrchestratorAgent ⭐ NEW

Purpose: High-level strategic planning and agent coordination

Capabilities:

Analyzes target website structure
Selects optimal search strategy based on site characteristics
Decomposes strategy into prioritized tasks
Coordinates Navigation, Extraction, and Validation agents
Provides real-time progress updates

Strategies:

Common Page - Fast search of standard pages (contact, about, team)
Deep Crawl - Systematic exploration of all pages
LinkedIn - Extract LinkedIn profiles and construct emails
API Discovery - Look for public APIs exposing contact info

Example Strategy Output:

{
  "strategy": "common_page",
  "tasks": [
    {"type": "navigate", "url": "/contact", "priority": 10},
    {"type": "navigate", "url": "/about", "priority": 9},
    {"type": "navigate", "url": "/team", "priority": 8}
  ],
  "estimatedPages": 5,
  "reasoning": "Site has clear navigation with contact page"
}

AgentCoordinator ⭐ NEW

Purpose: Unified search workflow management

Features:

Initializes and manages all agents
Executes complete search workflow
Progress tracking with callbacks
Error handling and recovery
Resource cleanup

Workflow:

Initialize browser and agents
Navigate to target URL
Plan search strategy
Execute strategy with progress updates
Validate and deduplicate emails
Return structured results

BaseAgent

Abstract base class that all agents inherit from. Provides:

LLM conversation management (Anthropic Claude)
Message history tracking
Think/respond cycle
Configurable model parameters

NavigationAgent

Purpose: Scout pages and identify high-value targets

Capabilities:

Analyzes accessibility tree (lightweight observation)
Identifies pages likely to contain emails (contact, about, team)
Uses ReAct prompting for reasoning
Decides which pages to hand off to Extraction Agent

System Prompt:

You are a Navigation Agent specialized in scouting web pages to find
contact information. Use the accessibility tree to understand page
structure and identify high-value pages. Use ReAct format:
THOUGHT: [Your reasoning]
ACTION: [navigate|click|extract|finish]

ExtractionAgent

Purpose: Deep email extraction from high-value pages

Methods:

Regex Scanning - Pattern matching on page text
mailto: Links - Clickable email links
JavaScript Variables - Emails stored in JS code
LLM Extraction - Claude-based intelligent parsing

Output: Email candidates with context

ValidationAgent

Purpose: Validate and score email quality

Validation Checks:

✅ Format validation (RFC 5322)
✅ Domain match detection (+40 bonus if matches)
✅ MX record lookup (+20 if exists)
✅ Generic role detection (-10 for info@, contact@)
✅ Suspicious pattern detection (-30 for noreply@)
✅ Context presence (+10 if has name/role)

Critical Design:

Domain mismatch is NOT penalized
Personal emails (@gmail.com) are valid findings
Only penalize truly suspicious patterns

Scoring Example:

john.doe@example.com (from example.com)
→ +40 (domain match) +10 (context) +20 (MX) = 70 points ✓

jane@gmail.com (from example.com)
→ +0 (no penalty!) +10 (context) +20 (MX) = 30 points ✓

info@example.com (from example.com)
→ +40 (domain match) -10 (generic) +20 (MX) = 50 points

🔧 Development

Code Quality

# Lint code
pnpm lint

# Format code
pnpm format

# Type check
pnpm build

Adding a New Agent

Create a new file in src/agents/
Extend BaseAgent
Implement getSystemPrompt()
Add agent-specific methods
Write tests in tests/agents/

Example:

import { BaseAgent, AgentConfig } from './BaseAgent.js';

export class MyAgent extends BaseAgent {
  constructor(config: Partial<AgentConfig> = {}) {
    super({
      name: 'MyAgent',
      model: 'claude-3-5-sonnet-20241022',
      temperature: 0.5,
      maxTokens: 2000,
      ...config,
    });
  }

  getSystemPrompt(): string {
    return 'You are MyAgent. Your role is...';
  }

  async performTask(): Promise<void> {
    const response = await this.think('Task prompt here');
    // Process response...
  }
}

🔧 Phase 3 Components

Link Scoring Algorithm

Intelligently scores links based on their likelihood to contain emails:

High-Value Keywords: contact (+10), about (+9), team (+9), people (+8)
Avoid Keywords: login (-20), cart (-15), privacy (-10)
Same Domain Bonus: +20 points
Path Depth Penalty: -2 per level
Root Level Bonus: +5 points
Email Text Indicators: +5 if text mentions email/contact

Example:

import { linkScorer } from './utils/linkScorer';

const scored = linkScorer.scoreLink(
  { href: '/contact', text: 'Contact Us' },
  'https://example.com'
);
// Result: score ~35 (contact +10, same domain +20, root +5)

DOM Distillation

Reduces full page DOM to email-relevant content only:

10-100x token reduction
Extracts high-value sections (footer, header, contact areas)
Identifies relevant links automatically
Detects contact forms and social links
Compact summaries for LLM processing

Before: 50KB full HTML → After: 2KB distilled content

📊 Phase 3 Completion Status

Next: Phase 4 - Advanced Extraction & Validation

🔗 Related Documentation

🤝 Contributing

This is currently a development branch implementing Phase 2 of the multiphase plan. See the implementation plan for the full roadmap.

📝 License

MIT

🎯 Roadmap

Phase 0: Project setup
Phase 1: Core browser automation
Phase 2: Basic agent framework
Phase 3: Multi-agent orchestration
Phase 4: Advanced extraction & validation ← YOU ARE HERE
Phase 5: Frontend & UI
Phase 6: Lead enrichment
Phase 7: Production deployment

Built with ❤️ using TypeScript, Playwright, and Claude AI

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
backend		backend
convex		convex
data		data
docs		docs
frontend		frontend
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
ENRICHMENT_UPDATES.md		ENRICHMENT_UPDATES.md
MIGRATION_COMPLETE.md		MIGRATION_COMPLETE.md
MIGRATION_TO_BUN_AND_CONVEX.md		MIGRATION_TO_BUN_AND_CONVEX.md
README.md		README.md
TESTING.md		TESTING.md
agencies.txt		agencies.txt
bun.lock		bun.lock
debug_error.png		debug_error.png
fatal_error.png		fatal_error.png
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

s33k3r - AI-Powered Email Finder

📋 Overview

Current Implementation: Phase 4

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Configuration

Initialize Convex

Running the Email Finder

🧪 Testing

Test Coverage

📁 Project Structure

🤖 Agent Details

OrchestratorAgent ⭐ NEW

AgentCoordinator ⭐ NEW

BaseAgent

NavigationAgent

ExtractionAgent

ValidationAgent

🔧 Development

Code Quality

Adding a New Agent

🔧 Phase 3 Components

Link Scoring Algorithm

DOM Distillation

📊 Phase 3 Completion Status

🔗 Related Documentation

🤝 Contributing

📝 License

🎯 Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages