Version: 0.3.0 (Phase 4 - Advanced Extraction & Validation) Status: In Development Stack: TypeScript, Node.js, Playwright, Anthropic Claude, OpenAI GPT-4V
s33k3r is an AI-powered email finder that uses autonomous agents to discover contact information from websites. The system employs a multi-agent architecture with browser automation to intelligently navigate, extract, and validate email addresses.
This branch implements Phase 4: Advanced Extraction & Validation which includes:
Phase 4 Features (NEW!):
- ✅ Vision-based Extraction - GPT-4V screenshot analysis for emails in images
- ✅ LLM-Enhanced Validation - Intelligent name/role extraction from context
- ✅ Contact Form Detection - Vision-based form identification
- ✅ Enhanced Quality Scoring - Improved confidence calculations
- ✅ Graceful Fallbacks - Works with or without OpenAI API key
Phase 3 Features:
- ✅ Orchestrator Agent - High-level strategy planning and agent coordination
- ✅ Agent Coordinator - Unified search workflow with progress tracking
- ✅ Link Scoring Algorithm - Intelligent link prioritization
- ✅ DOM Distillation - Efficient page content reduction (10-100x token savings)
- ✅ Strategy Selection - Automatic strategy selection (Common Page, Deep Crawl, LinkedIn)
- ✅ Complete Workflow - End-to-end automated email discovery
Phase 2 Foundation:
- ✅ Base Agent System - Foundation for all AI agents with LLM integration
- ✅ Navigation Agent - Scouts pages to identify high-value targets
- ✅ Extraction Agent - Extracts emails using multiple methods (regex, mailto, JS, LLM)
- ✅ Validation Agent - Validates and scores email quality
- ✅ Browser Automation - Playwright-based stealth browsing
- ✅ Type Safety - Full TypeScript with Zod schemas
┌─────────────────────────────────────────────────────────┐
│ Orchestrator Agent │
│ • High-level task planning │
│ • Strategy selection │
│ • Result aggregation │
└────────────────┬────────────────────────────────────────┘
│
├──────────────┬──────────────┬─────────────
│ │ │
┌────────▼──────┐ ┌───▼──────┐ ┌───▼──────────┐
│ Navigation │ │ Extraction│ │ Validation │
│ Agent │ │ Agent │ │ Agent │
│ │ │ │ │ │
│ • Browse pages │ │ • Parse │ │ • Verify │
│ • Follow links │ │ content │ │ emails │
│ • Handle auth │ │ • Pattern │ │ • Dedupe │
└────────┬───────┘ │ match │ │ • Score │
│ └─────┬─────┘ └──────┬───────┘
│ │ │
┌────────▼────────────────▼────────────────▼───────┐
│ Browser Control Layer │
│ (Playwright + Stealth) │
└───────────────────────────────────────────────────┘
- Bun 1.0+ (https://bun.sh)
- Gemini API Key (Google Gemini 2.5)
- Convex Account (https://convex.dev)
# Install Bun (if not already installed)
curl -fsSL https://bun.sh/install | bash
# Clone the repository
git clone https://github.com/adam7rans/s33k3r.git
cd s33k3r
# Install dependencies
bun install
# Install Playwright browsers
cd backend && bunx playwright install chromium && cd ..# Copy environment templates
cp backend/.env.example backend/.env
cp frontend/.env.local.example frontend/.env.local
# Edit backend/.env and add your API keys
GEMINI_API_KEY=your_gemini_api_key_here
CONVEX_URL=your_convex_deployment_url_here
# Edit frontend/.env.local
NEXT_PUBLIC_CONVEX_URL=your_convex_deployment_url_here# Run Convex dev server to sync schema and generate types
bunx convex devThis will authenticate you with Convex, create your deployment, and push the database schema.
# Run with default URL (example.com)
bun dev
# Or specify a target website
cd backend && bun dev https://yourwebsite.com
# Build and run in production mode
bun build
bun start https://yourwebsite.comThis will:
- Initialize the multi-agent system
- Analyze the target website structure
- Select the optimal search strategy (Common Page, Deep Crawl, etc.)
- Navigate to high-value pages (contact, about, team)
- Extract emails using multiple methods
- Validate and score all discovered emails
- Display detailed results with progress tracking
# Run all tests
pnpm test
# Run tests with coverage
pnpm test:coverage
# Run tests in watch mode
pnpm test -- --watchCurrent tests cover:
- ✅ BaseAgent initialization and conversation history
- ✅ NavigationAgent high-value page detection
- ✅ ExtractionAgent email finding methods
- ✅ ValidationAgent scoring and validation
- ✅ BrowserManager stealth mode
- ✅ PageNavigator retry logic
s33k3r/
├── backend/
│ ├── src/
│ │ ├── agents/ # AI Agent implementations
│ │ │ ├── BaseAgent.ts # Abstract base class
│ │ │ ├── OrchestratorAgent.ts # Strategy planning ⭐ NEW
│ │ │ ├── AgentCoordinator.ts # Workflow coordination ⭐ NEW
│ │ │ ├── NavigationAgent.ts # Page navigation
│ │ │ ├── ExtractionAgent.ts # Email extraction
│ │ │ └── ValidationAgent.ts # Email validation
│ │ ├── browser/ # Browser automation
│ │ │ ├── BrowserManager.ts # Playwright browser pool
│ │ │ ├── PageNavigator.ts # Navigation utilities
│ │ │ ├── AccessibilityExtractor.ts # A11y tree parsing
│ │ │ └── ScreenshotManager.ts # Screenshot capture
│ │ ├── utils/ # Utilities
│ │ │ ├── logger.ts # Logging system
│ │ │ ├── linkScorer.ts # Link scoring ⭐ NEW
│ │ │ └── domDistillation.ts # DOM reduction ⭐ NEW
│ │ ├── types/ # TypeScript types & Zod schemas
│ │ └── index.ts # Main entry point
│ ├── tests/ # Unit and integration tests
│ ├── package.json
│ ├── tsconfig.json
│ └── vitest.config.ts
├── docs/ # Documentation
│ ├── EMAIL_FINDER_AGENT_SPEC.md
│ ├── AGENT_AND_USER_FLOWS.md
│ └── MULTIPHASE_IMPLEMENTATION_PLAN.md (on other branch)
└── README.md # This file
Purpose: High-level strategic planning and agent coordination
Capabilities:
- Analyzes target website structure
- Selects optimal search strategy based on site characteristics
- Decomposes strategy into prioritized tasks
- Coordinates Navigation, Extraction, and Validation agents
- Provides real-time progress updates
Strategies:
- Common Page - Fast search of standard pages (contact, about, team)
- Deep Crawl - Systematic exploration of all pages
- LinkedIn - Extract LinkedIn profiles and construct emails
- API Discovery - Look for public APIs exposing contact info
Example Strategy Output:
{
"strategy": "common_page",
"tasks": [
{"type": "navigate", "url": "/contact", "priority": 10},
{"type": "navigate", "url": "/about", "priority": 9},
{"type": "navigate", "url": "/team", "priority": 8}
],
"estimatedPages": 5,
"reasoning": "Site has clear navigation with contact page"
}Purpose: Unified search workflow management
Features:
- Initializes and manages all agents
- Executes complete search workflow
- Progress tracking with callbacks
- Error handling and recovery
- Resource cleanup
Workflow:
- Initialize browser and agents
- Navigate to target URL
- Plan search strategy
- Execute strategy with progress updates
- Validate and deduplicate emails
- Return structured results
Abstract base class that all agents inherit from. Provides:
- LLM conversation management (Anthropic Claude)
- Message history tracking
- Think/respond cycle
- Configurable model parameters
Purpose: Scout pages and identify high-value targets
Capabilities:
- Analyzes accessibility tree (lightweight observation)
- Identifies pages likely to contain emails (contact, about, team)
- Uses ReAct prompting for reasoning
- Decides which pages to hand off to Extraction Agent
System Prompt:
You are a Navigation Agent specialized in scouting web pages to find
contact information. Use the accessibility tree to understand page
structure and identify high-value pages. Use ReAct format:
THOUGHT: [Your reasoning]
ACTION: [navigate|click|extract|finish]
Purpose: Deep email extraction from high-value pages
Methods:
- Regex Scanning - Pattern matching on page text
- mailto: Links - Clickable email links
- JavaScript Variables - Emails stored in JS code
- LLM Extraction - Claude-based intelligent parsing
Output: Email candidates with context
Purpose: Validate and score email quality
Validation Checks:
- ✅ Format validation (RFC 5322)
- ✅ Domain match detection (+40 bonus if matches)
- ✅ MX record lookup (+20 if exists)
- ✅ Generic role detection (-10 for info@, contact@)
- ✅ Suspicious pattern detection (-30 for noreply@)
- ✅ Context presence (+10 if has name/role)
Critical Design:
- Domain mismatch is NOT penalized
- Personal emails (@gmail.com) are valid findings
- Only penalize truly suspicious patterns
Scoring Example:
john.doe@example.com (from example.com)
→ +40 (domain match) +10 (context) +20 (MX) = 70 points ✓
jane@gmail.com (from example.com)
→ +0 (no penalty!) +10 (context) +20 (MX) = 30 points ✓
info@example.com (from example.com)
→ +40 (domain match) -10 (generic) +20 (MX) = 50 points
# Lint code
pnpm lint
# Format code
pnpm format
# Type check
pnpm build- Create a new file in
src/agents/ - Extend
BaseAgent - Implement
getSystemPrompt() - Add agent-specific methods
- Write tests in
tests/agents/
Example:
import { BaseAgent, AgentConfig } from './BaseAgent.js';
export class MyAgent extends BaseAgent {
constructor(config: Partial<AgentConfig> = {}) {
super({
name: 'MyAgent',
model: 'claude-3-5-sonnet-20241022',
temperature: 0.5,
maxTokens: 2000,
...config,
});
}
getSystemPrompt(): string {
return 'You are MyAgent. Your role is...';
}
async performTask(): Promise<void> {
const response = await this.think('Task prompt here');
// Process response...
}
}Intelligently scores links based on their likelihood to contain emails:
- High-Value Keywords: contact (+10), about (+9), team (+9), people (+8)
- Avoid Keywords: login (-20), cart (-15), privacy (-10)
- Same Domain Bonus: +20 points
- Path Depth Penalty: -2 per level
- Root Level Bonus: +5 points
- Email Text Indicators: +5 if text mentions email/contact
Example:
import { linkScorer } from './utils/linkScorer';
const scored = linkScorer.scoreLink(
{ href: '/contact', text: 'Contact Us' },
'https://example.com'
);
// Result: score ~35 (contact +10, same domain +20, root +5)Reduces full page DOM to email-relevant content only:
- 10-100x token reduction
- Extracts high-value sections (footer, header, contact areas)
- Identifies relevant links automatically
- Detects contact forms and social links
- Compact summaries for LLM processing
Before: 50KB full HTML → After: 2KB distilled content
- Orchestrator Agent with strategy planning
- Agent Coordinator service
- Link scoring algorithm
- DOM distillation utilities
- Strategy selection logic
- Complete search workflow
- Progress tracking system
- Integration tests
- Updated documentation
Next: Phase 4 - Advanced Extraction & Validation
This is currently a development branch implementing Phase 2 of the multiphase plan. See the implementation plan for the full roadmap.
MIT
- Phase 0: Project setup
- Phase 1: Core browser automation
- Phase 2: Basic agent framework
- Phase 3: Multi-agent orchestration
- Phase 4: Advanced extraction & validation ← YOU ARE HERE
- Phase 5: Frontend & UI
- Phase 6: Lead enrichment
- Phase 7: Production deployment
Built with ❤️ using TypeScript, Playwright, and Claude AI