Skip to content

kksen18-collab/leadscout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LeadScout

An AI agent that browses websites, extracts job listings and company leads, and exports structured results — all driven by the Model Context Protocol (MCP).

Built with the OpenAI Agents SDK, Playwright, and Gradio.


What it does

Give LeadScout a target website and a description of what you want — it will:

  1. Navigate the site using a real browser (Playwright handles JavaScript, pagination, dynamic content)
  2. Extract job listings, company details, contacts, or other leads
  3. Save every find to a structured vault as it browses
  4. Export the full list to CSV or JSON in one click

Results appear in a live Gradio UI alongside a real-time log of every MCP tool call the agent makes.


MCP Concepts

What is MCP?

The Model Context Protocol (MCP) is an open standard that defines how AI applications connect to external tools and data sources. Think of it as USB-C for AI — a single, consistent interface that lets any MCP-compatible AI host talk to any MCP-compatible server, regardless of who built either one.

Before MCP, every AI application had to build its own integrations: custom code to call a browser, custom code to read files, custom code to query a database. With MCP, those integrations are written once as MCP servers and can be reused by any AI host.

The Three-Layer Architecture

MCP defines three distinct roles:

┌─────────────────────────────────────────────────────┐
│                    MCP HOST                          │
│  The AI application that controls the agent loop.   │
│  Owns the LLM, decides which servers to connect,    │
│  and manages the overall conversation.              │
│                                                     │
│  In LeadScout: AgentManager + OpenAI Agents SDK     │
└──────────────────┬──────────────────────────────────┘
                   │  spawns & talks to
       ┌───────────┼───────────┐
       │           │           │
┌──────▼──────┐ ┌──▼──────┐ ┌──▼──────────┐
│  MCP CLIENT │ │  MCP    │ │   MCP       │
│  (built in  │ │ CLIENT  │ │  CLIENT     │
│   the SDK)  │ │         │ │             │
│             │ │         │ │             │
│  ──────────▼─┤ ├────────▼─┤ ├──────────▼─┤
│  MCP SERVER │ │  MCP    │ │   MCP       │
│  Playwright │ │  Fetch  │ │  ProspectV. │
└─────────────┘ └─────────┘ └─────────────┘
Role Responsibility Example in LeadScout
Host Runs the agent loop, connects to servers, sends tool results to the LLM AgentManager using the OpenAI Agents SDK
Client One connection to one server — manages the protocol session Created automatically by the SDK per server
Server Exposes tools, resources, or prompts to the host Playwright, Fetch, ProspectVault

The host and client usually live in the same process. The server can be local (subprocess) or remote (over the network).


MCP Primitives

MCP servers expose three types of capabilities:

Tools

Functions the LLM can call to take actions or retrieve data. The server declares them; the LLM decides when to use them; the host executes them and returns results.

LLM decides to call add_prospect(name="Acme Corp", type="company", ...)
  → Host sends the call to ProspectVault MCP server
  → Server writes to vault, returns {"status": "saved", "id": "..."}
  → Host sends the result back to the LLM
  → LLM continues reasoning

Resources

Read-only data the host can fetch at any time — like file system reads or database queries. Identified by URI (e.g. vault://prospects). The host, not the LLM, decides when to read them.

Prompts

Reusable prompt templates defined by the server. Less common, mostly used in IDE-style integrations where the user selects a predefined workflow.

LeadScout uses tools (for actions like add_prospect, browser_navigate) and resources (for vault reads like vault://stats).


Transports: stdio vs SSE

MCP does not mandate a wire format — it defines a protocol, and that protocol can run over different transports. The two main options are stdio and SSE.

stdio (Standard I/O)

The host spawns the server as a child process. Messages travel over the process's stdin/stdout as newline-delimited JSON. The server lives and dies with the host process.

Host process
│
├── spawn: python prospect_vault_server.py
│         stdin  ──────────────────────────► server reads requests
│         stdout ◄────────────────────────── server writes responses
│
├── spawn: npx @playwright/mcp
│         stdin/stdout ←──────────────────── same pattern
│
└── spawn: uvx mcp-server-fetch
          stdin/stdout ←──────────────────── same pattern

When to use stdio:

  • Local tools (browser automation, file system, databases on the same machine)
  • Development and prototyping
  • Tools that should not be exposed over the network
  • Simple deployment — no server to run separately, no ports to manage

All three servers in LeadScout use stdio.

SSE (Server-Sent Events)

The server runs as a standalone HTTP service. The host connects over HTTP — commands go as POST requests, responses stream back as SSE events. The server runs independently and can serve multiple hosts simultaneously.

Host process                    Remote / separate process
│                               │
│  POST /message  ─────────────►│  MCP Server (HTTP)
│  GET  /sse      ◄─────────────│  streams responses
│                               │
│                               │  can serve many hosts at once
│                               │  survives host restarts

When to use SSE:

  • Shared infrastructure (one server instance used by many agents or users)
  • Remote tools — the server is on a different machine or in the cloud
  • Long-running services that should persist independently (e.g. a company-wide knowledge base server)
  • Microservice architectures where tools are deployed separately

Comparison

stdio SSE
Deployment Subprocess, same machine Standalone HTTP server
Startup On-demand, spawned by host Always-on, host connects to it
Scope One host at a time Many hosts simultaneously
Network None needed HTTP/HTTPS
Security OS process isolation Network auth (API keys, OAuth)
Best for Local tools, dev Shared/remote tools, prod infra

MCP Server Types

1. Prebuilt servers (community ecosystem)

Ready-to-use servers published to npm or PyPI. You run them with npx or uvx — no installation, no code to write.

# Playwright browser automation
npx @playwright/mcp@latest

# Fetch / HTTP retrieval
uvx mcp-server-fetch

# Filesystem access
uvx mcp-server-filesystem /path/to/dir

# GitHub
uvx mcp-server-github

The growing ecosystem means most common integrations (web search, databases, cloud services, dev tools) already have a prebuilt server. Browse the full list at modelcontextprotocol.io/servers.

Tradeoffs: Fast to get started, no maintenance burden. But you're limited to what the server exposes — no custom business logic.

2. Self-hosted custom servers

You write and run your own MCP server. This is the right choice when you need:

  • Custom business logic (e.g. "save a lead with our specific schema")
  • Integration with proprietary internal systems
  • Fine-grained control over what the LLM can and cannot do
  • Domain-specific tools that don't exist in the ecosystem

The simplest way to build one in Python is FastMCP, which ships with the mcp[cli] package:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("MyServer")

@mcp.tool()
def save_record(name: str, value: str) -> dict:
    """Save a record to the database."""
    db.insert(name, value)
    return {"status": "saved"}

@mcp.resource("data://records")
def list_records() -> str:
    """Return all records as JSON."""
    return json.dumps(db.all())

if __name__ == "__main__":
    mcp.run()  # stdio by default; mcp.run(transport="sse") for HTTP

The @mcp.tool() decorator auto-generates the JSON schema from type hints and docstrings — the LLM sees a clean tool definition with no extra work.

ProspectVault (src/leadscout/servers/prospect_vault_server.py) is this project's custom server. It stores leads in a JSON file, supports filtering, and exports to CSV/JSON — business logic that doesn't exist in any prebuilt server.


How the Agent Loop Uses MCP

When you start a search, here is the full flow:

1. User submits query in Gradio UI

2. AgentManager.initialize() (first run only)
   └── Spawns three MCP servers as subprocesses (stdio)
   └── Each server handshakes with the host: declares its tools and resources
   └── SDK caches the tool list from each server

3. Runner.run(agent, message)
   └── SDK builds a system prompt that includes all available MCP tools
   └── LLM receives: instructions + tool definitions + user message

4. LLM decides to use a tool (e.g. browser_navigate)
   └── SDK routes the call to the correct MCP server (Playwright)
   └── Server executes the action (real browser navigates to URL)
   └── Server returns result as text
   └── Result is fed back to the LLM as a tool response

5. LLM continues — may call more tools, read resources, reason over results

6. LLM calls add_prospect() for each find
   └── ProspectVault server saves to sandbox/prospects.json
   └── Gradio UI polls the file and updates the Prospects tab live

7. LLM produces a final text summary → displayed in chat

The MCPActivityTracer intercepts every span emitted by the SDK (tool calls, list-tools handshakes, LLM generation) and streams them into the Live MCP Activity tab so you can watch the agent work in real time.


Architecture

┌──────────────────────────────────────────────────────────────┐
│                     Gradio UI (app.py)                       │
│  Search input + chat  │  MCP Activity │ Prospects │ Servers  │
└──────────┬───────────────────────────────────────────────────┘
           │
    AgentManager (agent_manager.py)
           │  openai-agents SDK + AsyncExitStack
           │
    ┌──────┼──────────────┬────────────────┐
    │      │              │                │
┌───▼───┐  │         ┌────▼────┐   ┌──────▼──────┐
│Fetch  │  │         │Playwright│   │ProspectVault│
│Server │  │         │ Server  │   │  Server     │
│(uvx)  │  │         │ (npx)   │   │  (python)   │
└───────┘  │         └─────────┘   └─────────────┘
Prebuilt   │         Prebuilt      Custom
Fast page  │         Full browser  Stores & exports
fetching   │         automation    discovered leads
           │
    MCPActivityTracer (tracing.py)
    Captures every tool call, LLM span → shown in UI

MCP Servers

Server Type Transport What it provides
Playwright Prebuilt (npx) stdio Browser automation — navigate, click, scroll, snapshot, screenshot
Fetch Prebuilt (uvx) stdio Fast page content retrieval for simpler URLs
ProspectVault Custom (Python) stdio Save leads, filter by type/location, export to CSV/JSON

ProspectVault tools & resources

Tool Description
add_prospect Save a found job or company lead with structured fields
list_prospects List all saved prospects with optional type/location filters
get_prospect Retrieve full detail of a single prospect
update_notes Append notes to an existing prospect
export_prospects Write all prospects to sandbox/ as CSV or JSON
clear_vault Wipe all prospects to start a fresh session
Resource URI Description
vault://prospects All prospects as a JSON array
vault://stats Counts by type and top locations

Project Structure

mcp/
├── .env.example
├── .gitignore
├── pyproject.toml
├── README.md
├── sandbox/                       # Agent outputs — gitignored
│   ├── prospects.json             # Live vault data
│   └── prospects_YYYYMMDD_*.csv  # Exports
└── src/
    └── leadscout/
        ├── __main__.py
        ├── app.py                 # Gradio UI
        ├── agent_manager.py       # Server lifecycle + agent runner
        ├── config.py              # Server definitions + agent prompt
        ├── tracing.py             # MCP activity capture
        └── servers/
            └── prospect_vault_server.py  # Custom FastMCP server

Setup

Prerequisites

Requirement Notes
Python 3.13+
UV Package manager — curl -LsSf https://astral.sh/uv/install.sh | sh
Node.js 18+ Required for Playwright MCP — nodejs.org
OpenAI API key gpt-4o is the default model

Install

cd mcp
uv sync
cp .env.example .env
# Edit .env — set OPENAI_API_KEY=sk-...

Playwright's browser binaries are downloaded automatically on first run by @playwright/mcp.

Run

uv run leadscout
# or
uv run python -m leadscout

Open http://localhost:7860


Usage

Basic search

  1. Enter a Target URL — e.g. https://jobs.lever.co/anthropic
  2. Describe what to look for — e.g. Remote Python backend roles paying $120k+
  3. Select a Prospect type (job / company / contact / lead, or Any)
  4. Click ▶ Start Search

The agent will navigate the site, extract listings, and save each one to the vault. Watch the MCP Activity tab to see every browser action and vault save in real time.

Follow-up questions

After a search completes, use the Follow-up chat to refine:

"Filter to remote-only roles" "Which of these companies are Series A or earlier?" "Export everything you found to CSV"

Export

Click Export CSV or Export JSON — the agent writes the file to sandbox/ and a download link appears in the UI.

Example prompts

Prompt What happens
Browse YC's work-at-a-startup page and save all engineering roles Playwright navigates workatastartup.com, saves each role
Find ML engineer openings at AI labs and note the salary ranges Browses company career pages, extracts compensation data
Find B2B SaaS companies that raised in the last 6 months Fetches news/funding pages, saves company leads
List all jobs you found and tell me which looks best for a senior engineer Uses ProspectVault list_prospects, agent summarises

Environment Variables

Variable Default Description
OPENAI_API_KEY (required) Your OpenAI key
OPENAI_MODEL gpt-4o Model for the agent
SANDBOX_DIR ./sandbox Where vault data and exports are written

Dependencies

Package Purpose
openai-agents Agent orchestration + MCP client
mcp[cli] FastMCP server framework
gradio Web UI
python-dotenv Env var loading

About

Multi-server MCP agent that browses the live web with a real browser, extracts structured leads, and streams every tool call to a Gradio UI — powered by GPT-5.4-mini

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages