Automated AI Job Hunter. Smart, async scraper that discovers, parses (AI), filters, and notifies about new career opportunities in real-time.
- Automated Search — Scheduled job scraping via Serper API with configurable intervals
- Intelligent Parsing — LLM-powered extraction (OpenAI GPT-4 with Gemini fallback) of structured job data from raw HTML
- Smart Deduplication — Redis-based atomic duplicate detection preventing redundant processing
- Customizable Filtering — Keyword, location, and salary-based job filtering engine
- Real-time Notifications — Slack integration for instant job alerts
- Production-Ready — Async task queue (TaskIQ), database migrations (Alembic), structured logging (Structlog), and error tracking (Sentry)
[existing content]
The bot delivers structured job alerts directly to your Slack channel, including:
- Job title and seniority level
- Company name and location
- Salary range (when available)
- Required skills and technologies
- Direct "View Job" link to the original posting
- Python 3.13 — Runtime with asyncio support
- FastAPI — REST API framework for manual triggers and health checks
- PostgreSQL 15 — Primary data store with JSONB support
- Redis 7 — Deduplication cache and task broker
- TaskIQ — Async task queue with Redis broker
- TaskIQ Scheduler — Cron-based job scheduling
- SQLAlchemy 2.0 — Async ORM with declarative models
- Pydantic v2 — Schema validation with Rust core
- Alembic — Database migration management
- Serper API — Google search and web page content extraction
- OpenAI API — GPT-4 for job data parsing (primary)
- Google Gemini — Fallback LLM provider
- Slack SDK — Notification delivery
- Docker + Docker Compose — Containerized deployment
- Structlog — Structured JSON logging
- Sentry — Error tracking and monitoring
- Pre-commit — Code quality enforcement (Ruff linting/formatting)
- Docker & Docker Compose
- Python 3.13+ (for local development)
- Poetry 1.8+
- Clone the repository
git clone https://github.com/PyDevDeep/TalentStream.git
cd TalentStream- Configure environment variables
cp .env.example .env
# Edit .env with your API keys and settingsRequired environment variables:
# Database
DATABASE_URL=postgresql+psycopg://user:password@localhost:5432/jobscraper
# Redis
REDIS_URL=redis://localhost:6379/0
# API Keys
SERPER_API_KEY=your_serper_api_key
OPENAI_API_KEY=your_openai_api_key
GEMINI_API_KEY=your_gemini_api_key # Fallback
# Slack
SLACK_BOT_TOKEN=xoxb-your-slack-bot-token
SLACK_CHANNEL_ID=C01234567890
# Filters
FILTER_KEYWORDS=python,backend,fastapi
FILTER_LOCATION=Remote
FILTER_SALARY_MIN=50000
# Scheduler
SCRAPE_CRON=0 */6 * * * # Every 6 hours
SCRAPE_QUERY=python backend developer remote
# Sentry (optional)
SENTRY_DSN=your_sentry_dsn- Launch with Docker Compose
docker compose up -dThis starts:
- PostgreSQL (port 5432)
- Redis (port 6379)
- FastAPI app (port 8000)
- TaskIQ worker
- TaskIQ scheduler
- Run database migrations
docker compose exec app alembic upgrade headcurl -X POST http://localhost:8000/trigger-scrape \
-H "Content-Type: application/json" \
-d '{"query": "python senior developer remote"}'curl http://localhost:8000/healthdocker compose logs -f app
docker compose logs -f worker┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Scheduler │─────▶│ Scrape Task │─────▶│ Parse Tasks │
│ (cron) │ │ (Serper API) │ │ (per URL) │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌───────────────────────────────┐
│ Redis Dedup Check │
│ (atomic SET NX EX) │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Fetch Page Content │
│ (Serper View API) │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Strip Noise │
│ (regex patterns, truncate) │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ LLM Parsing │
│ (OpenAI GPT-4 → Gemini) │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Filter Engine │
│ (keywords, location, salary) │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Store in PostgreSQL │
│ (upsert with URL dedup) │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Notify Task │
│ (Slack message) │
└───────────────────────────────┘
Tasks (app/tasks/)
scrape.py— Search job URLs via Serper, queue parse tasksparse.py— Full pipeline: dedup → fetch → parse → filter → storenotify.py— Fetch unnotified jobs from DB, send to Slack
Services (app/services/)
dedup.py— Redis-based duplicate detection with TTLfilter.py— Multi-criteria job filtering (keywords, location, salary)noise_stripper.py— HTML noise removal (nav, footer, boilerplate)
Clients (app/clients/)
serper.py— Wrapper for Serper Search and View APIsllm/router.py— LLM routing with OpenAI primary, Gemini fallbackllm/openai_client.py— GPT-4 structured output extractionllm/gemini_client.py— Gemini fallback implementation
Database (app/db/)
models/job.py— SQLAlchemy Job model with JSONB metadatarepository.py— CRUD operations with upsert and notification trackingsession.py— Async database session factory
# Install development dependencies
poetry install
# Run all tests
pytest
# Run with coverage
pytest --cov=app --cov-report=html
# Run specific test suite
pytest tests/unit/
pytest tests/integration/Test structure:
tests/unit/— Isolated component tests (clients, filters, schemas)tests/integration/— Database, Redis, and API integration teststests/e2e/— Full pipeline end-to-end tests
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -dProduction overrides include:
- Restart policies
- Resource limits
- Log rotation
- Health checks
GitHub Actions workflows (.github/workflows/):
- ci.yml — Linting, type checking, tests on every PR
- deploy.yml — Docker build and deployment on tag push
- Latency: p95 < 30s per job (Serper + LLM + store)
- Throughput: ≥ 100 jobs/hour with 4 workers
- Dedup Effectiveness: < 5% duplicates in Slack
- Uptime: 99% scheduler availability
- Error Rate: < 2% failed tasks per batch
All services output structured JSON logs via Structlog:
{
"event": "job_stored_successfully",
"job_id": 123,
"url": "https://example.com/job",
"timestamp": "2026-04-23T12:00:00Z"
}Sentry integration captures:
- Unhandled exceptions
- LLM parsing failures
- API rate limit errors
- Database connection issues
Edit in .env:
# Comma-separated keywords (OR logic)
FILTER_KEYWORDS=python,fastapi,asyncio,backend
# Location string matching
FILTER_LOCATION=Remote
# Minimum annual salary (USD)
FILTER_SALARY_MIN=80000Cron expression in .env:
# Run every 6 hours at minute 0
SCRAPE_CRON=0 */6 * * *
# Serper search query
SCRAPE_QUERY=senior python developer remote usaPriority order:
- OpenAI GPT-4 (primary, fastest)
- Google Gemini (fallback on OpenAI failure)
Configure in app/clients/llm/router.py
- Free Tier: 2500 credits/month
- Search: 2 credit per query
- View: 2, 6 or 10 credit per URL
- Recommended: Paid plan ($50/month for 50K credits) for production
- Model: GPT-4 Turbo
- Estimated Cost: ~$8/month at 7200 calls/month (240 jobs/day)
- Input: ~2000 tokens/call × $0.01/1K = $0.02/call
- Output: ~500 tokens/call × $0.03/1K = $0.015/call
- Rate Limit: 1 message/second per channel
- Mitigation:
asyncio.sleep(1)between messages
[INSERT LICENSE TYPE]
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Run pre-commit checks (
pre-commit run --all-files) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Issue: Tasks not processing
# Check worker logs
docker compose logs worker
# Verify Redis connection
docker compose exec app python -c "from redis.asyncio import Redis; import asyncio; asyncio.run(Redis.from_url('redis://redis:6379/0').ping())"Issue: Database connection errors
# Check PostgreSQL status
docker compose ps postgres
# Run migrations
docker compose exec app alembic upgrade headIssue: Slack notifications not sending
# Test Slack token
docker compose exec app python test_notify.pyIssue: LLM parsing failures
- Check API keys in
.env - Review logs for rate limit errors
- Verify noise stripper output length (should be < 5000 chars)
For issues and questions:
- GitHub Issues: https://github.com/PyDevDeep/TalentStream/issues
- Repository: https://github.com/PyDevDeep/TalentStream
Built with ❤️ using Python, FastAPI, and asyncio
