Automated AI Job Hunter. Smart, async scraper that discovers, parses (AI), filters, and notifies about new career opportunities in real-time.
- Automated Search β Scheduled job scraping via Serper API with configurable intervals
- Intelligent Parsing β LLM-powered extraction (OpenAI GPT-4 with Gemini fallback) of structured job data from raw HTML
- Smart Deduplication β Redis-based atomic duplicate detection preventing redundant processing
- Customizable Filtering β Keyword, location, and salary-based job filtering engine
- Real-time Notifications β Slack integration for instant job alerts
- Production-Ready β Async task queue (TaskIQ), database migrations (Alembic), structured logging (Structlog), and error tracking (Sentry)
[existing content]
The bot delivers structured job alerts directly to your Slack channel, including:
- Job title and seniority level
- Company name and location
- Salary range (when available)
- Required skills and technologies
- Direct "View Job" link to the original posting
- Python 3.13 β Runtime with asyncio support
- FastAPI β REST API framework for manual triggers and health checks
- PostgreSQL 15 β Primary data store with JSONB support
- Redis 7 β Deduplication cache and task broker
- TaskIQ β Async task queue with Redis broker
- TaskIQ Scheduler β Cron-based job scheduling
- SQLAlchemy 2.0 β Async ORM with declarative models
- Pydantic v2 β Schema validation with Rust core
- Alembic β Database migration management
- Serper API β Google search and web page content extraction
- OpenAI API β GPT-4 for job data parsing (primary)
- Google Gemini β Fallback LLM provider
- Slack SDK β Notification delivery
- Docker + Docker Compose β Containerized deployment
- Structlog β Structured JSON logging
- Sentry β Error tracking and monitoring
- Pre-commit β Code quality enforcement (Ruff linting/formatting)
- Docker & Docker Compose
- Python 3.13+ (for local development)
- Poetry 1.8+
- Clone the repository
git clone https://github.com/PyDevDeep/TalentStream.git
cd TalentStream- Configure environment variables
cp .env.example .env
# Edit .env with your API keys and settingsRequired environment variables:
# Database
DATABASE_URL=postgresql+psycopg://user:password@localhost:5432/jobscraper
# Redis
REDIS_URL=redis://localhost:6379/0
# API Keys
SERPER_API_KEY=your_serper_api_key
OPENAI_API_KEY=your_openai_api_key
GEMINI_API_KEY=your_gemini_api_key # Fallback
# Slack
SLACK_BOT_TOKEN=xoxb-your-slack-bot-token
SLACK_CHANNEL_ID=C01234567890
# Filters
FILTER_KEYWORDS=python,backend,fastapi
FILTER_LOCATION=Remote
FILTER_SALARY_MIN=50000
# Scheduler
SCRAPE_CRON=0 */6 * * * # Every 6 hours
SCRAPE_QUERY=python backend developer remote
# Sentry (optional)
SENTRY_DSN=your_sentry_dsn- Launch with Docker Compose
docker compose up -dThis starts:
- PostgreSQL (port 5432)
- Redis (port 6379)
- FastAPI app (port 8000)
- TaskIQ worker
- TaskIQ scheduler
- Run database migrations
docker compose exec app alembic upgrade headcurl -X POST http://localhost:8000/trigger-scrape \
-H "Content-Type: application/json" \
-d '{"query": "python senior developer remote"}'curl http://localhost:8000/healthdocker compose logs -f app
docker compose logs -f workerβββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Scheduler βββββββΆβ Scrape Task βββββββΆβ Parse Tasks β
β (cron) β β (Serper API) β β (per URL) β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Redis Dedup Check β
β (atomic SET NX EX) β
βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Fetch Page Content β
β (Serper View API) β
βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Strip Noise β
β (regex patterns, truncate) β
βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β LLM Parsing β
β (OpenAI GPT-4 β Gemini) β
βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Filter Engine β
β (keywords, location, salary) β
βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Store in PostgreSQL β
β (upsert with URL dedup) β
βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Notify Task β
β (Slack message) β
βββββββββββββββββββββββββββββββββ
Tasks (app/tasks/)
scrape.pyβ Search job URLs via Serper, queue parse tasksparse.pyβ Full pipeline: dedup β fetch β parse β filter β storenotify.pyβ Fetch unnotified jobs from DB, send to Slack
Services (app/services/)
dedup.pyβ Redis-based duplicate detection with TTLfilter.pyβ Multi-criteria job filtering (keywords, location, salary)noise_stripper.pyβ HTML noise removal (nav, footer, boilerplate)
Clients (app/clients/)
serper.pyβ Wrapper for Serper Search and View APIsllm/router.pyβ LLM routing with OpenAI primary, Gemini fallbackllm/openai_client.pyβ GPT-4 structured output extractionllm/gemini_client.pyβ Gemini fallback implementation
Database (app/db/)
models/job.pyβ SQLAlchemy Job model with JSONB metadatarepository.pyβ CRUD operations with upsert and notification trackingsession.pyβ Async database session factory
# Install development dependencies
poetry install
# Run all tests
pytest
# Run with coverage
pytest --cov=app --cov-report=html
# Run specific test suite
pytest tests/unit/
pytest tests/integration/Test structure:
tests/unit/β Isolated component tests (clients, filters, schemas)tests/integration/β Database, Redis, and API integration teststests/e2e/β Full pipeline end-to-end tests
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -dProduction overrides include:
- Restart policies
- Resource limits
- Log rotation
- Health checks
GitHub Actions workflows (.github/workflows/):
- ci.yml β Linting, type checking, tests on every PR
- deploy.yml β Docker build and deployment on tag push
- Latency: p95 < 30s per job (Serper + LLM + store)
- Throughput: β₯ 100 jobs/hour with 4 workers
- Dedup Effectiveness: < 5% duplicates in Slack
- Uptime: 99% scheduler availability
- Error Rate: < 2% failed tasks per batch
All services output structured JSON logs via Structlog:
{
"event": "job_stored_successfully",
"job_id": 123,
"url": "https://example.com/job",
"timestamp": "2026-04-23T12:00:00Z"
}Sentry integration captures:
- Unhandled exceptions
- LLM parsing failures
- API rate limit errors
- Database connection issues
Edit in .env:
# Comma-separated keywords (OR logic)
FILTER_KEYWORDS=python,fastapi,asyncio,backend
# Location string matching
FILTER_LOCATION=Remote
# Minimum annual salary (USD)
FILTER_SALARY_MIN=80000Cron expression in .env:
# Run every 6 hours at minute 0
SCRAPE_CRON=0 */6 * * *
# Serper search query
SCRAPE_QUERY=senior python developer remote usaPriority order:
- OpenAI GPT-4 (primary, fastest)
- Google Gemini (fallback on OpenAI failure)
Configure in app/clients/llm/router.py
- Free Tier: 2500 credits/month
- Search: 2 credit per query
- View: 2, 6 or 10 credit per URL
- Recommended: Paid plan ($50/month for 50K credits) for production
- Model: GPT-4 Turbo
- Estimated Cost: ~$8/month at 7200 calls/month (240 jobs/day)
- Input: ~2000 tokens/call Γ $0.01/1K = $0.02/call
- Output: ~500 tokens/call Γ $0.03/1K = $0.015/call
- Rate Limit: 1 message/second per channel
- Mitigation:
asyncio.sleep(1)between messages
[INSERT LICENSE TYPE]
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Run pre-commit checks (
pre-commit run --all-files) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Issue: Tasks not processing
# Check worker logs
docker compose logs worker
# Verify Redis connection
docker compose exec app python -c "from redis.asyncio import Redis; import asyncio; asyncio.run(Redis.from_url('redis://redis:6379/0').ping())"Issue: Database connection errors
# Check PostgreSQL status
docker compose ps postgres
# Run migrations
docker compose exec app alembic upgrade headIssue: Slack notifications not sending
# Test Slack token
docker compose exec app python test_notify.pyIssue: LLM parsing failures
- Check API keys in
.env - Review logs for rate limit errors
- Verify noise stripper output length (should be < 5000 chars)
For issues and questions:
- GitHub Issues: https://github.com/PyDevDeep/TalentStream/issues
- Repository: https://github.com/PyDevDeep/TalentStream
Built with β€οΈ using Python, FastAPI, and asyncio
