A powerful, extensible content scraping system for collecting authentic content from public figures across multiple platforms.
Features β’ Quick Start β’ Installation β’ Usage β’ Documentation β’ Contributing
Scrape, validate, and analyze content from your favorite thought leaders across Twitter, YouTube, Blogs, Podcasts, and Books. Built with authenticity validation, AI-powered processing, and vector embeddings for semantic search.
Currently supports:
- π― Balaji Srinivasan (@balajis)
- π Tim Ferriss (@tferriss)
Easily extensible to any public figure!
- Twitter/X: Full tweet history + automatic thread reconstruction
- YouTube: Video metadata + automatic transcript extraction
- Blogs: Full article text from personal blogs (tim.blog, balajis.com)
- Podcasts: RSS feed parsing + episode metadata
- Books: Online books & blog excerpts
- Domain Verification: Ensures content is from official sources
- Platform-Specific Checks: Twitter handles, YouTube channels, etc.
- Authenticity Scoring: 0-100 score for each piece of content
- Configurable Filters: Only save high-quality, authentic content
- Text Cleaning: Automatic normalization and cleaning
- Keyword Extraction: Identify main topics and themes
- Content Chunking: Smart chunking with configurable overlap
- OpenAI Embeddings: Generate vector embeddings for semantic search
- Structured Data Extraction: Extract goals, strategies, principles
- SQL Database: SQLAlchemy with SQLite/PostgreSQL support
- Vector Stores: Pinecone, ChromaDB, or Weaviate integration
- JSON Export: Export data in standard formats
- Incremental Updates: Only scrape new content
- Rate Limiting: Respects API limits with token bucket algorithm
- Robots.txt Compliance: Ethical web scraping
- Retry Logic: Exponential backoff for failed requests
- Comprehensive Logging: Debug and monitor with loguru
- Error Handling: Graceful degradation and error recovery
- Progress Tracking: Real-time progress bars with tqdm
# Clone the repository
git clone https://github.com/REDFOX1899/content-scraper.git
cd content-scraper
# Run automated setup
./setup.shOr manually:
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment template
cp .env.example .envEdit .env and add your API keys:
TWITTER_BEARER_TOKEN=your_token_here
YOUTUBE_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here # Optional, for embeddings# Scrape Tim Ferriss blog posts
python main.py scrape --author tim_ferriss --platform blog --max-items 20
# Scrape Balaji's tweets
python main.py scrape --author balaji_srinivasan --platform twitter --max-items 50
# Scrape with embeddings for AI applications
python main.py scrape --author tim_ferriss --platform blog --embed --max-items 100# Scrape specific platform
python main.py scrape --author tim_ferriss --platform blog --max-items 50
# Scrape multiple platforms
python main.py scrape --author balaji_srinivasan \
--platform twitter \
--platform youtube \
--max-items 100
# Scrape with date filter
python main.py scrape --author tim_ferriss \
--date-from 2023-01-01 \
--date-to 2024-01-01
# Only save authentic content
python main.py scrape --author balaji_srinivasan --authentic-only
# Process existing data
python main.py process --limit 100 --embed
# View statistics
python main.py stats
# Export to JSON
python main.py export --author tim_ferriss --output data.jsonfrom scrapers.blog_scraper import BlogScraper
from validators.authenticity_validator import AuthenticityValidator
from storage.database import ContentDatabase
# Initialize scraper
scraper = BlogScraper('tim_ferriss', author_config)
# Scrape content
content = scraper.scrape(max_pages=10)
# Validate authenticity
validator = AuthenticityValidator()
validated = validator.validate_batch(content)
# Store in database
db = ContentDatabase()
db.save_batch(validated)See example_usage.py for more examples.
βββββββββββββββββββ
β User Input β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β CLI Interface β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Orchestrator β
β βββββββββββββββββββββββββββ β
β β Platform Scrapers β β
β β β’ Blog β β
β β β’ Twitter β β
β β β’ YouTube β β
β β β’ Podcast β β
β β β’ Book β β
β βββββββββββββββββββββββββββ β
βββββββββββββ¬ββββββββββββββββββββββ
β
βΌ
βββββββββββββββββ
β Validator β
β (Score 0-100)β
βββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββ
β Processor β
β β’ Clean β
β β’ Extract β
β β’ Chunk β
βββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββ
β Embeddings β
β (OpenAI) β
βββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββββββββ
β Storage β
β ββββββββββββββββββ β
β β SQL Database β β
β β Vector Store β β
β β JSON Export β β
β ββββββββββββββββββ β
βββββββββββββββββββββββββ
content-scraper/
βββ config/ # Configuration
β βββ settings.py # Main settings
β βββ authors.json # Author profiles
βββ scrapers/ # Platform scrapers
β βββ base_scraper.py # Base class
β βββ blog_scraper.py
β βββ twitter_scraper.py
β βββ youtube_scraper.py
β βββ podcast_scraper.py
β βββ book_scraper.py
βββ validators/ # Content validation
β βββ authenticity_validator.py
βββ storage/ # Data storage
β βββ database.py # SQL database
β βββ vector_store.py # Vector stores
βββ processing/ # Content processing
β βββ text_processor.py
β βββ content_extractor.py
βββ utils/ # Utilities
β βββ rate_limiter.py
βββ main.py # CLI interface
βββ example_usage.py # Examples
βββ README.md # This file
Build a semantic search engine over your favorite thought leader's content:
# Scrape with embeddings
python main.py scrape --author tim_ferriss --embed
# Use vector store for semantic search
from storage.vector_store import create_vector_store
store = create_vector_store("chroma")
results = store.query(question_embedding, top_k=5)Analyze trends, topics, and insights:
# Export data
python main.py export --output data.json
# Analyze with pandas
import pandas as pd
df = pd.read_json('data.json')
df['keywords'].value_counts()Curate the best content automatically:
# Get only high-quality, authentic content
python main.py scrape --author balaji_srinivasan \
--authentic-only \
--date-from 2024-01-01Train AI chatbots on authentic content:
- Scrape content with embeddings
- Store in vector database
- Build RAG (Retrieval-Augmented Generation) system
Edit config/authors.json:
{
"new_author": {
"name": "Author Name",
"twitter": {"handle": "username"},
"youtube_channels": [{
"name": "Channel Name",
"channel_id": "UCxxxxx"
}],
"blogs": [{
"name": "Blog Name",
"url": "https://blog.com"
}],
"official_domains": ["blog.com", "website.com"]
}
}Edit config/settings.py:
# Rate limiting
RATE_LIMIT_CALLS = 10
RATE_LIMIT_PERIOD = 60 # seconds
# Content filtering
MIN_AUTHENTICITY_SCORE = 75
MIN_CONTENT_LENGTH = 100
# Text processing
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
# Embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"CREATE TABLE content (
id VARCHAR(64) PRIMARY KEY,
author VARCHAR(100) NOT NULL,
platform VARCHAR(50) NOT NULL,
content_type VARCHAR(50),
title TEXT NOT NULL,
content TEXT NOT NULL,
url TEXT NOT NULL,
date_published DATETIME,
date_scraped DATETIME NOT NULL,
authenticity_score INTEGER,
processed BOOLEAN DEFAULT FALSE,
embedded BOOLEAN DEFAULT FALSE,
metadata JSON,
word_count INTEGER
);- Go to Twitter Developer Portal
- Create a new app
- Copy the Bearer Token
- Go to Google Cloud Console
- Create project β Enable YouTube Data API v3
- Create credentials β Copy API Key
- Go to OpenAI Platform
- Create API key
- Used for embeddings and content analysis
- Twitter: ~300 requests per 15 minutes (managed automatically)
- YouTube: 10,000 quota units per day
- Blogs: Respectful 2-second delays between requests
- Robots.txt: Always respected
Best Practices:
- Start with
--max-items 10to test - Use
--date-fromfor incremental updates - Use
--authentic-onlyfor quality data - Monitor
logs/scraper.log - Export data regularly
We welcome contributions! Here's how you can help:
- Create a new scraper inheriting from
BaseScraper - Implement the
scrape()method - Add platform validation
- Submit a PR!
from scrapers.base_scraper import BaseScraper
class NewPlatformScraper(BaseScraper):
def scrape(self, **kwargs):
# Your scraping logic
return content_list- Add configuration to
config/authors.json - Add official domains for validation
- Test thoroughly
- Submit a PR!
See CONTRIBUTING.md for detailed guidelines.
- Quick Start Guide - Get started in 5 minutes
- Example Usage - Code examples
- API Documentation - Detailed API docs (coming soon)
"Twitter API key not found"
# Add to .env
TWITTER_BEARER_TOKEN=your_token_here"Rate limit exceeded"
# Wait and retry, or reduce max-items
python main.py scrape --author tim_ferriss --max-items 10"No module named 'tweepy'"
pip install -r requirements.txtDatabase locked
# Only one process can write at a time
# Wait for current operation to completeSee QUICKSTART.md for more troubleshooting tips.
This project is licensed under the MIT License - see the LICENSE file for details.
- β Only scrapes publicly available content
- β Respects robots.txt files
- β Implements rate limiting
- β Does NOT scrape private or paywalled content
- β For personal use, research, and education
Important: Always respect the terms of service of the platforms you're scraping. This tool is designed for ethical, legal use only.
Built for learning from:
- Balaji Srinivasan (@balajis) - Entrepreneur, investor, thought leader
- Tim Ferriss (@tferriss) - Author, podcaster, entrepreneur
This tool helps fans and researchers analyze and learn from their public content.
If you find this project useful, please consider giving it a star! β
- Add more authors (Paul Graham, Naval Ravikant, etc.)
- Web dashboard for browsing scraped content
- REST API endpoints
- Docker support
- Incremental update scheduler
- Content deduplication
- Advanced ML-based topic modeling
- Notion/Obsidian export
- Browser extension
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Pull Requests: Contributing Guide
Built with β€οΈ for the learning community