Skip to content

πŸš€ A powerful, extensible content scraping system for collecting authentic content from public figures across Twitter, YouTube, blogs, podcasts, and books. Built with AI-powered processing, authenticity validation, and vector embeddings.

License

Notifications You must be signed in to change notification settings

REDFOX1899/content-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Multi-Source Content Scraper

Python Version License: MIT Code style: black PRs Welcome

A powerful, extensible content scraping system for collecting authentic content from public figures across multiple platforms.

Features β€’ Quick Start β€’ Installation β€’ Usage β€’ Documentation β€’ Contributing


🌟 Overview

Scrape, validate, and analyze content from your favorite thought leaders across Twitter, YouTube, Blogs, Podcasts, and Books. Built with authenticity validation, AI-powered processing, and vector embeddings for semantic search.

Currently supports:

  • 🎯 Balaji Srinivasan (@balajis)
  • πŸ“š Tim Ferriss (@tferriss)

Easily extensible to any public figure!

✨ Features

πŸ” Multi-Platform Scraping

  • Twitter/X: Full tweet history + automatic thread reconstruction
  • YouTube: Video metadata + automatic transcript extraction
  • Blogs: Full article text from personal blogs (tim.blog, balajis.com)
  • Podcasts: RSS feed parsing + episode metadata
  • Books: Online books & blog excerpts

βœ… Authenticity Validation

  • Domain Verification: Ensures content is from official sources
  • Platform-Specific Checks: Twitter handles, YouTube channels, etc.
  • Authenticity Scoring: 0-100 score for each piece of content
  • Configurable Filters: Only save high-quality, authentic content

🧠 AI-Powered Processing

  • Text Cleaning: Automatic normalization and cleaning
  • Keyword Extraction: Identify main topics and themes
  • Content Chunking: Smart chunking with configurable overlap
  • OpenAI Embeddings: Generate vector embeddings for semantic search
  • Structured Data Extraction: Extract goals, strategies, principles

πŸ’Ύ Flexible Storage

  • SQL Database: SQLAlchemy with SQLite/PostgreSQL support
  • Vector Stores: Pinecone, ChromaDB, or Weaviate integration
  • JSON Export: Export data in standard formats
  • Incremental Updates: Only scrape new content

πŸ›‘οΈ Production-Ready

  • Rate Limiting: Respects API limits with token bucket algorithm
  • Robots.txt Compliance: Ethical web scraping
  • Retry Logic: Exponential backoff for failed requests
  • Comprehensive Logging: Debug and monitor with loguru
  • Error Handling: Graceful degradation and error recovery
  • Progress Tracking: Real-time progress bars with tqdm

πŸš€ Quick Start

1️⃣ Installation

# Clone the repository
git clone https://github.com/REDFOX1899/content-scraper.git
cd content-scraper

# Run automated setup
./setup.sh

Or manually:

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment template
cp .env.example .env

2️⃣ Configuration

Edit .env and add your API keys:

TWITTER_BEARER_TOKEN=your_token_here
YOUTUBE_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here  # Optional, for embeddings

3️⃣ Start Scraping!

# Scrape Tim Ferriss blog posts
python main.py scrape --author tim_ferriss --platform blog --max-items 20

# Scrape Balaji's tweets
python main.py scrape --author balaji_srinivasan --platform twitter --max-items 50

# Scrape with embeddings for AI applications
python main.py scrape --author tim_ferriss --platform blog --embed --max-items 100

πŸ“‹ Usage

Basic Commands

# Scrape specific platform
python main.py scrape --author tim_ferriss --platform blog --max-items 50

# Scrape multiple platforms
python main.py scrape --author balaji_srinivasan \
  --platform twitter \
  --platform youtube \
  --max-items 100

# Scrape with date filter
python main.py scrape --author tim_ferriss \
  --date-from 2023-01-01 \
  --date-to 2024-01-01

# Only save authentic content
python main.py scrape --author balaji_srinivasan --authentic-only

# Process existing data
python main.py process --limit 100 --embed

# View statistics
python main.py stats

# Export to JSON
python main.py export --author tim_ferriss --output data.json

Python API

from scrapers.blog_scraper import BlogScraper
from validators.authenticity_validator import AuthenticityValidator
from storage.database import ContentDatabase

# Initialize scraper
scraper = BlogScraper('tim_ferriss', author_config)

# Scrape content
content = scraper.scrape(max_pages=10)

# Validate authenticity
validator = AuthenticityValidator()
validated = validator.validate_batch(content)

# Store in database
db = ContentDatabase()
db.save_batch(validated)

See example_usage.py for more examples.

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   User Input    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  CLI Interface  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      Orchestrator               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Platform Scrapers      β”‚   β”‚
β”‚  β”‚  β€’ Blog                 β”‚   β”‚
β”‚  β”‚  β€’ Twitter              β”‚   β”‚
β”‚  β”‚  β€’ YouTube              β”‚   β”‚
β”‚  β”‚  β€’ Podcast              β”‚   β”‚
β”‚  β”‚  β€’ Book                 β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Validator    β”‚
    β”‚  (Score 0-100)β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Processor   β”‚
    β”‚  β€’ Clean      β”‚
    β”‚  β€’ Extract    β”‚
    β”‚  β€’ Chunk      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Embeddings  β”‚
    β”‚   (OpenAI)    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      Storage          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  SQL Database  β”‚   β”‚
β”‚  β”‚  Vector Store  β”‚   β”‚
β”‚  β”‚  JSON Export   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

content-scraper/
β”œβ”€β”€ config/                     # Configuration
β”‚   β”œβ”€β”€ settings.py            # Main settings
β”‚   └── authors.json           # Author profiles
β”œβ”€β”€ scrapers/                   # Platform scrapers
β”‚   β”œβ”€β”€ base_scraper.py        # Base class
β”‚   β”œβ”€β”€ blog_scraper.py
β”‚   β”œβ”€β”€ twitter_scraper.py
β”‚   β”œβ”€β”€ youtube_scraper.py
β”‚   β”œβ”€β”€ podcast_scraper.py
β”‚   └── book_scraper.py
β”œβ”€β”€ validators/                 # Content validation
β”‚   └── authenticity_validator.py
β”œβ”€β”€ storage/                    # Data storage
β”‚   β”œβ”€β”€ database.py            # SQL database
β”‚   └── vector_store.py        # Vector stores
β”œβ”€β”€ processing/                 # Content processing
β”‚   β”œβ”€β”€ text_processor.py
β”‚   └── content_extractor.py
β”œβ”€β”€ utils/                      # Utilities
β”‚   └── rate_limiter.py
β”œβ”€β”€ main.py                     # CLI interface
β”œβ”€β”€ example_usage.py            # Examples
└── README.md                   # This file

🎯 Use Cases

1. AI-Powered Knowledge Base

Build a semantic search engine over your favorite thought leader's content:

# Scrape with embeddings
python main.py scrape --author tim_ferriss --embed

# Use vector store for semantic search
from storage.vector_store import create_vector_store
store = create_vector_store("chroma")
results = store.query(question_embedding, top_k=5)

2. Research & Analysis

Analyze trends, topics, and insights:

# Export data
python main.py export --output data.json

# Analyze with pandas
import pandas as pd
df = pd.read_json('data.json')
df['keywords'].value_counts()

3. Content Curation

Curate the best content automatically:

# Get only high-quality, authentic content
python main.py scrape --author balaji_srinivasan \
  --authentic-only \
  --date-from 2024-01-01

4. Chatbot Training

Train AI chatbots on authentic content:

  • Scrape content with embeddings
  • Store in vector database
  • Build RAG (Retrieval-Augmented Generation) system

πŸ”§ Configuration

Adding New Authors

Edit config/authors.json:

{
  "new_author": {
    "name": "Author Name",
    "twitter": {"handle": "username"},
    "youtube_channels": [{
      "name": "Channel Name",
      "channel_id": "UCxxxxx"
    }],
    "blogs": [{
      "name": "Blog Name",
      "url": "https://blog.com"
    }],
    "official_domains": ["blog.com", "website.com"]
  }
}

Customizing Settings

Edit config/settings.py:

# Rate limiting
RATE_LIMIT_CALLS = 10
RATE_LIMIT_PERIOD = 60  # seconds

# Content filtering
MIN_AUTHENTICITY_SCORE = 75
MIN_CONTENT_LENGTH = 100

# Text processing
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200

# Embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"

πŸ“Š Database Schema

CREATE TABLE content (
    id VARCHAR(64) PRIMARY KEY,
    author VARCHAR(100) NOT NULL,
    platform VARCHAR(50) NOT NULL,
    content_type VARCHAR(50),
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    url TEXT NOT NULL,
    date_published DATETIME,
    date_scraped DATETIME NOT NULL,
    authenticity_score INTEGER,
    processed BOOLEAN DEFAULT FALSE,
    embedded BOOLEAN DEFAULT FALSE,
    metadata JSON,
    word_count INTEGER
);

πŸ”‘ API Keys

Twitter API

  1. Go to Twitter Developer Portal
  2. Create a new app
  3. Copy the Bearer Token

YouTube Data API

  1. Go to Google Cloud Console
  2. Create project β†’ Enable YouTube Data API v3
  3. Create credentials β†’ Copy API Key

OpenAI API (Optional)

  1. Go to OpenAI Platform
  2. Create API key
  3. Used for embeddings and content analysis

🚦 Rate Limits & Best Practices

  • Twitter: ~300 requests per 15 minutes (managed automatically)
  • YouTube: 10,000 quota units per day
  • Blogs: Respectful 2-second delays between requests
  • Robots.txt: Always respected

Best Practices:

  • Start with --max-items 10 to test
  • Use --date-from for incremental updates
  • Use --authentic-only for quality data
  • Monitor logs/scraper.log
  • Export data regularly

🀝 Contributing

We welcome contributions! Here's how you can help:

Adding New Platforms

  1. Create a new scraper inheriting from BaseScraper
  2. Implement the scrape() method
  3. Add platform validation
  4. Submit a PR!
from scrapers.base_scraper import BaseScraper

class NewPlatformScraper(BaseScraper):
    def scrape(self, **kwargs):
        # Your scraping logic
        return content_list

Adding New Authors

  1. Add configuration to config/authors.json
  2. Add official domains for validation
  3. Test thoroughly
  4. Submit a PR!

See CONTRIBUTING.md for detailed guidelines.

πŸ“– Documentation

πŸ› Troubleshooting

Common Issues

"Twitter API key not found"

# Add to .env
TWITTER_BEARER_TOKEN=your_token_here

"Rate limit exceeded"

# Wait and retry, or reduce max-items
python main.py scrape --author tim_ferriss --max-items 10

"No module named 'tweepy'"

pip install -r requirements.txt

Database locked

# Only one process can write at a time
# Wait for current operation to complete

See QUICKSTART.md for more troubleshooting tips.

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

βš–οΈ Legal & Ethics

  • βœ… Only scrapes publicly available content
  • βœ… Respects robots.txt files
  • βœ… Implements rate limiting
  • βœ… Does NOT scrape private or paywalled content
  • βœ… For personal use, research, and education

Important: Always respect the terms of service of the platforms you're scraping. This tool is designed for ethical, legal use only.

πŸ™ Acknowledgments

Built for learning from:

  • Balaji Srinivasan (@balajis) - Entrepreneur, investor, thought leader
  • Tim Ferriss (@tferriss) - Author, podcaster, entrepreneur

This tool helps fans and researchers analyze and learn from their public content.

⭐ Star History

If you find this project useful, please consider giving it a star! ⭐

πŸ—ΊοΈ Roadmap

  • Add more authors (Paul Graham, Naval Ravikant, etc.)
  • Web dashboard for browsing scraped content
  • REST API endpoints
  • Docker support
  • Incremental update scheduler
  • Content deduplication
  • Advanced ML-based topic modeling
  • Notion/Obsidian export
  • Browser extension

πŸ’¬ Community


Built with ❀️ for the learning community

⬆ back to top

About

πŸš€ A powerful, extensible content scraping system for collecting authentic content from public figures across Twitter, YouTube, blogs, podcasts, and books. Built with AI-powered processing, authenticity validation, and vector embeddings.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published