X-Scraper

Twitter/X scraper built with Playwright for browser automation and OpenAI for AI-powered tweet analysis.

Features

Core Scraping Capabilities

Timeline Scraping: Extract tweets from any user's timeline with full metadata
Historical Search: Scrape tweets from specific date ranges
Keyword Search: Search for tweets by keywords, hashtags, or phrases
Mixed Strategy: Combine timeline + historical search for comprehensive coverage

Advanced Features

AI Analysis: Automatic sentiment analysis, topic extraction, and summaries using ChatGPT
Checkpoint/Resume: Resume interrupted scrapes from where they stopped
Progress Tracking: Real-time progress bars and detailed logging
Proxy Support: Built-in support for residential/mobile proxies
Session Persistence: Cookie-based authentication for reliable long-term scraping
Rate Limit Handling: Intelligent retry mechanisms with exponential backoff
Deduplication: Automatic removal of duplicate tweets

Data Quality

Rich Metadata: Captures tweets, user info, engagement metrics, media, hashtags, URLs
Structured Output: Clean JSON format with proper typing and validation

Prerequisites

Python 3.11+
Chrome/Chromium browser (installed automatically by Playwright)
Twitter/X account credentials
Proxy service for IP rotation
(Optional) OpenAI API key for AI analysis

Installation

Standard Installation

git clone https://github.com/proxidize/x-scraper

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install uv
uv sync

# Install Playwright browsers
playwright install chromium

# Copy config template
cp config.ini.template config.ini

Docker Installation

docker build -t x-scraper .

Configuration

Edit config.ini with your settings:

[TWITTER]
username = your_username
email = your_email@example.com
password = your_password

[PROXY]
use_proxy = true
proxy_url = http://your-proxy:port

[AI]
enable_analysis = true
openai_api_key = sk-your-api-key-here
model = gpt-4o-mini
batch_size = 10

[SCRAPING]
output_directory = ./data
max_tweets_per_session = 1000
scroll_delay_min = 2.0
scroll_delay_max = 5.0

Required Fields

[TWITTER]: username, email, password
[SCRAPING]: output_directory, max_tweets_per_session

Optional Fields

[AI]: All fields (for AI analysis)

Usage

Note: Throughout these examples, we use FabrizioRomano (a popular football transfer news reporter) as our test account. This account was used during development and testing due to its high tweet volume and consistent posting patterns.

Interactive Mode

python main.py interactive

Launches an interactive menu with guided options for all scraping modes.

Command Line Interface

1. Scrape User Timeline

python main.py user <username> [OPTIONS]

# Example: Scrape 500 tweets from FabrizioRomano
python main.py user FabrizioRomano --max-tweets 500

# With AI analysis
python main.py user FabrizioRomano --max-tweets 500 --analysis sentiment,topics,summary

Options:

--max-tweets: Maximum number of tweets to scrape (default: from config)
--analysis: AI analysis types (sentiment, topics, summary, all)

Performance Note:

A typical timeline scraping session takes approximately 30-40 minutes
Produces roughly 800-1000 tweets
For comprehensive coverage beyond this limit, use historical search by date (see below)

2. Keyword Search

python main.py search <query> [OPTIONS]

# Example: Search for "Here we go" tweets
python main.py search "Here we go" --max-tweets 200

# Search with filters
python main.py search "#TransferNews" --max-tweets 500 --analysis all

Options:

--max-tweets: Maximum number of tweets to scrape
--analysis: AI analysis types

3. Historical Search (Date Range)

python main.py search-historical <username> --since YYYY-MM-DD --until YYYY-MM-DD [OPTIONS]

# Example: Scrape FabrizioRomano's tweets from December 2021
python main.py search-historical FabrizioRomano \
  --since 2021-12-01 \
  --until 2021-12-31 \
  --max-tweets 500

# With AI analysis
python main.py search-historical FabrizioRomano \
  --since 2023-07-01 \
  --until 2023-07-31 \
  --analysis sentiment,topics

Options:

--since: Start date (YYYY-MM-DD)
--until: End date (YYYY-MM-DD)
--max-tweets: Maximum tweets per date chunk
--analysis: AI analysis types

Note: Historical search automatically chunks large date ranges into weekly intervals for better coverage.

Recommended Strategy: Mixed Approach

For comprehensive tweet coverage, combine both methods:

Start with Timeline Scraping: Get recent tweets (last 800-1000 tweets)
```
python main.py user FabrizioRomano --max-tweets 1000
```

Use Historical Search for Older Tweets: Go beyond timeline limitations

python main.py search-historical FabrizioRomano \
  --since 2023-01-01 \
  --until 2023-12-31 \
  --max-tweets 500

Output Structure

Tweet Data

{
  "username": "FabrizioRomano",
  "tweet_count": 303,
  "unique_tweet_count": 303,
  "tweets": [
    {
      "id": "1468352445113942019",
      "text": "Tweet content here...",
      "created_at": "Tue Dec 07 22:51:18 +0000 2021",
      "user": {
        "id": "330262748",
        "followers_count": 26490513,
        "verified": true
      },
      "metrics": {
        "retweet_count": 3,
        "favorite_count": 401,
        "reply_count": 7,
        "quote_count": 1
      },
      "hashtags": ["TransferNews"],
      "media": [],
      "is_retweet": false,
      "is_reply": true
    }
  ],
  "date_range": {
    "since": "2021-12-01",
    "until": "2021-12-31"
  }
}

AI Analysis Output

{
  "total_tweets": 303,
  "analysis": {
    "sentiment": {
      "positive": 156,
      "neutral": 120,
      "negative": 27
    },
    "topics": [
      {"topic": "Transfer News", "count": 89},
      {"topic": "Contract Extensions", "count": 45}
    ],
    "summary": "Analysis of 303 tweets shows..."
  }
}

Contributing

We welcome contributions! Here's how:

Fork the repository
Create a feature branch
Add tests for new functionality
Commit your changes
Push to the branch
Submit a pull request

License

This project is for educational and research purposes only.

Important Legal Notes

Respect Twitter/X's Terms of Service
Do not use for commercial scraping without proper authorization
Be mindful of rate limits and API usage

Disclaimer: The authors are not responsible for misuse of this tool. Use responsibly and ethically.

Blog Post

For a detailed walkthrough of how this Twitter/X scraper was built, including challenges faced and solutions implemented, read our comprehensive blog post:

Twitter/X Scraper: How to Scrape Twitter for Free

The blog post covers:

Why Python and Playwright were chosen
How Twitter/X's infinite scroll was handled
Timeline vs. historical search strategies
Proxy rotation and error handling
AI integration with OpenAI

Support

For issues, questions, or feature requests, please open an issue on GitHub or contact support@proxidize.com.

Note: This tool is designed for ethical data collection and research purposes. Always comply with Twitter/X's Terms of Service and respect rate limits.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cli		cli
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.ini.template		config.ini.template
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

X-Scraper

Features

Core Scraping Capabilities

Advanced Features

Data Quality

Prerequisites

Installation

Standard Installation

Docker Installation

Configuration

Required Fields

Optional Fields

Usage

Interactive Mode

Command Line Interface

1. Scrape User Timeline

2. Keyword Search

3. Historical Search (Date Range)

Recommended Strategy: Mixed Approach

Output Structure

Tweet Data

AI Analysis Output

Contributing

License

Important Legal Notes

Blog Post

Support

About

Uh oh!

Releases

Packages

Languages

License

proxidize/x-scraper

Folders and files

Latest commit

History

Repository files navigation

X-Scraper

Features

Core Scraping Capabilities

Advanced Features

Data Quality

Prerequisites

Installation

Standard Installation

Docker Installation

Configuration

Required Fields

Optional Fields

Usage

Interactive Mode

Command Line Interface

1. Scrape User Timeline

2. Keyword Search

3. Historical Search (Date Range)

Recommended Strategy: Mixed Approach

Output Structure

Tweet Data

AI Analysis Output

Contributing

License

Important Legal Notes

Blog Post

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages