Twitter/X scraper built with Playwright for browser automation and OpenAI for AI-powered tweet analysis.
- Timeline Scraping: Extract tweets from any user's timeline with full metadata
- Historical Search: Scrape tweets from specific date ranges
- Keyword Search: Search for tweets by keywords, hashtags, or phrases
- Mixed Strategy: Combine timeline + historical search for comprehensive coverage
- AI Analysis: Automatic sentiment analysis, topic extraction, and summaries using ChatGPT
- Checkpoint/Resume: Resume interrupted scrapes from where they stopped
- Progress Tracking: Real-time progress bars and detailed logging
- Proxy Support: Built-in support for residential/mobile proxies
- Session Persistence: Cookie-based authentication for reliable long-term scraping
- Rate Limit Handling: Intelligent retry mechanisms with exponential backoff
- Deduplication: Automatic removal of duplicate tweets
- Rich Metadata: Captures tweets, user info, engagement metrics, media, hashtags, URLs
- Structured Output: Clean JSON format with proper typing and validation
- Python 3.11+
- Chrome/Chromium browser (installed automatically by Playwright)
- Twitter/X account credentials
- Proxy service for IP rotation
- (Optional) OpenAI API key for AI analysis
git clone https://github.com/proxidize/x-scraper
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install uv
uv sync
# Install Playwright browsers
playwright install chromium
# Copy config template
cp config.ini.template config.inidocker build -t x-scraper .Edit config.ini with your settings:
[TWITTER]
username = your_username
email = your_email@example.com
password = your_password
[PROXY]
use_proxy = true
proxy_url = http://your-proxy:port
[AI]
enable_analysis = true
openai_api_key = sk-your-api-key-here
model = gpt-4o-mini
batch_size = 10
[SCRAPING]
output_directory = ./data
max_tweets_per_session = 1000
scroll_delay_min = 2.0
scroll_delay_max = 5.0[TWITTER]: username, email, password[SCRAPING]: output_directory, max_tweets_per_session
[AI]: All fields (for AI analysis)
Note: Throughout these examples, we use
FabrizioRomano(a popular football transfer news reporter) as our test account. This account was used during development and testing due to its high tweet volume and consistent posting patterns.
python main.py interactiveLaunches an interactive menu with guided options for all scraping modes.
python main.py user <username> [OPTIONS]
# Example: Scrape 500 tweets from FabrizioRomano
python main.py user FabrizioRomano --max-tweets 500
# With AI analysis
python main.py user FabrizioRomano --max-tweets 500 --analysis sentiment,topics,summaryOptions:
--max-tweets: Maximum number of tweets to scrape (default: from config)--analysis: AI analysis types (sentiment, topics, summary, all)
Performance Note:
- A typical timeline scraping session takes approximately 30-40 minutes
- Produces roughly 800-1000 tweets
- For comprehensive coverage beyond this limit, use historical search by date (see below)
python main.py search <query> [OPTIONS]
# Example: Search for "Here we go" tweets
python main.py search "Here we go" --max-tweets 200
# Search with filters
python main.py search "#TransferNews" --max-tweets 500 --analysis allOptions:
--max-tweets: Maximum number of tweets to scrape--analysis: AI analysis types
python main.py search-historical <username> --since YYYY-MM-DD --until YYYY-MM-DD [OPTIONS]
# Example: Scrape FabrizioRomano's tweets from December 2021
python main.py search-historical FabrizioRomano \
--since 2021-12-01 \
--until 2021-12-31 \
--max-tweets 500
# With AI analysis
python main.py search-historical FabrizioRomano \
--since 2023-07-01 \
--until 2023-07-31 \
--analysis sentiment,topicsOptions:
--since: Start date (YYYY-MM-DD)--until: End date (YYYY-MM-DD)--max-tweets: Maximum tweets per date chunk--analysis: AI analysis types
Note: Historical search automatically chunks large date ranges into weekly intervals for better coverage.
For comprehensive tweet coverage, combine both methods:
-
Start with Timeline Scraping: Get recent tweets (last 800-1000 tweets)
python main.py user FabrizioRomano --max-tweets 1000
-
Use Historical Search for Older Tweets: Go beyond timeline limitations
python main.py search-historical FabrizioRomano \ --since 2023-01-01 \ --until 2023-12-31 \ --max-tweets 500
{
"username": "FabrizioRomano",
"tweet_count": 303,
"unique_tweet_count": 303,
"tweets": [
{
"id": "1468352445113942019",
"text": "Tweet content here...",
"created_at": "Tue Dec 07 22:51:18 +0000 2021",
"user": {
"id": "330262748",
"followers_count": 26490513,
"verified": true
},
"metrics": {
"retweet_count": 3,
"favorite_count": 401,
"reply_count": 7,
"quote_count": 1
},
"hashtags": ["TransferNews"],
"media": [],
"is_retweet": false,
"is_reply": true
}
],
"date_range": {
"since": "2021-12-01",
"until": "2021-12-31"
}
}{
"total_tweets": 303,
"analysis": {
"sentiment": {
"positive": 156,
"neutral": 120,
"negative": 27
},
"topics": [
{"topic": "Transfer News", "count": 89},
{"topic": "Contract Extensions", "count": 45}
],
"summary": "Analysis of 303 tweets shows..."
}
}We welcome contributions! Here's how:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Commit your changes
- Push to the branch
- Submit a pull request
This project is for educational and research purposes only.
- Respect Twitter/X's Terms of Service
- Do not use for commercial scraping without proper authorization
- Be mindful of rate limits and API usage
Disclaimer: The authors are not responsible for misuse of this tool. Use responsibly and ethically.
For a detailed walkthrough of how this Twitter/X scraper was built, including challenges faced and solutions implemented, read our comprehensive blog post:
Twitter/X Scraper: How to Scrape Twitter for Free
The blog post covers:
- Why Python and Playwright were chosen
- How Twitter/X's infinite scroll was handled
- Timeline vs. historical search strategies
- Proxy rotation and error handling
- AI integration with OpenAI
For issues, questions, or feature requests, please open an issue on GitHub or contact support@proxidize.com.
Note: This tool is designed for ethical data collection and research purposes. Always comply with Twitter/X's Terms of Service and respect rate limits.