A research-grade scraper for X / Twitter timelines. Account rotation, jittered pacing, cursor-resume, single-file CSV output.
Quickstart · How it works · Cookies guide · Architecture · Ethics · Example: U.S. Congress
When X closed the public Twitter API in 2023, a generation of academic, journalistic, and civic-tech work that depended on free-tier timeline access stopped working overnight. Studies of political polarization, election integrity, public-health communication, crisis response, and disinformation all share the same blocker: they need the ability to walk a public profile's timeline at low cost. There is currently no free or affordable replacement on offer.
x-scraper is one answer. It is a small, single-file Python tool that authenticates as a normal logged-in user, paginates through public timelines, and writes the results to a single CSV with consistent columns. It rotates across as many accounts as you want, sleeps between requests, and checkpoints aggressively so a multi-day run survives interruptions.
It is built for research and journalism, not for growth hacking. Read the ethics doc before you point it at anything sensitive.
- Account rotation pool. Plug in N authenticated accounts. The scraper picks the most-rested one each cycle and enforces a per-account cooldown.
- Jittered pacing. Per-cycle quotas, page delays, and account-swap delays are all randomized so traffic does not look like a perfect machine.
- Cursor-aware resume. State is checkpointed after every page write.
Ctrl+Cis safe. Re-running picks up where you stopped, including mid-handle. - Long-form text. Resolves
note_tweet(up to 25k chars), thenfull_text, thentext, so long tweets are not truncated. - Single CSV output. 25 columns covering tweet identity, engagement, and a profile snapshot. Easy to load into pandas, DuckDB, or anything that reads CSV.
- Bring your own accounts. Bundle import (
scripts/import_accounts.py) for the colon-separated cookie format, or hand-roll a JSON file from browser DevTools. See the cookies guide. - No surprise dependencies. Pure Python, one library (
twikit), zero database, zero queue, zero infrastructure. - Worked example included.
examples/congress/ships the official 118th and 119th U.S. Congress member metadata so you can reproduce a real research dataset end to end.
This is a research tool, not a product. It works, it has been used to collect about 1.5 M public tweets across roughly 500 handles, and it is intentionally minimal. The scraper makes no attempt to defeat captchas, fingerprinting, or behavioral analysis. If X tightens its detection further, the right move is to retire the affected accounts and refresh the cookie pool, not to add evasion plumbing here.
# 1. Clone and install
git clone https://github.com/FelixSBuehrm/x-scraper.git
cd x-scraper
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# 2. Add at least one X account to the pool
# See docs/cookies.md for the extraction guide.
cp config/accounts/accounts.example.json config/accounts/accounts.json
# (edit config/accounts/accounts.json with your handles + auth tokens)
# (drop one cookie file per account into config/cookies/account_N.json)
# 3. Smoke test the cookie pool
python3 scripts/test_auth.py
# 4. Add target handles, one per line, to config/usernames.txt
# The repo ships with three placeholder handles you can replace.
# 5. Run the bulk scrape
python3 scripts/auto_rotating_scraper.pyThe first run produces:
output/
├── csv/
│ └── all_users_tweets.csv # one row per tweet, 25 columns
└── bulk_progress.json # checkpoint file (resume state)
Stop with Ctrl+C at any time. Re-run to resume.
config/usernames.txt -> for each handle:
config/accounts/*.json for each cycle (until cap or end of timeline):
pick the most-rested account
sleep for the random pre-cycle delay
fetch a page of tweets via twikit
paginate until the cycle target is met
append rows to all_users_tweets.csv
checkpoint progress to bulk_progress.json
rotate to next account
A few design choices worth flagging:
- Most-rested-first selection beats round-robin because handles vary in length. Some accounts will burn their cycle quota in 30 seconds, others will hit the end of the timeline early. Picking on idle time keeps the pool balanced without any tracking.
- Cursors are persisted between cycles. If account A makes it 600 tweets deep into a handle and then hits a 429, account B picks up at exactly tweet 601 with the same cursor. No duplicates.
- The CSV is append-only. A row that has been written will not be rewritten. Combined with the cursor, this means you can interrupt and resume safely.
- The whole loop is one file (
scripts/auto_rotating_scraper.py, ~400 lines). If something is wrong, you can read the entire pipeline top to bottom in ten minutes.
For the full breakdown including failure modes, see docs/architecture.md.
All tunables live at the top of scripts/auto_rotating_scraper.py:
| Setting | Default | What it does |
|---|---|---|
TWEETS_PER_USER |
1000 |
Maximum tweets per handle. Scraping also stops at the end of the timeline. |
TWEETS_PER_ACCOUNT_MIN / MAX |
480 / 520 |
Per-cycle quota is randomized inside this range. |
TWEETS_PER_PAGE |
20 |
Tweets per pagination call. Matches the X UI default. |
ACCOUNT_COOLDOWN_MINUTES |
15 |
Minimum time before reusing the same account. |
PAGE_DELAY_RANGE |
(0.3, 1.0) |
Sleep between pagination requests, in seconds. |
CYCLE_DELAY_RANGE |
(1, 3) |
Sleep before starting each new cycle. |
ACCOUNT_SWAP_DELAY |
(2, 5) |
Sleep when switching accounts. |
The defaults are tuned for a pool of about ten accounts and a target of 1000 tweets per handle. Smaller pools will spend more time waiting on cooldowns.
output/csv/all_users_tweets.csv has 25 columns. Identity and content first, then engagement, then a snapshot of the user's public profile at the time of scrape.
| Group | Columns |
|---|---|
| Identity | Tweet_ID, Posted_Time, Tweet_URL, Tweet_Content |
| Engagement | Replies_Count, Reposts_Count, Likes_Count, Views_Count, Quote_Count, Bookmark_Count |
| Profile snapshot | User_Handle, User_Name, UserID, Follower_Count, Following_Count, Posts_Count, Media_Count, Joined_Date, Location, Professional_Category, Website, Is_Blue_Verified, Can_DM, Account_URL, Birthdate |
Tweet_Content resolves to the longest available text: long-form note_tweet first, then extended full_text, then the legacy 280-character text. Newlines are stripped so each tweet stays on one CSV row.
Loading the result is a one-liner:
import pandas as pd
df = pd.read_csv("output/csv/all_users_tweets.csv")
df.groupby("User_Handle").size().sort_values(ascending=False).head()The examples/congress/ folder ships the official 118th and 119th Congress member metadata, including bioguide IDs, party, state, chamber, and the public X / Twitter handle for each member. There is a README in that folder that walks through joining scraped tweets back to member metadata so you can answer questions like "what fraction of Senate Democrats tweeted about a given topic last week?"
The full 119th Congress sweep (about 470 active handles, 1000 tweets each) takes around a day and a half on a pool of ten accounts. The example data files are a starter kit you can lift directly into your own research pipeline.
x-scraper/
├── README.md
├── LICENSE
├── CONTRIBUTING.md
├── requirements.txt
├── .gitignore
├── .gitattributes
│
├── scripts/
│ ├── auto_rotating_scraper.py # main scraper, one file
│ ├── test_auth.py # cookie pool health check
│ └── import_accounts.py # import a colon-separated bundle
│
├── config/
│ ├── usernames.txt # target handles, one per line
│ ├── accounts/
│ │ ├── accounts.example.json # template for the manifest
│ │ └── accounts.json # YOUR manifest (gitignored)
│ └── cookies/
│ ├── account_1.example.json # template for a single cookie file
│ └── account_*.json # YOUR cookie files (gitignored)
│
├── docs/
│ ├── cookies.md # how to extract cookies, three methods
│ ├── architecture.md # deep dive on the loop
│ └── ethics.md # responsible-use notes
│
├── examples/
│ └── congress/
│ ├── README.md # joining tweets back to member metadata
│ ├── example_usernames.txt # 10 well-known handles to try
│ ├── congress118_socials.csv # 118th Congress handles, 424 rows
│ ├── congress119_socials.csv # 119th Congress handles, 527 rows
│ ├── congress119.csv # 119th Congress full member metadata
│ └── info.csv # one row per (member, account) pair
│
└── output/
└── (created on first run, gitignored)
You bring them. The scraper does not care whether the accounts are your own throwaway accounts you created in private browser profiles, your real accounts (not recommended), or session bundles obtained elsewhere. Some marketplaces resell X session bundles for very small amounts of money. Whether using them is appropriate is a policy and ethical question, not a technical one. The cookie file format is the same either way. See docs/cookies.md.
Possibly. Treat every account in the rotation pool as disposable. The defaults (15-minute cooldowns, randomized quotas, page delays) are conservative, but X's detection is opaque and changes without notice. Do not pool accounts you care about losing.
Only if your authenticated session already follows them. The scraper sees exactly what a logged-in browser would see at the same URL.
No. If X serves a captcha, the scraper logs an auth error and rotates to the next account. The pool is the only mitigation.
Yes. nohup python3 scripts/auto_rotating_scraper.py & works. Tail output/bulk_progress.json for live progress, or check wc -l output/csv/all_users_tweets.csv for row counts.
Replace append_tweets_to_csv in scripts/auto_rotating_scraper.py. The schema is defined by extract_tweet_data and there is exactly one place to write rows. The rest of the loop does not care.
No. The whole thing is three small scripts and one dependency. A container would add more weight than it removes.
You can. If your budget covers the X API Pro tier and your data needs fit inside the rate limits, that is the right answer. This project is for the cases where the API tier is the wrong shape: too expensive, too narrow, or unavailable for your jurisdiction.
- Built on
twikit, the unofficial Python client that makes the cookie-auth flow viable. - Congress example data was assembled from the Press Gallery 119th Congress handle list and the unitedstates/congress-legislators project.
- Inspired by the work of Alex Litel's congresstweets archive, which kept Congressional tweets accessible for years before the API closed.
See CONTRIBUTING.md. The project is small on purpose, and bug fixes get merged faster than feature additions.
MIT. See LICENSE.