Skip to content

FelixSBuehrm/x-scraper

Repository files navigation

x-scraper

A research-grade scraper for X / Twitter timelines. Account rotation, jittered pacing, cursor-resume, single-file CSV output.

Python License: MIT Built on twikit Status

Quickstart · How it works · Cookies guide · Architecture · Ethics · Example: U.S. Congress


Why this exists

When X closed the public Twitter API in 2023, a generation of academic, journalistic, and civic-tech work that depended on free-tier timeline access stopped working overnight. Studies of political polarization, election integrity, public-health communication, crisis response, and disinformation all share the same blocker: they need the ability to walk a public profile's timeline at low cost. There is currently no free or affordable replacement on offer.

x-scraper is one answer. It is a small, single-file Python tool that authenticates as a normal logged-in user, paginates through public timelines, and writes the results to a single CSV with consistent columns. It rotates across as many accounts as you want, sleeps between requests, and checkpoints aggressively so a multi-day run survives interruptions.

It is built for research and journalism, not for growth hacking. Read the ethics doc before you point it at anything sensitive.

Features

  • Account rotation pool. Plug in N authenticated accounts. The scraper picks the most-rested one each cycle and enforces a per-account cooldown.
  • Jittered pacing. Per-cycle quotas, page delays, and account-swap delays are all randomized so traffic does not look like a perfect machine.
  • Cursor-aware resume. State is checkpointed after every page write. Ctrl+C is safe. Re-running picks up where you stopped, including mid-handle.
  • Long-form text. Resolves note_tweet (up to 25k chars), then full_text, then text, so long tweets are not truncated.
  • Single CSV output. 25 columns covering tweet identity, engagement, and a profile snapshot. Easy to load into pandas, DuckDB, or anything that reads CSV.
  • Bring your own accounts. Bundle import (scripts/import_accounts.py) for the colon-separated cookie format, or hand-roll a JSON file from browser DevTools. See the cookies guide.
  • No surprise dependencies. Pure Python, one library (twikit), zero database, zero queue, zero infrastructure.
  • Worked example included. examples/congress/ ships the official 118th and 119th U.S. Congress member metadata so you can reproduce a real research dataset end to end.

Status

This is a research tool, not a product. It works, it has been used to collect about 1.5 M public tweets across roughly 500 handles, and it is intentionally minimal. The scraper makes no attempt to defeat captchas, fingerprinting, or behavioral analysis. If X tightens its detection further, the right move is to retire the affected accounts and refresh the cookie pool, not to add evasion plumbing here.

Quickstart

# 1. Clone and install
git clone https://github.com/FelixSBuehrm/x-scraper.git
cd x-scraper
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# 2. Add at least one X account to the pool
#    See docs/cookies.md for the extraction guide.
cp config/accounts/accounts.example.json config/accounts/accounts.json
# (edit config/accounts/accounts.json with your handles + auth tokens)
# (drop one cookie file per account into config/cookies/account_N.json)

# 3. Smoke test the cookie pool
python3 scripts/test_auth.py

# 4. Add target handles, one per line, to config/usernames.txt
#    The repo ships with three placeholder handles you can replace.

# 5. Run the bulk scrape
python3 scripts/auto_rotating_scraper.py

The first run produces:

output/
├── csv/
│   └── all_users_tweets.csv     # one row per tweet, 25 columns
└── bulk_progress.json           # checkpoint file (resume state)

Stop with Ctrl+C at any time. Re-run to resume.

How it works

config/usernames.txt   ->   for each handle:
config/accounts/*.json          for each cycle (until cap or end of timeline):
                                    pick the most-rested account
                                    sleep for the random pre-cycle delay
                                    fetch a page of tweets via twikit
                                    paginate until the cycle target is met
                                    append rows to all_users_tweets.csv
                                    checkpoint progress to bulk_progress.json
                                    rotate to next account

A few design choices worth flagging:

  • Most-rested-first selection beats round-robin because handles vary in length. Some accounts will burn their cycle quota in 30 seconds, others will hit the end of the timeline early. Picking on idle time keeps the pool balanced without any tracking.
  • Cursors are persisted between cycles. If account A makes it 600 tweets deep into a handle and then hits a 429, account B picks up at exactly tweet 601 with the same cursor. No duplicates.
  • The CSV is append-only. A row that has been written will not be rewritten. Combined with the cursor, this means you can interrupt and resume safely.
  • The whole loop is one file (scripts/auto_rotating_scraper.py, ~400 lines). If something is wrong, you can read the entire pipeline top to bottom in ten minutes.

For the full breakdown including failure modes, see docs/architecture.md.

Configuration

All tunables live at the top of scripts/auto_rotating_scraper.py:

Setting Default What it does
TWEETS_PER_USER 1000 Maximum tweets per handle. Scraping also stops at the end of the timeline.
TWEETS_PER_ACCOUNT_MIN / MAX 480 / 520 Per-cycle quota is randomized inside this range.
TWEETS_PER_PAGE 20 Tweets per pagination call. Matches the X UI default.
ACCOUNT_COOLDOWN_MINUTES 15 Minimum time before reusing the same account.
PAGE_DELAY_RANGE (0.3, 1.0) Sleep between pagination requests, in seconds.
CYCLE_DELAY_RANGE (1, 3) Sleep before starting each new cycle.
ACCOUNT_SWAP_DELAY (2, 5) Sleep when switching accounts.

The defaults are tuned for a pool of about ten accounts and a target of 1000 tweets per handle. Smaller pools will spend more time waiting on cooldowns.

Output schema

output/csv/all_users_tweets.csv has 25 columns. Identity and content first, then engagement, then a snapshot of the user's public profile at the time of scrape.

Group Columns
Identity Tweet_ID, Posted_Time, Tweet_URL, Tweet_Content
Engagement Replies_Count, Reposts_Count, Likes_Count, Views_Count, Quote_Count, Bookmark_Count
Profile snapshot User_Handle, User_Name, UserID, Follower_Count, Following_Count, Posts_Count, Media_Count, Joined_Date, Location, Professional_Category, Website, Is_Blue_Verified, Can_DM, Account_URL, Birthdate

Tweet_Content resolves to the longest available text: long-form note_tweet first, then extended full_text, then the legacy 280-character text. Newlines are stripped so each tweet stays on one CSV row.

Loading the result is a one-liner:

import pandas as pd
df = pd.read_csv("output/csv/all_users_tweets.csv")
df.groupby("User_Handle").size().sort_values(ascending=False).head()

Worked example: U.S. Congress

The examples/congress/ folder ships the official 118th and 119th Congress member metadata, including bioguide IDs, party, state, chamber, and the public X / Twitter handle for each member. There is a README in that folder that walks through joining scraped tweets back to member metadata so you can answer questions like "what fraction of Senate Democrats tweeted about a given topic last week?"

The full 119th Congress sweep (about 470 active handles, 1000 tweets each) takes around a day and a half on a pool of ten accounts. The example data files are a starter kit you can lift directly into your own research pipeline.

Project layout

x-scraper/
├── README.md
├── LICENSE
├── CONTRIBUTING.md
├── requirements.txt
├── .gitignore
├── .gitattributes
│
├── scripts/
│   ├── auto_rotating_scraper.py    # main scraper, one file
│   ├── test_auth.py                # cookie pool health check
│   └── import_accounts.py          # import a colon-separated bundle
│
├── config/
│   ├── usernames.txt               # target handles, one per line
│   ├── accounts/
│   │   ├── accounts.example.json   # template for the manifest
│   │   └── accounts.json           # YOUR manifest (gitignored)
│   └── cookies/
│       ├── account_1.example.json  # template for a single cookie file
│       └── account_*.json          # YOUR cookie files (gitignored)
│
├── docs/
│   ├── cookies.md                  # how to extract cookies, three methods
│   ├── architecture.md             # deep dive on the loop
│   └── ethics.md                   # responsible-use notes
│
├── examples/
│   └── congress/
│       ├── README.md               # joining tweets back to member metadata
│       ├── example_usernames.txt   # 10 well-known handles to try
│       ├── congress118_socials.csv # 118th Congress handles, 424 rows
│       ├── congress119_socials.csv # 119th Congress handles, 527 rows
│       ├── congress119.csv         # 119th Congress full member metadata
│       └── info.csv                # one row per (member, account) pair
│
└── output/
    └── (created on first run, gitignored)

FAQ

Where do the accounts in the rotation pool come from?

You bring them. The scraper does not care whether the accounts are your own throwaway accounts you created in private browser profiles, your real accounts (not recommended), or session bundles obtained elsewhere. Some marketplaces resell X session bundles for very small amounts of money. Whether using them is appropriate is a policy and ethical question, not a technical one. The cookie file format is the same either way. See docs/cookies.md.

Will my accounts get banned?

Possibly. Treat every account in the rotation pool as disposable. The defaults (15-minute cooldowns, randomized quotas, page delays) are conservative, but X's detection is opaque and changes without notice. Do not pool accounts you care about losing.

Can I scrape protected / private accounts?

Only if your authenticated session already follows them. The scraper sees exactly what a logged-in browser would see at the same URL.

Does this defeat captchas or device fingerprinting?

No. If X serves a captcha, the scraper logs an auth error and rotates to the next account. The pool is the only mitigation.

Can I run it in the background?

Yes. nohup python3 scripts/auto_rotating_scraper.py & works. Tail output/bulk_progress.json for live progress, or check wc -l output/csv/all_users_tweets.csv for row counts.

How do I add a new output format (Parquet, JSONL, database)?

Replace append_tweets_to_csv in scripts/auto_rotating_scraper.py. The schema is defined by extract_tweet_data and there is exactly one place to write rows. The rest of the loop does not care.

Is there a Docker image?

No. The whole thing is three small scripts and one dependency. A container would add more weight than it removes.

Why not just use the official API tier?

You can. If your budget covers the X API Pro tier and your data needs fit inside the rate limits, that is the right answer. This project is for the cases where the API tier is the wrong shape: too expensive, too narrow, or unavailable for your jurisdiction.

Credits

Contributing

See CONTRIBUTING.md. The project is small on purpose, and bug fixes get merged faster than feature additions.

License

MIT. See LICENSE.


Use it for research. Be proportional. Read docs/ethics.md first.

About

Research-grade scraper for X / Twitter timelines. Account rotation, jittered pacing, cursor-resume, single-file CSV output. Built for academic and journalistic research.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages