Reddit-Post-Scraper

This code can get the URLs of the ~600 million posts made to Reddit. As of Jan 2019, these urls can be obtained in about 24 hours.

I wrote this code when I was trying to find historial posts in a subreddit,
and there was a limitation of 1000 posts backwards in time.

Heavily inspired by this blog post by Andy Balaam:
Making 100 million requests with Python aiohttp
https://www.artificialworlds.net/blog/2017/06/12/making-100-million-requests-with-python-aiohttp/

Technically, this code does not require a Reddit account, and does not use
OAUTH or any API tokens. It just extracts information contained in the response
headers of Reddit.
I most definitely exceeded the request rate limit when gathering the data,
but I was never blocked by Reddit.

If this script is run by itself, the data is collected.
This script could be imported into another script, and the function
ControlLoop()
can be used on its own

It takes ~35 GB to store the urls of all ~600 million posts.

A single file is shown in the JSON folder as an example.
###
Code Written by:
Kyle Shepherd
KyleAnthonyShepherd.gmail.com
Jan 25, 2019
###

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
JSON		JSON
LICENSE		LICENSE
ModuleVersions.json		ModuleVersions.json
README.md		README.md
asyncredditJSON.py		asyncredditJSON.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit-Post-Scraper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reddit-Post-Scraper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages