Skip to content

KyleAnthonyShepherd/Reddit-Post-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit-Post-Scraper

This code can get the URLs of the ~600 million posts made to Reddit. As of Jan 2019, these urls can be obtained in about 24 hours.

I wrote this code when I was trying to find historial posts in a subreddit,
and there was a limitation of 1000 posts backwards in time.

Heavily inspired by this blog post by Andy Balaam:
Making 100 million requests with Python aiohttp
https://www.artificialworlds.net/blog/2017/06/12/making-100-million-requests-with-python-aiohttp/

Technically, this code does not require a Reddit account, and does not use
OAUTH or any API tokens. It just extracts information contained in the response
headers of Reddit.
I most definitely exceeded the request rate limit when gathering the data,
but I was never blocked by Reddit.

If this script is run by itself, the data is collected.
This script could be imported into another script, and the function
ControlLoop()
can be used on its own

It takes ~35 GB to store the urls of all ~600 million posts.

A single file is shown in the JSON folder as an example.
###
Code Written by:
Kyle Shepherd
KyleAnthonyShepherd.gmail.com
Jan 25, 2019
###

About

This code can get the URLs of the ~600 million posts made to Reddit. As of Jan 2019, these urls can be obtained in about 24 hours.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages