Project used for converting urls into DML and images for the Glimpse dataset.
Python 2.7.X
pip
virtualenv
autoenv
ChomeDriver
Redis
-
Download the ChomeDriver executable and place it somewhere memorable (I chose
/usr/local/lib/chromedriver). -
Run
$ python setup.pyto create some gitignored directories and your.envfile. -
Add the following to your
.envfile:export REDIS_URL="<YOUR_REDIS_URL>" export CHROMEDRIVER="/path/to/chromedriver" -
Activate your environment variables:
$ source ~/.bashrc $ cd . -
Create a new virtual environment and activate it:
$ virtualenv venv && source venv/bin/activate -
Install python library requirements:
$ pip install -r requirements.txt
-
Populate your redis instance with a dictionary of urls set to the key
'urls', with keys being urls and values being'0':from src.helpers.cache import get_redis from src.statuses.urls import UNFETCHED redis = get_redis() urls = ['http://google.com/', ...] # list of urls you want to scrape url_hash = {u: UNFETCHED for u in urls} redis.hmset('urls', url_hash)
-
Scrape HTML from urls into local HTML files with inline-styled CSS:
$ python scrape.py -
Manually evaluate the HTML files to fix/remove outliers (optional):
$ python eval.pyThis starts a Flask server. From there, you can open any local html files in your browser and evaluate them with the following keys:
D--> moves HTML file to discard directory
S--> saves current working HTML file in place (good for touching up files) -
Duplicate HTML files with
<section>siblings that can be permutated:$ python permutate.py -
Capture screenshots of each HTML file:
$ python capture.py -
Translate all HTML files to DML:
$ python translate.py