For analyzing comments submitted to regulations.gov.
Yes, this is a bit messy and could get a whole lot cleaner if we used this all the time. It's basically a one-off, so it's not the cleanest bit of scripting you've ever seen. I'd probably build the whole thing in Python/Pandas if I had to do it again, but csvkit and bash get the job done. Searching is significantly slower than if you used Pandas, however.
- Jupyter Notebook (Highly recommend using virtual environments.
pip install jupyterif you're already using Python) - Pandas (
pip install pandas) - markegge's get-comments-with-api notebook
- csvkit (
pip install csvkit) - jot (included in MacOS, must compile from source on other platforms. Alternately, use another random number generator in line 19 of generate-random.sh.)
- GNU core utilities (included in Linux, must install on MacOS using
brew install coreutils)
- Run get-comments-with-api from Jupyter Notebook to download the full comment set. (Alternately, export the notebook to a .py file and run that from the command line.) Note that you need an API key from data.gov to download all the comments.
- Copy comments.csv into your working directory.
- Run
sh match-random.shto clean comments.csv and pick 1000 random comments from it. - Run
sh search-comments.sh utah-residents.txtto find possible comments from Utah residents (output is in utah-residents.csv) - Run
sh random-from-search.sh 1000 utah-residents.csvto pick 1000 random comments. - Import
export-1000-random.csvandutah-residents-random.csvinto a spreadsheet (we used Google Docs for simultaneous editing) and code each comment by hand.
-
If you just want to search the comment set for a bunch of terms, first generate a
clean.csvfile:csvclean -l comments.csv && mv comments_out.csv clean.csv -
Then put your search terms into a .txt file, one term per line. (csvgrep uses regex, so terms like
liv(e|ed|ing) in utahwill find people who live, lived, or are living in Utah., utah (\d*)finds digits (like a zip code) after comma-space-utah.) -
run
sh search-comments.sh [myfile.txt]to search clean.csv for all the terms in your text file. Output will be in[myfile].csv.] -
This all works for me on MacOS Sierra. It should work fine on Linux, but in line 19 of generate-random.sh, you'll need to change
gshuftoshuf. -
Our analysis of 650,000 comments posted as of 7:00 am MDT Monday, July 10 is available here and here.