TODO
This repository represents a data-processing pipeline for rental listings data. It replaces rental-listings-mapper, rental-listings-cleaner, rental-listings-geolocator, and rental-listings-processor, simplifying them into one repository which uses one language (Python) and removes the dependencies on Docker/docker-compose, Ruby, and R. It also brings a more sophisticated approach to listing deduplication (via the dedupe library), as well as adding database connections and new processing/reprocessing functionality.
TODO: Diagram
First, you'll need to set up your environment variables. These can be set using a .env file in the root of this project. A template (.env.template) has been provided as an example (see comments in this file for details on how to configure them). The production values are saved as a secure note in Dashlane.
Create a virtual environment if you haven't already: python -m venv virtualenv_mapper
Enter the virtual environment: source virtualenv_mapper/bin/activate
Then, install the requirements: pip install -r requirements.txt
Before running anything, first make sure your environment variables are configure appropriately, especially the YEAR and either the MONTH or QUARTER environment variables.
Then, you can run the entire processing pipeline: python process.py
You can also run each part individually:
Just the mapper: python map.py
Just the cleaner: python cleaner.py
TODO: geolocator?
See also:
- db_sync.py, which can be run using
python db_sync.py - reprocess_all.py, which can be run using
python reprocess_all.py - upload_rldb_to_db.py, which can be run using
python upload_rldb_to_db.py
TODO Update this
- data
- TODO
- mapped.csv
- shapefile names are: towns_MA.shp, comm_type.shp, census_tract.shp, 1partner_city_nhoods.shp, 2Bos_neighborhoods.shp, 3MARKET AREAS NEW_region.shp
- output
- All files generated by scripts in this repository (including outputs that are inputs to later stages of the pipeline)
- TODO
This script runs all parts of the data pipeline for the currently-specified time period (configured using environment variables). Note: This still needs to be rewritten to make use of this new monorepo.
This script pulls data from the table that is populated by the scraper module, and maps the data into a format consumable by the cleaner (either as a CSV, writing to the mapped database table, or both).
This script reads the output of the mapper (either the file or the mapped database table), performs cleanup and enrichment of the data, and writes the output to a file and/or the cleaned database table.
TODO: pull in and update geolocation code from https://github.com/mapc/rental-listing-geolocator
Utility script for syncing cleaner output files (i.e., <timestamp>_listings_unique.csv) to the database.
Utility script which runs process.py against all available rental listing data (i.e., since 01/01/2020), including quarterly and yearly aggregates (in addition to monthly). Note: this will run for a long time.
Crawls the K: drive (or a local copy of the rental-listings-data-analysis outputs) and uploads each month's rental listings data to pg.mapc.org. Tries to handle the various file encodings used for the CSVs on K:, and generally tries to fail gracefully if they cannot be read or the data can't be written to the DB. Uses a simple in-memory cache to avoid duplicate PK errors when processing multiple months at once.
An overview of the steps involved is on Gitbook, but the specifics for the data processing pipeline are as follows:
- rental-listing-cleaner
- If the municipality has them, get the geometries and names of their neighborhoods. In lieu of neighborhoods, we can also use census tracts (like Arlington does), but the user experience is degraded (the RLDB site will show census tract IDs in the Neighborhood dropdowns instead of human-readable names)
- If neighborhoods (or census tracts) are available, the data processing scripts will need to be updated to make use of them
- If neighborhoods or census tracts are not available, the front-end should still function, but controls related to neighborhood selection will be disabled, and data aggregations at the neighborhood level will not be available
- If the municipality has them, get the geometries and names of their neighborhoods. In lieu of neighborhoods, we can also use census tracts (like Arlington does), but the user experience is degraded (the RLDB site will show census tract IDs in the Neighborhood dropdowns instead of human-readable names)
- rental-listings-data-analysis
- The data analysis scripts will need to be updated to create a new directory for the municipality, which should contain at least the unique + cleaned rental listings for that municipality (e.g.
CAMBRIDGE/listings_CAMBRIDGE_unique_clean_full_units_2020-8.csv)
- The data analysis scripts will need to be updated to create a new directory for the municipality, which should contain at least the unique + cleaned rental listings for that municipality (e.g.
- rldb
- The upload_rldb_to_db.py script in the
rldbrepo should be updated to include the new municipality.
- The upload_rldb_to_db.py script in the
Once all of the code changes have been made, the data will need to be reprocessed by each, in order, to get the updated data into the rental-listings database. You can confirm the data has been imported successfully by querying the units table in that database (e.g., select count(id) from units where muni='BOSTON'; for the total number of units added).
Instructions for updating the application code to include the new municipality can be found in the README here.