Skip to content

MAPC/rldb

Repository files navigation

Overview

TODO

This repository represents a data-processing pipeline for rental listings data. It replaces rental-listings-mapper, rental-listings-cleaner, rental-listings-geolocator, and rental-listings-processor, simplifying them into one repository which uses one language (Python) and removes the dependencies on Docker/docker-compose, Ruby, and R. It also brings a more sophisticated approach to listing deduplication (via the dedupe library), as well as adding database connections and new processing/reprocessing functionality.

TODO: Diagram

Setup

First, you'll need to set up your environment variables. These can be set using a .env file in the root of this project. A template (.env.template) has been provided as an example (see comments in this file for details on how to configure them). The production values are saved as a secure note in Dashlane.

Create a virtual environment if you haven't already: python -m venv virtualenv_mapper

Enter the virtual environment: source virtualenv_mapper/bin/activate

Then, install the requirements: pip install -r requirements.txt

Running the code

Before running anything, first make sure your environment variables are configure appropriately, especially the YEAR and either the MONTH or QUARTER environment variables.

Then, you can run the entire processing pipeline: python process.py

You can also run each part individually:

Just the mapper: python map.py Just the cleaner: python cleaner.py TODO: geolocator?

See also:

Directories

TODO Update this

  • data
    • TODO
    • mapped.csv
    • shapefile names are: towns_MA.shp, comm_type.shp, census_tract.shp, 1partner_city_nhoods.shp, 2Bos_neighborhoods.shp, 3MARKET AREAS NEW_region.shp
  • output
    • All files generated by scripts in this repository (including outputs that are inputs to later stages of the pipeline)
    • TODO

Files

process.py

This script runs all parts of the data pipeline for the currently-specified time period (configured using environment variables). Note: This still needs to be rewritten to make use of this new monorepo.

map.py

This script pulls data from the table that is populated by the scraper module, and maps the data into a format consumable by the cleaner (either as a CSV, writing to the mapped database table, or both).

cleaner.py

This script reads the output of the mapper (either the file or the mapped database table), performs cleanup and enrichment of the data, and writes the output to a file and/or the cleaned database table.

geolocator.py

TODO: pull in and update geolocation code from https://github.com/mapc/rental-listing-geolocator

db_sync.py

Utility script for syncing cleaner output files (i.e., <timestamp>_listings_unique.csv) to the database.

reprocess_all.py

Utility script which runs process.py against all available rental listing data (i.e., since 01/01/2020), including quarterly and yearly aggregates (in addition to monthly). Note: this will run for a long time.

upload_rldb_to_db.py

Crawls the K: drive (or a local copy of the rental-listings-data-analysis outputs) and uploads each month's rental listings data to pg.mapc.org. Tries to handle the various file encodings used for the CSVs on K:, and generally tries to fail gracefully if they cannot be read or the data can't be written to the DB. Uses a simple in-memory cache to avoid duplicate PK errors when processing multiple months at once.

Other Notes

Adding a new member municipality

An overview of the steps involved is on Gitbook, but the specifics for the data processing pipeline are as follows:

  • rental-listing-cleaner
    • If the municipality has them, get the geometries and names of their neighborhoods. In lieu of neighborhoods, we can also use census tracts (like Arlington does), but the user experience is degraded (the RLDB site will show census tract IDs in the Neighborhood dropdowns instead of human-readable names)
      • If neighborhoods (or census tracts) are available, the data processing scripts will need to be updated to make use of them
      • If neighborhoods or census tracts are not available, the front-end should still function, but controls related to neighborhood selection will be disabled, and data aggregations at the neighborhood level will not be available
  • rental-listings-data-analysis
    • The data analysis scripts will need to be updated to create a new directory for the municipality, which should contain at least the unique + cleaned rental listings for that municipality (e.g. CAMBRIDGE/listings_CAMBRIDGE_unique_clean_full_units_2020-8.csv)
  • rldb
    • The upload_rldb_to_db.py script in the rldb repo should be updated to include the new municipality.

Once all of the code changes have been made, the data will need to be reprocessed by each, in order, to get the updated data into the rental-listings database. You can confirm the data has been imported successfully by querying the units table in that database (e.g., select count(id) from units where muni='BOSTON'; for the total number of units added).

Instructions for updating the application code to include the new municipality can be found in the README here.

About

Data processing pipeline for RLDB

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •