Skip to content

Data pipeline

Eamon Caddigan edited this page Apr 17, 2016 · 2 revisions

Automating the classification process will require a new data pipeline. Besides the classification process, some of the challenges include:

  • Matching donors across multiple records/campaigns/etc.
  • Deduping entries across amended submissions.
  • Matching entities to campaigns.

As we continue focusing on municipal elections, manually matching entities to campaigns remains feasible, but this will need to be changed in the long-term.

Matching donors across records is a much bigger challenge. Donors can be identified by name and address, but this is plagued by typos and lack of consistency (e.g.; firstname lastname; lastname, firstname; firstname middleinitial lastname, honorific; etc.). I recently found libpostal, an external library that does an excellent job of normalizing addresses, and have installed it on the server.

Clone this wiki locally