-
Notifications
You must be signed in to change notification settings - Fork 10
Data pipeline
Eamon Caddigan edited this page Apr 17, 2016
·
2 revisions
Automating the classification process will require a new data pipeline. Besides the classification process, some of the challenges include:
- Matching donors across multiple records/campaigns/etc.
- Deduping entries across amended submissions.
- Matching entities to campaigns.
As we continue focusing on municipal elections, manually matching entities to campaigns remains feasible, but this will need to be changed in the long-term.
Matching donors across records is a much bigger challenge. Donors can be identified by name and address, but this is plagued by typos and lack of consistency (e.g.; firstname lastname; lastname, firstname; firstname middleinitial lastname, honorific; etc.). I recently found libpostal, an external library that does an excellent job of normalizing addresses, and have installed it on the server.