-
Notifications
You must be signed in to change notification settings - Fork 1
Description
I put this in slack too (link), but am experimenting with workflow stuff.
writing up the task we agreed on today:
We are defining highest-quality duplicate according to the following criteria: a) clearest text: color is best, black and white next, then half tone or photocopied least preferred; b) metadata and email body are all on the same page; c) email has complete metadata and follows most standard format; d) fewest spelling errors in OCR rendering.
Hannah will experiment with a couple categories to determine which iteration of a Type A (or B) duplicate is preferred. Possibilities discussed: the better-quality DEQs with fewer duplicates; (for type B) duplicates appearing in shorter bookmarks; timestamps with the most metadata; fewer spelling errors (proxy for OCR quality). Hannah will pass the proposed major categories by Louise
Hannah will generate a sample spreadsheet of some preferred iterations of duplicates along with their non-preferred versions and give the list to Terry. if we are deciding between two algorithms with different results, this may be two lists.
For each duplicate set, Terry will hand-check whether the "preferred" result generated through the algorithm is actually the best-quality iteration of the duplicated material (ie should this favorite twin be the favorite twin) and return the results, marking off whether each posited best version is actually the best (i.e., "yes" in another column) or, if not, which version(s) of the duplicate is clearer/better instead.
Then we will discuss results and if things are "good enough" (ie this system pulls the actual best version 90+% of the time), in which case we will repeat the same process for the other type of duplicate and feed results to Matthew for database.
If we are at less than 90% "yes" matches, we will generate some solutions for potential changes/different categories to prioritize or different weights of categories in the algorithm (ie this component is most important). Then Hannah will pull a new set of results and Terry will check for improvement.
Our goal is to finish at least one of the two duplicate-evaluating processes in a month (end of August) to feed to database. (edited)