-
Notifications
You must be signed in to change notification settings - Fork 2
Description
We need a script that generates a seed data file for the Library of Congress Maphub instance.
Each "map record" in the seed data file includes:
- pointers to the map image file URIs
- selected metadata fields (title, description, subject, creator)
We already have the maps in place and scripts to download metadata from the LoC's GMD collection (see scripts directory). The script has to read these map identifiers, iterate over the harvested metadata records, identify matching records (based on the map identifier), and output a maphub map record for each match.
The challenging part of this script is to select the appropriate metadata fields from the OAI-PMH records. We want only those that carry "relevant" semantics about the map. Also some data cleansing (whitespace, special chars, etc.) steps might be necessary. At the end the metadata need to be indexed by Apache Solr / Lucene.
The results should be a script generate-loc-seeddata which takes the directory of map image files and a directory of XML files (= the metadata records) and a set of identifiers (probably a TXT file) as input and generates an outputfile loc-seeddata.yaml
Possible execution:
generate-loc-seeddata maps/ metadata/*.xml
generate-loc-seeddata -n 10 maps/ metadata/*.xml for only 10 maps