Skip to content

seed-data generation script #1

@behas

Description

@behas

We need a script that generates a seed data file for the Library of Congress Maphub instance.

Each "map record" in the seed data file includes:

  • pointers to the map image file URIs
  • selected metadata fields (title, description, subject, creator)

We already have the maps in place and scripts to download metadata from the LoC's GMD collection (see scripts directory). The script has to read these map identifiers, iterate over the harvested metadata records, identify matching records (based on the map identifier), and output a maphub map record for each match.

The challenging part of this script is to select the appropriate metadata fields from the OAI-PMH records. We want only those that carry "relevant" semantics about the map. Also some data cleansing (whitespace, special chars, etc.) steps might be necessary. At the end the metadata need to be indexed by Apache Solr / Lucene.

The results should be a script generate-loc-seeddata which takes the directory of map image files and a directory of XML files (= the metadata records) and a set of identifiers (probably a TXT file) as input and generates an outputfile loc-seeddata.yaml

Possible execution:

generate-loc-seeddata maps/ metadata/*.xml

generate-loc-seeddata -n 10 maps/ metadata/*.xml for only 10 maps

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions