Industry-Wide Search for Structured Data.
This workflow retrieves a subset of performing arts organizations from Wikidata, spider crawls their website, and computes a structured data score for each one. This workflow was designed to identify potential sources of event data for ETL to Artsdata.
Crawled website can be consulted in this report.
- Go to the Website Spider Crawls report;
- Filter sources by “structured_data_score” to identify sources that good enough to load;
- Under the “View events” column, click on “Load using databus”;
- Click “Push latest”. It will take 10-15 seconds for the graph to appear in the list of data sources;
- Proceed with data quality validation and nested entity reconciliation activities;
- If the data is good enough to turn auto-minting on and to set a schedule, add a GitHub workflow in the Artsdata-Orion repo.