Conversation
… reconciliation and RDF conversion.
… and ingestion process
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a complete overhaul of the SIMSSA DB ingestion pipeline, replacing legacy SQL-based scripts with a cleaner, more maintainable Python-based approach using CSV exports and pandas operations.
Key changes:
- New export and merge scripts using pandas for data processing instead of complex SQL queries
- OpenRefine configuration files for entity reconciliation workflows
- RDF conversion configuration to integrate with the shared conversion pipeline
- Comprehensive documentation updates explaining the database structure and ingestion process
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| simssa/src/export_all_tables.py | Exports all PostgreSQL tables to categorized CSV files |
| simssa/src/merge.py | Merges and processes raw CSVs into consolidated entity files |
| simssa/src/flattening/SQL_query.py | Removed legacy SQL-based flattening script |
| simssa/src/flattening/restructure.py | Removed legacy pandas restructuring script |
| simssa/openrefine/history/history_work.json | Reconciliation workflow for musical works |
| simssa/openrefine/history/history_person.json | Reconciliation workflow for persons |
| simssa/openrefine/export/export_work.json | Export configuration for reconciled work data |
| simssa/openrefine/export/export_person.json | Export configuration for reconciled person data |
| shared/rdf_config/simssadb.toml | RDF conversion configuration for SIMSSA DB entities |
| simssa/README.md | Comprehensive ingestion documentation with database schema overview |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| # Clean work_title by removing brackets and quotes | ||
| work_df["work_title"] = work_df["work_title"].str.replace( | ||
| r"[\[\]'\"']", "", regex=True |
There was a problem hiding this comment.
The regex pattern contains a redundant character class with both a double quote and a curly quote. The pattern [\[\]'\"'] includes both \" (escaped double quote) and ' (curly right single quotation mark U+2019), which appears unintentional. If both straight and curly quotes should be removed, this should be documented. If only straight quotes are intended, remove the curly quote character.
| r"[\[\]'\"']", "", regex=True | |
| r"[\[\]'\"\"]", "", regex=True |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…alake into simssadb-ingestion
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "columns": [ | ||
| { | ||
| "name": "person_id", | ||
| "reconSettings": { | ||
| "output": "entity-name", | ||
| "blankUnmatchedCells": false, | ||
| "linkToEntityPages": true | ||
| }, | ||
| "dateSettings": { | ||
| "format": "iso-8601", | ||
| "useLocalTimeZone": false, | ||
| "omitTime": false | ||
| } | ||
| }, | ||
| { | ||
| "name": "person_name", | ||
| "reconSettings": { | ||
| "output": "entity-id", | ||
| "blankUnmatchedCells": false, | ||
| "linkToEntityPages": true | ||
| }, | ||
| "dateSettings": { | ||
| "format": "iso-8601", | ||
| "useLocalTimeZone": false, | ||
| "omitTime": false | ||
| } | ||
| }, |
There was a problem hiding this comment.
This file is labeled as the OpenRefine export settings for work.csv, but the configured columns are person_id, person_name, birth_year, etc., which looks like the person schema (and doesn’t include work_id, work_title, genre_name, etc.). This will cause the wrong columns to be exported for works; update the export settings to match the merged work.csv columns and desired reconciled outputs.
Corrected grammatical errors and improved clarity in README.
| # Connect to database | ||
| conn = psycopg2.connect(**DB_PARAMS) | ||
| cur = conn.cursor() | ||
|
|
There was a problem hiding this comment.
You should move these two to inside the try block below. If psycopg2.connect() raises an exception, conn is never assigned, and the finally block will crash with a NameError on top of the original error.
| # The original table store the role of the contributor as either AUTHOR or COMPOSER | ||
| contribution_pivoted = ( | ||
| contribution_df.pivot_table( | ||
| index="work_id", columns="role", values="person_id", aggfunc="first" | ||
| ) | ||
| .rename(columns={"AUTHOR": "author_id", "COMPOSER": "composer_id"}) | ||
| .reset_index() | ||
| ) |
There was a problem hiding this comment.
Does SimssaDB have works with multiple composers or authors? If so, aggfunc="first" would silently drop them. Could also add a log warning if duplicates are found, just in case.
There was a problem hiding this comment.
Hm, I actually forgot to consider that case. I'll implement something that duplicates the rows and test it out.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…conversion config Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ightly more precise. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR introduces the scripts, configuration files, and documentation necessary to ingest SIMSSA DB data.
Scripts (
simssa/src/)export_all_tables.py
merge.py
Deleted scripts
Both deleted scripts are fully replaced by the new scripts, which are more concise and easier to maintain.
OpenRefine Configuration Files (
simssa/openrefine/)Export Settings
export/export_person.jsonexport/export_work.jsonConfigure proper export formats for
person.csvandwork.csv(the only files needing reconciliation).History / Reconciliation Procedures
history/history_person.jsonhistory/history_work.jsonAllow users to automatically reapply the same reconciliation steps
RDF Conversion Config
shared/rdf_config/simssadb.toml
shared/rdfconv/convert.py).Documentation
simssa/README.md