Apply schema edits from post-review: Remove fields and update terminology#120
Apply schema edits from post-review: Remove fields and update terminology#120realmarcin wants to merge 3 commits intomainfrom
Conversation
Implements comprehensive RO-Crate JSON-LD to D4D YAML transformation with: - d4d-rocrate skill for interactive transformations - Transformation scripts with 95.2% D4D schema coverage (83/87 fields) - Multi-file RO-Crate merging with intelligent conflict resolution - Informativeness scoring to prioritize RO-Crate sources - Field-level prioritization (policy, technical, descriptive) - Automated discovery and batch processing - D4D schema validation with detailed error reporting Architecture: - mapping_loader.py: TSV mapping parser (83 field mappings) - rocrate_parser.py: RO-Crate JSON-LD structure parser - d4d_builder.py: D4D YAML builder with transformations - validator.py: LinkML schema validator - rocrate_merger.py: Multi-file merge orchestrator - informativeness_scorer.py: Source ranking by D4D value - field_prioritizer.py: Conflict resolution rules - auto_process_rocrates.py: Automated batch processor Features: - 3 merge strategies: merge (default), concatenate, hybrid - Provenance tracking for merged fields - Detailed merge reports with statistics - Test files and examples included Documentation: - Complete implementation guide in CLAUDE.md - Multi-RO-Crate merge methodology in notes/ - Makefile targets for common workflows Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removed classes and attributes: - ParticipantPrivacy class (D4D_Human module) - HumanSubjectCompensation class (D4D_Human module) - participant_privacy attribute (Dataset class) - participant_compensation attribute (Dataset class) Updated terminology: - VulnerablePopulations: Changed "vulnerable" to "at-risk" in descriptions - Class description: "at-risk populations" instead of "vulnerable populations" - Attribute descriptions: Updated all references to use "at-risk" - Dataset.vulnerable_populations: Updated description to use "at-risk populations" Changes applied to: - src/data_sheets_schema/schema/D4D_Human.yaml (module) - src/data_sheets_schema/schema/data_sheets_schema.yaml (main schema) - src/data_sheets_schema/schema/data_sheets_schema_all.yaml (full merged schema) - src/data_sheets_schema/datamodel/data_sheets_schema.py (generated Python model) - project/ (generated artifacts: JSON-LD, JSON Schema, OWL) All tests pass successfully. Fixes #113 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create D4D Slim, a reduced-complexity schema optimized for RO-Crate JSON-LD to D4D YAML transformations. Uses Complete-Class approach (≥50% coverage). Statistics: - Full D4D: 74 classes, 680 attributes - D4D Slim: 5 classes, 237 attributes - Reduction: 93% fewer classes, 65% fewer attributes - RO-Crate coverage: 40% of full schema (272 mapped attributes) Included Classes (≥50% RO-Crate mapping coverage): 1. Dataset (87.9% - 91 attributes) 2. DataSubset (86.0% - 93 attributes) 3. DatasetCollection (83.3% - 24 attributes) 4. Information (82.6% - 23 attributes) 5. Software (50.0% - 6 attributes) Excluded: 69 detail classes from all modules - Motivation (7), Composition (15), Collection (7) - Preprocessing (7), Uses (7), Distribution (3) - Maintenance (6), Ethics (5), Human (3) - Data Governance (3), Variables (1), FormatDialect (1) Design Principles: - Complete-Class approach (all attrs from ≥50% classes, not cherry-picked) - Simplified types (complex objects → strings/arrays) - Coverage documentation (% and unmapped attrs marked on every class) - Progressive enhancement path (start simple, add detail later) Critical Gaps in D4D Slim: - Workflow/Process documentation (11 classes) - Quality/Validation structures (5 classes) - Ethical/Compliance details (6 classes) - Technical metadata (3 classes) - Distribution/Access specifications (7 classes) Workarounds: Use simplified string fields for basic documentation Files: - src/data_sheets_schema/schema/D4D_Slim.yaml (schema definition) - notes/D4D_SLIM_ANALYSIS.md (1,505-line coverage analysis) - project/slim/D4D_Slim.py (Python datamodel) - project/slim/jsonschema/D4D_Slim.schema.json (JSON Schema) - project/slim/jsonld/D4D_Slim.jsonld (JSON-LD context) - project/slim/README.md (usage guide) - CLAUDE.md (updated with D4D Slim section) Use Cases: ✓ RO-Crate transformations ✓ Quick dataset documentation ✓ Prototyping catalog systems ✓ Minimal documentation requirements ✓ Flat data structure preference Migration Path: D4D Slim → Full D4D - Start with Slim for rapid documentation - Identify gaps using unmapped attribute comments - Progressively enhance with detail classes - Replace string arrays with structured objects Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update: D4D Slim Schema AddedThis PR now includes a second major feature: the D4D Slim schema for RO-Crate transformations. D4D Slim OverviewCreated a streamlined schema optimized for RO-Crate JSON-LD to D4D YAML transformations: Statistics: Included Classes (≥50% RO-Crate coverage):
Excluded: 69 detail classes from all modules Design Approach
Use CasesPerfect for:
Use Full D4D for:
Files AddedTesting✅ Schema validation: PASSED (6 style warnings, acceptable) Documentation
This PR now contains:
|
There was a problem hiding this comment.
Pull request overview
This PR updates the core D4D schema per issue #113 by removing human-subject privacy/compensation fields and updating “vulnerable” terminology to “at-risk”, while also introducing a new “D4D Slim” schema and a RO-Crate → D4D transformation/merge toolchain with generated test artifacts.
Changes:
- Remove
participant_privacy/participant_compensationfrom the main schema and delete the underlyingParticipantPrivacy/HumanSubjectCompensationclasses. - Update human-subjects terminology from “vulnerable” → “at-risk” in descriptions/labels.
- Add a new D4D Slim schema plus RO-Crate transformation scripts, Makefile targets, docs, and sample outputs/reports.
Reviewed changes
Copilot reviewed 34 out of 36 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/data_sheets_schema/schema/data_sheets_schema.yaml | Removes dataset slots and updates “at-risk populations” wording. |
| src/data_sheets_schema/schema/D4D_Human.yaml | Deletes two human-subject classes; updates “at-risk” language in VulnerablePopulations. |
| src/data_sheets_schema/schema/D4D_Slim.yaml | Adds a new slim LinkML schema intended for RO-Crate-friendly transformations. |
| src/data_sheets_schema/datamodel/data_sheets_schema.py | Regenerates Python datamodel to reflect schema removals and terminology update. |
| project/jsonschema/data_sheets_schema.schema.json | Regenerates JSON Schema to remove slots/defs and update wording. |
| project/jsonld/data_sheets_schema.jsonld | Regenerates JSON-LD artifact to remove slots/classes and update wording. |
| project/slim/README.md | Documents the new D4D Slim schema, scope, and usage. |
| notes/ROCRATE_IMPLEMENTATION_SUMMARY.md | Adds implementation summary for RO-Crate → D4D tooling. |
| notes/MULTI_ROCRATE_MERGE_SUMMARY.md | Adds implementation summary for multi-RO-Crate merge strategy. |
| data/test/transformation_report.txt | Adds example transformation report output. |
| data/test/minimal_d4d_validation_errors.txt | Adds example validation error output for minimal transformation. |
| data/test/minimal_d4d.yaml | Adds example generated D4D YAML output. |
| data/test/minimal-ro-crate.json | Adds a minimal RO-Crate fixture for testing. |
| data/test/CM4AI_merge_test_validation_errors.txt | Adds validation error output from merge test. |
| data/test/CM4AI_merge_test_merge_report.txt | Adds merge report output from merge test. |
| data/test/CM4AI_merge_test.yaml | Adds example merged D4D YAML output. |
| Makefile | Adds RO-Crate transform/merge targets and help text. |
| CLAUDE.md | Documents D4D Slim and RO-Crate transformation workflows. |
| .claude/agents/scripts/validator.py | Adds a Python wrapper around linkml-validate with parsing/suggestions. |
| .claude/agents/scripts/rocrate_to_d4d.py | Adds main CLI orchestrator for single-file transform and multi-file merge. |
| .claude/agents/scripts/rocrate_parser.py | Adds RO-Crate JSON-LD parsing and property flattening utilities. |
| .claude/agents/scripts/rocrate_merger.py | Adds merge logic and reporting for multi-RO-Crate inputs. |
| .claude/agents/scripts/mapping_loader.py | Adds TSV mapping loader for RO-Crate → D4D field mapping. |
| .claude/agents/scripts/informativeness_scorer.py | Adds scoring/ranking for selecting a primary RO-Crate source. |
| .claude/agents/scripts/field_prioritizer.py | Adds field merge strategies and conflict resolution rules. |
| .claude/agents/scripts/d4d_builder.py | Adds value transformation logic for mapped RO-Crate fields. |
| .claude/agents/scripts/auto_process_rocrates.py | Adds auto-discovery + processing strategy runner. |
| .claude/agents/scripts/README.md | Documents the RO-Crate transformation scripts and usage. |
| .claude/agents/d4d-rocrate.md | Adds a “skill” doc describing RO-Crate → D4D usage and troubleshooting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Count fields this source contributed | ||
| contributed_fields = sum( | ||
| 1 for field, sources in self.provenance.items() | ||
| if name in sources or (i == 0 and "primary" in sources) | ||
| ) | ||
|
|
||
| marker = "(PRIMARY)" if i == 0 else "" | ||
| report.append(f"{i+1}. {name} {marker}") |
There was a problem hiding this comment.
generate_merge_report() assumes the primary source is index 0 and also checks for a literal "primary" marker in provenance, but merge_field() replaces that marker with the actual primary_name. As a result, the per-source contribution counts and the (PRIMARY) marker can be wrong (or always 0 for primary) when primary_index != 0 or when provenance stores normalized names. Track primary_index (or primary source name) on the merger instance and compute contributions/markers using the same source ids used in provenance.
| if rocrate_paths: | ||
| # Multi-file merge mode | ||
| f.write(f"# Primary source: {rocrate_paths[0].name}\n") | ||
| if len(rocrate_paths) > 1: | ||
| f.write("# Additional sources:\n") | ||
| for path in rocrate_paths[1:]: | ||
| f.write(f"# - {path.name}\n") | ||
| f.write(f"# Merged: {datetime.now().isoformat()}\n") |
There was a problem hiding this comment.
save_d4d_yaml() labels rocrate_paths[0] as the primary source in the header, but merge mode allows selecting a different primary via --primary (or ranking may reorder sources). This can produce incorrect headers and confusion when auditing provenance. Pass the chosen primary path explicitly (or reorder rocrate_paths so the primary is first) to ensure the output header matches the merge behavior.
| # Common error patterns | ||
| patterns = [ | ||
| # Missing required field: "... is a required field" | ||
| r"'(.+?)' is a required field", | ||
| # Type mismatch: "... Expected type ..." | ||
| r"(.+?): Expected type (.+?), got (.+)", | ||
| # Invalid value: "... is not a valid ..." | ||
| r"'(.+?)' is not a valid (.+)", | ||
| # Enum constraint: "... not in permissible values" | ||
| r"'(.+?)' not in permissible values \[(.+?)\]", | ||
| ] |
There was a problem hiding this comment.
parse_validation_errors() patterns don’t match the jsonschema-style messages emitted by linkml-validate (e.g., "'id' is a required property" and "is not of type ... in /field" as seen in the checked-in *_validation_errors.txt files). This means errors won’t be parsed/classified and fix suggestions will often be empty. Extend the regex patterns to cover the actual output formats (required property, type errors with JSON pointer paths, date-time format errors, etc.).
| elif error_type == 'format_error': | ||
| if 'date' in field.lower(): | ||
| suggestions.append( | ||
| f"Fix date format for '{field}'. Use YYYY-MM-DD format." | ||
| ) | ||
| elif 'url' in field.lower() or 'uri' in field.lower(): |
There was a problem hiding this comment.
suggest_fixes() recommends "Use YYYY-MM-DD" for date-related format errors, but the current schema validation output includes "is not a 'date-time'" (ISO 8601 datetime). This guidance is incorrect for those cases and will keep validation failing. Instead, derive the expected format from the error message (e.g., date vs date-time) and suggest an ISO 8601 date-time when the validator indicates 'date-time'.
| # Test RO-Crate transformation with minimal example | ||
| test-rocrate-transform: | ||
| @echo "Testing RO-Crate transformation with minimal example..." | ||
| @mkdir -p data/test | ||
| $(RUN) python .claude/agents/scripts/rocrate_to_d4d.py \ | ||
| --input data/test/minimal-ro-crate.json \ | ||
| --output data/test/minimal_d4d.yaml \ | ||
| --mapping "$(ROCRATE_MAPPING)" \ | ||
| --schema $(ROCRATE_SCHEMA) \ | ||
| --validate |
There was a problem hiding this comment.
The test-rocrate-transform target runs the transformer with --validate but not --strict, and rocrate_to_d4d.py returns success even when validation fails. This makes the Make target name "test" misleading because CI/local runs can appear green while producing invalid YAML. Make the target fail on validation errors (e.g., pass --strict or have the script exit nonzero when --validate fails) so it functions as a real test.
| if source_names is None: | ||
| source_names = [ | ||
| Path(parser.rocrate_path).name.replace('-ro-crate-metadata.json', '') | ||
| for parser in rocrate_parsers | ||
| ] |
There was a problem hiding this comment.
In merge_rocrates(), defaulting source_names to a suffix-stripped filename makes provenance sources (e.g., "release") differ from the names used elsewhere (full filenames). This mismatch causes downstream reporting (and any consumer comparing names to file paths) to show zero contributions. Use consistent source identifiers end-to-end (either always full filenames, or always normalized), and pass the same source_names into save_merge_report()/generate_merge_report().
| parser.add_argument( | ||
| '--strict', | ||
| action='store_true', | ||
| help='Fail on missing required D4D fields' | ||
| ) |
There was a problem hiding this comment.
The --strict flag help text says it will "Fail on missing required D4D fields", but the implementation only checks for ['title', 'description'] and does not reflect the schema’s required slots (e.g., id in the current schema). Either update the help text to reflect the minimal check, or implement strictness by deriving required fields from the LinkML schema/validation output.
| # Default paths for RO-Crate transformation | ||
| ROCRATE_MAPPING = data/ro-crate_mapping/D4D - RO-Crate - RAI Mappings.xlsx - Class Alignment.tsv | ||
| ROCRATE_SCHEMA = $(SOURCE_SCHEMA_ALL) | ||
|
|
There was a problem hiding this comment.
ROCRATE_SCHEMA is set to the full schema ($(SOURCE_SCHEMA_ALL)), but the RO-Crate transformer currently emits simplified primitives for several fields (e.g., creators as strings, many list-typed slots as scalars), which leads to consistent validation failures against the full schema (as shown by the checked-in validation error fixtures). Consider defaulting these Make targets to the slim schema (src/data_sheets_schema/schema/D4D_Slim.yaml) or updating the transformer to produce full-schema-compliant objects/types before validating against SOURCE_SCHEMA_ALL.
| @echo "════════════════════════════════════════════════════════════════" | ||
| @echo " D4D Pipeline: RO-Crate Transformation" | ||
| @echo "════════════════════════════════════════════════════════════════" | ||
| @echo "make rocrate-to-d4d -- transform single RO-Crate to D4D YAML" | ||
| @echo " (usage: INPUT=rocrate.json OUTPUT=d4d.yaml)" | ||
| @echo "make merge-rocrates -- merge multiple RO-Crates into comprehensive D4D" | ||
| @echo " (usage: INPUTS=\"file1.json file2.json\" OUTPUT=d4d.yaml)" | ||
| @echo "make auto-process-rocrates -- auto-discover and process all RO-Crates in directory" | ||
| @echo " (usage: DIR=data/ro-crate/PROJECT OUTPUT=d4d.yaml)" | ||
| @echo "make merge-cm4ai-rocrates -- merge all CM4AI RO-Crates (release + 2 sub-crates)" | ||
| @echo "make test-rocrate-transform -- test single-file transformation" | ||
| @echo "make test-rocrate-merge -- test multi-file merge (CM4AI top 2)" |
There was a problem hiding this comment.
The PR title/description and linked issue (#113) describe only schema edits (removing participant_privacy/participant_compensation and terminology updates), but this change set also adds a new RO-Crate transformation pipeline (new Make targets + scripts) and a new D4D Slim schema. If these additions are intentional, the PR description/scope should be updated; otherwise, consider splitting the RO-Crate/Slim work into a separate PR to keep the schema-change PR focused and easier to review/release.
🆕 Additional Changes: ECO Evidence Type IntegrationThis PR now also implements #117 - adding ECO (Evidence & Conclusion Ontology) types to distinguish human vs. machine annotation. What was added:New Enum:
New Fields (added to both
Example Files (all validated ✅):
Documentation: New "ECO Evidence Type Classification" section in Key Features:✅ Backward Compatible - All fields are optional This PR now addresses both #113 and #117. Fixes #117 |
🆕 Additional Changes: Pipeline Ordering (Issue #119)This PR now also implements #119 - adding implied order of operations for data manipulation steps. Changes:
Typical Order:
Key Features:
Fixes #119 |
Summary
This PR implements the schema changes requested in #113 following post-review feedback.
Changes
🗑️ Removed Classes (from
D4D_Human.yaml)ParticipantPrivacy- Privacy protections and anonymization procedures classHumanSubjectCompensation- Compensation/incentives information class🗑️ Removed Attributes (from
data_sheets_schema.yaml)participant_privacy- Removed from Dataset class (referenced ParticipantPrivacy)participant_compensation- Removed from Dataset class (referenced HumanSubjectCompensation)✏️ Updated Terminology
Changed "vulnerable" to "at-risk" throughout:
VulnerablePopulationsclass description → "Information about protections for at-risk populations"vulnerable_groups_includedattribute → "Are any at-risk populations included"special_protectionsattribute → "What additional protections were implemented for at-risk populations"vulnerable_populationsDataset attribute → "Information about protections for at-risk populations"Files Modified
src/data_sheets_schema/schema/D4D_Human.yaml(source module)src/data_sheets_schema/schema/data_sheets_schema.yaml(main schema)src/data_sheets_schema/schema/data_sheets_schema_all.yaml(generated full schema)src/data_sheets_schema/datamodel/data_sheets_schema.py(generated Python datamodel)project/jsonld/,project/jsonschema/,project/owl/(generated artifacts)Impact
Testing
make test-schema- PASSEDmake lint-modules- PASSEDmake test-python- PASSED (6 tests, 3 skipped)make regen-all- SUCCESSVerification
Fixes #113
Review checklist: