Scape All Available PDF Metadata

Currently, the embedding pipeline only scrapes a minimal subset of the metadata available in PDFs. We may as well retrieve all of the available data:

<img width="1020" height="1034" alt="Image" src="https://github.com/user-attachments/assets/511fdd17-5f21-4980-9b00-548a68fdb775" />

This should just be a matter of updating the sqlite metadata table definition, updating the embedding pipeline to add that data into the metadata.json, and updating the generate_index_metadata.py script to insert it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scape All Available PDF Metadata #35

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scape All Available PDF Metadata #35

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions