Skip to content

Conversation

@qiu-tiandev
Copy link

@qiu-tiandev qiu-tiandev commented Nov 20, 2025

Added get_pdf_title.py to try to get the pdf title from the following order:

  1. title metadata
  2. first line in the first 2 pages if less than 30% of its content being numbers (prevent dates)
  3. First readable line in the first 2 pages
  4. First readable line in the entire doc
  5. Unititled_SHA1HASH
    From issue Generate Better PDF Titles #32
    @kylebd99

@kylebd99 kylebd99 self-requested a review December 3, 2025 21:00
@kylebd99
Copy link
Collaborator

kylebd99 commented Dec 3, 2025

Hey! Sorry for the slow response on this, and thank you for working on it!

This looks like a great start for getting better titles, but I'd like to move it around a bit in the code base. In particular, could you generate the pdf title in the function create_metadata_jsons_worker in pdf_to_embed.py? That way it will run when we create the metadata information, and we can store the title within the metadata index. Also, it would be great if you could include a couple of basic tests for the function. For example, create a couple of simple PDFs that demonstrate each of the title inference paths, then make a test file that performs that inference.

Thanks again for the help!

@qiu-tiandev
Copy link
Author

Hi,
I will work on it ASAP. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants