Added script with multiple approaches to get pdf title name #50

qiu-tiandev · 2025-11-20T12:02:02Z

Added get_pdf_title.py to try to get the pdf title from the following order:

title metadata
first line in the first 2 pages if less than 30% of its content being numbers (prevent dates)
First readable line in the first 2 pages
First readable line in the entire doc
Unititled_SHA1HASH
From issue Generate Better PDF Titles #32
@kylebd99

kylebd99 · 2025-12-03T21:04:21Z

Hey! Sorry for the slow response on this, and thank you for working on it!

This looks like a great start for getting better titles, but I'd like to move it around a bit in the code base. In particular, could you generate the pdf title in the function create_metadata_jsons_worker in pdf_to_embed.py? That way it will run when we create the metadata information, and we can store the title within the metadata index. Also, it would be great if you could include a couple of basic tests for the function. For example, create a couple of simple PDFs that demonstrate each of the title inference paths, then make a test file that performs that inference.

Thanks again for the help!

qiu-tiandev · 2025-12-04T06:16:39Z

Hi,
I will work on it ASAP. Thank you!

Added script with multiple approaches to get pdf title name

00ea5c7

kylebd99 self-requested a review December 3, 2025 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added script with multiple approaches to get pdf title name #50

Added script with multiple approaches to get pdf title name #50

Uh oh!

qiu-tiandev commented Nov 20, 2025 •

edited

Loading

Uh oh!

kylebd99 commented Dec 3, 2025

Uh oh!

qiu-tiandev commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added script with multiple approaches to get pdf title name #50

Are you sure you want to change the base?

Added script with multiple approaches to get pdf title name #50

Uh oh!

Conversation

qiu-tiandev commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylebd99 commented Dec 3, 2025

Uh oh!

qiu-tiandev commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qiu-tiandev commented Nov 20, 2025 •

edited

Loading