Skip to content

TocHeaders: byte order marks in TOC prevent recognition of the document structure #308

@soelderer

Description

@soelderer

I'm extracting this paper to markdown like this:

doc = pymupdf.open(absolute_path)
headers = pymupdf4llm.TocHeaders(doc)
text = pymupdf4llm.to_markdown(doc, hdr_info=headers)

I noticed that the TocHeaders start with UTF-8 byte order marks: '\ufeffEffects of open-label placebos across populations and outcomes: an updated systematic review and meta-analysis of randomized controlled trials'. This prevents recognising the document structure, because title.startswith(text) fails.

For a quick fix, you could just strip the BOM in get_header_id: #309

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions