TocHeaders: byte order marks in TOC prevent recognition of the document structure

I'm extracting [this paper](https://www.nature.com/articles/s41598-025-14895-z) to markdown like this:

```
doc = pymupdf.open(absolute_path)
headers = pymupdf4llm.TocHeaders(doc)
text = pymupdf4llm.to_markdown(doc, hdr_info=headers)
```

I noticed that the TocHeaders start with UTF-8 byte order marks: `'\ufeffEffects of open-label placebos across populations and outcomes: an updated systematic review and meta-analysis of randomized controlled trials'`. This prevents recognising the document structure, because `title.startswith(text)` fails.

For a quick fix, you could just strip the BOM in get_header_id: #309 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TocHeaders: byte order marks in TOC prevent recognition of the document structure #308

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TocHeaders: byte order marks in TOC prevent recognition of the document structure #308

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions