Skip to content

'pdf' field missing in DocLayNet dataset on HF #13

@Ulipenitz

Description

@Ulipenitz

Hi @piegu!

Thanks for processing the DocLayNet dataset into smaller portions. It really helps for fast experimentations!

It was especially useful to have the byte stream of the pdfs in the dataset, so one does not have to download all those files & build a script for aligning the dataset with the files.

This is a notebook where the field still existed:
https://github.com/piegu/language-models/blob/master/processing_DocLayNet_dataset_to_be_used_by_layout_models_of_HF_hub.ipynb

Is there a reason for removing this field?
It would be really great to have it back!

Thanks & all the best!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions