Skip to content

Unable to write images to metadata #15

@meltedhead

Description

@meltedhead

i have my setup as

 elements = partition_pdf(
                filename=pdf_path,
                strategy="hi_res",
                chunking_strategy="by_title",
                include_orig_elements=True,
                extract_images_in_pdf=True,
                extract_image_block_types=["Image", "Table"],
                extract_image_block_output_dir=str(self.dirs["images"]),  # Save images to disk
                extract_image_block_to_payload=False,  # Ensure base64 is not used
                include_page_breaks=True,
                languages=self.ocr_languages,
                infer_table_structure=True,
                max_characters=500,
                new_after_n_chars=500,
                overlap=0
            )
            
            elements_dict = convert_to_dict(elements)

This extracts Tables and CompositeElements to a dict but when i write to JSON then i don't see any meta data to identify the images that are outputted including table and figure images. My PDF always shows zero images but there are tables and figures that are images that write but i don't see in the metadata. Why is this?

when i set this to True

extract_image_block_to_payload=True, # Ensure base64 is not used

Then i don't see anything in the metadata to easiliy identify that figure or table image and embed as an image. What am i doing wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions