-
Notifications
You must be signed in to change notification settings - Fork 101
Open
Description
i have my setup as
elements = partition_pdf(
filename=pdf_path,
strategy="hi_res",
chunking_strategy="by_title",
include_orig_elements=True,
extract_images_in_pdf=True,
extract_image_block_types=["Image", "Table"],
extract_image_block_output_dir=str(self.dirs["images"]), # Save images to disk
extract_image_block_to_payload=False, # Ensure base64 is not used
include_page_breaks=True,
languages=self.ocr_languages,
infer_table_structure=True,
max_characters=500,
new_after_n_chars=500,
overlap=0
)
elements_dict = convert_to_dict(elements)
This extracts Tables and CompositeElements to a dict but when i write to JSON then i don't see any meta data to identify the images that are outputted including table and figure images. My PDF always shows zero images but there are tables and figures that are images that write but i don't see in the metadata. Why is this?
when i set this to True
extract_image_block_to_payload=True, # Ensure base64 is not used
Then i don't see anything in the metadata to easiliy identify that figure or table image and embed as an image. What am i doing wrong?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels