Skip to content

How to extract vectorized figures? #108

@Vincent-Stragier

Description

@Vincent-Stragier

Hello,

I'm trying to automate the extraction of figures from articles to easily integrate them in my reports, but I am not able to extract the vectorized figures.

For the rasterized figures, I can use the following and even, I'm not extracting the inline images:

"""Extract all images from a PDF file."""
import argparse
import os

from pdfreader import SimplePDFViewer


def images_from_viewer(viewer) -> list:
    """Yield all images from a PDF viewer.

    Args:
        viewer (SimplePDFViewer): A PDF viewer.

    Returns:
        list: A list of images.
    """
    images = []
    page_count = len(list(viewer.doc.pages()))

    for index, canvas in enumerate(viewer):
        print(f"On page {index + 1}/{page_count}", end="\r")
        page_images = canvas.images
        # print(f'Found {len(page_images)} images on page {index + 1}')

        for page_image in page_images.values():
            images.append(page_image.to_Pillow())

    print()

    return images


def save_images(images: list, path: str) -> None:
    """Save images to a path.

    Args:
        images (list): A list of images.
        path (str): A path to save images to.
    """
    for index, image in enumerate(images):
        image.save(f"{path}_{index}.png", format="png")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)

    parser.add_argument("pdf_path", help="Path to PDF file")
    parser.add_argument("image_path", help="Path to save images to")

    args = parser.parse_args()

    pdf_path = args.pdf_path
    image_path = args.image_path

    # Ensure that the image path exists and create it if it doesn't
    parent_dir = os.path.dirname(image_path)
    os.makedirs(parent_dir, exist_ok=True)

    with open(pdf_path, "rb") as file:
        simple_viewer = SimplePDFViewer(file)
        extracted_images = images_from_viewer(simple_viewer)
        save_images(extracted_images, image_path)

Any idea on how I could also extract the figures from a document like this one?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions