-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Hello,
I'm trying to automate the extraction of figures from articles to easily integrate them in my reports, but I am not able to extract the vectorized figures.
For the rasterized figures, I can use the following and even, I'm not extracting the inline images:
"""Extract all images from a PDF file."""
import argparse
import os
from pdfreader import SimplePDFViewer
def images_from_viewer(viewer) -> list:
"""Yield all images from a PDF viewer.
Args:
viewer (SimplePDFViewer): A PDF viewer.
Returns:
list: A list of images.
"""
images = []
page_count = len(list(viewer.doc.pages()))
for index, canvas in enumerate(viewer):
print(f"On page {index + 1}/{page_count}", end="\r")
page_images = canvas.images
# print(f'Found {len(page_images)} images on page {index + 1}')
for page_image in page_images.values():
images.append(page_image.to_Pillow())
print()
return images
def save_images(images: list, path: str) -> None:
"""Save images to a path.
Args:
images (list): A list of images.
path (str): A path to save images to.
"""
for index, image in enumerate(images):
image.save(f"{path}_{index}.png", format="png")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("pdf_path", help="Path to PDF file")
parser.add_argument("image_path", help="Path to save images to")
args = parser.parse_args()
pdf_path = args.pdf_path
image_path = args.image_path
# Ensure that the image path exists and create it if it doesn't
parent_dir = os.path.dirname(image_path)
os.makedirs(parent_dir, exist_ok=True)
with open(pdf_path, "rb") as file:
simple_viewer = SimplePDFViewer(file)
extracted_images = images_from_viewer(simple_viewer)
save_images(extracted_images, image_path)Any idea on how I could also extract the figures from a document like this one?
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request