Skip to content
This repository was archived by the owner on Apr 15, 2024. It is now read-only.
This repository was archived by the owner on Apr 15, 2024. It is now read-only.

pdfminer.high_level.extract_text pdfminer.six, but using pdfminer package #318

@Lucas-C

Description

@Lucas-C

Hi!

I have moved from using pdfminer.six to using this pdfminer package,
and I needed an equivalent of pdfminer.high_level.extract_text().

I thought it may be useful to other people performing the same migration to share the extract_text() function I used:

from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

def extract_text(pdf_file, password="", page_numbers=None, maxpages=0, caching=True, laparams=None):
    """
    Equivalent of pdfminer.high_level.extract_text from pdfminer.six, but with pdfminer package.
    Inspired by https://github.com/euske/pdfminer/blob/master/tools/pdf2txt.py
    """
    outfp = StringIO()
    rsrcmgr = PDFResourceManager(caching=caching)
    device = TextConverter(rsrcmgr, outfp, laparams=laparams)
    with open(pdf_file, 'rb') as fp:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, page_numbers, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
            interpreter.process_page(page)
    device.close()
    return outfp.getvalue()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions