This repository was archived by the owner on Apr 15, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
This repository was archived by the owner on Apr 15, 2024. It is now read-only.
Can't extract text objects #319
Copy link
Copy link
Open
Description
Hi,
When using pdfminer.six to extract text elements from a pdf file, I found that it doesn't work in some cases.
Pdf files:
2022 Mar quarterly report_ Ali.pdf
SIA_AR_2021.pdf
Description:
- File 1: can't extract text, however, it's able to extract text when we convert the original pdf file to a printed pdf.
- File 2: can't extract only part of the text.
Code which is used:
def get_page_layout(
filename,
line_overlap=0.5,
char_margin=1.0,
line_margin=0.5,
word_margin=0.1,
boxes_flow=0.5,
detect_vertical=True,
all_texts=True,
):
"""Returns a PDFMiner LTPage object and page dimension of a single
page pdf. To get the definitions of kwargs, see
https://pdfminersix.rtfd.io/en/latest/reference/composable.html.
Parameters
----------
filename : string
Path to pdf file.
line_overlap : float
char_margin : float
line_margin : float
word_margin : float
boxes_flow : float
detect_vertical : bool
all_texts : bool
Returns
-------
layout : object
PDFMiner LTPage object.
dim : tuple
Dimension of pdf page in the form (width, height).
"""
with open(filename, "rb") as f:
parser = PDFParser(f)
document = PDFDocument(parser)
if not document.is_extractable:
raise PDFTextExtractionNotAllowed(
f"Text extraction is not allowed: {filename}"
)
laparams = LAParams(
line_overlap=line_overlap,
char_margin=char_margin,
line_margin=line_margin,
word_margin=word_margin,
boxes_flow=boxes_flow,
detect_vertical=detect_vertical,
all_texts=all_texts,
)
rsrcmgr = PDFResourceManager()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page_num, page in enumerate(PDFPage.create_pages(document)):
interpreter.process_page(page)
layout = device.get_result()
width = layout.bbox[2]
height = layout.bbox[3]
dim = (width, height)
return layout, dim
def get_text_objects(layout, ltype="char", t=None):
"""Recursively parses pdf layout to get a list of
PDFMiner text objects.
Parameters
----------
layout : object
PDFMiner LTPage object.
ltype : string
Specify 'char', 'lh', 'lv' to get LTChar, LTTextLineHorizontal,
and LTTextLineVertical objects respectively.
t : list
Returns
-------
t : list
List of PDFMiner text objects.
"""
if ltype == "char":
LTObject = LTChar
elif ltype == "image":
LTObject = LTImage
elif ltype == "horizontal_text":
LTObject = LTTextLineHorizontal
elif ltype == "vertical_text":
LTObject = LTTextLineVertical
if t is None:
t = []
try:
for obj in layout._objs:
if isinstance(obj, LTObject):
t.append(obj)
else:
t += get_text_objects(obj, ltype=ltype)
except AttributeError:
pass
return tMetadata
Metadata
Assignees
Labels
No labels