This project contains python scripts to read pdfs tables (mostly for catalogs) and parse them into the selected format.
├── extracted_files/
│ ├── index.html
│ └── extracted_tables_with_headers.xlsx
├── result_images/
├── html_result.png
│ ├── xlsx_result.png
│ └── html_result.png
├── catalog.pdf
├── pdfToExcel.py
├── pdfToHtml.py
├── README.md
├── style.css
-
Clone the repository or download the files to your local machine.
-
Ensure you have Python installed (version 3.6 or higher).
-
Install the required packages:
run the following command to install the required packages
for html parser
pip install pdfplumber pandas beautifulsoup4for excel parser
pip install fitz pandas
-
This file uses
fitzto read, and iterate over the pages in the pdf file. With the help of thefind_tablesbuilt in method, we look for table structures, and extract all the data from the table cells. After the table data is extracted, we look for the title for the table above the by finding the first text element before the table. -
After the data is extracted, be insert it into the excel using
panda. -
we save the excel with the given name
extracted_tables_with_headers.xlsx.
-
We open the pdf using
pdfplumberand iterate over the data. -
We create a html structure, and apppend the pdf data into Html table tags, inside the Html.
-
We save the Html file with the given name
index.html.

