Catalog Table Parser

This project contains python scripts to read pdfs tables (mostly for catalogs) and parse them into the selected format.

excel preview

Html preview

File Structure

├── extracted_files/
│ ├── index.html
│ └── extracted_tables_with_headers.xlsx
├── result_images/
├── html_result.png
│ ├── xlsx_result.png
│ └── html_result.png
├── catalog.pdf
├── pdfToExcel.py
├── pdfToHtml.py
├── README.md
├── style.css

Setup

Clone the repository or download the files to your local machine.
Ensure you have Python installed (version 3.6 or higher).
Install the required packages:

Packages

run the following command to install the required packages

for html parser
```
pip install pdfplumber pandas beautifulsoup4
```
for excel parser
```
pip install fitz pandas
```

Code Explanation

pdfToExcel.py

This file uses fitz to read, and iterate over the pages in the pdf file. With the help of the find_tables built in method, we look for table structures, and extract all the data from the table cells. After the table data is extracted, we look for the title for the table above the by finding the first text element before the table.
After the data is extracted, be insert it into the excel using panda.
we save the excel with the given name extracted_tables_with_headers.xlsx.

pdfToHtml.py

We open the pdf using pdfplumber and iterate over the data.
We create a html structure, and apppend the pdf data into Html table tags, inside the Html.
We save the Html file with the given name index.html.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Catalog Table Parser

excel preview

Html preview

File Structure

Setup

Packages

Code Explanation

pdfToExcel.py

pdfToHtml.py

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
result_images		result_images
README.md		README.md
catalog.pdf		catalog.pdf
pdfToExcel.py		pdfToExcel.py
pdfToHtml.py		pdfToHtml.py
style.css		style.css

bokorarmin/catalog-table-parser

Folders and files

Latest commit

History

Repository files navigation

Catalog Table Parser

excel preview

Html preview

File Structure

Setup

Packages

Code Explanation

pdfToExcel.py

pdfToHtml.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages