pdf-gpt

Question PDF documents using AI

This project uses OpenAI and Langchain to allow users to ask questions about groups of pdf documents. It was specifically developed to return information from financial disclosure forms of U.S. congress members in a pandas dataframe. Users should find code ammendable to a wide array of use cases.

The project is maintained by Citizen Codex.

Areas we look to improve in the future:

Chunk text according to assets info. Don't allow page breaks to split up information about an individual asset
The AI struggles to interpret both the word "None" and the number 0. It likes to look for the next number
Implement a SQL database to write and manipulate data

Notes of inconsistency

The names of assets on page break may be truncated, e.g. simply saying SP
Asset types may not be able to be pulled for truncated asset names (only 200-300 of 15,000 assets)
The AI sometimes will attach an income value to an asset with no value

How to repo works:

Create a .env to store your API key
All scripts rely on functions from utils.py
The script gen_pdf_urls.py uses metadata about financial disclosure submissions to create urls pointing towards the PDfs containing the data. This script matches each disclosue form with political information about a given congress member (e.g. their party, age, etc.).
The scripts beginning with extract_ comb through and extract the relevant pages from pdfs accessed via their urls.
The scripts assets.py and liabilities.py send the extracted text to ChatGPT to be returned as structured json data.
The script processing.py cleans the results of ChatGPT analyis and calculates the networth of congress members. This script also combines the automatically processed data with the manually processed data.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
manual_data		manual_data
output		output
pdf_urls		pdf_urls
.gitignore		.gitignore
README.md		README.md
asset_types.xlsx		asset_types.xlsx
assets.py		assets.py
extract_asset_pages.py		extract_asset_pages.py
extract_liability_pages.py		extract_liability_pages.py
gen_pdf_urls.py		gen_pdf_urls.py
liabilities.py		liabilities.py
min_to_max_values.xlsx		min_to_max_values.xlsx
processing.py		processing.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-gpt

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdf-gpt

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages