Skip to content

Conversation

@haragam22
Copy link
Contributor

@haragam22 haragam22 commented Oct 19, 2025

Summary

This pull request introduces a Python script that reads all .pdf files from a folder and extracts their text into corresponding .txt files using the PyPDF2 library.This will solve issue #81.

Changes Made

  • Added main.py for reading and processing PDF files.
  • Created a pdfs/ folder for input PDFs.
  • Created an output/ folder for saving extracted text files.
  • Added a requirements.txt file listing dependencies.

How It Works

  1. The script scans the pdfs/ folder for .pdf files.
  2. For each PDF, it extracts text using PyPDF2.PdfReader.
  3. Writes the extracted text into .txt files in the output/ folder.

Commands Used

git checkout -b feature/pdf-text-extraction
git add .
git commit -m "Add PDF text extraction script using PyPDF2"
git push -u origin feature/pdf-text-extraction

Notes

  • Works best for text-based PDFs.
  • Future improvement: integrate OCR for scanned PDFs.

@devmalik7
Copy link
Owner

@haragam22 , Great work , merging it now.

@devmalik7 devmalik7 merged commit 9f99f72 into devmalik7:main Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants