Skip to content

WIP: Backend PDF Parsing#152

Draft
harshkhandeparkar wants to merge 1 commit intomainfrom
harsh/new-pdf-parsing
Draft

WIP: Backend PDF Parsing#152
harshkhandeparkar wants to merge 1 commit intomainfrom
harsh/new-pdf-parsing

Conversation

@harshkhandeparkar
Copy link
Copy Markdown
Member

Fixes #146

Description

WIP: Created a (testing sandbox) binary that takes a PDF, renders its pages to images, parses each one using tesseract, and then detects QPs in them.

Planned features

  • Detect multiple QPs in one PDF
  • Remove OCR on the frontend, which takes too long, especially on mobile phones
    • The upload endpoint should create DB entries and dump all files in a directory
    • The PDF parser looks for changes in the uploaded papers directory and parses them one by one
    • If duplicate papers are found, add them unapproved and show on the Admin dashboard
    • If the detected paper is not duplicate, directly approve it (unless important details are missing, in which case also add to the admin dashboard)
  • FUTURE: Report paper feature for incorrect detections

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
iqps Ready Ready Preview, Comment Apr 19, 2026 5:02am

@harshkhandeparkar harshkhandeparkar marked this pull request as draft April 19, 2026 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible PDF-Related QoL Improvements

1 participant