A comprehensive resume parsing system that extracts structured data from PDF resumes using LangChain and OpenAI, and stores the data in a SQLite database categorized by user.
- 📄 PDF Text Extraction: Extract text from PDF resume files using PyPDF2
- 🤖 AI-Powered Parsing: Use LangChain with OpenAI to intelligently parse resume content
- 🗃️ Structured Data Storage: Store parsed data in SQLite database with user categorization
- 👤 User Management: Automatically categorize resumes by user name (inferred or specified)
- 📊 Comprehensive Data Extraction: Extract education, experience, skills, projects, achievements, and more
- Clone or set up the project
- Install required dependencies:
pip install -r requirements.txt- Set up your OpenAI API credentials in a
.envfile:
OPENROUTER_API_KEY=your_api_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
YOUR_SITE_URL=your_site_url
YOUR_SITE_NAME=your_site_namePlace your PDF resume files in the resume/ folder:
resume/
├── arpit_solanki_resume.pdf
├── john_doe_cv.pdf
└── jane_smith_resume.pdfpython main.pyThis will process all PDF files in the resume/ folder and automatically infer user names from the file content or filename.
python main.py --user "Arpit Solanki"python main.py --file "resume.pdf" --user "John Doe"python main.py --list--user, -u: Specify user name to associate with resume(s)--file, -f: Process a specific PDF file--list, -l: List all users and their resumes in the database
langchain/
├── resume/ # Folder for PDF resume files
├── main.py # Main script to run the resume parser
├── pdf_parser.py # PDF text extraction module
├── resume_parser.py # LangChain-based resume parsing module
├── database.py # Database models and management
├── model.py # Original LangChain model configuration
├── requirements.txt # Python dependencies
├── .env # Environment variables (API keys)
└── README.md # This file
id: Primary keyname: User namecreated_at: Timestamp
id: Primary keyuser_id: Foreign key to users tablefile_name: Original PDF filenamefull_name: Extracted full nameemail: Email addressphone: Phone numberlocation: Current locationlinkedin: LinkedIn profile URLgithub: GitHub profile URLsummary: Professional summaryeducation: JSON string of education detailsexperience: JSON string of work experiencetechnical_skills: JSON string of technical skillssoft_skills: JSON string of soft skillsprojects: JSON string of projectsachievements: JSON string of achievementscertifications: JSON string of certificationsraw_text: Original extracted textcreated_at: Timestampupdated_at: Last update timestamp
The system extracts the following structured data from resumes:
- Full name
- Email address
- Phone number
- Current location
- LinkedIn profile
- GitHub profile
- Brief professional summary or objective
- Institution name
- Degree/qualification
- Field of study
- Graduation year
- GPA (if mentioned)
- Location
- Company name
- Position/job title
- Duration of employment
- Location
- Key responsibilities and achievements
- Technical skills (programming languages, frameworks, tools)
- Soft skills
- Project name
- Description
- Technologies used
- Duration
- Project links
- Achievement title
- Description
- Year achieved
- Awarding organization
- Certification name
- Issuing organization
- Issue date
- Expiry date
- Credential ID
from pdf_parser import PDFParser
from resume_parser import ResumeParser
from database import DatabaseManager
# Initialize components
pdf_parser = PDFParser()
resume_parser = ResumeParser()
db_manager = DatabaseManager()
# Extract text from a PDF
text = pdf_parser.extract_text_from_pdf("resume/arpit_resume.pdf")
# Parse the resume
parsed_data = resume_parser.parse_resume(text)
# Save to database
resume_record = db_manager.save_resume_data("Arpit Solanki", "arpit_resume.pdf", parsed_data)The system includes comprehensive error handling:
- Failed PDF text extraction
- AI parsing failures (falls back to regex-based extraction)
- Database connection issues
- File not found errors
🚀 Starting Resume Processing Pipeline
=====================================
🔧 Initializing components...
📄 Found 1 PDF file(s): ['arpit_resume.pdf']
============================================================
Processing: arpit_resume.pdf
============================================================
✅ Successfully extracted text (2847 characters)
🔄 Parsing resume with AI...
👤 User identified as: Arpit Solanki
💾 Saving to database...
✅ Successfully saved resume data for Arpit Solanki
📝 Resume ID: 1
📊 Extracted Data Summary:
• Full Name: Arpit Solanki
• Email: arpitsolanki6825@gmail.com
• Phone: +91-8279824227
• Location: N/A
• Education entries: 1
• Work experience entries: 1
• Projects: 2
• Technical skills: 8
🎯 Processing Complete!
========================================
✅ Successfully processed: 1
❌ Failed to process: 0
📊 Total files: 1
💾 Data saved to: resume_database.db
🔍 You can query the database to retrieve user resume data
Feel free to contribute by:
- Adding support for more file formats (DOC, DOCX)
- Improving the AI parsing prompts
- Adding more structured data fields
- Enhancing error handling
- Adding a web interface
This project is open source. Feel free to use and modify as needed.