""" README for Hugging Face Contributor Email Extractor """
A tool to extract contributor emails from Hugging Face repositories by first getting names from git logs and then finding emails through web searches, particularly from academic papers.
- Extract contributor names from Hugging Face repository git logs
- Search for contributor emails using Google Scholar and general web searches
- Download and parse PDFs to extract email addresses
- Prioritize academic emails over generic ones
- Modern web interface with React, Vite, Tailwind CSS, and shadcn UI
- FastAPI backend with background processing
- PostgreSQL database integration
The application consists of two main components:
-
Server (FastAPI)
- RESTful API for repository processing
- Background task processing for email extraction
- Database integration for storing results
- Comprehensive error handling and validation
-
Web (React + Vite)
- Modern UI with Tailwind CSS and shadcn UI components
- Real-time status updates with polling
- Responsive design for all device sizes
- Python 3.10+
- Node.js 20+
- PostgreSQL database
- Clone the repository
- Install Python dependencies:
cd server pip install requests beautifulsoup4 fastapi uvicorn psycopg2-binary anthropic openai pdf2image pytesseract PyPDF2 - Configure environment variables in
server/config.py - Start the server:
cd server uvicorn backend:app --host 0.0.0.0 --port 8000
- Navigate to the web directory
- Install Node.js dependencies:
cd web npm install - Start the development server:
npm run dev
- Enter a Hugging Face repository path (e.g.,
deepseek-ai/DeepSeek-V3-0324) in the web interface - Click "Extract Emails" to start the extraction process
- The application will:
- Clone the repository
- Extract contributor names from git logs
- Search for emails using Google Scholar and web searches
- Download and parse PDFs to extract email addresses
- Display the results in the web interface
POST /extract: Start email extraction for a repositoryGET /status/{repo_path}: Get extraction status for a repository
The application uses two tables:
-
manus_hf_repositories: Stores repository informationid: Primary keyrepo_path: Repository pathcreated_at: Creation timestamp
-
manus_hf_contributors: Stores contributor informationid: Primary keyrepo_id: Foreign key to repositories tablename: Contributor nameemail: Contributor emailcommit_count: Number of commitsfirst_commit_date: Date of first commitlast_commit_date: Date of last commitcreated_at: Creation timestamp
This project is licensed under the MIT License.