AI-powered tool for processing SOC1 Type II audit reports and automatically filling management review Excel templates. Find here
- PDF Extraction: Uses
pdfplumberto extract text and tables from SOC1 Type II PDF reports - AI-Powered Mapping: Uses Google Gemini AI (free tier) to intelligently map extracted content to Excel template fields
- Excel Generation: Automatically fills management review templates with extracted data
- Gap Analysis: Analyzes the extraction for completeness and provides recommendations
- User Feedback System: Collects user feedback to help you continuously improve extraction quality
- Python 3.10+
- Node.js 18+
- Google API key (https://aistudio.google.com/apikey)
cd backend
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Set your Google API key (free) in the .env file
GOOGLE_API_KEY="your-google-api-key"
# Run the server
uvicorn main:app --reload --port 8000cd frontend
npm install
npm run devOpen http://localhost:5173 to use the UI.
-
Upload Files:
- Upload a SOC1 Type II report (PDF format)
- Upload a blank management review template (Excel .xlsx format)
-
Processing:
- The system extracts text and tables from the PDF
- AI analyzes the content to identify controls, test results, and findings
- The Excel template is populated with the extracted information
-
Download:
- Once processing completes, download the filled management review
- Review the analysis summary for key findings and recommendations
GET /api/health- Health check endpointPOST /api/upload- Upload files and start processing- Accepts:
type_ii_report(PDF),management_review(Excel) - Returns:
job_idfor status polling
- Accepts:
GET /api/status/{job_id}- Check processing status and get analysis summaryGET /api/download/{job_id}- Download the filled Excel filePOST /api/feedback/{job_id}- Submit user feedback on extraction qualityGET /api/feedback/stats- Get aggregated feedback statistics (admin)POST /api/cleanup-uploads- Clear temporary upload files (maintenance)POST /api/cleanup-old-files- Remove output files older than 24 hours (maintenance)
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │────▶│ FastAPI │────▶│ Gemini AI │
│ (React) │ │ Backend │ │ (Free Tier) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ pdfplumber │ │ openpyxl │
│ (PDF Extract) │ │ (read-only) │
└─────────────────┘ └─────────────────┘
│
┌───────┴───────┐
│ xlsxwriter │
│ (Excel Write) │
└───────────────┘
The Excel read/write pipeline is split into two libraries to stay well under a 2 GB container limit:
| Phase | Library | Mode | Memory |
|---|---|---|---|
| Read template | openpyxl | read_only=True (streaming) |
~20 MB |
| Parse layout & styles | stdlib zipfile + ElementTree.iterparse |
Streaming XML | ~5 MB |
| Write output | xlsxwriter | Forward-only rows | ~10 MB |
All original formatting is preserved through streaming XML parsing:
- Fonts (family, size, bold, italic, color)
- Cell backgrounds (colors and patterns)
- Borders (all sides with correct styles)
- Number formats (dates, decimals, etc.)
- Alignment (horizontal, vertical, text wrapping)
- Column widths and row heights
- Merged cells
The styles are parsed directly from xl/styles.xml in the XLSX ZIP, so the full workbook object model is never loaded into memory. AI-extracted data with low/medium confidence gets highlighted background colors (red/yellow) that override the original template colors.
fastapi- Web frameworkuvicorn- ASGI serverpdfplumber- PDF text and table extractionopenpyxl- Excel template reading (read-only streaming)xlsxwriter- Excel output writing (memory-efficient)google-genai- Google Gemini AI clientpython-dotenv- Environment variable management
- React 18
- TypeScript
- Vite
| Variable | Description | Required |
|---|---|---|
GOOGLE_API_KEY |
Google AI API key (free tier available) | Yes |
VITE_API_URL |
Backend API URL (frontend only) | No* |
*If not set, defaults to https://soc1-management-review-generator.onrender.com/api
Get your free Google API key at: https://aistudio.google.com/apikey
- Push your code to GitHub
- Go to render.com and sign in with GitHub
- Create a new Web Service:
- Select your repository
- Set Root Directory:
backend - Runtime: Python 3.11
- Build Command:
pip install -r requirements.txt - Start Command:
uvicorn main:app --host 0.0.0.0 --port 8000
- Add environment variable in Settings:
GOOGLE_API_KEY: Your free API key from https://aistudio.google.com/apikey
- Deploy! Render will provide a URL like
https://your-app.onrender.com
- Go to vercel.com and sign in with GitHub
- Import your repository
- Configure:
- Framework: Vite
- Root Directory:
frontend - Build Command:
npm run build - Output Directory:
dist
- Add environment variable:
VITE_API_URL:https://your-backend.onrender.com/api(from Render deployment)
- Deploy!
The backend CORS is configured to accept requests from:
http://localhost:5173(local development)https://soc1-agent.vercel.app/(production)
To update for your frontend domain, edit backend/main.py line 20 and add your frontend URL to the allow_origins list:
allow_origins=[
"http://localhost:5173",
"https://your-frontend-domain.vercel.app"
]After downloading results, users can optionally provide feedback (star rating, issue categories, comments). This helps you identify and fix common extraction issues.
Analyze feedback:
cd backend
python analyze_feedback.py # View summary statistics
python analyze_feedback.py --detailed # See all feedback entries
python analyze_feedback.py --export # Export to CSVFeedback data stored in:
backend/feedback/feedback_log.json- All feedback entriesbackend/feedback/{job_id}_corrected.xlsx- User-corrected files (when provided)
Use feedback to improve:
- Run analysis tool to identify common issues
- Review corrected files to see what the AI missed
- Update AI prompts in
agent.pywith better instructions/examples - Test with previous problem cases
- Deploy improvements
You can also use the agent directly in Python:
from agent import process_soc1_sync
result = process_soc1_sync(
type_ii_path="path/to/soc1-report.pdf",
management_review_path="path/to/template.xlsx",
output_dir="path/to/output",
)
print(f"Filled template saved to: {result['output_path']}")
print(f"Analysis: {result['analysis']}")