"Rethink Reading. Rediscover Knowledge."
Welcome to our submission for the Adobe India Hackathon 2025.
Our project transforms static PDFs into dynamic, intelligent documents β capable of understanding structure, surfacing insights, and connecting dots across knowledge sources β all offline.
The hackathon consists of two technical rounds:
| Round | Focus |
|---|---|
| π¦ Round 1A | Extract structured document outlines (Title, H1βH3) |
| π© Round 1B | Surface sections relevant to a specific persona |
| Feature | Status |
|---|---|
| Extract Title, H1βH3 Headings | β Implemented |
| Font-size & Layout-based Logic | β Implemented |
| Structured JSON Output | β Yes |
| Fully Offline (No Web Access) | β Yes |
| CPU-Only Execution | β Yes |
| Docker Support | β Yes |
| Sample Input/Output Provided | β Yes |
| Multilingual PDF Support | βοΈ Planned |
Automatically extract the Title, and H1 / H2 / H3 headings with their corresponding page numbers from any PDF (β€ 50 pages), and output a valid JSON as per Adobeβs spec.
{
"title": "Understanding AI",
"outline": [
{ "level": "H1", "text": "Introduction", "page": 1 },
{ "level": "H2", "text": "What is AI?", "page": 2 },
{ "level": "H3", "text": "History of AI", "page": 3 }
]
}Round1A/
βββ app/
β βββ input/ # Input PDFs placed here (Docker-mounted)
β βββ output/ # Output JSONs written here (Docker-mounted)
βββ src/
β βββ main.py # Entrypoint for processing all PDFs
β βββ extractor.py # Extracts title, H1, H2, H3 using font sizes
β βββ utils.py # Helpers for reading PDF, writing JSON
β βββ config.py # Thresholds/configs for heading detection
βββ requirements.txt # pdfplumber + PyMuPDF
βββ Dockerfile # CPU-only, offline, AMD64-compliant
βββ generate_dummy_pdf.py # (Optional) Generate test PDFs with headings
βββ sample.pdf # (Optional) A test input PDF
βββ sample.json # (Optional) Expected output for sample.pdf
| Tool/Library | Use Case |
|---|---|
pdfplumber, PyMuPDF |
Parsing PDF text + layout |
sentence-transformers |
Semantic relevance ranking (R1B) |
scikit-learn, numpy |
Similarity scoring, vector ops |
transformers (optional) |
Summarization (R1B, optional) |
Docker |
CPU-only, offline deployment |
Python 3.10+ |
Primary language |
π All tools meet the offline, lightweight, and CPU-compliant constraints.
-
Place
.pdffiles in theapp/input/folder. -
Run the program:
python src/main.py- It will:
- β Extract title and headings (H1βH3)
- β
Save output JSON in
app/output/with the same filename
pdfplumber==0.10.2
PyMuPDF==1.23.1
Install locally with:
pip install -r requirements.txtOur system simulates how a human visually parses a document β not just scans font sizes. Hereβs how we do it:
| #οΈβ£ | Heuristic Rule | Signal Type |
|---|---|---|
| 1 | Title must appear in top 15β25% of page 0, based on Y-position | Layout + Visual |
| 2 | Font sizes are ranked dynamically per document (largest = H1) | Font Heuristic |
| 3 | Headings must be β€ 3 lines and β€ 120 characters | Content Filter |
| 4 | ALL CAPS or bold text β heading unless layout supports it | Visual + Context |
| 5 | Boost semantic phrases like "Goals", "Summary", "Appendix" |
NLP + Semantics |
| 6 | Skip content inside tables, forms, QA blocks | Layout Heuristic |
| 7 | Preserve section numbers like 2.1 Mission β never split/truncate |
Content Rule |
| 8 | Merge multi-line headings only if alignment and spacing match | Visual Merge |
| 9 | Prefer blocks with white space padding above and below | Structural Cue |
| 10 | Skip paragraph-like blocks that appear bold but arenβt section-defining | Noise Filter |
| 11 | Repeated patterns (e.g., "Step 1", "Phase X") hint heading structure |
Pattern Learning |
| 12 | Promote early headings when no clear H1 exists, to prevent outline starvation | Recovery Logic |
| 13 | Preserve all symbols/punctuation: no normalization (Goals:, not Goals) |
Output Policy |
| 14 | Indented headings are allowed if visually distinct & top-aligned | Layout Analysis |
| 15 | Output must read like a Table of Contents, not just text with sizes | UX-Oriented Rule |
π View rules 16β35
| #οΈβ£ | Heuristic Rule | Signal Type |
|---|---|---|
| 16 | Ignore headers/footers repeated across pages | Layout Filter |
| 17 | Remove text with high frequency + small font across pages | Noise Control |
| 18 | Penalize left/right page margin-aligned content | Layout Heuristic |
| 19 | Titles with no sibling block nearby are considered isolated β boost score | Position Scoring |
| 20 | Headings often follow white space | White Space Rule |
| 21 | Pages with no detected headings: fallback to top font chunks | Recovery Logic |
| 22 | Prefer phrases with verbs/nouns over adjectives | NLP Patterning |
| 23 | Visually centered blocks on page 1 β strong title candidates | Title Heuristic |
| 24 | Avoid text with large line-height | Visual Check |
| 25 | Penalize headings with multiple font styles in one line | Mixed Font Check |
| 26 | Limit each heading level to β€ 30% of total blocks | Balance Check |
| 27 | Stop at 50 pages even if file is larger | Constraint Rule |
| 28 | Avoid headings ending with ellipses/colons (unless list intro) | Punctuation Rule |
| 29 | Heading must be larger or bolder than adjacent text blocks | Contrast Rule |
| 30 | Emphasize blocks that appear only once across document | Rarity Boost |
| 31 | Prefer headings that appear top-to-bottom sequentially | Logical Flow |
| 32 | Allow H2s inside H1s if indentation + size are justified | Nested Rule |
| 33 | Penalize headings shorter than 3 characters | Min-Length Guard |
| 34 | Promote aligned blocks with white space above & followed by body text | Composite Cue |
| 35 | Use weighted ensemble of heuristics + layout scoring | Final Scoring |
Given:
- A user persona
- A task
- A set of PDFs
β‘οΈ Surface and rank the most relevant sections, plus optionally summarize.
{
"metadata": {
"persona": "Undergraduate Chemistry Student",
"job": "Prepare for reaction kinetics exam",
"documents": ["doc1.pdf", "doc2.pdf"],
"timestamp": "2025-07-16T18:30:00Z"
},
"sections": [
{
"document": "doc1.pdf",
"page": 4,
"section_title": "Reaction Mechanisms",
"importance_rank": 1
}
],
"subsections": [
{
"document": "doc1.pdf",
"page": 4,
"refined_text": "The SN1 reaction involves a two-step mechanism..."
}
]
}| File | Description |
|---|---|
parser.py |
Splits PDF into logical chunks |
ranker.py |
Ranks sections via semantic similarity |
summarizer.py |
Summarizes sections (optional) |
main.py |
Pipeline orchestrator |
| Requirement | β Met |
|---|---|
| CPU-only | β |
| Offline execution | β |
| β€ 1GB model | β |
| Runtime β€ 60s | β |
| Valid JSON Output | β |
docker build --platform linux/amd64 -t round1a-extractor .
docker run --rm \
-v $(pwd)/app/input:/app/input \
-v $(pwd)/app/output:/app/output \
--network none round1a-extractor| Member | Role |
|---|---|
| Jahnavi | Lead Developer |
| Sahithi | Document Intelligence Engineer |
π July 2025 π Adobe India Hackathon β Connecting the Dots
This project is licensed under the MIT License.
"We donβt just extract β we understand. We donβt just read β we connect." β Team DCODERZ