Skip to content

jahnavi-j9/adobe-hackathon-Round1A

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎯 Adobe India Hackathon 2025 – Connecting the Dots

"Rethink Reading. Rediscover Knowledge."

Welcome to our submission for the Adobe India Hackathon 2025.
Our project transforms static PDFs into dynamic, intelligent documents β€” capable of understanding structure, surfacing insights, and connecting dots across knowledge sources β€” all offline.


🧠 Hackathon Overview

The hackathon consists of two technical rounds:

Round Focus
🟦 Round 1A Extract structured document outlines (Title, H1–H3)
🟩 Round 1B Surface sections relevant to a specific persona

✨ Features Summary

Feature Status
Extract Title, H1–H3 Headings βœ… Implemented
Font-size & Layout-based Logic βœ… Implemented
Structured JSON Output βœ… Yes
Fully Offline (No Web Access) βœ… Yes
CPU-Only Execution βœ… Yes
Docker Support βœ… Yes
Sample Input/Output Provided βœ… Yes
Multilingual PDF Support βš™οΈ Planned

🟦 Round 1A – PDF Outline Extractor

πŸ“Œ Objective

Automatically extract the Title, and H1 / H2 / H3 headings with their corresponding page numbers from any PDF (≀ 50 pages), and output a valid JSON as per Adobe’s spec.

βœ… Sample Output Format

{
  "title": "Understanding AI",
  "outline": [
    { "level": "H1", "text": "Introduction", "page": 1 },
    { "level": "H2", "text": "What is AI?", "page": 2 },
    { "level": "H3", "text": "History of AI", "page": 3 }
  ]
}

πŸ“‚ Folder Structure

Round1A/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ input/                   # Input PDFs placed here (Docker-mounted)
β”‚   └── output/                  # Output JSONs written here (Docker-mounted)
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py                  # Entrypoint for processing all PDFs
β”‚   β”œβ”€β”€ extractor.py             # Extracts title, H1, H2, H3 using font sizes
β”‚   β”œβ”€β”€ utils.py                 # Helpers for reading PDF, writing JSON
β”‚   └── config.py                # Thresholds/configs for heading detection
β”œβ”€β”€ requirements.txt             # pdfplumber + PyMuPDF
β”œβ”€β”€ Dockerfile                   # CPU-only, offline, AMD64-compliant
β”œβ”€β”€ generate_dummy_pdf.py        # (Optional) Generate test PDFs with headings
β”œβ”€β”€ sample.pdf                   # (Optional) A test input PDF
└── sample.json                  # (Optional) Expected output for sample.pdf

🧰 Tech Stack

Tool/Library Use Case
pdfplumber, PyMuPDF Parsing PDF text + layout
sentence-transformers Semantic relevance ranking (R1B)
scikit-learn, numpy Similarity scoring, vector ops
transformers (optional) Summarization (R1B, optional)
Docker CPU-only, offline deployment
Python 3.10+ Primary language

πŸ“Œ All tools meet the offline, lightweight, and CPU-compliant constraints.


βš™οΈ How It Works (Round 1A)

  1. Place .pdf files in the app/input/ folder.

  2. Run the program:

python src/main.py
  1. It will:
  • βœ… Extract title and headings (H1–H3)
  • βœ… Save output JSON in app/output/ with the same filename

πŸ“¦ Dependencies

pdfplumber==0.10.2  
PyMuPDF==1.23.1

Install locally with:

pip install -r requirements.txt

🧠 Refined Heuristics for Heading Detection

Our system simulates how a human visually parses a document β€” not just scans font sizes. Here’s how we do it:

#️⃣ Heuristic Rule Signal Type
1 Title must appear in top 15–25% of page 0, based on Y-position Layout + Visual
2 Font sizes are ranked dynamically per document (largest = H1) Font Heuristic
3 Headings must be ≀ 3 lines and ≀ 120 characters Content Filter
4 ALL CAPS or bold text β‰  heading unless layout supports it Visual + Context
5 Boost semantic phrases like "Goals", "Summary", "Appendix" NLP + Semantics
6 Skip content inside tables, forms, QA blocks Layout Heuristic
7 Preserve section numbers like 2.1 Mission β€” never split/truncate Content Rule
8 Merge multi-line headings only if alignment and spacing match Visual Merge
9 Prefer blocks with white space padding above and below Structural Cue
10 Skip paragraph-like blocks that appear bold but aren’t section-defining Noise Filter
11 Repeated patterns (e.g., "Step 1", "Phase X") hint heading structure Pattern Learning
12 Promote early headings when no clear H1 exists, to prevent outline starvation Recovery Logic
13 Preserve all symbols/punctuation: no normalization (Goals:, not Goals) Output Policy
14 Indented headings are allowed if visually distinct & top-aligned Layout Analysis
15 Output must read like a Table of Contents, not just text with sizes UX-Oriented Rule
πŸ” View rules 16–35
#️⃣ Heuristic Rule Signal Type
16 Ignore headers/footers repeated across pages Layout Filter
17 Remove text with high frequency + small font across pages Noise Control
18 Penalize left/right page margin-aligned content Layout Heuristic
19 Titles with no sibling block nearby are considered isolated β†’ boost score Position Scoring
20 Headings often follow white space White Space Rule
21 Pages with no detected headings: fallback to top font chunks Recovery Logic
22 Prefer phrases with verbs/nouns over adjectives NLP Patterning
23 Visually centered blocks on page 1 β†’ strong title candidates Title Heuristic
24 Avoid text with large line-height Visual Check
25 Penalize headings with multiple font styles in one line Mixed Font Check
26 Limit each heading level to ≀ 30% of total blocks Balance Check
27 Stop at 50 pages even if file is larger Constraint Rule
28 Avoid headings ending with ellipses/colons (unless list intro) Punctuation Rule
29 Heading must be larger or bolder than adjacent text blocks Contrast Rule
30 Emphasize blocks that appear only once across document Rarity Boost
31 Prefer headings that appear top-to-bottom sequentially Logical Flow
32 Allow H2s inside H1s if indentation + size are justified Nested Rule
33 Penalize headings shorter than 3 characters Min-Length Guard
34 Promote aligned blocks with white space above & followed by body text Composite Cue
35 Use weighted ensemble of heuristics + layout scoring Final Scoring

🟩 Round 1B – Persona-Driven Document Intelligence

🎯 Objective

Given:

  • A user persona
  • A task
  • A set of PDFs

➑️ Surface and rank the most relevant sections, plus optionally summarize.


πŸ“₯ Input/Output Example

{
  "metadata": {
    "persona": "Undergraduate Chemistry Student",
    "job": "Prepare for reaction kinetics exam",
    "documents": ["doc1.pdf", "doc2.pdf"],
    "timestamp": "2025-07-16T18:30:00Z"
  },
  "sections": [
    {
      "document": "doc1.pdf",
      "page": 4,
      "section_title": "Reaction Mechanisms",
      "importance_rank": 1
    }
  ],
  "subsections": [
    {
      "document": "doc1.pdf",
      "page": 4,
      "refined_text": "The SN1 reaction involves a two-step mechanism..."
    }
  ]
}

βš™οΈ Planned Modules (Round 1B)

File Description
parser.py Splits PDF into logical chunks
ranker.py Ranks sections via semantic similarity
summarizer.py Summarizes sections (optional)
main.py Pipeline orchestrator

βœ… Round 1B Constraints Coverage

Requirement βœ… Met
CPU-only βœ…
Offline execution βœ…
≀ 1GB model βœ…
Runtime ≀ 60s βœ…
Valid JSON Output βœ…

🐳 Docker Setup (Round 1A)

docker build --platform linux/amd64 -t round1a-extractor .

docker run --rm \
  -v $(pwd)/app/input:/app/input \
  -v $(pwd)/app/output:/app/output \
  --network none round1a-extractor

πŸ‘₯ Team DCODERZ

Member Role
Jahnavi Lead Developer
Sahithi Document Intelligence Engineer

πŸ“… July 2025 🏁 Adobe India Hackathon – Connecting the Dots


πŸ“Ž License

This project is licensed under the MIT License.


"We don’t just extract β€” we understand. We don’t just read β€” we connect." β€” Team DCODERZ

About

An intelligent, offline-first PDF understanding system built for the Adobe India Hackathon 2025. Extracts structured outlines and semantically ranks sections based on persona-driven tasks, fully Dockerized and scoring-ready.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors