Skip to content

Serli/pdf-to-markdown-service

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 PDF to Markdown Service

This project is a simple Node.js API that converts a remote PDF document into Markdown text.

🚀 Features

  • Exposes an HTTP API with two endpoints:
    • GET /convert?url=<PDF_URL>
    • GET /<encoded_PDF_URL> (for URLs encoded in the path, e.g. https://monsuperservice.com/https%3A%2F%2Fexample.com%2Ffile.pdf)
  • Extracts text from PDF using pdfjs-dist
  • Converts headings, lists, paragraphs to Markdown
  • Handles multi-page documents (adds --- as page separator)
  • CORS enabled (open)

📦 Installation

git clone <repo_url>
cd pdf-to-markdown-service
npm install

▶️ Usage

Start the server:

npm start

By default it runs on http://localhost:3000

Example requests

# Convert via query parameter
curl -L "http://localhost:3000/convert?url=https%3A%2F%2Fwww.w3.org%2FWAI%2FER%2Ftests%2Fxhtml%2Ftestfiles%2Fresources%2Fpdf%2Fdummy.pdf"

# Convert via encoded URL in path
curl -L "http://localhost:3000/https%3A%2F%2Fwww.w3.org%2FWAI%2FER%2Ftests%2Fxhtml%2Ftestfiles%2Fresources%2Fpdf%2Fdummy.pdf"

🛠 Requirements

  • Node.js >= 20
  • Internet access to fetch remote PDFs

⚠️ Limitations

  • Only extracts text (no OCR for scanned PDFs)
  • Complex layouts (tables, multi-column) may be simplified
  • Titles detection is heuristic (font size relative)

Made with ❤️ in Node.js

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%