PDFOxide is the complete PDF toolkit. One library for extracting, creating, and editing PDFs with a unified API. Built on a Rust core for maximum performance.
pip install pdf_oxideThe Pdf class is your main entry point for all PDF operations:
from pdf_oxide import Pdf
# Create from Markdown
pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")from pdf_oxide import Pdf
pdf = Pdf.from_markdown("""
# Report Title
## Introduction
This is **bold** and *italic* text.
- Item 1
- Item 2
- Item 3
## Code Example
```python
print("Hello, World!")""") pdf.save("report.pdf")
### From HTML
```python
from pdf_oxide import Pdf
pdf = Pdf.from_html("""
<h1>Invoice</h1>
<p>Thank you for your purchase.</p>
<table>
<tr><th>Item</th><th>Price</th></tr>
<tr><td>Widget</td><td>$10.00</td></tr>
</table>
""")
pdf.save("invoice.pdf")
from pdf_oxide import Pdf
pdf = Pdf.from_text("Simple plain text document.\n\nWith paragraphs.")
pdf.save("notes.pdf")from pdf_oxide import Pdf
# Single image
pdf = Pdf.from_image("photo.jpg")
pdf.save("photo.pdf")
# Multiple images (one per page)
album = Pdf.from_images(["page1.jpg", "page2.png", "page3.jpg"])
album.save("album.pdf")from pdf_oxide import PdfDocument
# Open existing PDF (path can be str or pathlib.Path)
doc = PdfDocument("document.pdf")
# Or use as a context manager
with PdfDocument("document.pdf") as doc:
text = doc.extract_text(0)
print(f"Text: {text}")
markdown = doc.to_markdown(0)
print(f"Pages: {doc.page_count()}")
# Extract text from page 0
text = doc.extract_text(0)
print(f"Text: {text}")
# Convert to Markdown
markdown = doc.to_markdown(0)
print(f"Markdown:\n{markdown}")
# Get page count
print(f"Pages: {doc.page_count()}")For full control over PDF creation, use PdfBuilder:
from pdf_oxide import PdfBuilder, PageSize
pdf = (PdfBuilder()
.title("Annual Report 2025")
.author("Company Inc.")
.subject("Financial Summary")
.page_size(PageSize.A4)
.margins(72.0, 72.0, 72.0, 72.0) # 1 inch margins
.font_size(11.0)
.from_markdown("# Annual Report\n\n..."))
pdf.save("annual-report.pdf")from pdf_oxide import Pdf
pdf = Pdf.from_markdown("# Confidential Document")
# Simple password protection (AES-256)
pdf.save_encrypted("secure.pdf", "user-password", "owner-password")from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)from pdf_oxide import PdfDocument, ConversionOptions
doc = PdfDocument("paper.pdf")
options = ConversionOptions(
detect_headings=True,
detect_lists=True,
embed_images=True
)
markdown = doc.to_markdown(0, options)from pdf_oxide import PdfDocument
doc = PdfDocument("book.pdf")
# Extract text from all pages
all_text = doc.extract_text_all()
# Convert entire document to Markdown
all_markdown = doc.to_markdown_all()Convert DOCX, XLSX, and PPTX files to PDF:
from pdf_oxide import OfficeConverter
# Auto-detect format
converter = OfficeConverter()
# Convert Word document
converter.convert("report.docx", "report.pdf")
# Convert Excel spreadsheet
converter.convert("data.xlsx", "data.pdf")
# Convert PowerPoint presentation
converter.convert("slides.pptx", "slides.pdf")from pdf_oxide import PdfDocument
doc = PdfDocument("document.pdf")
images = doc.extract_images(0)
for i, img in enumerate(images):
img.save(f"image_{i}.png")from pdf_oxide import PdfDocument, ConversionOptions
doc = PdfDocument("paper.pdf")
options = ConversionOptions(embed_images=True)
# Images embedded as base64 data URIs
html = doc.to_html(0, options)For a comprehensive guide covering model selection, configuration reference, resize strategies, and troubleshooting, see the OCR Guide.
The OCR feature requires heavy machine learning dependencies (ONNX Runtime) and is optional.
# Recommended: Install with OCR support
pip install pdf_oxide[ocr]
# Or build from source with OCR
maturin develop --features python,ocrTroubleshooting: If you see
RuntimeError: OCR feature not enabled, it means the library was installed without OCR support. Re-install using the[ocr]extra above.
Quick start — download the recommended models:
./scripts/setup_ocr_models.shPDFOxide supports PaddleOCR v3, v4, and v5 models. You can mix detection and recognition models from different versions.
| Combination | Detection | Recognition | English Accuracy | Total Size |
|---|---|---|---|---|
| V4 det + V5 rec (recommended) | ch_PP-OCRv4_det | en_PP-OCRv5_mobile_rec | Best | ~12.5 MB |
| V4 det + V4 rec | ch_PP-OCRv4_det | en_PP-OCRv4_rec | Good | ~12.4 MB |
| V5 det + V5 rec | PP-OCRv5_server_det | en_PP-OCRv5_mobile_rec | Good (different errors) | ~96 MB |
| V3 det + V3 rec | en_PP-OCRv3_det | en_PP-OCRv3_rec | Fair | ~11 MB |
The V4 detection + V5 recognition combination gives the best results for English documents: V4 detection reliably segments text lines, while V5 recognition has the highest character-level accuracy.
Manual download:
# Recommended: V4 detection + V5 recognition
# Detection (4.7 MB):
curl -L https://huggingface.co/deepghs/paddleocr/resolve/main/det/ch_PP-OCRv4_det/model.onnx -o .models/det.onnx
# Recognition (7.8 MB):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/rec.onnx -o .models/rec.onnx
# Dictionary (must include space as last entry):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/dict.txt -o .models/en_dict.txt
echo " " >> .models/en_dict.txtfrom pdf_oxide import PdfDocument, OcrEngine, OcrConfig
# Create OCR engine (default config works with recommended V4 det + V5 rec models)
engine = OcrEngine(
det_model_path=".models/det.onnx",
rec_model_path=".models/rec.onnx",
dict_path=".models/en_dict.txt",
)
# Extract text using OCR
doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(page=0, engine=engine)
print(text)doc = PdfDocument("scanned.pdf")
for page in range(doc.page_count()):
text = doc.extract_text_ocr(page=page, engine=engine)
if text.strip():
print(f"--- Page {page + 1} ---")
print(text)If you use the full PP-OCRv5 stack (v5 detection + v5 recognition), pass use_v5=True to OcrConfig. This preserves the original image resolution instead of downscaling to 960px, which the larger v5 detection model needs:
config = OcrConfig(use_v5=True)
engine = OcrEngine(
det_model_path="v5_det.onnx",
rec_model_path="v5_rec.onnx",
dict_path="v5_dict.txt",
config=config,
)Note: The
OcrEngineis reusable — create it once and pass it to multipleextract_text_ocrcalls. ONNX Runtime requireslibonnxruntime.so(v1.23+) to be available at runtime (viaLD_LIBRARY_PATHor system install).
Beyond plain text, PDFOxide can extract structured content from pages:
from pdf_oxide import PdfDocument
doc = PdfDocument("document.pdf")
# 1. Scoped extraction from a specific area (v0.3.14)
# Area: (x, y, width, height) in points
header = doc.within(0, (0, 700, 612, 92)).extract_text()
# 2. Text spans with font info, position, and style
spans = doc.extract_spans(0)
for span in spans:
print(f"{span.text} — {span.font_name} {span.font_size}pt")
# 3. Word-level extraction (v0.3.14)
words = doc.extract_words(0)
for w in words:
print(f"Word: {w.text} at {w.bbox}")
# Access character metadata for the word
# print(w.chars[0].font_name)
# Optional: override the adaptive word gap threshold (in PDF points).
# Smaller values split more aggressively; useful for dense forms.
words = doc.extract_words(0, word_gap_threshold=2.5)
# 4. Line-level extraction (v0.3.14)
lines = doc.extract_text_lines(0)
for line in lines:
print(f"Line: {line.text}")
# Optional: override word and/or line gap thresholds (in PDF points).
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)
# 5. Inspect computed layout params before overriding
params = doc.page_layout_params(0)
print(f"Adaptive word gap: {params.word_gap_threshold:.1f}pt")
print(f"Adaptive line gap: {params.line_gap_threshold:.1f}pt")
# 6. Pre-tuned extraction profiles for different document types
from pdf_oxide import ExtractionProfile
profile = ExtractionProfile.form()
print(f"Profile: {profile.name}, word_margin_ratio={profile.word_margin_ratio}")
# Pass a profile to extraction methods to control how raw text is parsed
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())
# Combine profile with threshold overrides (profile controls span parsing,
# thresholds control word/line clustering)
words = doc.extract_words(0, word_gap_threshold=1.5, profile=ExtractionProfile.aggressive())
# 7. Image metadata
images = doc.extract_images(0)
for img in images:
print(f"{img['width']}x{img['height']} {img['color_space']}")
# 8. Bookmarks / table of contents
outline = doc.get_outline() # None if no outline
if outline:
for item in outline:
print(f"{item['title']} -> page {item.get('page')}")
# 9. Vector paths (lines, curves, shapes)
paths = doc.extract_paths(0)
for path in paths:
print(f"bbox={path['bbox']}, stroke={path.get('stroke_color')}")PDFOxide can extract, read, fill, and export PDF form field data (AcroForm fields).
from pdf_oxide import PdfDocument
doc = PdfDocument("tax-form.pdf")
fields = doc.get_form_fields()
for f in fields:
print(f"{f.name} ({f.field_type}) = {f.value}")Each FormField has:
name— fully qualified field name (e.g."topmostSubform[0].Page1[0].f1_01[0]")field_type—"text","button","choice","signature"value— current value (str,bool, orNone)flags— field flags (read-only, required, etc.)
doc = PdfDocument("w2.pdf")
# Read a field value
ssn = doc.get_form_field_value("topmostSubform[0].CopyA[0].f1_01[0]")
print(f"SSN: {ssn}")
# Fill fields
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.set_form_field_value("retirement_plan", True) # checkbox
# Save (values are persisted via incremental save)
doc.save("filled_w2.pdf")Filled form field values appear inline in extract_text and to_markdown:
doc = PdfDocument("filled_w2.pdf")
# Form values appear inline in extracted text
text = doc.extract_text(0)
print(text) # "Jane Doe" appears where the name field is
# to_markdown includes form fields by default
md = doc.to_markdown(0, include_form_fields=True)
# Exclude form field values
md_clean = doc.to_markdown(0, include_form_fields=False)doc = PdfDocument("filled-form.pdf")
# Export as FDF
doc.export_form_data("form_data.fdf")
# Export as XFDF
doc.export_form_data("form_data.xfdf", format="xfdf")- Reuse document objects - Opening a PDF has overhead, reuse the object for multiple operations
- Use specific page extraction -
extract_text(page_num)is faster thanextract_text_all()if you only need some pages - Disable features you don't need - Use
ConversionOptionsto skip heading detection, image extraction, etc.
from pdf_oxide import PdfDocument, ConversionOptions
doc = PdfDocument("large.pdf")
# Fast extraction - minimal processing
options = ConversionOptions(
detect_headings=False,
detect_lists=False,
embed_images=False
)
text = doc.to_markdown(0, options)from pdf_oxide import PdfDocument, PdfError
try:
doc = PdfDocument("document.pdf")
text = doc.extract_text(0)
except PdfError as e:
print(f"PDF error: {e}")
except FileNotFoundError:
print("File not found")See the examples/ directory for complete working examples.
Extract text from all PDFs in a folder:
from pdf_oxide import PdfDocument
from pathlib import Path
for pdf_path in Path("documents").glob("*.pdf"):
# PdfDocument accepts pathlib.Path directly
with PdfDocument(pdf_path) as doc:
text = doc.extract_text_all()
output_path = pdf_path.with_suffix(".txt")
output_path.write_text(text)
print(f"Extracted: {pdf_path.name}")Batch convert Markdown to PDF:
from pdf_oxide import Pdf
from pathlib import Path
for md_path in Path("notes").glob("*.md"):
content = md_path.read_text()
pdf = Pdf.from_markdown(content)
output_path = md_path.with_suffix(".pdf")
pdf.save(str(output_path))
print(f"Created: {output_path.name}")- API Reference - Full API documentation
- PDF Creation Guide - Advanced creation options
- GitHub Issues - Report bugs or request features