Getting Started with PDFOxide (Python)

PDFOxide is the complete PDF toolkit. One library for extracting, creating, and editing PDFs with a unified API. Built on a Rust core for maximum performance.

Installation

pip install pdf_oxide

Quick Start - The Unified `Pdf` API

The Pdf class is your main entry point for all PDF operations:

from pdf_oxide import Pdf

# Create from Markdown
pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")

Creating PDFs

From Markdown

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("""
# Report Title

## Introduction

This is **bold** and *italic* text.

- Item 1
- Item 2
- Item 3

## Code Example

```python
print("Hello, World!")

""") pdf.save("report.pdf")


### From HTML

```python
from pdf_oxide import Pdf

pdf = Pdf.from_html("""
<h1>Invoice</h1>
<p>Thank you for your purchase.</p>
<table>
    <tr><th>Item</th><th>Price</th></tr>
    <tr><td>Widget</td><td>$10.00</td></tr>
</table>
""")
pdf.save("invoice.pdf")

From Plain Text

from pdf_oxide import Pdf

pdf = Pdf.from_text("Simple plain text document.\n\nWith paragraphs.")
pdf.save("notes.pdf")

From Images

from pdf_oxide import Pdf

# Single image
pdf = Pdf.from_image("photo.jpg")
pdf.save("photo.pdf")

# Multiple images (one per page)
album = Pdf.from_images(["page1.jpg", "page2.png", "page3.jpg"])
album.save("album.pdf")

Opening and Reading PDFs

from pdf_oxide import PdfDocument

# Open existing PDF (path can be str or pathlib.Path)
doc = PdfDocument("document.pdf")

# Or use as a context manager
with PdfDocument("document.pdf") as doc:
    text = doc.extract_text(0)
    print(f"Text: {text}")
    markdown = doc.to_markdown(0)
    print(f"Pages: {doc.page_count()}")

# Extract text from page 0
text = doc.extract_text(0)
print(f"Text: {text}")

# Convert to Markdown
markdown = doc.to_markdown(0)
print(f"Markdown:\n{markdown}")

# Get page count
print(f"Pages: {doc.page_count()}")

Builder Pattern for Advanced Creation

For full control over PDF creation, use PdfBuilder:

from pdf_oxide import PdfBuilder, PageSize

pdf = (PdfBuilder()
    .title("Annual Report 2025")
    .author("Company Inc.")
    .subject("Financial Summary")
    .page_size(PageSize.A4)
    .margins(72.0, 72.0, 72.0, 72.0)  # 1 inch margins
    .font_size(11.0)
    .from_markdown("# Annual Report\n\n..."))

pdf.save("annual-report.pdf")

Encryption and Security

Password Protection

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Confidential Document")

# Simple password protection (AES-256)
pdf.save_encrypted("secure.pdf", "user-password", "owner-password")

Text Extraction Options

Basic Extraction

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)

With Options

from pdf_oxide import PdfDocument, ConversionOptions

doc = PdfDocument("paper.pdf")
options = ConversionOptions(
    detect_headings=True,
    detect_lists=True,
    embed_images=True
)
markdown = doc.to_markdown(0, options)

Extract All Pages

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")

# Extract text from all pages
all_text = doc.extract_text_all()

# Convert entire document to Markdown
all_markdown = doc.to_markdown_all()

Office Document Conversion

Convert DOCX, XLSX, and PPTX files to PDF:

from pdf_oxide import OfficeConverter

# Auto-detect format
converter = OfficeConverter()

# Convert Word document
converter.convert("report.docx", "report.pdf")

# Convert Excel spreadsheet
converter.convert("data.xlsx", "data.pdf")

# Convert PowerPoint presentation
converter.convert("slides.pptx", "slides.pdf")

Working with Images

Extract Images from PDF

from pdf_oxide import PdfDocument

doc = PdfDocument("document.pdf")
images = doc.extract_images(0)

for i, img in enumerate(images):
    img.save(f"image_{i}.png")

Embed Images in Output

from pdf_oxide import PdfDocument, ConversionOptions

doc = PdfDocument("paper.pdf")
options = ConversionOptions(embed_images=True)

# Images embedded as base64 data URIs
html = doc.to_html(0, options)

OCR - Extracting Text from Scanned PDFs

For a comprehensive guide covering model selection, configuration reference, resize strategies, and troubleshooting, see the OCR Guide.

Setup

The OCR feature requires heavy machine learning dependencies (ONNX Runtime) and is optional.

# Recommended: Install with OCR support
pip install pdf_oxide[ocr]

# Or build from source with OCR
maturin develop --features python,ocr

Troubleshooting: If you see RuntimeError: OCR feature not enabled, it means the library was installed without OCR support. Re-install using the [ocr] extra above.

Quick start — download the recommended models:

./scripts/setup_ocr_models.sh

Model Selection Guide

PDFOxide supports PaddleOCR v3, v4, and v5 models. You can mix detection and recognition models from different versions.

Combination	Detection	Recognition	English Accuracy	Total Size
V4 det + V5 rec (recommended)	ch_PP-OCRv4_det	en_PP-OCRv5_mobile_rec	Best	~12.5 MB
V4 det + V4 rec	ch_PP-OCRv4_det	en_PP-OCRv4_rec	Good	~12.4 MB
V5 det + V5 rec	PP-OCRv5_server_det	en_PP-OCRv5_mobile_rec	Good (different errors)	~96 MB
V3 det + V3 rec	en_PP-OCRv3_det	en_PP-OCRv3_rec	Fair	~11 MB

The V4 detection + V5 recognition combination gives the best results for English documents: V4 detection reliably segments text lines, while V5 recognition has the highest character-level accuracy.

Manual download:

# Recommended: V4 detection + V5 recognition
# Detection (4.7 MB):
curl -L https://huggingface.co/deepghs/paddleocr/resolve/main/det/ch_PP-OCRv4_det/model.onnx -o .models/det.onnx

# Recognition (7.8 MB):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/rec.onnx -o .models/rec.onnx

# Dictionary (must include space as last entry):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/dict.txt -o .models/en_dict.txt
echo " " >> .models/en_dict.txt

Basic OCR Usage

from pdf_oxide import PdfDocument, OcrEngine, OcrConfig

# Create OCR engine (default config works with recommended V4 det + V5 rec models)
engine = OcrEngine(
    det_model_path=".models/det.onnx",
    rec_model_path=".models/rec.onnx",
    dict_path=".models/en_dict.txt",
)

# Extract text using OCR
doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(page=0, engine=engine)
print(text)

Processing Multiple Pages

doc = PdfDocument("scanned.pdf")
for page in range(doc.page_count()):
    text = doc.extract_text_ocr(page=page, engine=engine)
    if text.strip():
        print(f"--- Page {page + 1} ---")
        print(text)

Using PP-OCRv5 Detection

If you use the full PP-OCRv5 stack (v5 detection + v5 recognition), pass use_v5=True to OcrConfig. This preserves the original image resolution instead of downscaling to 960px, which the larger v5 detection model needs:

config = OcrConfig(use_v5=True)
engine = OcrEngine(
    det_model_path="v5_det.onnx",
    rec_model_path="v5_rec.onnx",
    dict_path="v5_dict.txt",
    config=config,
)

Note: The OcrEngine is reusable — create it once and pass it to multiple extract_text_ocr calls. ONNX Runtime requires libonnxruntime.so (v1.23+) to be available at runtime (via LD_LIBRARY_PATH or system install).

Structured Extraction

Beyond plain text, PDFOxide can extract structured content from pages:

from pdf_oxide import PdfDocument

doc = PdfDocument("document.pdf")

# 1. Scoped extraction from a specific area (v0.3.14)
# Area: (x, y, width, height) in points
header = doc.within(0, (0, 700, 612, 92)).extract_text()

# 2. Text spans with font info, position, and style
spans = doc.extract_spans(0)
for span in spans:
    print(f"{span.text} — {span.font_name} {span.font_size}pt")

# 3. Word-level extraction (v0.3.14)
words = doc.extract_words(0)
for w in words:
    print(f"Word: {w.text} at {w.bbox}")
    # Access character metadata for the word
    # print(w.chars[0].font_name)

# Optional: override the adaptive word gap threshold (in PDF points).
# Smaller values split more aggressively; useful for dense forms.
words = doc.extract_words(0, word_gap_threshold=2.5)

# 4. Line-level extraction (v0.3.14)
lines = doc.extract_text_lines(0)
for line in lines:
    print(f"Line: {line.text}")

# Optional: override word and/or line gap thresholds (in PDF points).
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# 5. Inspect computed layout params before overriding
params = doc.page_layout_params(0)
print(f"Adaptive word gap: {params.word_gap_threshold:.1f}pt")
print(f"Adaptive line gap: {params.line_gap_threshold:.1f}pt")

# 6. Pre-tuned extraction profiles for different document types
from pdf_oxide import ExtractionProfile
profile = ExtractionProfile.form()
print(f"Profile: {profile.name}, word_margin_ratio={profile.word_margin_ratio}")

# Pass a profile to extraction methods to control how raw text is parsed
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# Combine profile with threshold overrides (profile controls span parsing,
# thresholds control word/line clustering)
words = doc.extract_words(0, word_gap_threshold=1.5, profile=ExtractionProfile.aggressive())

# 7. Image metadata
images = doc.extract_images(0)
for img in images:
    print(f"{img['width']}x{img['height']} {img['color_space']}")

# 8. Bookmarks / table of contents
outline = doc.get_outline()  # None if no outline
if outline:
    for item in outline:
        print(f"{item['title']} -> page {item.get('page')}")

# 9. Vector paths (lines, curves, shapes)
paths = doc.extract_paths(0)
for path in paths:
    print(f"bbox={path['bbox']}, stroke={path.get('stroke_color')}")

Working with Form Fields

PDFOxide can extract, read, fill, and export PDF form field data (AcroForm fields).

List All Form Fields

from pdf_oxide import PdfDocument

doc = PdfDocument("tax-form.pdf")
fields = doc.get_form_fields()

for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")

Each FormField has:

name — fully qualified field name (e.g. "topmostSubform[0].Page1[0].f1_01[0]")
field_type — "text", "button", "choice", "signature"
value — current value (str, bool, or None)
flags — field flags (read-only, required, etc.)

Read and Set Field Values

doc = PdfDocument("w2.pdf")

# Read a field value
ssn = doc.get_form_field_value("topmostSubform[0].CopyA[0].f1_01[0]")
print(f"SSN: {ssn}")

# Fill fields
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.set_form_field_value("retirement_plan", True)  # checkbox

# Save (values are persisted via incremental save)
doc.save("filled_w2.pdf")

Extract Text with Form Field Values

Filled form field values appear inline in extract_text and to_markdown:

doc = PdfDocument("filled_w2.pdf")

# Form values appear inline in extracted text
text = doc.extract_text(0)
print(text)  # "Jane Doe" appears where the name field is

# to_markdown includes form fields by default
md = doc.to_markdown(0, include_form_fields=True)

# Exclude form field values
md_clean = doc.to_markdown(0, include_form_fields=False)

Export Form Data

doc = PdfDocument("filled-form.pdf")

# Export as FDF
doc.export_form_data("form_data.fdf")

# Export as XFDF
doc.export_form_data("form_data.xfdf", format="xfdf")

Performance Tips

Reuse document objects - Opening a PDF has overhead, reuse the object for multiple operations
Use specific page extraction - extract_text(page_num) is faster than extract_text_all() if you only need some pages
Disable features you don't need - Use ConversionOptions to skip heading detection, image extraction, etc.

from pdf_oxide import PdfDocument, ConversionOptions

doc = PdfDocument("large.pdf")

# Fast extraction - minimal processing
options = ConversionOptions(
    detect_headings=False,
    detect_lists=False,
    embed_images=False
)
text = doc.to_markdown(0, options)

Error Handling

from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("document.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")

Examples

See the examples/ directory for complete working examples.

Quick Script Examples

Extract text from all PDFs in a folder:

from pdf_oxide import PdfDocument
from pathlib import Path

for pdf_path in Path("documents").glob("*.pdf"):
    # PdfDocument accepts pathlib.Path directly
    with PdfDocument(pdf_path) as doc:
        text = doc.extract_text_all()

    output_path = pdf_path.with_suffix(".txt")
    output_path.write_text(text)
    print(f"Extracted: {pdf_path.name}")

Batch convert Markdown to PDF:

from pdf_oxide import Pdf
from pathlib import Path

for md_path in Path("notes").glob("*.md"):
    content = md_path.read_text()
    pdf = Pdf.from_markdown(content)

    output_path = md_path.with_suffix(".pdf")
    pdf.save(str(output_path))
    print(f"Created: {output_path.name}")

Next Steps

API Reference - Full API documentation
PDF Creation Guide - Advanced creation options
GitHub Issues - Report bugs or request features

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started with PDFOxide (Python)

Installation

Quick Start - The Unified `Pdf` API

Creating PDFs

From Markdown

From Plain Text

From Images

Opening and Reading PDFs

Builder Pattern for Advanced Creation

Encryption and Security

Password Protection

Text Extraction Options

Basic Extraction

With Options

Extract All Pages

Office Document Conversion

Working with Images

Extract Images from PDF

Embed Images in Output

OCR - Extracting Text from Scanned PDFs

Setup

Model Selection Guide

Basic OCR Usage

Processing Multiple Pages

Using PP-OCRv5 Detection

Structured Extraction

Working with Form Fields

List All Form Fields

Read and Set Field Values

Extract Text with Form Field Values

Export Form Data

Performance Tips

Error Handling

Examples

Quick Script Examples

Next Steps

FilesExpand file tree

getting-started-python.md

Latest commit

History

getting-started-python.md

File metadata and controls

Getting Started with PDFOxide (Python)

Installation

Quick Start - The Unified Pdf API

Creating PDFs

From Markdown

From Plain Text

From Images

Opening and Reading PDFs

Builder Pattern for Advanced Creation

Encryption and Security

Password Protection

Text Extraction Options

Basic Extraction

With Options

Extract All Pages

Office Document Conversion

Working with Images

Extract Images from PDF

Embed Images in Output

OCR - Extracting Text from Scanned PDFs

Setup

Model Selection Guide

Basic OCR Usage

Processing Multiple Pages

Using PP-OCRv5 Detection

Structured Extraction

Working with Form Fields

List All Form Fields

Read and Set Field Values

Extract Text with Form Field Values

Export Form Data

Performance Tips

Error Handling

Examples

Quick Script Examples

Next Steps

Quick Start - The Unified `Pdf` API