Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
355 changes: 355 additions & 0 deletions docs/design/byok-pdf/byok-pdf-spike.md

Large diffs are not rendered by default.

245 changes: 245 additions & 0 deletions docs/design/byok-pdf/byok-pdf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
# Feature design: BYOK PDF support

| | |
|--------------------|-------------------------------------------|
| **Date** | 2026-04-27 |
| **Component** | rag-content (primary), lightspeed-stack (docs only) |
| **Authors** | Maxim Svistunov |
| **Feature** | [LCORE-1471](https://issues.redhat.com/browse/LCORE-1471) |
| **Spike** | [LCORE-1471](https://issues.redhat.com/browse/LCORE-1471) — see [byok-pdf-spike.md](byok-pdf-spike.md) |
| **Precedent** | [LCORE-1035](https://issues.redhat.com/browse/LCORE-1035) — HTML support (PR `7f688b0`) |

## What

Add native PDF input support to the BYOK content production tool (`rag-content`). After this feature, customers can drop `.pdf` files into the input directory of `custom_processor.py` alongside `.md`, `.txt`, and `.html` files, and get a functioning vector store without manual pre-conversion.

## Why

Today, customers with PDF content must convert it to Markdown themselves before feeding it to `rag-content`. The BYOK guide currently instructs them to use `docling` as a separate pre-processing step. This is friction — and ironic, because `rag-content` *already* depends on `docling` (it ships with HTML support). Wiring the existing dependency through to PDF removes the manual step at no cost in third-party deps.

## Requirements

- **R1**: `python -m lightspeed_rag_content.pdf convert -i input.pdf -o output.md` converts a single PDF to Markdown.
- **R2**: `python -m lightspeed_rag_content.pdf batch -i ./pdfs/ -o ./md/` converts a directory of PDFs.
- **R3**: `custom_processor.py` with `-f` pointing to a directory containing PDFs produces a vector store; PDFs are routed through `MarkdownNodeParser` after docling export.
- **R4**: No new entries in `pyproject.toml` (docling is already a dependency).
- **R5**: OCR is not invoked. Scanned/image-only PDFs are out of scope; their conversion may yield empty or near-empty Markdown without erroring.
- **R6**: The `lightspeed-stack/docs/byok_guide.md` no longer instructs users to pre-convert PDFs.

## Use Cases

- **U1**: As a BYOK customer with product documentation PDFs, I want to feed the PDFs directly to `rag-content` so that I don't have to maintain a separate conversion step in my pipeline.
- **U2**: As an LCS operator, I want a vector store generated from a PDF to behave indistinguishably from one generated from Markdown when queried via `lightspeed-stack`, so that input-format choice does not affect retrieval semantics.

## Architecture

### Overview

```text
input/
├── doc.md ─┐
├── note.txt ─┼─→ SimpleDirectoryReader
├── page.html ┤ │
└── manual.pdf┘ │
file_extractor lookup by ext:
.html ─→ HTMLReader (existing)
.pdf ─→ PDFReader (new) ←── this feature
.md, .txt ─→ default text reader
Document(text=markdown)
MarkdownNodeParser (existing path,
extended to recognize doc_type="pdf")
embedding + vector store
```

PDF support reuses the entire downstream pipeline. The only new code is the reader/CLI, plus a one-token addition (`"pdf"`) to two `doc_type` checks in `document_processor.py`.

### Reader

`PDFReader(BaseReader)` mirrors `HTMLReader` line-for-line with two differences:

1. The docling `DocumentConverter` is constructed with `allowed_formats=[InputFormat.PDF]` (not HTML).
2. The converter receives a `PdfFormatOption(pipeline_options=...)` argument with explicit pipeline knobs.

```python
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

class PDFReader(BaseReader):
def __init__(self) -> None:
opts = PdfPipelineOptions()
opts.do_ocr = False
opts.do_table_structure = True
opts.table_structure_options.mode = "accurate"
self.converter = DocumentConverter(
allowed_formats=[InputFormat.PDF],
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=opts),
},
)

def load_data(self, file: Path, extra_info=None, **kwargs):
# identical body to HTMLReader.load_data — see html_reader.py
...
```

### Pipeline configuration

| Option | Value | Why |
|--------|-------|-----|
| `do_ocr` | `False` | Out of scope (R5) |
| `do_table_structure` | `True` | Tables are common in customer docs |
| `table_structure_options.mode` | `"accurate"` | Offline indexing tolerates the perf cost |
| `do_picture_classification` | `False` (default) | Vector search does not use images |
| `do_picture_description` | `False` (default) | Heavy VLM call, no value |
| `generate_page_images` | `False` (default) | Wasted I/O |

These are baked into `PDFReader.__init__`. **No CLI flags expose them in v1.** If customer feedback calls for tuning, add flags later.

### Chunking

PDFs go through the same chunking path as HTML and Markdown. The change is in `document_processor.py`:

```python
# document_processor.py:75 — _LlamaStackDB
if config.doc_type in ("markdown", "html"): # before
if config.doc_type in ("markdown", "html", "pdf"): # after

# document_processor.py:87 — _BaseDB
if config.doc_type in ("markdown", "html"): # before
if config.doc_type in ("markdown", "html", "pdf"): # after
```
Comment on lines +110 to +118
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Replace hardcoded document_processor.py:75/87 line refs with stable anchors

These line numbers point into the sibling rag-content repo and are not gated by this PR. They will drift the moment any unrelated edit touches document_processor.py above line 75 (imports, a new helper, etc.), and the design doc — which lives on indefinitely — will then point at the wrong lines. Reference the symbols instead (_LlamaStackDB.__init__ and _BaseDB.__init__, plus the if config.doc_type in (...) predicate), so the agentic-tool instructions in byok-pdf-spike.md and the implementation ticket stay accurate without re-validation.

Also applies to: 162-186

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/byok-pdf/byok-pdf.md` around lines 110 - 118, Replace the brittle
line-number references to document_processor.py with stable symbol anchors:
update the docs to point to _LlamaStackDB.__init__ and _BaseDB.__init__ (and
explicitly call out the predicate "if config.doc_type in (\"markdown\",
\"html\", \"pdf\")") instead of "document_processor.py:75/87"; also update the
other occurrence mentioned (around lines 162-186) to reference the same
symbols/predicate so the byok-pdf-spike.md agentic-tool instructions and
implementation ticket remain correct even if the file shifts.

🧹 Nitpick | 🔵 Trivial

Extract the doc_type tuple to a module-level constant rather than duplicating it twice

The proposed change repeats ("markdown", "html", "pdf") at two call sites in document_processor.py. Open Questions (line 216) explicitly anticipates DOCX/RTF/EPUB readers landing later — each new format would then require editing two tuples in lockstep, with no compiler help if one is forgotten. Suggest the implementation ticket land a single MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str, ...]] = ("markdown", "html", "pdf") (or similar) used at both sites. Cheap DRY win that pays off across the planned roadmap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/byok-pdf/byok-pdf.md` around lines 110 - 118, The two duplicated
doc_type tuples used in document_processor.py should be extracted to a single
module-level constant to avoid drift; add a constant like
MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str, ...]] = ("markdown", "html",
"pdf") at the top of the module (import Final from typing), then replace the
inline tuples in the _LlamaStackDB and _BaseDB checks (the places referencing
config.doc_type) to use MARKDOWN_COMPATIBLE_DOC_TYPES instead.


`MarkdownNodeParser` operates on the markdown that docling exports. It splits on heading boundaries, which works because docling produces well-formed `# ` / `## ` / `### ` headings for body content. Heading-text degradation in some PDFs (see "Known limitations") corrupts the heading *text* but does not break splitting.

### CLI module (`pdf/__main__.py`)

Mirror `html/__main__.py` exactly. Two subcommands:

- `convert -i input.pdf [-o output.md]` — single file
- `batch -i ./pdf_dir [-o ./md_dir]` — directory walk over `**/*.pdf`

Reuse `lightspeed_rag_content.utils.add_input_file_argument` and `run_cli_command`.

### Configuration

No YAML configuration changes. The reader's defaults are hard-coded; the CLI takes paths only; `custom_processor.py` already accepts the input directory as `-f`.

### API changes

No HTTP API changes. This feature is entirely offline (rag-content is a build-time tool).

### Error handling

Mirror HTMLReader exactly:

- `FileNotFoundError` if the input path does not exist.
- `RuntimeError` (with the underlying exception chained via `from exc`) if docling conversion fails.
- The CLI catches both, logs, and exits non-zero.

A scanned (image-only) PDF will produce empty or near-empty Markdown. This is not an error — the document parses fine; it just contains no extractable text. Customers will see a valid but empty-ish vector store entry. Document this in the rag-content README.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify a warning log when conversion yields empty/near-empty Markdown

Treating a scanned PDF as "not an error" is reasonable, but silently producing a near-empty vector-store entry is a debugging nightmare for BYOK customers — they will index a PDF, run a query, get nothing back, and have no signal that the PDF lacked extractable text. Recommend specifying that PDFReader.load_data emits a logger.warning(...) when the exported Markdown is below some small threshold (e.g., empty after whitespace strip, or under N characters), so the failure mode is observable in custom_processor.py output. Cheap to add and aligns with the coding guideline of using clear log levels.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/byok-pdf/byok-pdf.md` at line 147, PDFReader.load_data currently
lets scanned (image-only) PDFs produce empty/near-empty Markdown silently;
update PDFReader.load_data to detect when the exported Markdown is empty after
strip or under a small configurable threshold (e.g., N characters) and emit
logger.warning with contextual info (filename/document id and resulting length
or "empty after strip") so the condition is visible in custom_processor.py logs;
add a configurable constant/argument for the threshold and ensure the warning
message is clear and only logged at warning level.


### Migration / backwards compatibility

None. This is purely additive: existing pipelines that don't pass PDFs are unaffected. No schema, no config, no API changes.

## Implementation Suggestions

### Key files and insertion points

| Repo | File | What to do |
|------|------|------------|
| rag-content | `src/lightspeed_rag_content/pdf/__init__.py` | New — minimal package init mirroring `html/__init__.py` |
| rag-content | `src/lightspeed_rag_content/pdf/__main__.py` | New — CLI mirroring `html/__main__.py` |
| rag-content | `src/lightspeed_rag_content/pdf/pdf_reader.py` | New — `PDFReader(BaseReader)` mirroring `html_reader.py` |
| rag-content | `src/lightspeed_rag_content/document_processor.py` | Edit lines 75 and 87 — add `"pdf"` to the `doc_type in (...)` tuples |
| rag-content | `tests/pdf/__init__.py` | New |
| rag-content | `tests/pdf/test_pdf_reader.py` | New — mirror `tests/html/test_html_reader.py` |
| rag-content | `tests/pdf/test__main__.py` | New — mirror `tests/html/test__main__.py` |
| rag-content | `tests/pdf/fixture.pdf` | New — small text-extractable test PDF (< 50 KB) |
| rag-content | `README.md` | Edit — add PDF to the supported input formats list |
| lightspeed-stack | `docs/byok_guide.md` | Edit lines ~106-118 — list PDF as directly supported, drop the docling pre-conversion example |

### Insertion point detail

`document_processor.py:75` (inside `_LlamaStackDB.__init__`):

```python
if config.doc_type in ("markdown", "html"):
Settings.node_parser = MarkdownNodeParser()
```

`document_processor.py:87` (inside `_BaseDB.__init__`):

```python
if config.doc_type in ("markdown", "html"):
Settings.node_parser = MarkdownNodeParser()
```

Both lines need `"pdf"` added to the tuple. No other branches in the file route by `doc_type`.

### Config pattern

N/A — this feature has no Python config classes (it's a CLI tool, not a service).

### Test patterns

Mirror the HTML test layout exactly:

- `tests/pdf/test_pdf_reader.py`:
- Test that `PDFReader().load_data(valid_path)` returns a list with one `Document`.
- Test that the returned `Document.text` is non-empty for a real fixture PDF.
- Test that `FileNotFoundError` is raised for a non-existent path.
- Test that `RuntimeError` is raised when docling raises (mock the converter).
- Test that `extra_info` is preserved in `Document.metadata`.
- `tests/pdf/test__main__.py`:
- Test argument parsing for `convert` and `batch`.
- Test that `convert` writes output to the inferred path when `-o` is omitted.
- Test that `batch` walks subdirectories and preserves structure.
- Test that errors exit non-zero.

Use the existing `mocker` patterns from `tests/html/`. The PDF fixture should be small (< 50 KB) and committed to git — generate one from a known Markdown source so test assertions can match exact strings.

## Open Questions for Future Work

- **OCR support**: Scanned PDFs require docling's OCR engines (tesseract, easyocr, rapidocr). File a follow-up JIRA when there is customer demand. Implementation is mostly a flag toggle; the cost is in build/runtime size.
- **Hybrid (page-aware) chunking**: If retrieval quality on real customer PDFs is poor, switch from `MarkdownNodeParser` to docling's hybrid chunker. Requires a new branch in `_BaseDB.__init__`.
- **Heading-cleanup post-processor**: Confluence-export PDFs with letter-spaced headings yield `H e a d i n g` text. A small post-processor that collapses single-character runs in headings would mitigate. Not in scope for v1.
- **Shared `DoclingReader` base class**: If `PDFReader` and `HTMLReader` end up sharing significant logic beyond imports, refactor into a base. Not required by this feature.
- **DOCX, RTF, EPUB**: docling supports these. Add as separate readers when customer demand justifies.

## Known limitations

These are intentional v1 trade-offs documented for the rag-content README and the BYOK guide:

- **Scanned PDFs**: produce empty or near-empty Markdown. Use a separate OCR step today; native OCR support tracked as a follow-up.
- **Letter-spaced display fonts**: typical of Confluence "Export to PDF" output. Headings may extract with spaces between letters (`H e a d i n g`). Body text is unaffected. The `## ` heading prefix is intact, so chunking still happens at heading boundaries; only the heading *text* is corrupted, which slightly degrades retrieval if a query mentions the heading literally.
- **Performance**: ~30-90 seconds per small/medium PDF on CPU after model warm-up. Acceptable for offline indexing; not suitable for interactive use.

## Changelog

| Date | Change | Reason |
|------------|-----------------|---------------------------------|
| 2026-04-27 | Initial version | Spike deliverable for LCORE-1471 |

## Appendix A: PoC evidence

See [`poc-results/`](poc-results/) in the spike PR for the full PoC report, conversion logs, and converted Markdown samples. The PoC validated the recommendations above on two real PDFs and surfaced the heading-degradation limitation documented in "Known limitations".

## Appendix B: HTML precedent

The HTML implementation under [LCORE-1035](https://issues.redhat.com/browse/LCORE-1035) (commit `7f688b0`, 2026-01-15) established every pattern used here:

- `BaseReader` with docling-backed conversion
- `__main__.py` CLI structure with `convert` and `batch` subcommands
- `tests/html/` layout
- `document_processor.py` `doc_type` branching

Refer to the HTML files (`src/lightspeed_rag_content/html/`, `tests/html/`) when implementing this feature. Differences are limited to the `InputFormat` enum value and the addition of `PdfFormatOption(pipeline_options=...)` in the `DocumentConverter` constructor.
Loading
Loading