LCORE-1471 spike: BYOK PDF support by max-svistunov · Pull Request #1598 · lightspeed-core/lightspeed-stack

max-svistunov · 2026-04-27T13:36:37Z

fmax## LCORE-1471 spike: BYOK PDF support

Spike deliverable for LCORE-1471 — adding native PDF support to the BYOK content production tool (rag-content). HTML support already shipped under LCORE-1035 (Jan 2026), so this spike is intentionally lightweight: scaffold-and-mirror the existing HTML pattern rather than re-design from scratch.

What's in this PR

Design docs (docs/design/byok-pdf/):

byok-pdf-spike.md (use the "Outline" button) — decisions, alternatives, proposed JIRAs, PoC results
byok-pdf.md (use the "Outline" button) — feature spec — changeable based on final decisions, kept in repo long-term

PoC code (will be removed before merge per spike workflow step 10):

docs/design/byok-pdf/poc/pdf_reader.py — 60-line script mirroring HTMLReader for PDF input
docs/design/byok-pdf/poc/sample_jira_*.pdf — two real-world PDFs used as test corpus

PoC validation results (will be removed before merge):

docs/design/byok-pdf/poc-results/01-poc-report.txt — methodology, findings, implications
docs/design/byok-pdf/poc-results/02-conversion-log.txt — exact commands and timings
docs/design/byok-pdf/poc-results/03-sample-jira-1311.md — clean conversion output
docs/design/byok-pdf/poc-results/04-sample-jira-836.md — heading degradation visible

Main findings

No new dependencies needed. docling is already in pyproject.toml (added by LCORE-1035). The PDF path uses the same DocumentConverter API with InputFormat.PDF and PdfFormatOption(pipeline_options=...).
Body text and tables convert cleanly. PoC produced well-formed Markdown with proper GitHub-flavored tables. The existing MarkdownNodeParser chunks the output without modification — only a one-token addition ("pdf") to two doc_type checks in document_processor.py.
Heading degradation on letter-spaced fonts is the one real caveat. Confluence "Export to PDF" output produces headings like H e a d i n g. Body unaffected, chunking unaffected, but heading retrieval is degraded. Documented as a known v1 limitation; a heading-cleanup post-processor is noted as a follow-up.
Performance is acceptable for offline indexing. ~30–90 s per small/medium PDF on CPU after model warm-up. Cold start is ~5 min for the first invocation.

For reviewers

@tisnik @sbunciak

Strategic decisions — please confirm or override (each has an options table and recommendation in the linked section):

Decision 1: Library choice (docling vs alternatives) — recommend docling
Decision 2: OCR for scanned PDFs — recommend out of scope for v1
Decision 3: Repository placement — recommend rag-content + stack docs update

@tisnik

Technical decisions — implementation-level:

Decision 4: Pipeline configuration knobs — recommend hard-coded defaults, no CLI flags in v1
Decision 5: Chunking strategy — recommend reuse MarkdownNodeParser
Decision 6: Code organization — recommend new pdf/ package mirroring html/
Decision 7: Test coverage scope — recommend unit/integration in JIRA GitHub templates #2, e2e in JIRA Project file #3

Proposed JIRAs — review the Proposed JIRAs section; 4 sub-tickets under LCORE-1471:

rag-content: Implement PDF support
rag-content: Unit and integration tests
rag-content: End-to-end test (PDF → vector store → stack query)
lightspeed-stack: Update BYOK guide for native PDF support

Doc structure note: The decision and JIRA sections of the spike doc are where your input is needed. They link to background sections later in the doc — read those if you need more context on a specific point, but it is optional.

Type of change

New feature
Documentation Update

Tools used to create PR

Assisted-by: Claude Opus 4.7 (1M context)
Generated by: Claude Opus 4.7 (1M context)

Related Tickets & Documents

Related Issue # LCORE-1035 (HTML precedent)
Closes # LCORE-1471 (when JIRAs are filed and accepted; the spike itself is the deliverable)

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests. (N/A — spike only; tests are part of proposed JIRA GitHub templates #2)

Testing

The PoC was validated end-to-end against two real-world PDFs. Reproduction steps:

# 1. From the rag-content sibling repo (where docling is installed)
cd ~/repos/lightspeed/rag-content

# 2. Convert sample PDF #1 (clean Atlassian Cloud PDF)
uv run python ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/pdf_reader.py \
    ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/sample_jira_1311.pdf \
    /tmp/out_1311.md

# Expected output:
#   converted .../sample_jira_1311.pdf -> /tmp/out_1311.md
#     elapsed: 332s (cold) or ~30s (warm)
#     output:  ~7600 chars, ~290 lines

# 3. Convert sample PDF #2 (Confluence-style PDF, surfaces heading degradation)
uv run python ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/pdf_reader.py \
    ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/sample_jira_836.pdf \
    /tmp/out_836.md

# Expected output:
#   converted .../sample_jira_836.pdf -> /tmp/out_836.md
#     elapsed: ~70s (warm)
#     output:  ~3100 chars, ~165 lines

The committed poc-results/03-*.md and poc-results/04-*.md are the canonical conversion outputs from the spike runs — diff against them to confirm reproducibility.

Closes LCORE-1471
Related: LCORE-1035

Summary by CodeRabbit

Documentation
- Added design specifications and proof-of-concept documentation for planned native PDF support in the BYOK content production tool
- Includes conversion quality metrics, performance benchmarks, and detailed implementation approach for PDF ingestion workflows

Add a minimal PoC script that mirrors the production HTMLReader from src/lightspeed_rag_content/html/html_reader.py but configures docling for InputFormat.PDF. The script validates that docling's PDF pipeline produces usable Markdown for the BYOK ingestion path with the recommended defaults: - do_ocr=False (text-extractable PDFs only; OCR is explicit non-goal of LCORE-1471) - do_table_structure=True (cheap quality win; tables are common in customer documentation) - table_structure_options.mode='accurate' - do_picture_classification=False, do_picture_description=False - generate_page_images=False The PoC was run against two real-world PDFs (locally-available Lightspeed JIRA exports — Atlassian Cloud and Confluence-style): sample_jira_1311.pdf 217 KB -> 7,608 chars / 288 lines sample_jira_836.pdf 372 KB -> 3,084 chars / 165 lines Run command (from rag-content's venv where docling is installed): cd ~/repos/lightspeed/rag-content uv run python <stack>/docs/design/byok-pdf/poc/pdf_reader.py \ <input.pdf> [<output.md>] The PoC is throwaway code. The production implementation will mirror the lightspeed_rag_content/html/ package one-for-one, replacing InputFormat.HTML with InputFormat.PDF and adding the PdfFormatOption(pipeline_options=...) constructor argument. This commit only adds the PoC script and sample inputs; the converted output, evidence files, and design docs land in a follow-up commit.

Add the spike doc (decisions up front, background below, 4 proposed JIRAs) and the spec doc (R1..R6 requirements, architecture, key files and insertion points, known limitations) under docs/design/byok-pdf/. The spike is lightweight by design: HTML support shipped under LCORE-1035 (commit 7f688b0, 2026-01-15), so the architectural pattern, docling dependency, BaseReader plumbing, CLI shape, and test layout are all already established. PDF support is a scaffold-and-mirror job plus a one-line addition to document_processor.py's doc_type branches. Decisions captured for confirmation (each with options table and recommendation in the spike doc): D1: Library -- docling (already a dependency) D2: OCR for scanned PDFs -- out of scope; track as follow-up D3: Repo placement -- rag-content (impl) + lightspeed-stack (BYOK guide update only) D4: Pipeline knobs -- hard-coded sensible defaults; no CLI flags in v1 (mirrors HTMLReader) D5: Chunking strategy -- reuse MarkdownNodeParser; add "pdf" to document_processor.py:75 and :87 D6: Code organization -- new pdf/ package mirroring html/ D7: Test coverage -- unit/integration in JIRA #2, e2e in lightspeed-core#3 Four sub-JIRAs proposed under LCORE-1471 (parseable by dev-tools/file-jiras.sh): 1. rag-content: Implement PDF support 2. rag-content: Unit and integration tests 3. rag-content: End-to-end test (PDF -> vector store -> stack query) 4. lightspeed-stack: Update BYOK guide for native PDF support PoC evidence under poc-results/: 01-poc-report.txt Methodology, findings, implications 02-conversion-log.txt Exact commands and timings 03-sample-jira-1311.md Clean conversion (Atlassian Cloud PDF) 04-sample-jira-836.md Body clean, headings degraded (Confluence PDF, letter-spaced display font) Honest PoC findings worth surfacing: - No new dependencies are needed (docling is already in pyproject.toml). - Body text and tables convert cleanly to Markdown. - MarkdownNodeParser handles the output -- no parallel chunking pipeline. - Letter-spaced display fonts (typical of Confluence "Export to PDF") produce noisy heading text; documented as a v1 known limitation. - Cold model load is ~5 minutes on CPU; warm conversions ~30-90 s for small/medium PDFs. Acceptable for offline indexing. Per howto-run-a-spike.md step 10, poc/ and poc-results/ will be removed before merge; spike doc and spec doc remain in the repo.

coderabbitai · 2026-04-27T13:36:54Z

Walkthrough

This PR adds design documentation and proof-of-concept implementation for enabling native PDF input support in the BYOK content production tool. It includes detailed feature and spike specifications, PoC results with conversion analysis, sample Jira issue documentation, and a standalone PDF-to-Markdown converter script using docling.

Changes

Cohort / File(s)	Summary
Design Specifications `docs/design/byok-pdf/byok-pdf-spike.md`, `docs/design/byok-pdf/byok-pdf.md`	Complete spike and feature design specs for PDF input support in BYOK tool, including architecture recommendations, integration points with existing document processors, scope constraints (text-extractable PDFs only, no OCR by default), and module structure mirroring HTML handler patterns.
PoC Results & Analysis `docs/design/byok-pdf/poc-results/01-poc-report.txt`, `docs/design/byok-pdf/poc-results/02-conversion-log.txt`	PoC validation documentation capturing conversion quality findings, performance metrics, runtime environment details, and conversion logs with timestamped pipeline stages.
Reference Documentation `docs/design/byok-pdf/poc-results/03-sample-jira-1311.md`, `docs/design/byok-pdf/poc-results/04-sample-jira-836.md`	Sample Jira issue documentation for context summarization (LCORE-1311) and configuration consolidation (LCORE-836) features.
PoC Implementation `docs/design/byok-pdf/poc/pdf_reader.py`	Standalone PDF-to-Markdown converter script demonstrating `docling` integration with `PdfPipelineOptions`, featuring `make_converter()` and `convert()` functions and CLI entry point for batch conversion validation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the spike work: adding native PDF support to the BYOK content production tool (rag-content), as reflected in the design docs, PoC script, and validation results.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/byok-pdf/byok-pdf-spike.md`:
- Around line 252-257: The "PoC results" section in the spike doc contains
inline links to artifacts that are slated for deletion (notably the
poc-results/* artifacts and poc/pdf_reader.py referenced in
02-conversion-log.txt); pick one of the two cleanup options now and implement
it: either remove the PoC results section and replace it with a short summary
paragraph that references the implementation PR for evidence, or keep a minimal
poc-results/ with only a findings summary so the links remain valid; update the
spike doc's "PoC results" section accordingly and add a note in the cleanup
checklist recording the chosen approach so the final PR preserves or removes
links consistently.
- Line 178: The docs/design/byok-pdf/byok-pdf-spike.md references
../../local-stack-testing.md which does not exist; either add the missing
docs/local-stack-testing.md containing the local-stack testing pattern or update
the reference in byok-pdf-spike.md to point to an existing document (search the
repo for equivalent testing docs and replace the link), and ensure the link is
correct and relative so it resolves from byok-pdf-spike.md.

In `@docs/design/byok-pdf/byok-pdf.md`:
- Around line 110-118: Replace the brittle line-number references to
document_processor.py with stable symbol anchors: update the docs to point to
_LlamaStackDB.__init__ and _BaseDB.__init__ (and explicitly call out the
predicate "if config.doc_type in (\"markdown\", \"html\", \"pdf\")") instead of
"document_processor.py:75/87"; also update the other occurrence mentioned
(around lines 162-186) to reference the same symbols/predicate so the
byok-pdf-spike.md agentic-tool instructions and implementation ticket remain
correct even if the file shifts.
- Around line 110-118: The two duplicated doc_type tuples used in
document_processor.py should be extracted to a single module-level constant to
avoid drift; add a constant like MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str,
...]] = ("markdown", "html", "pdf") at the top of the module (import Final from
typing), then replace the inline tuples in the _LlamaStackDB and _BaseDB checks
(the places referencing config.doc_type) to use MARKDOWN_COMPATIBLE_DOC_TYPES
instead.
- Line 147: PDFReader.load_data currently lets scanned (image-only) PDFs produce
empty/near-empty Markdown silently; update PDFReader.load_data to detect when
the exported Markdown is empty after strip or under a small configurable
threshold (e.g., N characters) and emit logger.warning with contextual info
(filename/document id and resulting length or "empty after strip") so the
condition is visible in custom_processor.py logs; add a configurable
constant/argument for the threshold and ensure the warning message is clear and
only logged at warning level.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4691a22a-baf7-4e8c-995e-29d5cdee34cb

📥 Commits

Reviewing files that changed from the base of the PR and between c985fd7 and 250881e.

⛔ Files ignored due to path filters (2)

docs/design/byok-pdf/poc/sample_jira_1311.pdf is excluded by !**/*.pdf
docs/design/byok-pdf/poc/sample_jira_836.pdf is excluded by !**/*.pdf

📒 Files selected for processing (7)

docs/design/byok-pdf/byok-pdf-spike.md
docs/design/byok-pdf/byok-pdf.md
docs/design/byok-pdf/poc-results/01-poc-report.txt
docs/design/byok-pdf/poc-results/02-conversion-log.txt
docs/design/byok-pdf/poc-results/03-sample-jira-1311.md
docs/design/byok-pdf/poc-results/04-sample-jira-836.md
docs/design/byok-pdf/poc/pdf_reader.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: build-pr
GitHub Check: mypy
GitHub Check: unit_tests (3.13)
GitHub Check: Pylinter
GitHub Check: unit_tests (3.12)
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
GitHub Check: E2E Tests for Lightspeed Evaluation job
GitHub Check: E2E: library mode / ci / group 3
GitHub Check: E2E: server mode / ci / group 3
GitHub Check: E2E: library mode / ci / group 1
GitHub Check: E2E: server mode / ci / group 1
GitHub Check: E2E: library mode / ci / group 2
GitHub Check: E2E: server mode / ci / group 2

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Import FastAPI dependencies with: from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Use typing_extensions.Self for model validators in Pydantic models
Use modern union type syntax str | int instead of Union[str, int]
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Use @model_validator and @field_validator for Pydantic model validation
Complete type annotations for all class attributes; use specific types, not Any
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections

Files:

docs/design/byok-pdf/poc/pdf_reader.py

🪛 LanguageTool

docs/design/byok-pdf/poc-results/04-sample-jira-836.md

[grammar] ~25-~25: Ensure spelling is correct
Context: ...--> ## Streaml i ne l i ghtspeed-stack conf i g <!-- ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~37-~37: Ensure spelling is correct
Context: ... Bus i ness Descr i pt i on ## Feature Overv i ew Currently, the Lightspeed stack requires...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[style] ~84-~84: This phrase is redundant. Consider writing “eliminated”.
Context: ...create or modify a separate run.yaml is completely eliminated. - The stack deploys and operates corre...

(COMPLETELY_ANNIHILATE)

[style] ~164-~164: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...ha Created October 15, 2025 at 1:34 PM Updated 3 days ago <!-- ima...

(MISSING_COMMA_AFTER_YEAR)

docs/design/byok-pdf/poc-results/01-poc-report.txt

[uncategorized] ~86-~86: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ... WELL. sample_jira_1311 contains two markdown tables (Component/Behavior and Area/...

(MARKDOWN_NNP)

[uncategorized] ~87-~87: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...Both came out as proper GitHub-flavored markdown tables with correct headers and cell...

(MARKDOWN_NNP)

[grammar] ~94-~94: Ensure spelling is correct
Context: ...ace between, producing headings like "Streaml i ne l i ghtspeed-stack conf i g" and "Acceptance cr...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

docs/design/byok-pdf/byok-pdf.md

[style] ~200-~200: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... is raised for a non-existent path. - Test that RuntimeError is raised when docl...