LCORE-1471 spike: BYOK PDF support#1598
LCORE-1471 spike: BYOK PDF support#1598max-svistunov wants to merge 2 commits intolightspeed-core:mainfrom
Conversation
Add a minimal PoC script that mirrors the production HTMLReader from
src/lightspeed_rag_content/html/html_reader.py but configures docling
for InputFormat.PDF. The script validates that docling's PDF pipeline
produces usable Markdown for the BYOK ingestion path with the
recommended defaults:
- do_ocr=False (text-extractable PDFs only; OCR is
explicit non-goal of LCORE-1471)
- do_table_structure=True (cheap quality win; tables are common
in customer documentation)
- table_structure_options.mode='accurate'
- do_picture_classification=False, do_picture_description=False
- generate_page_images=False
The PoC was run against two real-world PDFs (locally-available
Lightspeed JIRA exports — Atlassian Cloud and Confluence-style):
sample_jira_1311.pdf 217 KB -> 7,608 chars / 288 lines
sample_jira_836.pdf 372 KB -> 3,084 chars / 165 lines
Run command (from rag-content's venv where docling is installed):
cd ~/repos/lightspeed/rag-content
uv run python <stack>/docs/design/byok-pdf/poc/pdf_reader.py \
<input.pdf> [<output.md>]
The PoC is throwaway code. The production implementation will mirror
the lightspeed_rag_content/html/ package one-for-one, replacing
InputFormat.HTML with InputFormat.PDF and adding the
PdfFormatOption(pipeline_options=...) constructor argument.
This commit only adds the PoC script and sample inputs; the converted
output, evidence files, and design docs land in a follow-up commit.
Add the spike doc (decisions up front, background below, 4 proposed
JIRAs) and the spec doc (R1..R6 requirements, architecture, key files
and insertion points, known limitations) under docs/design/byok-pdf/.
The spike is lightweight by design: HTML support shipped under
LCORE-1035 (commit 7f688b0, 2026-01-15), so the architectural pattern,
docling dependency, BaseReader plumbing, CLI shape, and test layout
are all already established. PDF support is a scaffold-and-mirror job
plus a one-line addition to document_processor.py's doc_type branches.
Decisions captured for confirmation (each with options table and
recommendation in the spike doc):
D1: Library -- docling (already a dependency)
D2: OCR for scanned PDFs -- out of scope; track as follow-up
D3: Repo placement -- rag-content (impl) + lightspeed-stack
(BYOK guide update only)
D4: Pipeline knobs -- hard-coded sensible defaults; no CLI
flags in v1 (mirrors HTMLReader)
D5: Chunking strategy -- reuse MarkdownNodeParser; add "pdf" to
document_processor.py:75 and :87
D6: Code organization -- new pdf/ package mirroring html/
D7: Test coverage -- unit/integration in JIRA #2, e2e in lightspeed-core#3
Four sub-JIRAs proposed under LCORE-1471 (parseable by
dev-tools/file-jiras.sh):
1. rag-content: Implement PDF support
2. rag-content: Unit and integration tests
3. rag-content: End-to-end test (PDF -> vector store -> stack query)
4. lightspeed-stack: Update BYOK guide for native PDF support
PoC evidence under poc-results/:
01-poc-report.txt Methodology, findings, implications
02-conversion-log.txt Exact commands and timings
03-sample-jira-1311.md Clean conversion (Atlassian Cloud PDF)
04-sample-jira-836.md Body clean, headings degraded
(Confluence PDF, letter-spaced display font)
Honest PoC findings worth surfacing:
- No new dependencies are needed (docling is already in pyproject.toml).
- Body text and tables convert cleanly to Markdown.
- MarkdownNodeParser handles the output -- no parallel chunking pipeline.
- Letter-spaced display fonts (typical of Confluence "Export to PDF")
produce noisy heading text; documented as a v1 known limitation.
- Cold model load is ~5 minutes on CPU; warm conversions ~30-90 s for
small/medium PDFs. Acceptable for offline indexing.
Per howto-run-a-spike.md step 10, poc/ and poc-results/ will be
removed before merge; spike doc and spec doc remain in the repo.
WalkthroughThis PR adds design documentation and proof-of-concept implementation for enabling native PDF input support in the BYOK content production tool. It includes detailed feature and spike specifications, PoC results with conversion analysis, sample Jira issue documentation, and a standalone PDF-to-Markdown converter script using Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/byok-pdf/byok-pdf-spike.md`:
- Around line 252-257: The "PoC results" section in the spike doc contains
inline links to artifacts that are slated for deletion (notably the
poc-results/* artifacts and poc/pdf_reader.py referenced in
02-conversion-log.txt); pick one of the two cleanup options now and implement
it: either remove the PoC results section and replace it with a short summary
paragraph that references the implementation PR for evidence, or keep a minimal
poc-results/ with only a findings summary so the links remain valid; update the
spike doc's "PoC results" section accordingly and add a note in the cleanup
checklist recording the chosen approach so the final PR preserves or removes
links consistently.
- Line 178: The docs/design/byok-pdf/byok-pdf-spike.md references
../../local-stack-testing.md which does not exist; either add the missing
docs/local-stack-testing.md containing the local-stack testing pattern or update
the reference in byok-pdf-spike.md to point to an existing document (search the
repo for equivalent testing docs and replace the link), and ensure the link is
correct and relative so it resolves from byok-pdf-spike.md.
In `@docs/design/byok-pdf/byok-pdf.md`:
- Around line 110-118: Replace the brittle line-number references to
document_processor.py with stable symbol anchors: update the docs to point to
_LlamaStackDB.__init__ and _BaseDB.__init__ (and explicitly call out the
predicate "if config.doc_type in (\"markdown\", \"html\", \"pdf\")") instead of
"document_processor.py:75/87"; also update the other occurrence mentioned
(around lines 162-186) to reference the same symbols/predicate so the
byok-pdf-spike.md agentic-tool instructions and implementation ticket remain
correct even if the file shifts.
- Around line 110-118: The two duplicated doc_type tuples used in
document_processor.py should be extracted to a single module-level constant to
avoid drift; add a constant like MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str,
...]] = ("markdown", "html", "pdf") at the top of the module (import Final from
typing), then replace the inline tuples in the _LlamaStackDB and _BaseDB checks
(the places referencing config.doc_type) to use MARKDOWN_COMPATIBLE_DOC_TYPES
instead.
- Line 147: PDFReader.load_data currently lets scanned (image-only) PDFs produce
empty/near-empty Markdown silently; update PDFReader.load_data to detect when
the exported Markdown is empty after strip or under a small configurable
threshold (e.g., N characters) and emit logger.warning with contextual info
(filename/document id and resulting length or "empty after strip") so the
condition is visible in custom_processor.py logs; add a configurable
constant/argument for the threshold and ensure the warning message is clear and
only logged at warning level.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4691a22a-baf7-4e8c-995e-29d5cdee34cb
⛔ Files ignored due to path filters (2)
docs/design/byok-pdf/poc/sample_jira_1311.pdfis excluded by!**/*.pdfdocs/design/byok-pdf/poc/sample_jira_836.pdfis excluded by!**/*.pdf
📒 Files selected for processing (7)
docs/design/byok-pdf/byok-pdf-spike.mddocs/design/byok-pdf/byok-pdf.mddocs/design/byok-pdf/poc-results/01-poc-report.txtdocs/design/byok-pdf/poc-results/02-conversion-log.txtdocs/design/byok-pdf/poc-results/03-sample-jira-1311.mddocs/design/byok-pdf/poc-results/04-sample-jira-836.mddocs/design/byok-pdf/poc/pdf_reader.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: build-pr
- GitHub Check: mypy
- GitHub Check: unit_tests (3.13)
- GitHub Check: Pylinter
- GitHub Check: unit_tests (3.12)
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E: server mode / ci / group 3
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: E2E: server mode / ci / group 1
- GitHub Check: E2E: library mode / ci / group 2
- GitHub Check: E2E: server mode / ci / group 2
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Import FastAPI dependencies with:from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with:from llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Usetyping_extensions.Selffor model validators in Pydantic models
Use modern union type syntaxstr | intinstead ofUnion[str, int]
UseOptional[Type]for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Useasync deffor I/O operations and external API calls
HandleAPIConnectionErrorfrom Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with@abstractmethoddecorators
Use@model_validatorand@field_validatorfor Pydantic model validation
Complete type annotations for all class attributes; use specific types, notAny
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections
Files:
docs/design/byok-pdf/poc/pdf_reader.py
🪛 LanguageTool
docs/design/byok-pdf/poc-results/04-sample-jira-836.md
[grammar] ~25-~25: Ensure spelling is correct
Context: ...--> ## Streaml i ne l i ghtspeed-stack conf i g <!-- ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~37-~37: Ensure spelling is correct
Context: ... Bus i ness Descr i pt i on ## Feature Overv i ew Currently, the Lightspeed stack requires...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[style] ~84-~84: This phrase is redundant. Consider writing “eliminated”.
Context: ...create or modify a separate run.yaml is completely eliminated. - The stack deploys and operates corre...
(COMPLETELY_ANNIHILATE)
[style] ~164-~164: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...ha Created October 15, 2025 at 1:34 PM Updated 3 days ago <!-- ima...
(MISSING_COMMA_AFTER_YEAR)
docs/design/byok-pdf/poc-results/01-poc-report.txt
[uncategorized] ~86-~86: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ... WELL. sample_jira_1311 contains two markdown tables (Component/Behavior and Area/...
(MARKDOWN_NNP)
[uncategorized] ~87-~87: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...Both came out as proper GitHub-flavored markdown tables with correct headers and cell...
(MARKDOWN_NNP)
[grammar] ~94-~94: Ensure spelling is correct
Context: ...ace between, producing headings like "Streaml i ne l i ghtspeed-stack conf i g" and "Acceptance cr...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
docs/design/byok-pdf/byok-pdf.md
[style] ~200-~200: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... is raised for a non-existent path. - Test that RuntimeError is raised when docl...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~201-~201: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ocling raises (mock the converter). - Test that extra_info is preserved in `Docu...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~206-~206: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...irectories and preserves structure. - Test that errors exit non-zero. Use the exi...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
docs/design/byok-pdf/byok-pdf-spike.md
[style] ~241-~241: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ree function; production is a class). - No extra_info/metadata passthrough (prod...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~242-~242: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...andles extra_info like HTMLReader). - No batch mode. - No tests. ### Results |...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~243-~243: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...o` like HTMLReader). - No batch mode. - No tests. ### Results | PDF | Size | Wal...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 markdownlint-cli2 (0.22.1)
docs/design/byok-pdf/poc-results/04-sample-jira-836.md
[warning] 3-3: First line in a file should be a top-level heading
(MD041, first-line-heading, first-line-h1)
docs/design/byok-pdf/poc-results/03-sample-jira-1311.md
[warning] 1-1: First line in a file should be a top-level heading
(MD041, first-line-heading, first-line-h1)
[warning] 116-116: Ordered list item prefix
Expected: 4; Actual: 2; Style: 1/2/3
(MD029, ol-prefix)
[warning] 129-129: Ordered list item prefix
Expected: 4; Actual: 2; Style: 1/2/3
(MD029, ol-prefix)
[warning] 288-288: Files should end with a single newline character
(MD047, single-trailing-newline)
🔇 Additional comments (2)
docs/design/byok-pdf/poc/pdf_reader.py (1)
1-73: LGTM — throwaway PoC matches design and PoC reportThe script faithfully implements the pipeline configuration documented in
byok-pdf.mdand01-poc-report.txt(no OCR, accurate table mode, no picture/page-image work) and is correctly scoped as throwaway code per the docstring. Per PR objectives, this file is deleted before merge, so the usual nits (unusedLOGat line 21, no error handling aroundconverter.convert, slightly off-by-one line count) aren't worth chasing here.Note: Line 36 assigns string
"accurate"totable_structure_options.mode. Docling's Pydantic coercion permits this, but the idiomatic form is the enum memberTableFormerMode.ACCURATE. When mirroring this into the productionpdf_reader.pyin the implementation ticket, use the enum member directly.docs/design/byok-pdf/byok-pdf.md (1)
75-91: Verify implementation against HTMLReader precedent forsuper().__init__()requirement. The design doc correctly instructs implementers to mirrorHTMLReaderline-for-line, but this verification step should occur during implementation ofsrc/lightspeed_rag_content/pdf/pdf_reader.pyin the rag-content repository. Inspect the actualhtml_reader.pyto confirm whetherBaseReaderfrom llama-index requires an explicitsuper().__init__()call; if so, ensure thePDFReader.__init__includes it.
|
|
||
| **Scope**: | ||
|
|
||
| - Reuse the local-stack-testing pattern from [docs/local-stack-testing.md](../../local-stack-testing.md). |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the referenced doc exists at the expected resolved path.
fd -t f 'local-stack-testing.md'Repository: lightspeed-core/lightspeed-stack
Length of output: 58
🏁 Script executed:
# Search more broadly for the file with potential variations
fd -t f -i 'local.*stack.*testing'Repository: lightspeed-core/lightspeed-stack
Length of output: 58
🏁 Script executed:
# Read the actual file to verify the reference at line 178
head -n 185 docs/design/byok-pdf/byok-pdf-spike.md | tail -n 15Repository: lightspeed-core/lightspeed-stack
Length of output: 1000
The referenced file local-stack-testing.md does not exist at the expected location or anywhere else in the repository.
The relative path ../../local-stack-testing.md from docs/design/byok-pdf/byok-pdf-spike.md correctly resolves to docs/local-stack-testing.md, but that file is not present. Either create the file with the testing pattern or correct the reference to an existing location. This needs to be resolved before the e2e test scope becomes part of the agentic tool instructions or JIRA description, as users will encounter a broken reference.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design/byok-pdf/byok-pdf-spike.md` at line 178, The
docs/design/byok-pdf/byok-pdf-spike.md references ../../local-stack-testing.md
which does not exist; either add the missing docs/local-stack-testing.md
containing the local-stack testing pattern or update the reference in
byok-pdf-spike.md to point to an existing document (search the repo for
equivalent testing docs and replace the link), and ensure the link is correct
and relative so it resolves from byok-pdf-spike.md.
| Detailed findings, log excerpts, and converted Markdown are in [`poc-results/`](poc-results/): | ||
|
|
||
| - [`01-poc-report.txt`](poc-results/01-poc-report.txt) — methodology, findings, implications | ||
| - [`02-conversion-log.txt`](poc-results/02-conversion-log.txt) — exact commands and timings | ||
| - [`03-sample-jira-1311.md`](poc-results/03-sample-jira-1311.md) — converted output (clean) | ||
| - [`04-sample-jira-836.md`](poc-results/04-sample-jira-836.md) — converted output (heading degradation visible) |
There was a problem hiding this comment.
Spike doc will contain dangling links after PoC cleanup
The spike doc survives merge but poc-results/01..04 and poc/pdf_reader.py are explicitly slated for deletion (per 02-conversion-log.txt lines 65-67 and PR objectives). After cleanup, lines 254-257 here (and the link at line 252 to poc-results/) become broken intra-repo links in a long-lived doc. Two reasonable options for the cleanup PR / final state:
- Remove the "PoC results" section and these inline links once the artifacts are deleted, leaving a paragraph that summarizes findings and points at the implementation PR for evidence.
- Or, keep a minimal
poc-results/containing only this spike's findings summary (no script, no sample PDFs) so the links remain valid.
Worth deciding now and noting the chosen approach in the cleanup checklist so the link rot doesn't slip through.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design/byok-pdf/byok-pdf-spike.md` around lines 252 - 257, The "PoC
results" section in the spike doc contains inline links to artifacts that are
slated for deletion (notably the poc-results/* artifacts and poc/pdf_reader.py
referenced in 02-conversion-log.txt); pick one of the two cleanup options now
and implement it: either remove the PoC results section and replace it with a
short summary paragraph that references the implementation PR for evidence, or
keep a minimal poc-results/ with only a findings summary so the links remain
valid; update the spike doc's "PoC results" section accordingly and add a note
in the cleanup checklist recording the chosen approach so the final PR preserves
or removes links consistently.
| ```python | ||
| # document_processor.py:75 — _LlamaStackDB | ||
| if config.doc_type in ("markdown", "html"): # before | ||
| if config.doc_type in ("markdown", "html", "pdf"): # after | ||
|
|
||
| # document_processor.py:87 — _BaseDB | ||
| if config.doc_type in ("markdown", "html"): # before | ||
| if config.doc_type in ("markdown", "html", "pdf"): # after | ||
| ``` |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Replace hardcoded document_processor.py:75/87 line refs with stable anchors
These line numbers point into the sibling rag-content repo and are not gated by this PR. They will drift the moment any unrelated edit touches document_processor.py above line 75 (imports, a new helper, etc.), and the design doc — which lives on indefinitely — will then point at the wrong lines. Reference the symbols instead (_LlamaStackDB.__init__ and _BaseDB.__init__, plus the if config.doc_type in (...) predicate), so the agentic-tool instructions in byok-pdf-spike.md and the implementation ticket stay accurate without re-validation.
Also applies to: 162-186
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design/byok-pdf/byok-pdf.md` around lines 110 - 118, Replace the brittle
line-number references to document_processor.py with stable symbol anchors:
update the docs to point to _LlamaStackDB.__init__ and _BaseDB.__init__ (and
explicitly call out the predicate "if config.doc_type in (\"markdown\",
\"html\", \"pdf\")") instead of "document_processor.py:75/87"; also update the
other occurrence mentioned (around lines 162-186) to reference the same
symbols/predicate so the byok-pdf-spike.md agentic-tool instructions and
implementation ticket remain correct even if the file shifts.
🧹 Nitpick | 🔵 Trivial
Extract the doc_type tuple to a module-level constant rather than duplicating it twice
The proposed change repeats ("markdown", "html", "pdf") at two call sites in document_processor.py. Open Questions (line 216) explicitly anticipates DOCX/RTF/EPUB readers landing later — each new format would then require editing two tuples in lockstep, with no compiler help if one is forgotten. Suggest the implementation ticket land a single MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str, ...]] = ("markdown", "html", "pdf") (or similar) used at both sites. Cheap DRY win that pays off across the planned roadmap.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design/byok-pdf/byok-pdf.md` around lines 110 - 118, The two duplicated
doc_type tuples used in document_processor.py should be extracted to a single
module-level constant to avoid drift; add a constant like
MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str, ...]] = ("markdown", "html",
"pdf") at the top of the module (import Final from typing), then replace the
inline tuples in the _LlamaStackDB and _BaseDB checks (the places referencing
config.doc_type) to use MARKDOWN_COMPATIBLE_DOC_TYPES instead.
| - `RuntimeError` (with the underlying exception chained via `from exc`) if docling conversion fails. | ||
| - The CLI catches both, logs, and exits non-zero. | ||
|
|
||
| A scanned (image-only) PDF will produce empty or near-empty Markdown. This is not an error — the document parses fine; it just contains no extractable text. Customers will see a valid but empty-ish vector store entry. Document this in the rag-content README. |
There was a problem hiding this comment.
Specify a warning log when conversion yields empty/near-empty Markdown
Treating a scanned PDF as "not an error" is reasonable, but silently producing a near-empty vector-store entry is a debugging nightmare for BYOK customers — they will index a PDF, run a query, get nothing back, and have no signal that the PDF lacked extractable text. Recommend specifying that PDFReader.load_data emits a logger.warning(...) when the exported Markdown is below some small threshold (e.g., empty after whitespace strip, or under N characters), so the failure mode is observable in custom_processor.py output. Cheap to add and aligns with the coding guideline of using clear log levels.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design/byok-pdf/byok-pdf.md` at line 147, PDFReader.load_data currently
lets scanned (image-only) PDFs produce empty/near-empty Markdown silently;
update PDFReader.load_data to detect when the exported Markdown is empty after
strip or under a small configurable threshold (e.g., N characters) and emit
logger.warning with contextual info (filename/document id and resulting length
or "empty after strip") so the condition is visible in custom_processor.py logs;
add a configurable constant/argument for the threshold and ensure the warning
message is clear and only logged at warning level.
|
LGTM @max-svistunov agree with your recommendations |
fmax## LCORE-1471 spike: BYOK PDF support
Spike deliverable for LCORE-1471 — adding native PDF support to the BYOK content production tool (
rag-content). HTML support already shipped under LCORE-1035 (Jan 2026), so this spike is intentionally lightweight: scaffold-and-mirror the existing HTML pattern rather than re-design from scratch.What's in this PR
Design docs (
docs/design/byok-pdf/):byok-pdf-spike.md(use the "Outline" button) — decisions, alternatives, proposed JIRAs, PoC resultsbyok-pdf.md(use the "Outline" button) — feature spec — changeable based on final decisions, kept in repo long-termPoC code (will be removed before merge per spike workflow step 10):
docs/design/byok-pdf/poc/pdf_reader.py— 60-line script mirroringHTMLReaderfor PDF inputdocs/design/byok-pdf/poc/sample_jira_*.pdf— two real-world PDFs used as test corpusPoC validation results (will be removed before merge):
docs/design/byok-pdf/poc-results/01-poc-report.txt— methodology, findings, implicationsdocs/design/byok-pdf/poc-results/02-conversion-log.txt— exact commands and timingsdocs/design/byok-pdf/poc-results/03-sample-jira-1311.md— clean conversion outputdocs/design/byok-pdf/poc-results/04-sample-jira-836.md— heading degradation visibleMain findings
doclingis already inpyproject.toml(added by LCORE-1035). The PDF path uses the sameDocumentConverterAPI withInputFormat.PDFandPdfFormatOption(pipeline_options=...).MarkdownNodeParserchunks the output without modification — only a one-token addition ("pdf") to twodoc_typechecks indocument_processor.py.H e a d i n g. Body unaffected, chunking unaffected, but heading retrieval is degraded. Documented as a known v1 limitation; a heading-cleanup post-processor is noted as a follow-up.For reviewers
@tisnik @sbunciak
Strategic decisions — please confirm or override (each has an options table and recommendation in the linked section):
@tisnik
Technical decisions — implementation-level:
MarkdownNodeParserpdf/package mirroringhtml/Proposed JIRAs — review the Proposed JIRAs section; 4 sub-tickets under LCORE-1471:
Doc structure note: The decision and JIRA sections of the spike doc are where your input is needed. They link to background sections later in the doc — read those if you need more context on a specific point, but it is optional.
Type of change
Tools used to create PR
Related Tickets & Documents
Checklist before requesting a review
Testing
The PoC was validated end-to-end against two real-world PDFs. Reproduction steps:
The committed
poc-results/03-*.mdandpoc-results/04-*.mdare the canonical conversion outputs from the spike runs — diff against them to confirm reproducibility.Closes LCORE-1471
Related: LCORE-1035
Summary by CodeRabbit