Skip to content

LCORE-1471 spike: BYOK PDF support#1598

Open
max-svistunov wants to merge 2 commits intolightspeed-core:mainfrom
max-svistunov:lcore-1471-spike-byok-pdf
Open

LCORE-1471 spike: BYOK PDF support#1598
max-svistunov wants to merge 2 commits intolightspeed-core:mainfrom
max-svistunov:lcore-1471-spike-byok-pdf

Conversation

@max-svistunov
Copy link
Copy Markdown
Contributor

@max-svistunov max-svistunov commented Apr 27, 2026

fmax## LCORE-1471 spike: BYOK PDF support

Spike deliverable for LCORE-1471 — adding native PDF support to the BYOK content production tool (rag-content). HTML support already shipped under LCORE-1035 (Jan 2026), so this spike is intentionally lightweight: scaffold-and-mirror the existing HTML pattern rather than re-design from scratch.

What's in this PR

Design docs (docs/design/byok-pdf/):

  • byok-pdf-spike.md (use the "Outline" button) — decisions, alternatives, proposed JIRAs, PoC results
  • byok-pdf.md (use the "Outline" button) — feature spec — changeable based on final decisions, kept in repo long-term

PoC code (will be removed before merge per spike workflow step 10):

  • docs/design/byok-pdf/poc/pdf_reader.py — 60-line script mirroring HTMLReader for PDF input
  • docs/design/byok-pdf/poc/sample_jira_*.pdf — two real-world PDFs used as test corpus

PoC validation results (will be removed before merge):

  • docs/design/byok-pdf/poc-results/01-poc-report.txt — methodology, findings, implications
  • docs/design/byok-pdf/poc-results/02-conversion-log.txt — exact commands and timings
  • docs/design/byok-pdf/poc-results/03-sample-jira-1311.md — clean conversion output
  • docs/design/byok-pdf/poc-results/04-sample-jira-836.md — heading degradation visible

Main findings

  1. No new dependencies needed. docling is already in pyproject.toml (added by LCORE-1035). The PDF path uses the same DocumentConverter API with InputFormat.PDF and PdfFormatOption(pipeline_options=...).
  2. Body text and tables convert cleanly. PoC produced well-formed Markdown with proper GitHub-flavored tables. The existing MarkdownNodeParser chunks the output without modification — only a one-token addition ("pdf") to two doc_type checks in document_processor.py.
  3. Heading degradation on letter-spaced fonts is the one real caveat. Confluence "Export to PDF" output produces headings like H e a d i n g. Body unaffected, chunking unaffected, but heading retrieval is degraded. Documented as a known v1 limitation; a heading-cleanup post-processor is noted as a follow-up.
  4. Performance is acceptable for offline indexing. ~30–90 s per small/medium PDF on CPU after model warm-up. Cold start is ~5 min for the first invocation.

For reviewers

@tisnik @sbunciak

Strategic decisions — please confirm or override (each has an options table and recommendation in the linked section):

@tisnik

Technical decisions — implementation-level:

Proposed JIRAs — review the Proposed JIRAs section; 4 sub-tickets under LCORE-1471:

  1. rag-content: Implement PDF support
  2. rag-content: Unit and integration tests
  3. rag-content: End-to-end test (PDF → vector store → stack query)
  4. lightspeed-stack: Update BYOK guide for native PDF support

Doc structure note: The decision and JIRA sections of the spike doc are where your input is needed. They link to background sections later in the doc — read those if you need more context on a specific point, but it is optional.


Type of change

  • New feature
  • Documentation Update

Tools used to create PR

  • Assisted-by: Claude Opus 4.7 (1M context)
  • Generated by: Claude Opus 4.7 (1M context)

Related Tickets & Documents

  • Related Issue # LCORE-1035 (HTML precedent)
  • Closes # LCORE-1471 (when JIRAs are filed and accepted; the spike itself is the deliverable)

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests. (N/A — spike only; tests are part of proposed JIRA GitHub templates #2)

Testing

The PoC was validated end-to-end against two real-world PDFs. Reproduction steps:

# 1. From the rag-content sibling repo (where docling is installed)
cd ~/repos/lightspeed/rag-content

# 2. Convert sample PDF #1 (clean Atlassian Cloud PDF)
uv run python ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/pdf_reader.py \
    ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/sample_jira_1311.pdf \
    /tmp/out_1311.md

# Expected output:
#   converted .../sample_jira_1311.pdf -> /tmp/out_1311.md
#     elapsed: 332s (cold) or ~30s (warm)
#     output:  ~7600 chars, ~290 lines

# 3. Convert sample PDF #2 (Confluence-style PDF, surfaces heading degradation)
uv run python ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/pdf_reader.py \
    ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/sample_jira_836.pdf \
    /tmp/out_836.md

# Expected output:
#   converted .../sample_jira_836.pdf -> /tmp/out_836.md
#     elapsed: ~70s (warm)
#     output:  ~3100 chars, ~165 lines

The committed poc-results/03-*.md and poc-results/04-*.md are the canonical conversion outputs from the spike runs — diff against them to confirm reproducibility.

Closes LCORE-1471
Related: LCORE-1035

Summary by CodeRabbit

  • Documentation
    • Added design specifications and proof-of-concept documentation for planned native PDF support in the BYOK content production tool
    • Includes conversion quality metrics, performance benchmarks, and detailed implementation approach for PDF ingestion workflows

Add a minimal PoC script that mirrors the production HTMLReader from
src/lightspeed_rag_content/html/html_reader.py but configures docling
for InputFormat.PDF. The script validates that docling's PDF pipeline
produces usable Markdown for the BYOK ingestion path with the
recommended defaults:

- do_ocr=False                  (text-extractable PDFs only; OCR is
                                 explicit non-goal of LCORE-1471)
- do_table_structure=True       (cheap quality win; tables are common
                                 in customer documentation)
- table_structure_options.mode='accurate'
- do_picture_classification=False, do_picture_description=False
- generate_page_images=False

The PoC was run against two real-world PDFs (locally-available
Lightspeed JIRA exports — Atlassian Cloud and Confluence-style):

  sample_jira_1311.pdf  217 KB  ->  7,608 chars / 288 lines
  sample_jira_836.pdf   372 KB  ->  3,084 chars / 165 lines

Run command (from rag-content's venv where docling is installed):

  cd ~/repos/lightspeed/rag-content
  uv run python <stack>/docs/design/byok-pdf/poc/pdf_reader.py \
      <input.pdf> [<output.md>]

The PoC is throwaway code. The production implementation will mirror
the lightspeed_rag_content/html/ package one-for-one, replacing
InputFormat.HTML with InputFormat.PDF and adding the
PdfFormatOption(pipeline_options=...) constructor argument.

This commit only adds the PoC script and sample inputs; the converted
output, evidence files, and design docs land in a follow-up commit.
Add the spike doc (decisions up front, background below, 4 proposed
JIRAs) and the spec doc (R1..R6 requirements, architecture, key files
and insertion points, known limitations) under docs/design/byok-pdf/.

The spike is lightweight by design: HTML support shipped under
LCORE-1035 (commit 7f688b0, 2026-01-15), so the architectural pattern,
docling dependency, BaseReader plumbing, CLI shape, and test layout
are all already established. PDF support is a scaffold-and-mirror job
plus a one-line addition to document_processor.py's doc_type branches.

Decisions captured for confirmation (each with options table and
recommendation in the spike doc):

  D1: Library                -- docling (already a dependency)
  D2: OCR for scanned PDFs   -- out of scope; track as follow-up
  D3: Repo placement          -- rag-content (impl) + lightspeed-stack
                                  (BYOK guide update only)
  D4: Pipeline knobs          -- hard-coded sensible defaults; no CLI
                                  flags in v1 (mirrors HTMLReader)
  D5: Chunking strategy       -- reuse MarkdownNodeParser; add "pdf" to
                                  document_processor.py:75 and :87
  D6: Code organization       -- new pdf/ package mirroring html/
  D7: Test coverage           -- unit/integration in JIRA #2, e2e in lightspeed-core#3

Four sub-JIRAs proposed under LCORE-1471 (parseable by
dev-tools/file-jiras.sh):

  1. rag-content: Implement PDF support
  2. rag-content: Unit and integration tests
  3. rag-content: End-to-end test (PDF -> vector store -> stack query)
  4. lightspeed-stack: Update BYOK guide for native PDF support

PoC evidence under poc-results/:

  01-poc-report.txt    Methodology, findings, implications
  02-conversion-log.txt  Exact commands and timings
  03-sample-jira-1311.md  Clean conversion (Atlassian Cloud PDF)
  04-sample-jira-836.md   Body clean, headings degraded
                          (Confluence PDF, letter-spaced display font)

Honest PoC findings worth surfacing:

- No new dependencies are needed (docling is already in pyproject.toml).
- Body text and tables convert cleanly to Markdown.
- MarkdownNodeParser handles the output -- no parallel chunking pipeline.
- Letter-spaced display fonts (typical of Confluence "Export to PDF")
  produce noisy heading text; documented as a v1 known limitation.
- Cold model load is ~5 minutes on CPU; warm conversions ~30-90 s for
  small/medium PDFs. Acceptable for offline indexing.

Per howto-run-a-spike.md step 10, poc/ and poc-results/ will be
removed before merge; spike doc and spec doc remain in the repo.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

Walkthrough

This PR adds design documentation and proof-of-concept implementation for enabling native PDF input support in the BYOK content production tool. It includes detailed feature and spike specifications, PoC results with conversion analysis, sample Jira issue documentation, and a standalone PDF-to-Markdown converter script using docling.

Changes

Cohort / File(s) Summary
Design Specifications
docs/design/byok-pdf/byok-pdf-spike.md, docs/design/byok-pdf/byok-pdf.md
Complete spike and feature design specs for PDF input support in BYOK tool, including architecture recommendations, integration points with existing document processors, scope constraints (text-extractable PDFs only, no OCR by default), and module structure mirroring HTML handler patterns.
PoC Results & Analysis
docs/design/byok-pdf/poc-results/01-poc-report.txt, docs/design/byok-pdf/poc-results/02-conversion-log.txt
PoC validation documentation capturing conversion quality findings, performance metrics, runtime environment details, and conversion logs with timestamped pipeline stages.
Reference Documentation
docs/design/byok-pdf/poc-results/03-sample-jira-1311.md, docs/design/byok-pdf/poc-results/04-sample-jira-836.md
Sample Jira issue documentation for context summarization (LCORE-1311) and configuration consolidation (LCORE-836) features.
PoC Implementation
docs/design/byok-pdf/poc/pdf_reader.py
Standalone PDF-to-Markdown converter script demonstrating docling integration with PdfPipelineOptions, featuring make_converter() and convert() functions and CLI entry point for batch conversion validation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the spike work: adding native PDF support to the BYOK content production tool (rag-content), as reflected in the design docs, PoC script, and validation results.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/byok-pdf/byok-pdf-spike.md`:
- Around line 252-257: The "PoC results" section in the spike doc contains
inline links to artifacts that are slated for deletion (notably the
poc-results/* artifacts and poc/pdf_reader.py referenced in
02-conversion-log.txt); pick one of the two cleanup options now and implement
it: either remove the PoC results section and replace it with a short summary
paragraph that references the implementation PR for evidence, or keep a minimal
poc-results/ with only a findings summary so the links remain valid; update the
spike doc's "PoC results" section accordingly and add a note in the cleanup
checklist recording the chosen approach so the final PR preserves or removes
links consistently.
- Line 178: The docs/design/byok-pdf/byok-pdf-spike.md references
../../local-stack-testing.md which does not exist; either add the missing
docs/local-stack-testing.md containing the local-stack testing pattern or update
the reference in byok-pdf-spike.md to point to an existing document (search the
repo for equivalent testing docs and replace the link), and ensure the link is
correct and relative so it resolves from byok-pdf-spike.md.

In `@docs/design/byok-pdf/byok-pdf.md`:
- Around line 110-118: Replace the brittle line-number references to
document_processor.py with stable symbol anchors: update the docs to point to
_LlamaStackDB.__init__ and _BaseDB.__init__ (and explicitly call out the
predicate "if config.doc_type in (\"markdown\", \"html\", \"pdf\")") instead of
"document_processor.py:75/87"; also update the other occurrence mentioned
(around lines 162-186) to reference the same symbols/predicate so the
byok-pdf-spike.md agentic-tool instructions and implementation ticket remain
correct even if the file shifts.
- Around line 110-118: The two duplicated doc_type tuples used in
document_processor.py should be extracted to a single module-level constant to
avoid drift; add a constant like MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str,
...]] = ("markdown", "html", "pdf") at the top of the module (import Final from
typing), then replace the inline tuples in the _LlamaStackDB and _BaseDB checks
(the places referencing config.doc_type) to use MARKDOWN_COMPATIBLE_DOC_TYPES
instead.
- Line 147: PDFReader.load_data currently lets scanned (image-only) PDFs produce
empty/near-empty Markdown silently; update PDFReader.load_data to detect when
the exported Markdown is empty after strip or under a small configurable
threshold (e.g., N characters) and emit logger.warning with contextual info
(filename/document id and resulting length or "empty after strip") so the
condition is visible in custom_processor.py logs; add a configurable
constant/argument for the threshold and ensure the warning message is clear and
only logged at warning level.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4691a22a-baf7-4e8c-995e-29d5cdee34cb

📥 Commits

Reviewing files that changed from the base of the PR and between c985fd7 and 250881e.

⛔ Files ignored due to path filters (2)
  • docs/design/byok-pdf/poc/sample_jira_1311.pdf is excluded by !**/*.pdf
  • docs/design/byok-pdf/poc/sample_jira_836.pdf is excluded by !**/*.pdf
📒 Files selected for processing (7)
  • docs/design/byok-pdf/byok-pdf-spike.md
  • docs/design/byok-pdf/byok-pdf.md
  • docs/design/byok-pdf/poc-results/01-poc-report.txt
  • docs/design/byok-pdf/poc-results/02-conversion-log.txt
  • docs/design/byok-pdf/poc-results/03-sample-jira-1311.md
  • docs/design/byok-pdf/poc-results/04-sample-jira-836.md
  • docs/design/byok-pdf/poc/pdf_reader.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: build-pr
  • GitHub Check: mypy
  • GitHub Check: unit_tests (3.13)
  • GitHub Check: Pylinter
  • GitHub Check: unit_tests (3.12)
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 3
  • GitHub Check: E2E: library mode / ci / group 1
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E: library mode / ci / group 2
  • GitHub Check: E2E: server mode / ci / group 2
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Import FastAPI dependencies with: from fastapi import APIRouter, HTTPException, Request, status, Depends
Import Llama Stack client with: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for parameters and return types in functions
Use typing_extensions.Self for model validators in Pydantic models
Use modern union type syntax str | int instead of Union[str, int]
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use standard log levels with clear purposes: debug, info, warning, error
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Use @model_validator and @field_validator for Pydantic model validation
Complete type annotations for all class attributes; use specific types, not Any
Follow Google Python docstring conventions with Parameters, Returns, Raises, and Attributes sections

Files:

  • docs/design/byok-pdf/poc/pdf_reader.py
🪛 LanguageTool
docs/design/byok-pdf/poc-results/04-sample-jira-836.md

[grammar] ~25-~25: Ensure spelling is correct
Context: ...--> ## Streaml i ne l i ghtspeed-stack conf i g <!-- ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~37-~37: Ensure spelling is correct
Context: ... Bus i ness Descr i pt i on ## Feature Overv i ew Currently, the Lightspeed stack requires...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~84-~84: This phrase is redundant. Consider writing “eliminated”.
Context: ...create or modify a separate run.yaml is completely eliminated. - The stack deploys and operates corre...

(COMPLETELY_ANNIHILATE)


[style] ~164-~164: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...ha Created October 15, 2025 at 1:34 PM Updated 3 days ago <!-- ima...

(MISSING_COMMA_AFTER_YEAR)

docs/design/byok-pdf/poc-results/01-poc-report.txt

[uncategorized] ~86-~86: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ... WELL. sample_jira_1311 contains two markdown tables (Component/Behavior and Area/...

(MARKDOWN_NNP)


[uncategorized] ~87-~87: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...Both came out as proper GitHub-flavored markdown tables with correct headers and cell...

(MARKDOWN_NNP)


[grammar] ~94-~94: Ensure spelling is correct
Context: ...ace between, producing headings like "Streaml i ne l i ghtspeed-stack conf i g" and "Acceptance cr...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

docs/design/byok-pdf/byok-pdf.md

[style] ~200-~200: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... is raised for a non-existent path. - Test that RuntimeError is raised when docl...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~201-~201: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ocling raises (mock the converter). - Test that extra_info is preserved in `Docu...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~206-~206: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...irectories and preserves structure. - Test that errors exit non-zero. Use the exi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

docs/design/byok-pdf/byok-pdf-spike.md

[style] ~241-~241: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ree function; production is a class). - No extra_info/metadata passthrough (prod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~242-~242: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...andles extra_info like HTMLReader). - No batch mode. - No tests. ### Results |...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~243-~243: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...o` like HTMLReader). - No batch mode. - No tests. ### Results | PDF | Size | Wal...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.22.1)
docs/design/byok-pdf/poc-results/04-sample-jira-836.md

[warning] 3-3: First line in a file should be a top-level heading

(MD041, first-line-heading, first-line-h1)

docs/design/byok-pdf/poc-results/03-sample-jira-1311.md

[warning] 1-1: First line in a file should be a top-level heading

(MD041, first-line-heading, first-line-h1)


[warning] 116-116: Ordered list item prefix
Expected: 4; Actual: 2; Style: 1/2/3

(MD029, ol-prefix)


[warning] 129-129: Ordered list item prefix
Expected: 4; Actual: 2; Style: 1/2/3

(MD029, ol-prefix)


[warning] 288-288: Files should end with a single newline character

(MD047, single-trailing-newline)

🔇 Additional comments (2)
docs/design/byok-pdf/poc/pdf_reader.py (1)

1-73: LGTM — throwaway PoC matches design and PoC report

The script faithfully implements the pipeline configuration documented in byok-pdf.md and 01-poc-report.txt (no OCR, accurate table mode, no picture/page-image work) and is correctly scoped as throwaway code per the docstring. Per PR objectives, this file is deleted before merge, so the usual nits (unused LOG at line 21, no error handling around converter.convert, slightly off-by-one line count) aren't worth chasing here.

Note: Line 36 assigns string "accurate" to table_structure_options.mode. Docling's Pydantic coercion permits this, but the idiomatic form is the enum member TableFormerMode.ACCURATE. When mirroring this into the production pdf_reader.py in the implementation ticket, use the enum member directly.

docs/design/byok-pdf/byok-pdf.md (1)

75-91: Verify implementation against HTMLReader precedent for super().__init__() requirement. The design doc correctly instructs implementers to mirror HTMLReader line-for-line, but this verification step should occur during implementation of src/lightspeed_rag_content/pdf/pdf_reader.py in the rag-content repository. Inspect the actual html_reader.py to confirm whether BaseReader from llama-index requires an explicit super().__init__() call; if so, ensure the PDFReader.__init__ includes it.


**Scope**:

- Reuse the local-stack-testing pattern from [docs/local-stack-testing.md](../../local-stack-testing.md).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the referenced doc exists at the expected resolved path.
fd -t f 'local-stack-testing.md'

Repository: lightspeed-core/lightspeed-stack

Length of output: 58


🏁 Script executed:

# Search more broadly for the file with potential variations
fd -t f -i 'local.*stack.*testing'

Repository: lightspeed-core/lightspeed-stack

Length of output: 58


🏁 Script executed:

# Read the actual file to verify the reference at line 178
head -n 185 docs/design/byok-pdf/byok-pdf-spike.md | tail -n 15

Repository: lightspeed-core/lightspeed-stack

Length of output: 1000


The referenced file local-stack-testing.md does not exist at the expected location or anywhere else in the repository.

The relative path ../../local-stack-testing.md from docs/design/byok-pdf/byok-pdf-spike.md correctly resolves to docs/local-stack-testing.md, but that file is not present. Either create the file with the testing pattern or correct the reference to an existing location. This needs to be resolved before the e2e test scope becomes part of the agentic tool instructions or JIRA description, as users will encounter a broken reference.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/byok-pdf/byok-pdf-spike.md` at line 178, The
docs/design/byok-pdf/byok-pdf-spike.md references ../../local-stack-testing.md
which does not exist; either add the missing docs/local-stack-testing.md
containing the local-stack testing pattern or update the reference in
byok-pdf-spike.md to point to an existing document (search the repo for
equivalent testing docs and replace the link), and ensure the link is correct
and relative so it resolves from byok-pdf-spike.md.

Comment on lines +252 to +257
Detailed findings, log excerpts, and converted Markdown are in [`poc-results/`](poc-results/):

- [`01-poc-report.txt`](poc-results/01-poc-report.txt) — methodology, findings, implications
- [`02-conversion-log.txt`](poc-results/02-conversion-log.txt) — exact commands and timings
- [`03-sample-jira-1311.md`](poc-results/03-sample-jira-1311.md) — converted output (clean)
- [`04-sample-jira-836.md`](poc-results/04-sample-jira-836.md) — converted output (heading degradation visible)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Spike doc will contain dangling links after PoC cleanup

The spike doc survives merge but poc-results/01..04 and poc/pdf_reader.py are explicitly slated for deletion (per 02-conversion-log.txt lines 65-67 and PR objectives). After cleanup, lines 254-257 here (and the link at line 252 to poc-results/) become broken intra-repo links in a long-lived doc. Two reasonable options for the cleanup PR / final state:

  • Remove the "PoC results" section and these inline links once the artifacts are deleted, leaving a paragraph that summarizes findings and points at the implementation PR for evidence.
  • Or, keep a minimal poc-results/ containing only this spike's findings summary (no script, no sample PDFs) so the links remain valid.

Worth deciding now and noting the chosen approach in the cleanup checklist so the link rot doesn't slip through.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/byok-pdf/byok-pdf-spike.md` around lines 252 - 257, The "PoC
results" section in the spike doc contains inline links to artifacts that are
slated for deletion (notably the poc-results/* artifacts and poc/pdf_reader.py
referenced in 02-conversion-log.txt); pick one of the two cleanup options now
and implement it: either remove the PoC results section and replace it with a
short summary paragraph that references the implementation PR for evidence, or
keep a minimal poc-results/ with only a findings summary so the links remain
valid; update the spike doc's "PoC results" section accordingly and add a note
in the cleanup checklist recording the chosen approach so the final PR preserves
or removes links consistently.

Comment on lines +110 to +118
```python
# document_processor.py:75 — _LlamaStackDB
if config.doc_type in ("markdown", "html"): # before
if config.doc_type in ("markdown", "html", "pdf"): # after

# document_processor.py:87 — _BaseDB
if config.doc_type in ("markdown", "html"): # before
if config.doc_type in ("markdown", "html", "pdf"): # after
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Replace hardcoded document_processor.py:75/87 line refs with stable anchors

These line numbers point into the sibling rag-content repo and are not gated by this PR. They will drift the moment any unrelated edit touches document_processor.py above line 75 (imports, a new helper, etc.), and the design doc — which lives on indefinitely — will then point at the wrong lines. Reference the symbols instead (_LlamaStackDB.__init__ and _BaseDB.__init__, plus the if config.doc_type in (...) predicate), so the agentic-tool instructions in byok-pdf-spike.md and the implementation ticket stay accurate without re-validation.

Also applies to: 162-186

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/byok-pdf/byok-pdf.md` around lines 110 - 118, Replace the brittle
line-number references to document_processor.py with stable symbol anchors:
update the docs to point to _LlamaStackDB.__init__ and _BaseDB.__init__ (and
explicitly call out the predicate "if config.doc_type in (\"markdown\",
\"html\", \"pdf\")") instead of "document_processor.py:75/87"; also update the
other occurrence mentioned (around lines 162-186) to reference the same
symbols/predicate so the byok-pdf-spike.md agentic-tool instructions and
implementation ticket remain correct even if the file shifts.

🧹 Nitpick | 🔵 Trivial

Extract the doc_type tuple to a module-level constant rather than duplicating it twice

The proposed change repeats ("markdown", "html", "pdf") at two call sites in document_processor.py. Open Questions (line 216) explicitly anticipates DOCX/RTF/EPUB readers landing later — each new format would then require editing two tuples in lockstep, with no compiler help if one is forgotten. Suggest the implementation ticket land a single MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str, ...]] = ("markdown", "html", "pdf") (or similar) used at both sites. Cheap DRY win that pays off across the planned roadmap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/byok-pdf/byok-pdf.md` around lines 110 - 118, The two duplicated
doc_type tuples used in document_processor.py should be extracted to a single
module-level constant to avoid drift; add a constant like
MARKDOWN_COMPATIBLE_DOC_TYPES: Final[tuple[str, ...]] = ("markdown", "html",
"pdf") at the top of the module (import Final from typing), then replace the
inline tuples in the _LlamaStackDB and _BaseDB checks (the places referencing
config.doc_type) to use MARKDOWN_COMPATIBLE_DOC_TYPES instead.

- `RuntimeError` (with the underlying exception chained via `from exc`) if docling conversion fails.
- The CLI catches both, logs, and exits non-zero.

A scanned (image-only) PDF will produce empty or near-empty Markdown. This is not an error — the document parses fine; it just contains no extractable text. Customers will see a valid but empty-ish vector store entry. Document this in the rag-content README.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify a warning log when conversion yields empty/near-empty Markdown

Treating a scanned PDF as "not an error" is reasonable, but silently producing a near-empty vector-store entry is a debugging nightmare for BYOK customers — they will index a PDF, run a query, get nothing back, and have no signal that the PDF lacked extractable text. Recommend specifying that PDFReader.load_data emits a logger.warning(...) when the exported Markdown is below some small threshold (e.g., empty after whitespace strip, or under N characters), so the failure mode is observable in custom_processor.py output. Cheap to add and aligns with the coding guideline of using clear log levels.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/byok-pdf/byok-pdf.md` at line 147, PDFReader.load_data currently
lets scanned (image-only) PDFs produce empty/near-empty Markdown silently;
update PDFReader.load_data to detect when the exported Markdown is empty after
strip or under a small configurable threshold (e.g., N characters) and emit
logger.warning with contextual info (filename/document id and resulting length
or "empty after strip") so the condition is visible in custom_processor.py logs;
add a configurable constant/argument for the threshold and ensure the warning
message is clear and only logged at warning level.

Copy link
Copy Markdown
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@jrobertboos jrobertboos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sbunciak
Copy link
Copy Markdown
Contributor

LGTM @max-svistunov agree with your recommendations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants