diff --git a/docs/design/byok-pdf/byok-pdf-spike.md b/docs/design/byok-pdf/byok-pdf-spike.md new file mode 100644 index 000000000..9e8d79c1c --- /dev/null +++ b/docs/design/byok-pdf/byok-pdf-spike.md @@ -0,0 +1,355 @@ +# Spike for LCORE-1471: BYOK PDF support + +## Overview + +This document is the deliverable for [LCORE-1471](https://issues.redhat.com/browse/LCORE-1471). It proposes the design for adding PDF support to the BYOK content production tool (`rag-content`), with a recommendation and a proof-of-concept validation. + +**The problem**: The BYOK pipeline only accepts Markdown and plain-text input. Customers typically have content in PDF or HTML and must convert it themselves before indexing. HTML support shipped under [LCORE-1035](https://issues.redhat.com/browse/LCORE-1035) (Jan 2026); PDF support is still missing. + +**The recommendation**: Add a `PDFReader` to `rag-content` mirroring the existing `HTMLReader`, configured to use the `docling` library that is already a dependency. No new third-party dependencies. Reuse the existing `MarkdownNodeParser` for chunking — `docling` exports clean Markdown for body content. Scope is text-extractable PDFs only; OCR is deferred. + +**PoC validation**: A 60-line PoC script converted two real-world PDFs (Lightspeed JIRA exports, 217 KB and 372 KB) using docling's PDF pipeline with the recommended defaults. Body text quality is high; tables are preserved as Markdown tables. Headings degrade on letter-spaced Confluence-export PDFs (cosmetic noise that survives into the markdown but is stripped during chunking). See [PoC results](#poc-results). + +## Decisions for @maxrubyonrails (and product owner of choice) + +These determine scope and approach. + +### Decision 1: Library choice for PDF → Markdown conversion + +| Option | Description | New deps? | Quality | +|--------|-------------|-----------|---------| +| A | `docling` (already a dep) | None | High; ML-based layout + table detection | +| B | `pymupdf4llm` | New dep | Medium; faster but weaker on tables/structure | +| C | `marker` (datalab-to / marker) | New dep + heavy | Highest; but huge model download, GPU-friendly | +| D | `pypdf` + custom heuristics | New dep | Low; we'd be reinventing layout parsing | + +**Recommendation**: **A** (docling). It is *already* a dependency (added by [LCORE-1035](#html-precedent-lcore-1035) for HTML), the `BaseReader` plumbing exists, and PoC quality is good. Confidence: 95%. + +### Decision 2: OCR for scanned PDFs + +| Option | Description | +|--------|-------------| +| A | Out of scope for v1 — text-extractable PDFs only | +| B | Add `--ocr` opt-in flag, default off | +| C | Enable OCR by default | + +**Recommendation**: **A**. Already discussed and confirmed during scope clarification. OCR adds tesseract / easyocr as runtime deps, multiplies conversion time, and the typical BYOK customer ships text PDFs. Track as a follow-up JIRA. Confidence: 90%. + +### Decision 3: Repository placement + +| Option | Description | +|--------|-------------| +| A | Implementation in `rag-content`, docs update in `lightspeed-stack` | +| B | All in `rag-content` (skip stack docs update) | +| C | All in `lightspeed-stack` (move BYOK pipeline) | + +**Recommendation**: **A**. The "existing tool for producing BYOK vector store" *is* `rag-content`. `lightspeed-stack/docs/byok_guide.md` line 106-118 currently tells customers "PDFs must be converted to markdown first" — that needs to change to reflect native support. Confidence: 95%. + +## Technical decisions for @maxrubyonrails + +Architecture- and implementation-level decisions. + +### Decision 4: Pipeline configuration knobs + +`docling`'s `PdfPipelineOptions` exposes ~15 toggles. We need to pick defaults and decide which (if any) become CLI flags. + +| Knob | Recommended default | Expose as CLI flag? | Rationale | +|------|---------------------|---------------------|-----------| +| `do_ocr` | `False` | No | Decision 2: OCR out of scope | +| `do_table_structure` | `True` | No | Tables are common; cheap quality win | +| `table_structure_options.mode` | `accurate` | No | Accuracy over speed for offline indexing | +| `do_picture_classification` | `False` | No | Vector search doesn't use pictures | +| `do_picture_description` | `False` | No | Heavy (VLM call), no vector-search value | +| `generate_page_images` | `False` | No | Wasted I/O | + +**Recommendation**: ship the defaults above; **no CLI flags in v1**. This mirrors `HTMLReader`, which exposes nothing. Add flags later if customer feedback requires. Confidence: 80%. + +### Decision 5: Chunking strategy + +After docling exports to Markdown, how is the content chunked into vector-store nodes? + +| Option | Description | +|--------|-------------| +| A | Reuse the existing `MarkdownNodeParser`. Add `"pdf"` to the `doc_type` branches at `document_processor.py:75,87`. | +| B | Use docling's hybrid chunker (PDF-aware, page-boundary-aware). Different node parser path. | + +**Recommendation**: **A**. docling already exports clean Markdown for body content (PoC evidence: tables, lists, paragraphs all preserved). The `MarkdownNodeParser` we use for HTML and Markdown is well-tested and handles the output. **B** adds a parallel chunking pipeline and complicates `document_processor.py`. If retrieval quality on PDFs is poor in practice, **B** is a clean follow-up. Confidence: 85%. + +### Decision 6: Code organization + +| Option | Description | +|--------|-------------| +| A | New `src/lightspeed_rag_content/pdf/` package (mirrors `html/`). Standalone `__main__.py`, separate `pdf_reader.py`. | +| B | Add PDF as a third format inside `html/` (rename module to `docling/`). | +| C | Refactor `HTMLReader` and `PDFReader` into a shared `DoclingReader` base. | + +**Recommendation**: **A**. Mirrors the established pattern. **B** mixes concerns and renames a public module. **C** is a defensible cleanup but it's not in scope for LCORE-1471 — file it as a follow-up if both readers prove to share enough non-trivial logic to justify a base class. Confidence: 80%. + +### Decision 7: Test coverage scope + +| Option | Description | +|--------|-------------| +| A | Unit tests for the reader + CLI; integration test that builds a small Faiss index from a PDF and runs `query_rag.py`. | +| B | A only — no e2e through `lightspeed-stack`. | +| C | Full e2e — generate vector DB from PDF in rag-content, deploy via stack, run a real query against the stack endpoint. | + +**Recommendation**: **A** for the test JIRA, **C** as a separate e2e JIRA. The `tests/html/` precedent gives us the unit-test pattern; e2e is real work that needs the full local stack running. Confidence: 80%. + +## Proposed JIRAs + +Four sub-JIRAs under [LCORE-1471](https://issues.redhat.com/browse/LCORE-1471). Each `agentic tool instruction` points to the spec doc, not this spike doc. + + + +### LCORE-????: Implement PDF support in rag-content + +**Description**: Add a `PDFReader` to `rag-content` mirroring the existing `HTMLReader`. Use `docling` (already a dependency) configured for `InputFormat.PDF`. Wire it into `document_processor.py` so PDFs are recognized and parsed via `MarkdownNodeParser`. Update the `rag-content` README to list PDF as supported. + +**Scope**: + +- New package `src/lightspeed_rag_content/pdf/` with `__init__.py`, `__main__.py`, `pdf_reader.py`. +- `PDFReader(BaseReader)` exposing `load_data(file: Path) -> list[Document]`. +- `convert_pdf_file_to_markdown` and `convert_pdf_string_to_markdown` convenience helpers (mirror HTML). +- CLI subcommands `convert` and `batch` (mirror `html/__main__.py`). +- Update `document_processor.py` line 75 and line 87: `doc_type in ("markdown", "html", "pdf")`. +- Update `rag-content/README.md`: list PDF as a directly supported input format. +- Pass `uv run make format && uv run make verify` (or rag-content equivalent). +- No new entries in `pyproject.toml` — docling is already there. + +**Acceptance criteria**: + +- `python -m lightspeed_rag_content.pdf convert -i sample.pdf -o sample.md` succeeds. +- `python -m lightspeed_rag_content.pdf batch -i ./pdfs/ -o ./md/` converts a directory. +- Running `custom_processor.py` with `-f` pointing to a directory of PDFs produces a vector store the same way HTML and Markdown do. +- `rag-content/README.md` includes PDF in the supported-formats list. + +**Agentic tool instruction**: + +```text +Read the "Architecture" and "Implementation" sections in +docs/design/byok-pdf/byok-pdf.md (in the lightspeed-stack repo). +Key files in rag-content: + src/lightspeed_rag_content/html/ (precedent — mirror this) + src/lightspeed_rag_content/document_processor.py:75,87 + README.md +Mirror html/ into pdf/ with InputFormat.PDF and the pipeline options +listed in the spec doc's "Pipeline configuration" section. +``` + + + +### LCORE-????: Unit and integration tests for PDF support + +**Description**: Add unit tests for `PDFReader` and the CLI module mirroring the HTML test layout. Add an integration test that builds a small Faiss vector store from a real PDF and runs a query that returns expected content. + +**Scope**: + +- Create `tests/pdf/` directory with `__init__.py`. +- `tests/pdf/test_pdf_reader.py` mirroring `tests/html/test_html_reader.py` (load, error paths, missing file, conversion failure). +- `tests/pdf/test__main__.py` mirroring `tests/html/test__main__.py` (CLI argument parsing, convert and batch subcommands). +- Integration test: feed a small text PDF to `DocumentProcessor`, verify the resulting Faiss index contains a chunk whose text matches a known string from the source PDF. +- Commit a small (< 50 KB) text-extractable test PDF for fixtures. + +**Acceptance criteria**: + +- `pytest tests/pdf/` passes. +- Test coverage for `pdf/` matches the existing `html/` coverage threshold. +- Integration test runs in under 60 seconds on CI (cold model load excluded — pre-cache models in CI image). + +**Agentic tool instruction**: + +```text +Read the "Testing" section in docs/design/byok-pdf/byok-pdf.md. +Key files in rag-content: + tests/html/test_html_reader.py (mirror this) + tests/html/test__main__.py (mirror this) + tests/conftest.py +Use docling's mock-friendly seam from the HTML tests. +``` + + + +### LCORE-????: End-to-end test — PDF-built vector store consumed by lightspeed-stack + +**Description**: Verify that a vector store generated from a PDF (via the new `pdf` module) is consumed correctly by `lightspeed-stack` end-to-end: the stack starts up, the BYOK source is registered, and a query that should retrieve content from the PDF actually returns it. + +**Scope**: + +- Reuse the local-stack-testing pattern from [docs/local-stack-testing.md](../../local-stack-testing.md). +- Add an e2e feature file under `tests/e2e/features/` (BDD style). +- Step definitions that (1) generate a vector store from a sample PDF, (2) start the stack pointed at it, (3) issue a query, (4) assert retrieved content matches PDF source. +- Add the new feature to `tests/e2e/test_list.txt`. + +**Acceptance criteria**: + +- The e2e feature passes locally with the full stack (Llama Stack + MCP Mock + lightspeed-stack). +- The feature is added to CI's e2e suite if/when CI supports the rag-content cross-repo dependency. + +**Agentic tool instruction**: + +```text +Read the "End-to-end validation" section in docs/design/byok-pdf/byok-pdf.md. +Key files: + docs/local-stack-testing.md + tests/e2e/features/ (existing BDD features for pattern) + tests/integration/endpoints/test_query_byok_integration.py (similar pattern) +Generate the vector store ahead of stack startup using the rag-content +custom_processor.py invocation documented in the spec doc. +``` + + + +### LCORE-????: Update lightspeed-stack BYOK guide for native PDF support + +**Description**: Update `docs/byok_guide.md` to reflect that PDF is a directly supported input format, removing the "convert PDFs to Markdown first" instruction. + +**Scope**: + +- Edit `docs/byok_guide.md`: + - Line ~106 (`Directly supported`): add PDF. + - Line ~107 (`Requires conversion`): remove PDF; clarify which formats still require conversion. + - Line ~114-118 (Step 1): remove the docling-as-pre-conversion example, replace with a note that PDFs can be passed directly to `custom_processor.py`. +- Sanity-check no other parts of `docs/` give stale conversion advice (search for `docling` and `convert.*PDF`). + +**Acceptance criteria**: + +- `docs/byok_guide.md` no longer says PDFs require pre-conversion. +- A pointer to the rag-content README's PDF section is included. + +**Agentic tool instruction**: + +```text +Read the "Documentation impact" section in docs/design/byok-pdf/byok-pdf.md. +Key files: + docs/byok_guide.md (lines 106-118) + examples/lightspeed-stack-byok-okp-rag.yaml (no change, but verify still + accurate after PDF support) +``` + +## PoC results + +A working PoC script (`poc/pdf_reader.py`) implements `PDFReader` in 60 lines and was run against two real-world PDFs. + +### What the PoC does + +The PoC mirrors the production `HTMLReader` but configures docling for PDF (`InputFormat.PDF`, `PdfPipelineOptions` with the recommended defaults). It does *not* integrate with `document_processor.py` or implement a CLI — it is the minimum code needed to validate that docling's PDF pipeline produces usable Markdown. + +**Important**: The PoC diverges from the production design in these ways: + +- No CLI module (the production design has `__main__.py` with `convert` and `batch` subcommands). +- No `BaseReader` interface compliance (the PoC is a free function; production is a class). +- No `extra_info`/metadata passthrough (production handles `extra_info` like HTMLReader). +- No batch mode. +- No tests. + +### Results + +| PDF | Size | Wall-clock | Output | Quality | +|-----|------|-----------|--------|---------| +| sample_jira_1311.pdf | 217 KB | 332 s (incl. ~290 s model load) | 7,608 chars / 288 lines | High — clean headings, body, tables | +| sample_jira_836.pdf | 372 KB | ~70 s (warm) | 3,084 chars / 165 lines | Body clean; **headings degraded** (letter-spaced font) | + +Detailed findings, log excerpts, and converted Markdown are in [`poc-results/`](poc-results/): + +- [`01-poc-report.txt`](poc-results/01-poc-report.txt) — methodology, findings, implications +- [`02-conversion-log.txt`](poc-results/02-conversion-log.txt) — exact commands and timings +- [`03-sample-jira-1311.md`](poc-results/03-sample-jira-1311.md) — converted output (clean) +- [`04-sample-jira-836.md`](poc-results/04-sample-jira-836.md) — converted output (heading degradation visible) + +**Key takeaways the PoC proved**: + +1. No new dependencies are needed (docling already covers PDF). +2. Recommended pipeline defaults produce clean Markdown for body content. +3. Tables are preserved as proper Markdown tables. +4. `MarkdownNodeParser` will work for chunking — no parallel pipeline needed. + +**Honest limitation surfaced**: PDFs with letter-spaced display fonts (typical of Confluence "Export to PDF" output) produce noisy headings. This is a docling extraction limitation, not something `PdfPipelineOptions` controls. Document as a known caveat in the spec doc; no production fix in v1. + +## Background sections + +### Current state of `rag-content` + +The BYOK content tool lives at https://github.com/lightspeed-core/rag-content (sibling repo). Relevant structure: + +``` +rag-content/ +├── pyproject.toml # already has docling>=2.68.0 +├── README.md # documents Markdown / text / HTML inputs +└── src/lightspeed_rag_content/ + ├── document_processor.py # main pipeline; doc_type branches at L75, L87 + ├── metadata_processor.py + ├── utils.py # CLI helpers (add_input_file_argument, etc.) + ├── asciidoc/ # AsciiDoc converter (Ruby-based, separate path) + └── html/ # HTML support — added by LCORE-1035 + ├── __init__.py + ├── __main__.py # `convert` and `batch` CLI + └── html_reader.py # 165 LoC, BaseReader, uses docling +``` + +The `document_processor.py` file already routes by `doc_type`: + +```python +# src/lightspeed_rag_content/document_processor.py:75 +if config.doc_type in ("markdown", "html"): + Settings.node_parser = MarkdownNodeParser() +``` + +This is the only line that needs to grow `"pdf"` for chunking to work. + +### HTML precedent (LCORE-1035) + +PR `7f688b0` ("Add HTML support for BYOK", 2026-01-15) introduced the docling integration. Stats: + +``` +pyproject.toml | 2 + +scripts/query_rag.py | 2 +- +src/lightspeed_rag_content/document_processor.py | 5 +- +src/lightspeed_rag_content/html/__init__.py | 19 + +src/lightspeed_rag_content/html/__main__.py | 153 +++++ +src/lightspeed_rag_content/html/html_reader.py | 163 +++++ +src/lightspeed_rag_content/utils.py | 3 +- +tests/html/__init__.py | 15 + +tests/html/test_html_reader.py | 147 +++++ +uv.lock | 799 ++++++++++++++++++++++- +``` + +PDF support follows the same shape minus the `uv.lock` blast (no new deps). Estimated PR size: ~400 LoC (vs LCORE-1035's 470 LoC excluding `uv.lock`). + +### Why docling and not an alternative + +The decision was effectively pre-made by LCORE-1035: docling is already vendored. But here is a brief comparison for completeness: + +| Library | Strengths | Weaknesses for our use case | +|---------|-----------|------------------------------| +| **docling** | Already dep; layout + table ML; multi-format (PDF/HTML/DOCX); active dev | CPU-only is slow; large model downloads (already paid for HTML) | +| pymupdf4llm | Fast; small dep | Weaker on tables; no integrated CLI we already use | +| marker | Best quality on academic PDFs | Huge model download; GPU-friendly; overkill | +| pypdf + heuristics | Tiny dep | We'd reinvent layout parsing badly | + +There is no compelling reason to add a second PDF library when docling is already paid for. + +### Why MarkdownNodeParser and not docling's hybrid chunker + +The PoC output for body content is clean Markdown — paragraph breaks where you'd expect, headings as `## `, tables as `|...|`. `MarkdownNodeParser` is what `MarkdownReader` uses internally and what we already use for HTML (which goes through the same docling → markdown → MarkdownNodeParser path). Using a different chunker for PDFs would create two parallel pipelines for what is essentially the same intermediate representation. + +Docling's hybrid chunker is page-aware and could improve retrieval if PDF pagination conveys semantic structure (it usually doesn't for the BYOK use case — customer content is typically continuous prose). If we see retrieval quality issues on real customer PDFs, switching to the hybrid chunker is a self-contained follow-up. + +### Documentation impact + +In `lightspeed-stack`, `docs/byok_guide.md` currently states (line 106-107): + +``` +- **Directly supported**: Markdown (.md) and plain text (.txt) files +- **Requires conversion**: PDFs, AsciiDoc, HTML, and other formats must be + converted to markdown or TXT +``` + +This claim is already stale (HTML is supported via LCORE-1035). LCORE-1471 should make it correct, listing PDF *and* HTML as directly supported and removing the docling-as-pre-conversion example in Step 1. + +### Out of scope (follow-ups) + +- **OCR for scanned PDFs** — docling supports OCR via tesseract or easyocr. Track as a separate feature. +- **DOCX, RTF, EPUB** — docling supports these; if customer demand emerges, add as separate readers. +- **Refactor HTMLReader + PDFReader into a shared base** — defensible but not required by LCORE-1471. +- **Hybrid (page-aware) chunking** — switch from MarkdownNodeParser if retrieval quality is poor on real PDFs. +- **Heading-cleanup post-processor** — collapse single-character runs in headings extracted from letter-spaced fonts. diff --git a/docs/design/byok-pdf/byok-pdf.md b/docs/design/byok-pdf/byok-pdf.md new file mode 100644 index 000000000..18966ca6a --- /dev/null +++ b/docs/design/byok-pdf/byok-pdf.md @@ -0,0 +1,245 @@ +# Feature design: BYOK PDF support + +| | | +|--------------------|-------------------------------------------| +| **Date** | 2026-04-27 | +| **Component** | rag-content (primary), lightspeed-stack (docs only) | +| **Authors** | Maxim Svistunov | +| **Feature** | [LCORE-1471](https://issues.redhat.com/browse/LCORE-1471) | +| **Spike** | [LCORE-1471](https://issues.redhat.com/browse/LCORE-1471) — see [byok-pdf-spike.md](byok-pdf-spike.md) | +| **Precedent** | [LCORE-1035](https://issues.redhat.com/browse/LCORE-1035) — HTML support (PR `7f688b0`) | + +## What + +Add native PDF input support to the BYOK content production tool (`rag-content`). After this feature, customers can drop `.pdf` files into the input directory of `custom_processor.py` alongside `.md`, `.txt`, and `.html` files, and get a functioning vector store without manual pre-conversion. + +## Why + +Today, customers with PDF content must convert it to Markdown themselves before feeding it to `rag-content`. The BYOK guide currently instructs them to use `docling` as a separate pre-processing step. This is friction — and ironic, because `rag-content` *already* depends on `docling` (it ships with HTML support). Wiring the existing dependency through to PDF removes the manual step at no cost in third-party deps. + +## Requirements + +- **R1**: `python -m lightspeed_rag_content.pdf convert -i input.pdf -o output.md` converts a single PDF to Markdown. +- **R2**: `python -m lightspeed_rag_content.pdf batch -i ./pdfs/ -o ./md/` converts a directory of PDFs. +- **R3**: `custom_processor.py` with `-f` pointing to a directory containing PDFs produces a vector store; PDFs are routed through `MarkdownNodeParser` after docling export. +- **R4**: No new entries in `pyproject.toml` (docling is already a dependency). +- **R5**: OCR is not invoked. Scanned/image-only PDFs are out of scope; their conversion may yield empty or near-empty Markdown without erroring. +- **R6**: The `lightspeed-stack/docs/byok_guide.md` no longer instructs users to pre-convert PDFs. + +## Use Cases + +- **U1**: As a BYOK customer with product documentation PDFs, I want to feed the PDFs directly to `rag-content` so that I don't have to maintain a separate conversion step in my pipeline. +- **U2**: As an LCS operator, I want a vector store generated from a PDF to behave indistinguishably from one generated from Markdown when queried via `lightspeed-stack`, so that input-format choice does not affect retrieval semantics. + +## Architecture + +### Overview + +```text + input/ + ├── doc.md ─┐ + ├── note.txt ─┼─→ SimpleDirectoryReader + ├── page.html ┤ │ + └── manual.pdf┘ │ + ▼ + file_extractor lookup by ext: + .html ─→ HTMLReader (existing) + .pdf ─→ PDFReader (new) ←── this feature + .md, .txt ─→ default text reader + │ + ▼ + Document(text=markdown) + │ + ▼ + MarkdownNodeParser (existing path, + extended to recognize doc_type="pdf") + │ + ▼ + embedding + vector store +``` + +PDF support reuses the entire downstream pipeline. The only new code is the reader/CLI, plus a one-token addition (`"pdf"`) to two `doc_type` checks in `document_processor.py`. + +### Reader + +`PDFReader(BaseReader)` mirrors `HTMLReader` line-for-line with two differences: + +1. The docling `DocumentConverter` is constructed with `allowed_formats=[InputFormat.PDF]` (not HTML). +2. The converter receives a `PdfFormatOption(pipeline_options=...)` argument with explicit pipeline knobs. + +```python +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import PdfPipelineOptions +from docling.document_converter import DocumentConverter, PdfFormatOption + +class PDFReader(BaseReader): + def __init__(self) -> None: + opts = PdfPipelineOptions() + opts.do_ocr = False + opts.do_table_structure = True + opts.table_structure_options.mode = "accurate" + self.converter = DocumentConverter( + allowed_formats=[InputFormat.PDF], + format_options={ + InputFormat.PDF: PdfFormatOption(pipeline_options=opts), + }, + ) + + def load_data(self, file: Path, extra_info=None, **kwargs): + # identical body to HTMLReader.load_data — see html_reader.py + ... +``` + +### Pipeline configuration + +| Option | Value | Why | +|--------|-------|-----| +| `do_ocr` | `False` | Out of scope (R5) | +| `do_table_structure` | `True` | Tables are common in customer docs | +| `table_structure_options.mode` | `"accurate"` | Offline indexing tolerates the perf cost | +| `do_picture_classification` | `False` (default) | Vector search does not use images | +| `do_picture_description` | `False` (default) | Heavy VLM call, no value | +| `generate_page_images` | `False` (default) | Wasted I/O | + +These are baked into `PDFReader.__init__`. **No CLI flags expose them in v1.** If customer feedback calls for tuning, add flags later. + +### Chunking + +PDFs go through the same chunking path as HTML and Markdown. The change is in `document_processor.py`: + +```python +# document_processor.py:75 — _LlamaStackDB +if config.doc_type in ("markdown", "html"): # before +if config.doc_type in ("markdown", "html", "pdf"): # after + +# document_processor.py:87 — _BaseDB +if config.doc_type in ("markdown", "html"): # before +if config.doc_type in ("markdown", "html", "pdf"): # after +``` + +`MarkdownNodeParser` operates on the markdown that docling exports. It splits on heading boundaries, which works because docling produces well-formed `# ` / `## ` / `### ` headings for body content. Heading-text degradation in some PDFs (see "Known limitations") corrupts the heading *text* but does not break splitting. + +### CLI module (`pdf/__main__.py`) + +Mirror `html/__main__.py` exactly. Two subcommands: + +- `convert -i input.pdf [-o output.md]` — single file +- `batch -i ./pdf_dir [-o ./md_dir]` — directory walk over `**/*.pdf` + +Reuse `lightspeed_rag_content.utils.add_input_file_argument` and `run_cli_command`. + +### Configuration + +No YAML configuration changes. The reader's defaults are hard-coded; the CLI takes paths only; `custom_processor.py` already accepts the input directory as `-f`. + +### API changes + +No HTTP API changes. This feature is entirely offline (rag-content is a build-time tool). + +### Error handling + +Mirror HTMLReader exactly: + +- `FileNotFoundError` if the input path does not exist. +- `RuntimeError` (with the underlying exception chained via `from exc`) if docling conversion fails. +- The CLI catches both, logs, and exits non-zero. + +A scanned (image-only) PDF will produce empty or near-empty Markdown. This is not an error — the document parses fine; it just contains no extractable text. Customers will see a valid but empty-ish vector store entry. Document this in the rag-content README. + +### Migration / backwards compatibility + +None. This is purely additive: existing pipelines that don't pass PDFs are unaffected. No schema, no config, no API changes. + +## Implementation Suggestions + +### Key files and insertion points + +| Repo | File | What to do | +|------|------|------------| +| rag-content | `src/lightspeed_rag_content/pdf/__init__.py` | New — minimal package init mirroring `html/__init__.py` | +| rag-content | `src/lightspeed_rag_content/pdf/__main__.py` | New — CLI mirroring `html/__main__.py` | +| rag-content | `src/lightspeed_rag_content/pdf/pdf_reader.py` | New — `PDFReader(BaseReader)` mirroring `html_reader.py` | +| rag-content | `src/lightspeed_rag_content/document_processor.py` | Edit lines 75 and 87 — add `"pdf"` to the `doc_type in (...)` tuples | +| rag-content | `tests/pdf/__init__.py` | New | +| rag-content | `tests/pdf/test_pdf_reader.py` | New — mirror `tests/html/test_html_reader.py` | +| rag-content | `tests/pdf/test__main__.py` | New — mirror `tests/html/test__main__.py` | +| rag-content | `tests/pdf/fixture.pdf` | New — small text-extractable test PDF (< 50 KB) | +| rag-content | `README.md` | Edit — add PDF to the supported input formats list | +| lightspeed-stack | `docs/byok_guide.md` | Edit lines ~106-118 — list PDF as directly supported, drop the docling pre-conversion example | + +### Insertion point detail + +`document_processor.py:75` (inside `_LlamaStackDB.__init__`): + +```python +if config.doc_type in ("markdown", "html"): + Settings.node_parser = MarkdownNodeParser() +``` + +`document_processor.py:87` (inside `_BaseDB.__init__`): + +```python +if config.doc_type in ("markdown", "html"): + Settings.node_parser = MarkdownNodeParser() +``` + +Both lines need `"pdf"` added to the tuple. No other branches in the file route by `doc_type`. + +### Config pattern + +N/A — this feature has no Python config classes (it's a CLI tool, not a service). + +### Test patterns + +Mirror the HTML test layout exactly: + +- `tests/pdf/test_pdf_reader.py`: + - Test that `PDFReader().load_data(valid_path)` returns a list with one `Document`. + - Test that the returned `Document.text` is non-empty for a real fixture PDF. + - Test that `FileNotFoundError` is raised for a non-existent path. + - Test that `RuntimeError` is raised when docling raises (mock the converter). + - Test that `extra_info` is preserved in `Document.metadata`. +- `tests/pdf/test__main__.py`: + - Test argument parsing for `convert` and `batch`. + - Test that `convert` writes output to the inferred path when `-o` is omitted. + - Test that `batch` walks subdirectories and preserves structure. + - Test that errors exit non-zero. + +Use the existing `mocker` patterns from `tests/html/`. The PDF fixture should be small (< 50 KB) and committed to git — generate one from a known Markdown source so test assertions can match exact strings. + +## Open Questions for Future Work + +- **OCR support**: Scanned PDFs require docling's OCR engines (tesseract, easyocr, rapidocr). File a follow-up JIRA when there is customer demand. Implementation is mostly a flag toggle; the cost is in build/runtime size. +- **Hybrid (page-aware) chunking**: If retrieval quality on real customer PDFs is poor, switch from `MarkdownNodeParser` to docling's hybrid chunker. Requires a new branch in `_BaseDB.__init__`. +- **Heading-cleanup post-processor**: Confluence-export PDFs with letter-spaced headings yield `H e a d i n g` text. A small post-processor that collapses single-character runs in headings would mitigate. Not in scope for v1. +- **Shared `DoclingReader` base class**: If `PDFReader` and `HTMLReader` end up sharing significant logic beyond imports, refactor into a base. Not required by this feature. +- **DOCX, RTF, EPUB**: docling supports these. Add as separate readers when customer demand justifies. + +## Known limitations + +These are intentional v1 trade-offs documented for the rag-content README and the BYOK guide: + +- **Scanned PDFs**: produce empty or near-empty Markdown. Use a separate OCR step today; native OCR support tracked as a follow-up. +- **Letter-spaced display fonts**: typical of Confluence "Export to PDF" output. Headings may extract with spaces between letters (`H e a d i n g`). Body text is unaffected. The `## ` heading prefix is intact, so chunking still happens at heading boundaries; only the heading *text* is corrupted, which slightly degrades retrieval if a query mentions the heading literally. +- **Performance**: ~30-90 seconds per small/medium PDF on CPU after model warm-up. Acceptable for offline indexing; not suitable for interactive use. + +## Changelog + +| Date | Change | Reason | +|------------|-----------------|---------------------------------| +| 2026-04-27 | Initial version | Spike deliverable for LCORE-1471 | + +## Appendix A: PoC evidence + +See [`poc-results/`](poc-results/) in the spike PR for the full PoC report, conversion logs, and converted Markdown samples. The PoC validated the recommendations above on two real PDFs and surfaced the heading-degradation limitation documented in "Known limitations". + +## Appendix B: HTML precedent + +The HTML implementation under [LCORE-1035](https://issues.redhat.com/browse/LCORE-1035) (commit `7f688b0`, 2026-01-15) established every pattern used here: + +- `BaseReader` with docling-backed conversion +- `__main__.py` CLI structure with `convert` and `batch` subcommands +- `tests/html/` layout +- `document_processor.py` `doc_type` branching + +Refer to the HTML files (`src/lightspeed_rag_content/html/`, `tests/html/`) when implementing this feature. Differences are limited to the `InputFormat` enum value and the addition of `PdfFormatOption(pipeline_options=...)` in the `DocumentConverter` constructor. diff --git a/docs/design/byok-pdf/poc-results/01-poc-report.txt b/docs/design/byok-pdf/poc-results/01-poc-report.txt new file mode 100644 index 000000000..5ae147bf2 --- /dev/null +++ b/docs/design/byok-pdf/poc-results/01-poc-report.txt @@ -0,0 +1,153 @@ +PoC report — LCORE-1471 BYOK PDF support +========================================= + +Glossary +-------- + +docling + The IBM-developed document conversion library already a dependency of + rag-content (added under LCORE-1035 for HTML support). It uses ML models + (layout detection, table structure recognition) to extract structured + content from PDFs and other formats. + +rag-content + Sibling repo (https://github.com/lightspeed-core/rag-content) that + produces the BYOK vector stores consumed by lightspeed-stack. + +PdfPipelineOptions + docling configuration object controlling OCR, table extraction, image + handling, etc. for PDF input. + +PoC design +---------- + +Goal: validate that docling's PDF → Markdown conversion produces output +of high enough quality to feed into the existing markdown-based RAG +pipeline (`MarkdownNodeParser`) without per-PDF tuning. + +Method: a 60-line PoC script (poc/pdf_reader.py) that mirrors the +production HTMLReader from src/lightspeed_rag_content/html/html_reader.py +but configures docling for PDF. Defaults chosen: + + do_ocr = False # LCORE-1471 scope: text PDFs + do_table_structure = True # quality win, low cost + table_structure_options.mode = accurate # accuracy over speed + do_picture_classification = False # vector search doesn't use + do_picture_description = False # vector search doesn't use + generate_page_images = False # wasted I/O + +Corpus: two locally-available Lightspeed JIRA PDFs (Red Hat domain +documents — the public RH PDF goal was substituted because access.redhat.com +returned login redirects for direct PDF downloads): + + sample_jira_1311.pdf 217 KB, ~5 pages, JIRA-export from Atlassian Cloud + sample_jira_836.pdf 372 KB, ~6 pages, Confluence-style export + +Run command (from rag-content venv where docling is installed): + + cd ~/repos/lightspeed/rag-content + uv run python /docs/design/byok-pdf/poc/pdf_reader.py \ + /docs/design/byok-pdf/poc/sample_jira_1311.pdf \ + /docs/design/byok-pdf/poc-results/03-sample-jira-1311.md + +Results +------- + +Conversion succeeded for both PDFs. Output is well-formed Markdown that +parses cleanly (verified by visual inspection — see 03/04 files). + +Timings (CPU, no GPU): + + Run Wall-clock Notes + ----------------------------- ---------- --------------------------- + First conversion (1311) 332 s ~290 s spent loading docling + ML models on first invocation + Second conversion (836) ~70 s models warm; 372 KB input + + Steady-state per PDF: ~30-90 s for small/medium PDFs on CPU + +Output sizes: + + PDF Lines Chars Markdown structure + ------------------------------ ----- ------ ----------------------- + sample_jira_1311.pdf (217 KB) 288 7,608 Clean headings, tables + sample_jira_836.pdf (372 KB) 165 3,084 Body clean, headings noisy + +Findings +-------- + +1. Body paragraph quality is HIGH. + Both PDFs produced clean, well-formed paragraph text. No hyphenation + artifacts, no broken sentences, no character-level corruption in body + prose. This is the bulk of the content fed to the embedding model, so + this is the most important result. + +2. Table extraction WORKS WELL. + sample_jira_1311 contains two markdown tables (Component/Behavior and + Area/Impact). Both came out as proper GitHub-flavored markdown tables + with correct headers and cell boundaries. Confirms that + table_structure_options.mode='accurate' earns its keep. + +3. Heading extraction degrades on letter-spaced fonts. + sample_jira_836 (Confluence PDF export) uses a font with wide letter + spacing for headings. docling extracts each character with a space + between, producing headings like "Streaml i ne l i ghtspeed-stack + conf i g" and "Acceptance cr i ter i a". + + Impact on RAG: MarkdownNodeParser still recognizes these as headings + (the `## ` prefix is intact) but the heading TEXT is corrupted. This + degrades retrieval if a query mentions the heading text. Body content + under the heading is unaffected. + + This is a docling-side limitation, not something we control with + PdfPipelineOptions. A workaround would be a post-processor that + collapses single-character runs in headings, but that's out of scope + for v1. + +4. Image-placeholder noise. + docling emits `` HTML comments wherever it detects an + image (icons, avatars, logos). sample_jira_836 has 12 such comments + from Confluence chrome. These survive into the final markdown. + + Impact on RAG: comments are stripped by MarkdownNodeParser before + embedding, so this is cosmetic noise in the intermediate file only. + +5. Implementation effort is small. + The PoC is a 60-line script. The production code path is essentially + the same shape as html_reader.py (165 lines including docstrings) plus + a CLI module (~150 lines) plus tests. Total estimated LoC for the + implementation ticket: ~400 LoC, mirroring the LCORE-1035 HTML PR + which was 1,269 lines including the uv.lock bump. + +Implications for the spike +-------------------------- + +1. NO new dependencies needed. + docling is already in pyproject.toml (added by LCORE-1035). The PDF + path uses a different InputFormat enum value but the same library and + the same DocumentConverter API. uv.lock should not change. + +2. Default pipeline knobs are sufficient. + Recommendation: ship with the defaults documented above and DO NOT + expose CLI flags in v1. Add flags later if customer feedback requires. + +3. Chunking strategy: reuse MarkdownNodeParser (no new node parser). + Since docling exports clean markdown for body text, the existing + document_processor.py branch `if config.doc_type in ("markdown", + "html"): Settings.node_parser = MarkdownNodeParser()` just needs + `"pdf"` added. + +4. Heading degradation is a known limitation, not a blocker. + Document it in the spec doc as a known caveat. Customers with + Confluence PDFs should prefer exporting from the source format + (markdown/HTML/AsciiDoc) where available. + +5. OCR confirmation: out of scope is the right call. + The PoC PDFs are text-extractable so we never exercised OCR. Adding + `do_ocr=True` would require tesseract or easyocr as runtime deps and + would multiply conversion time. Defer to a follow-up JIRA. + +6. Performance is acceptable but not snappy. + ~30-90 s per PDF on CPU after warm-up is fine for offline indexing + (the BYOK use case) but would be unusable for interactive workflows. + This sets expectations for the documentation. diff --git a/docs/design/byok-pdf/poc-results/02-conversion-log.txt b/docs/design/byok-pdf/poc-results/02-conversion-log.txt new file mode 100644 index 000000000..c8bd3a5d3 --- /dev/null +++ b/docs/design/byok-pdf/poc-results/02-conversion-log.txt @@ -0,0 +1,67 @@ +Conversion log — LCORE-1471 PoC runs +===================================== + +Environment +----------- + + Host: Linux 6.8.9-100.fc38.x86_64 (no GPU) + Python: 3.12 via uv + docling: >=2.68.0 (from rag-content/pyproject.toml) + Date: 2026-04-27 + +Run 1: sample_jira_1311.pdf (217 KB) +------------------------------------- + +Command: + + cd ~/repos/lightspeed/rag-content + uv run python ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/pdf_reader.py \ + ~/repos/lightspeed/stack/docs/design/byok-pdf/poc/sample_jira_1311.pdf \ + ~/repos/lightspeed/stack/docs/design/byok-pdf/poc-results/03-sample-jira-1311.md + +Selected log lines (timestamps preserved): + + 2026-04-27 15:06:18 INFO detected formats: [] + 2026-04-27 15:06:18 INFO Going to convert document batch... + 2026-04-27 15:06:18 INFO Initializing pipeline for StandardPdfPipeline ... + 2026-04-27 15:06:18 INFO Loading plugin 'docling_defaults' + 2026-04-27 15:06:18 INFO Registered picture descriptions: ['vlm', 'api'] + 2026-04-27 15:06:18 INFO Registered ocr engines: ['auto', 'easyocr', ...] + 2026-04-27 15:06:18 INFO Registered layout engines: ['docling_layout_default', ...] + 2026-04-27 15:06:18 INFO Accelerator device: 'cpu' + 2026-04-27 15:07:44 INFO Loading plugin 'docling_defaults' + 2026-04-27 15:07:44 INFO Registered table structure engines: ['docling_tableformer'] + 2026-04-27 15:11:06 INFO Accelerator device: 'cpu' ← models loaded + 2026-04-27 15:11:09 INFO Processing document sample_jira_1311.pdf + 2026-04-27 15:11:50 INFO Finished converting document sample_jira_1311.pdf + in 332.01 sec. + converted .../sample_jira_1311.pdf -> .../03-sample-jira-1311.md + elapsed: 332.10s + output: 7608 chars, 288 lines + +Breakdown: + Model load: ~290 s (15:06:18 → 15:11:08) + Conversion: ~ 41 s (15:11:09 → 15:11:50) + +Run 2: sample_jira_836.pdf (372 KB) +------------------------------------ + +Same command pattern, second invocation. Models cached on disk so the +load phase reduced significantly (~30 s) on warm runs in the same +process; here we ran in a NEW process so paid the cold-start again, but +HuggingFace model files were already on disk from run 1, so the network +fetch was skipped. + +Total wall-clock: ~70 s (15:13:13 finish, started ~15:12:00). +Output: 3,084 chars, 165 lines. + +Reproducibility note +-------------------- + +Both PDFs are committed in poc/. The PoC script is committed as +poc/pdf_reader.py. To reproduce, install rag-content's deps (which pull +in docling) and run the commands above. + +After the PoC PR merges and the implementation lands in rag-content, the +PoC script and these PDFs will be deleted per the howto-run-a-spike.md +step 10 cleanup checklist. diff --git a/docs/design/byok-pdf/poc-results/03-sample-jira-1311.md b/docs/design/byok-pdf/poc-results/03-sample-jira-1311.md new file mode 100644 index 000000000..7ef51b640 --- /dev/null +++ b/docs/design/byok-pdf/poc-results/03-sample-jira-1311.md @@ -0,0 +1,288 @@ +## Red Hat Jira and Confluence Data Centre will be unavailable from 5 PM EDT on March 13 to 12 AM EDT on March 16 due to the migration to Atlassian Cloud + + + +## Lightspeed Core LCORE-1311 + +## Conversation History Summarization (Compaction) + +## Details + +Type: + +Feature + +Resolution: + +Unresolved + +Priority: + +Undefined + +Fix Version/s: + +Q2CY26 + +Affects Version/s: + +None + +Component/s: + +None + +Labels: + +ols-lcore-migration + +Blocked: + +False + +Blocked Reason: + +None  + +Ready: + +False + +SFDC Cases Links: + +SFDC Cases Open: + +0 + +SFDC Cases Counter: + +0 + +Intelligence + +Requested: + +False + +Market: + +Unclassified + +## Description + + + + + +## LCORE Feature Request: Conversation History Summarization + +## Summary + +Implement intelligent conversation history summarization to maintain context when conversations approach the model's context window limit, preventing loss of important context while enabling long-running conversations. + +## Background + +Related ticket: OLS-2500 - [Review] : Summarize conversation when context limit is reached + +## User Story + +As an LCS user, I want the conversation history to be intelligently summarized when it exceeds the context window limit, so that the assistant maintains an accurate understanding of previous interactions without losing important context. + +## Problem Statement + +Currently, neither lightspeed-stack nor Llama Stack implements context management for long conversations: + +- When context window is exceeded, requests fail with HTTP 413 (Prompt Too Long) +- No automatic truncation or summarization exists +- The truncated field in responses is hardcoded to False with TODO comments + +## Technical Analysis + +Current State + +| Component | Behavior | +|----------------------|-----------------------------------------------------------------------------------------------------| +| lightspeed-stack | Returns PromptTooLongResponse (413) on overflow. truncated field not implemented. | +| Llama Stack | Loads ALL conversation history via Conversations API + ResponsesStore. No truncation/summarization. | +| OpenAI Responses API | Has truncation parameter (auto/disabled) AND compact endpoint for summarization | + +OpenAI API Spec (Reference Implementation) + +OpenAI's Responses API includes two relevant features: + +1. truncation parameter on create response: +2. auto : Drops items from beginning of conversation to fit context window +3. disabled (default): Fails with 400 error if context exceeded +2. compact endpoint ( POST /v1/responses/compact ): +5. Runs compaction/summarization pass over conversation +6. Returns compacted response object with encrypted/opaque items + +## Neither feature is currently implemented in Llama Stack. + +Llama Stack Storage Architecture + +Llama Stack uses two storage mechanisms for conversation history: + +1. Conversations API ( conversations\_api.list\_items() ) +2. Stores ConversationItem objects (high-level: messages, tool calls) +3. Used for listing/managing conversations +2. ResponsesStore ( responses\_store.get\_conversation\_messages() ) +5. Stores raw OpenAIMessageParam list (what actually goes to LLM) +6. Stored in conversation\_messages table +7. Source of truth for building LLM context + +Both mechanisms store full history without any summarization or truncation. + +## Impact Analysis + +| Area | Impact | +|--------------------|---------------------------------------------------------------------| +| Transcripts | No impact - captures individual Q/A pairs, not conversation history | +| Conversation Cache | TBD - needs schema changes to store summary metadata | +| Llama Stack | Depends on implementation location | + +Transcripts Detail + +Transcripts store individual Q/A pairs per turn. The truncated field already exists to flag when truncation occurred. If summarization happens, transcripts would still capture the original query/response for that turn - this is the intended behavior. + +Conversation Cache Detail + +If implemented in lightspeed-stack, the conversation cache would need schema changes. + +## Design Considerations + +Key Principle: Decouple Storage from LLM Context + +## Storage and LLM context should be decoupled: + +- Full history preserved (for UI/audit/replay) +- Summary used only when building LLM context +- User sees all messages; AI gets summarized context for older turns + +This follows the pattern used by tools like Cursor/Claude - you can scroll up and see your full conversation, but the AI may only have summarized context of older turns. + +Summarization Trigger + +Summarization should be token-based, not turn-based (turn sizes vary significantly). + +Incremental Summarization + +Summary should be computed once, stored, and reused - not recomputed on every request. + +## Open Questions + +1. Should this be implemented in Llama Stack (where conversation history is managed) or lightspeed-stack? +2. If Llama Stack, what's the timeline/feasibility for upstream contribution? +3. How should summary storage interact with existing conversation cache schema? +4. What summarization prompt/strategy provides best context preservation? +5. How to handle model-specific context window sizes (configurable vs auto-detected)? + +## Acceptance Criteria + +1. When conversation history approaches context window limit, older messages are summarized (not just truncated) +2. Summary preserves clear attribution of actions (user vs assistant) +3. Summarization threshold is configurable or auto-determined based on model context window +4. Summarization is incremental (summary updated, not recomputed from scratch) +5. Full conversation history remains accessible (UI/audit) - only LLM context uses summary +6. Assistant correctly recalls and references prior context after summarization + +## References + +- OLS-2500: Original OLS ticket +- OpenAI Responses API +- OpenAI Compact endpoint +- OpenAI Conversation State Guide + +## Attachments + + + + + + + + + + + +Live updates + +No child issues. + +Assignee: + +Assign to me + +Reporter: + +Votes: + +- Vote for this issue 0 + +Watchers: + +- Start watching this issue 3 + +Drop files to attach, or browse. + +## is blocked by + +## Issue Links + +LCORE-1314 [Spike] Finalize scope of the feature request + +IN PROGRESS + +- Easy Agile Planning Poker + +Quick tour + +Vote + +## Child issues + +## Activity + +## Newest first + +- You can now pin up to five comments to highlight important information. Pinned comments will appear above all other comments, so they're easy to find.  + +Got it + +Learn more about pinned comments + +- added a comment - 2026/02/11 4:29 PM  Anxhela Coba + +Is the goal of the conversation history summarization task more about preventing missing information? In fine-tuning when we compress models from larger to more domain-specific they can be prone to catastrophic forgetting. I wonder if we can explore for this.. techniques like SVD ( singular vector decomposition) for this task. + +ex. if we can take the embedding of the history we can apply SVD or some similar technique to help "summarize" + +Pin + +- added a comment - 2026/02/09 4:07 PM + +Note: if it is implemented on the provider/openai API level - then this feature is locked for providers conforming to these particular endpoints. + +-  Ondrej Metelka + +Pin + +## People + +Unassigned + +Ondrej Metelka  + + + +## Dates + +Created: + +2026/02/09 3:45 PM + +Updated: + +2026/03/10 3:13 PM + +## Agile \ No newline at end of file diff --git a/docs/design/byok-pdf/poc-results/04-sample-jira-836.md b/docs/design/byok-pdf/poc-results/04-sample-jira-836.md new file mode 100644 index 000000000..2bd6489e1 --- /dev/null +++ b/docs/design/byok-pdf/poc-results/04-sample-jira-836.md @@ -0,0 +1,166 @@ + + +Spaces + +/ + + + +/ Lightspeed Core + +/ + +Add parent + +LCORE-836 + + + + + + + + + +## Streaml i ne l i ghtspeed-stack conf i g + + + + + +## Key deta i ls + +F i eld Tab Pr i or i t i zat i on Release Test i ng SFDC Portfol i o Solut i ons Bus i ness + +Descr i pt i on + +## Feature Overv i ew + +Currently, the Lightspeed stack requires operators to manage two separate configuration files: + +- run.yaml : For the underlying Llama Stack (e.g., inference endpoints, tools, vector stores). +- lightspeed-stack.yaml : For Lightspeed Core settings (e.g., authentication, data collection, server settings). + +This dual-file system increases complexity and the potential for misconfiguration. This feature proposes merging all required settings into the primary lightspeed-stack.yaml. Lightspeed Core will be responsible for managing the necessary Llama Stack configuration from this single, unified source of truth. + +The goal is to simplify the deployment and management experience for all downstream Lightspeed teams by providing a single, coherent configuration file. + +Welcome to Atlassian Cloud! To report issues, raise a support ticket here. + +Create + + + + + +To Do + + + +Improve Feature + + + + + + + + + + + + + + + + + +We should evaluate the option of having the possibility of overriding the llama-stack configuration even in this single-config file model, to support some edge cases where Lightspeed teams need to configure the underlying llama-stack directly. + +## Acceptance cr i ter i a + +- All necessary configurations required for Lightspeed assistants to operate previously set in run.yaml can now be defined within the lightspeed-stack.yaml file. +- Lightspeed Core correctly parses the unified configuration and applies the appropriate settings to the underlying Llama Stack services at runtime. +- The need for downstream teams to manually create or modify a separate run.yaml is completely eliminated. +- The stack deploys and operates correctly using only the single lightspeed-stack.yaml file. +- All relevant documentation is updated to reflect the new, single-file configuration process. + +## Env i ronment + +Add text + +Blocked Reason None + +Release Note Text + +None + +G i t Pull Request + +Add text + +Contr i but i ng Groups + +Add groups + +Blocked + +False + +Ready + +False + +Need Info From + +Add people + +Release Note Type + +Add opt i on + +S i ze + +Add opt i on + +Release Note Status Add opt i on + +Sync Fa i lure Flag Add labels + +Or i g i nal story po i nts + +Add number + +Bugz i lla Bug + +Add URL + +Start date + + + + + +## [LCORE-836] Streamline lightspeed-stack config - Red Hat Issue Tracker + + + +Parent + +Add parent + +- Automation + +Rule executions + + + +- Atlassian project + +Link to sha + + + +Created October 15, 2025 at 1:34 PM Updated 3 days ago + + \ No newline at end of file diff --git a/docs/design/byok-pdf/poc/pdf_reader.py b/docs/design/byok-pdf/poc/pdf_reader.py new file mode 100644 index 000000000..061735311 --- /dev/null +++ b/docs/design/byok-pdf/poc/pdf_reader.py @@ -0,0 +1,73 @@ +"""PoC PDF Reader for the LCORE-1471 spike. + +Mirrors the production HTMLReader in `lightspeed_rag_content/html/html_reader.py` +but configures docling for PDF input. This is throwaway PoC code -- it will be +adapted into the real `lightspeed_rag_content/pdf/` package by the implementation +ticket; do not import it from anywhere else. + +Run: + uv run python pdf_reader.py [] +""" + +import logging +import sys +import time +from pathlib import Path + +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import PdfPipelineOptions +from docling.document_converter import DocumentConverter, PdfFormatOption + +LOG = logging.getLogger(__name__) + + +def make_converter() -> DocumentConverter: + """Construct a docling DocumentConverter configured for PDF. + + Defaults chosen for the BYOK use case: + do_ocr=False -- text-extractable PDFs only (LCORE-1471 scope) + do_table_structure=True -- tables are common in customer docs; cheap quality win + table_structure_options.mode='accurate' -- accuracy over speed + do_picture_*=False -- vector search does not need images + """ + pipeline_options = PdfPipelineOptions() + pipeline_options.do_ocr = False + pipeline_options.do_table_structure = True + pipeline_options.table_structure_options.mode = "accurate" + + return DocumentConverter( + allowed_formats=[InputFormat.PDF], + format_options={ + InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options), + }, + ) + + +def convert(pdf_path: Path) -> tuple[str, float]: + """Convert a PDF file to Markdown. Returns (markdown, seconds).""" + converter = make_converter() + t0 = time.monotonic() + result = converter.convert(str(pdf_path)) + markdown = result.document.export_to_markdown() + elapsed = time.monotonic() - t0 + return markdown, elapsed + + +if __name__ == "__main__": + logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s") + + if len(sys.argv) < 2: + print(f"usage: {sys.argv[0]} []", file=sys.stderr) + sys.exit(2) + + in_path = Path(sys.argv[1]) + out_path = Path(sys.argv[2]) if len(sys.argv) > 2 else in_path.with_suffix(".md") + + markdown, elapsed = convert(in_path) + out_path.write_text(markdown, encoding="utf-8") + + chars = len(markdown) + lines = markdown.count("\n") + 1 + print(f"converted {in_path} -> {out_path}") + print(f" elapsed: {elapsed:.2f}s") + print(f" output: {chars} chars, {lines} lines") diff --git a/docs/design/byok-pdf/poc/sample_jira_1311.pdf b/docs/design/byok-pdf/poc/sample_jira_1311.pdf new file mode 100644 index 000000000..59736abe6 Binary files /dev/null and b/docs/design/byok-pdf/poc/sample_jira_1311.pdf differ diff --git a/docs/design/byok-pdf/poc/sample_jira_836.pdf b/docs/design/byok-pdf/poc/sample_jira_836.pdf new file mode 100644 index 000000000..b616eb935 Binary files /dev/null and b/docs/design/byok-pdf/poc/sample_jira_836.pdf differ