From 21c16aa908fbccb5c6dc68f3656bd79ba6221da2 Mon Sep 17 00:00:00 2001 From: Joshua Wilson Date: Mon, 20 Apr 2026 02:25:08 +0000 Subject: [PATCH] OLS-2882 add specification files for lightspeed-rag-content Two-layer spec structure under .ai/spec/: - what/ (5 files): system-overview, content-sources, embedding-pipeline, byok, container-build - how/ (5 files): project-structure, plaintext-pipeline, html-pipeline, lsc-library, container-build Co-Authored-By: Claude Opus 4.6 (1M context) --- .ai/spec/README.md | 41 ++++ .ai/spec/how/README.md | 25 +++ .ai/spec/how/container-build.md | 216 +++++++++++++++++++++ .ai/spec/how/html-pipeline.md | 163 ++++++++++++++++ .ai/spec/how/lsc-library.md | 279 ++++++++++++++++++++++++++++ .ai/spec/how/plaintext-pipeline.md | 120 ++++++++++++ .ai/spec/how/project-structure.md | 116 ++++++++++++ .ai/spec/what/README.md | 24 +++ .ai/spec/what/byok.md | 64 +++++++ .ai/spec/what/container-build.md | 89 +++++++++ .ai/spec/what/content-sources.md | 78 ++++++++ .ai/spec/what/embedding-pipeline.md | 89 +++++++++ .ai/spec/what/system-overview.md | 61 ++++++ 13 files changed, 1365 insertions(+) create mode 100644 .ai/spec/README.md create mode 100644 .ai/spec/how/README.md create mode 100644 .ai/spec/how/container-build.md create mode 100644 .ai/spec/how/html-pipeline.md create mode 100644 .ai/spec/how/lsc-library.md create mode 100644 .ai/spec/how/plaintext-pipeline.md create mode 100644 .ai/spec/how/project-structure.md create mode 100644 .ai/spec/what/README.md create mode 100644 .ai/spec/what/byok.md create mode 100644 .ai/spec/what/container-build.md create mode 100644 .ai/spec/what/content-sources.md create mode 100644 .ai/spec/what/embedding-pipeline.md create mode 100644 .ai/spec/what/system-overview.md diff --git a/.ai/spec/README.md b/.ai/spec/README.md new file mode 100644 index 000000000..88504048e --- /dev/null +++ b/.ai/spec/README.md @@ -0,0 +1,41 @@ +# OpenShift LightSpeed RAG Content -- Specifications + +These specs define the requirements, behaviors, and architecture for the lightspeed-rag-content project. They are organized into two layers: + +- **[`what/`](what/README.md)** -- Behavioral rules: WHAT the system must do and WHY. Technology-neutral, testable assertions. Use these to understand requirements, fix bugs, or rebuild components. +- **[`how/`](how/README.md)** -- Architecture specs: HOW the current implementation is structured. Module boundaries, data flow, design patterns. Use these to navigate, modify, and extend the codebase. + +## Scope + +These specs cover the **lightspeed-rag-content** project only -- the offline pipeline that produces pre-built vector indexes and packages them as container images. The lightspeed-service (which consumes these artifacts at runtime), the operator, and the console plugin are separate projects. + +## Audience + +AI agents (Claude). Specs optimize for precision, unambiguous rules, and machine-parseable structure. + +## Quick Start + +| I want to... | Read | +|--------------|------| +| Understand what this project does | `what/system-overview.md` | +| Understand content sources and acquisition | `what/content-sources.md` | +| Understand the embedding pipeline rules | `what/embedding-pipeline.md` | +| Understand BYOK (customer content) | `what/byok.md` | +| Understand the container build process | `what/container-build.md` | +| Navigate the codebase | `how/project-structure.md` | +| Modify the plaintext pipeline | `how/plaintext-pipeline.md` | +| Modify the HTML pipeline | `how/html-pipeline.md` | +| Modify the lsc library | `how/lsc-library.md` | +| Modify the container build or CI | `how/container-build.md` | +| See what's planned | Look for `[PLANNED: OLS-XXXX]` in `what/` specs | + +## Conventions + +- `[PLANNED: OLS-XXXX]` markers in `what/` specs indicate existing rules about to change due to open Jira work. +- "Planned Changes" sections list new capabilities not yet in code. +- User-configurable values are referenced by CLI argument name or environment variable name. +- Internal constants are stated as behavioral rules without numeric values; `how/` specs may include specific values. + +## Relationship to lightspeed-service + +This project produces artifacts consumed by lightspeed-service. The service's `what/rag.md` spec describes how it loads and queries these indexes at runtime. This project's specs describe how the indexes are built. The integration contract is documented in `what/system-overview.md`. diff --git a/.ai/spec/how/README.md b/.ai/spec/how/README.md new file mode 100644 index 000000000..6e5ee1052 --- /dev/null +++ b/.ai/spec/how/README.md @@ -0,0 +1,25 @@ +# Architecture Specifications (how/) + +These specs describe HOW the RAG content pipeline is structured -- module boundaries, data flow, design patterns, key abstractions, and implementation decisions. They are grounded in the current Python codebase and should be updated when the code changes. + +## Spec Index + +| Spec | Description | +|------|-------------| +| [project-structure.md](project-structure.md) | Directory layout, module map, dependency management, key relationships | +| [plaintext-pipeline.md](plaintext-pipeline.md) | `scripts/generate_embeddings.py` -- the production pipeline used by the Containerfile | +| [html-pipeline.md](html-pipeline.md) | `scripts/html_embeddings/` + `scripts/html_chunking/` -- HTML-based pipeline with semantic chunking | +| [lsc-library.md](lsc-library.md) | `lsc/src/lightspeed_rag_content/` -- installable library with multi-backend support | +| [container-build.md](container-build.md) | Containerfiles, Makefile targets, Konflux/Tekton pipelines, dependency management | + +## When to Read These + +- **Navigating the codebase**: Start with `project-structure.md` to understand where things live. +- **Modifying a pipeline**: Read the relevant pipeline spec to understand the current architecture before making changes. +- **Adding a new vector store backend**: Read `lsc-library.md` for the `_BaseDB` extension pattern. +- **Debugging**: The data flow sections trace the exact path documents take through the pipeline. +- **Changing the build**: Read `container-build.md` for Containerfile stages and Konflux pipeline structure. + +## Relationship to what/ Specs + +The [`what/` specs](../what/README.md) define behavioral contracts (technology-neutral). These `how/` specs describe the implementation that fulfills those contracts. When the two diverge, the `what/` spec is the source of truth for correct behavior, and the `how/` spec should be updated to reflect the current code. diff --git a/.ai/spec/how/container-build.md b/.ai/spec/how/container-build.md new file mode 100644 index 000000000..c84548376 --- /dev/null +++ b/.ai/spec/how/container-build.md @@ -0,0 +1,216 @@ +# Container Build -- Architecture + +This spec documents the Containerfiles, Makefile targets, and Konflux/Tekton pipeline configurations that build and publish the project's container images. + +## Module Map + +| Path | Purpose | +|---|---| +| `Containerfile` | Main RAG content image -- multi-stage build (builder → minimal) | +| `byok/Containerfile.tool` | BYOK tool image -- buildah + Python + model + script | +| `byok/Containerfile.output` | BYOK output image template -- vectors only, built inside tool container | +| `Makefile` | Developer-facing build automation | +| `.tekton/lightspeed-ocp-rag-push.yaml` | Konflux push pipeline for main RAG image | +| `.tekton/lightspeed-ocp-rag-pull-request.yaml` | Konflux PR pipeline for main RAG image | +| `.tekton/lightspeed-rag-tool-push.yaml` | Konflux push pipeline for BYOK tool image | +| `.tekton/lightspeed-rag-tool-pull-request.yaml` | Konflux PR pipeline for BYOK tool image | +| `.tekton/own-app-lightspeed-rag-content-push.yaml` | Alternative build variant push pipeline | +| `.tekton/own-app-lightspeed-rag-content-pull-request.yaml` | Alternative build variant PR pipeline | +| `.tekton/integration-tests/lightspeed-rag-content-image-verification.yaml` | Integration test -- validates image contents | +| `pyproject.toml` | PDM project metadata, dependency groups, linting config | +| `requirements.cpu.txt` / `requirements.gpu.txt` | Exported pip dependencies with hashes | +| `pdm.lock.cpu` / `pdm.lock.gpu` | PDM lockfiles per compute flavor | +| `rpms.in.yaml` / `rpms.lock.yaml` | RPM dependency spec + lockfile for Cachi2 | +| `artifacts.lock.yaml` | Pinned model.safetensors URL + SHA256 | +| `renovate.json` | Dependency update automation config | + +## Main Containerfile -- Build Stages + +### Stage 1: Base image selection + +Two named stages define the base images. The `FLAVOR` build arg (default: `cpu`) selects which one is used: + +```dockerfile +FROM registry.access.redhat.com/ubi9/python-311 as cpu-base +FROM nvcr.io/nvidia/cuda:12.9.1-devel-ubi9 as gpu-base +FROM ${FLAVOR}-base as lightspeed-rag-builder +``` + +The GPU base installs additional system packages: `python3.11`, `python3.11-pip`, `libcudnn9`, `libnccl`, `libcusparselt0`. + +### Stage 2: Builder (`lightspeed-rag-builder`) + +``` +USER 0, WORKDIR /workdir + +1. pip install requirements.gpu.txt +2. Symlink NLTK data: + ln -s .../site-packages/llama_index/core/_static/nltk_cache /root/nltk_data +3. COPY ocp-product-docs-plaintext, runbooks, embeddings_model +4. Acquire model.safetensors: + HERMETIC=true → cp /cachi2/output/deps/generic/model.safetensors + HERMETIC=false → curl from HuggingFace (pinned commit SHA) +5. GPU validation (FLAVOR=gpu only): + python3.11 -c "import torch; print(torch.version.cuda); print(torch.cuda.is_available())" + (requires LD_LIBRARY_PATH=/usr/local/cuda-12/compat) +6. COPY scripts/generate_embeddings.py +7. For each OCP_VERSION in $(ls -1 ocp-product-docs-plaintext): + python3.11 generate_embeddings.py \ + -f ocp-product-docs-plaintext/${VERSION} \ + -r runbooks/alerts \ + -md embeddings_model \ + -mn ${EMBEDDING_MODEL} \ + -o vector_db/ocp_product_docs/${VERSION} \ + -i ocp-product-docs-$(echo $VERSION | sed 's/\./_/g') \ + -v ${VERSION} \ + -hb $HERMETIC +8. Create latest symlink: + LATEST=$(ls -1 vector_db/ocp_product_docs/ | sort -V | tail -n 1) + ln -s ${LATEST} vector_db/ocp_product_docs/latest +``` + +### Stage 3: Final image + +```dockerfile +FROM registry.access.redhat.com/ubi9/ubi-minimal@sha256:{digest} +COPY --from=lightspeed-rag-builder /workdir/vector_db/ocp_product_docs /rag/vector_db/ocp_product_docs +COPY --from=lightspeed-rag-builder /workdir/embeddings_model /rag/embeddings_model +RUN mkdir /licenses +COPY LICENSE /licenses/ +# Enterprise contract labels (com.redhat.component, cpe, vendor, etc.) +USER 65532:65532 +``` + +The `ubi-minimal` image is pinned by SHA256 digest. Digest updates are managed by automated Konflux/Mintmaker PRs. + +## BYOK Containerfile.tool + +``` +FROM ubi9/ubi:latest + ├── dnf install buildah python3.11 python3.11-pip + ├── pip install requirements.cpu.txt (--no-deps) + ├── COPY embeddings_model + ├── Acquire model.safetensors (same HERMETIC logic as main Containerfile) + ├── COPY byok/generate_embeddings_tool.py, byok/Containerfile.output + ├── Enterprise contract labels + ├── Set environment: + │ _BUILDAH_STARTED_IN_USERNS="" + │ BUILDAH_ISOLATION=chroot + │ OUT_IMAGE_TAG, BYOK_TOOL_IMAGE, UBI_BASE_IMAGE, LOG_LEVEL, VECTOR_DB_INDEX + └── CMD: buildah build \ + --build-arg BYOK_TOOL_IMAGE=$BYOK_TOOL_IMAGE \ + --build-arg UBI_BASE_IMAGE=$UBI_BASE_IMAGE \ + --env VECTOR_DB_INDEX=$VECTOR_DB_INDEX \ + -t $OUT_IMAGE_TAG -f Containerfile.output \ + -v /markdown:/markdown:Z . \ + && buildah push $OUT_IMAGE_TAG docker-archive:/output/$OUT_IMAGE_TAG.tar +``` + +## BYOK Containerfile.output + +``` +FROM ${BYOK_TOOL_IMAGE} as tool + USER 0, WORKDIR /workdir + RUN python3.11 generate_embeddings_tool.py \ + -i /markdown -emd embeddings_model \ + -emn sentence-transformers/all-mpnet-base-v2 \ + -o vector_db -id $VECTOR_DB_INDEX + +FROM ${UBI_BASE_IMAGE} + COPY --from=tool /workdir/vector_db /rag/vector_db +``` + +## Makefile Targets + +| Target | Command | Purpose | +|---|---|---| +| `install-tools` | `pip3.11 install pdm` | Install PDM if not present | +| `pdm-lock-check` | `pdm lock --check --group {cpu,gpu}` | Validate both lockfiles | +| `install-deps` | `pdm sync --group $(TORCH_GROUP) --lockfile pdm.lock.$(TORCH_GROUP)` | Install runtime deps | +| `install-deps-test` | `pdm sync --dev --group $(TORCH_GROUP) ...` | Install dev deps | +| `update-deps` | `pdm update --update-all ... && pdm export ...` | Update + regenerate requirements.*.txt | +| `check-types` | `mypy --explicit-package-bases scripts` | Type checking | +| `format` | `black scripts && ruff check scripts --fix` | Code formatting | +| `verify` | `black --check scripts && ruff check scripts` | Lint verification | +| `update-docs` | Loop: `get_ocp_plaintext_docs.sh $V` + `get_runbooks.sh` | Refresh committed content | +| `update-model` | `python scripts/download_embeddings_model.py` | Download embedding model | +| `build-image` | `podman build -t rag-content .` | Local container build | +| `model-safetensors` | `wget model.safetensors` if not present | Download model binary | + +`FLAVOR` variable (default: `cpu`) maps to `TORCH_GROUP` which selects the lockfile and requirements file. The `verify` and `format` targets apply `--per-file-ignores=scripts/*:S101` to allow assert statements in scripts. + +## Konflux Pipeline Structure + +All six pipelines are Tekton PipelineRun definitions that follow the same pattern: + +### Prefetch dependencies + +Cachi2 prefetches three dependency types: +- **pip**: From `requirements.{cpu|gpu}.txt` with hashes. +- **rpm**: From `rpms.lock.yaml`. +- **generic**: From `artifacts.lock.yaml` (model.safetensors URL + SHA256). + +### Build + +Uses `buildah` task with: +- `hermetic=true` -- network-isolated build. +- Build args: `FLAVOR=gpu`, `HERMETIC=true`. +- The prefetched dependencies are injected into the build context. + +### Post-build + +- **Source image**: Created for artifact provenance tracking. +- **Label check**: Validates enterprise contract labels. +- **Integration test** (push pipelines only): Runs `lightspeed-rag-content-image-verification.yaml`. + +### Integration test + +`lightspeed-rag-content-image-verification.yaml` is a Tekton Task that: +1. Mounts the built image. +2. Checks for `/rag/vector_db/{version}/index_store.json` for at least one OCP version. +3. Checks for `/rag/embeddings_model/config.json`. +4. Fails if either path is missing. + +## Dependency Management Flow + +``` +pyproject.toml +├── [project.dependencies] Core deps (llama-index, faiss, etc.) +├── [project.optional-dependencies] +│ cpu = [torch @ https://...cpu...] CPU PyTorch wheel (pinned URL + hash) +│ gpu = [torch==2.6.0] GPU PyTorch from PyPI +└── [tool.pdm.dev-dependencies] + dev = [black, mypy, ruff, types-requests] + + │ + ▼ +pdm lock → pdm.lock.cpu / pdm.lock.gpu + │ + ▼ +pdm export → requirements.cpu.txt / requirements.gpu.txt + (with --hashes for pip install verification) + +rpms.in.yaml → rpms.lock.yaml + (Cachi2 RPM resolution for container build) + +artifacts.lock.yaml + (model.safetensors URL + SHA256 for Cachi2 generic artifact) + +renovate.json: + - Python package auto-updates: DISABLED + - Konflux references: auto-updated +``` + +## Implementation Notes + +- The main Containerfile always installs `requirements.gpu.txt` regardless of `FLAVOR`. The `FLAVOR` arg only affects the base image selection (CPU vs GPU). This means the CPU builder installs GPU-compatible torch, which works but is larger than necessary. + +- The `--no-deps` flag is used in the BYOK tool's `pip install` but NOT in the main Containerfile. This prevents pip from pulling transitive dependencies that might conflict with the locked set. + +- `generate_packages_to_prefetch.py` (in `lsc/scripts/`) is a complex script for Cachi2 hermetic build preparation. It: copies the project stub, removes torch from pyproject.toml, runs `pip-compile` to generate requirements.txt, removes torch + nvidia packages, separately downloads the CPU torch wheel from PyPI, computes its hash, and generates `requirements-build.txt`. This script is not invoked during the container build itself -- it is a developer tool for maintaining the Cachi2 prefetch inputs. + +- The NLTK data symlink (`ln -s .../nltk_cache /root/nltk_data`) is required because LlamaIndex's sentence tokenizer depends on NLTK's `punkt` tokenizer data. The data is bundled with the llama-index-core package but needs to be discoverable at the default NLTK data path. + +- GPU builds set `LD_LIBRARY_PATH=/usr/local/cuda-12/compat` for CUDA library discovery. This is needed both during the torch validation step and during the embedding generation loop. + +- The `ubi-minimal` final image digest is periodically updated by Konflux/Mintmaker automation, which submits PRs to update the `@sha256:...` pinning. diff --git a/.ai/spec/how/html-pipeline.md b/.ai/spec/how/html-pipeline.md new file mode 100644 index 000000000..1f92bf88e --- /dev/null +++ b/.ai/spec/how/html-pipeline.md @@ -0,0 +1,163 @@ +# HTML Pipeline -- Architecture + +This spec documents the HTML-based embedding pipeline in `scripts/html_embeddings/` and the semantic chunking library in `scripts/html_chunking/`. This pipeline downloads HTML documentation from the Red Hat portal, strips non-content markup, performs semantic HTML chunking that preserves document structure and anchor IDs, and generates FAISS vector indexes. + +## Module Map + +| Path | Key Symbols | +|---|---| +| `scripts/html_embeddings/generate_embeddings.py` | `main()`, `setup_environment()`, `run_download_step()`, `run_strip_step()`, `run_chunk_step()`, `run_runbooks_step()`, `run_embedding_step()`, `load_chunks_as_nodes()` | +| `scripts/html_embeddings/download_docs.py` | `download_documentation()` | +| `scripts/html_embeddings/strip_html.py` | `strip_html_content()` | +| `scripts/html_embeddings/chunk_html.py` | `chunk_html_documents()`, `chunk_single_html_file()`, `extract_metadata_from_path()`, `validate_chunks()`, `get_chunking_stats()` | +| `scripts/html_embeddings/process_runbooks.py` | `process_runbooks()` | +| `scripts/html_embeddings/utils.py` | `setup_logging()`, `create_directory_structure()`, `validate_dependencies()`, `sanitize_directory_path()` | +| `scripts/html_chunking/chunker.py` | `chunk_html()`, `ChunkingOptions`, `Chunk`, `find_first_anchor()`, `get_document_title()`, `_split_element_by_children()`, `_split_element_by_children_no_grouping()`, `_split_table()`, `_split_list()`, `_split_code()`, `_split_definition_list()`, `_linear_split()`, `_get_anchored_url()` | +| `scripts/html_chunking/tokenizer.py` | `count_html_tokens()` | + +## Data Flow -- 5-Step Pipeline + +``` +1. DOWNLOAD + Input: --doc-url | --doc-url-slug | --config-file + Output: cache/{slug}/{version}/downloads/*.html + Action: Fetch HTML pages from docs.redhat.com portal. + Supports single URL, slug+version, or batch config file. + +2. STRIP + Input: cache/{slug}/{version}/downloads/*.html + Output: cache/{slug}/{version}/stripped/*.html + Action: Remove non-content HTML (navigation, header, footer, + scripts, styles). Preserve document body structure. + +3. CHUNK + Input: cache/{slug}/{version}/stripped/*.html + Output: cache/{slug}/{version}/chunks/{doc_name}/*_chunk_NNNN.json + Action: Semantic HTML chunking → individual JSON chunk files. + Each chunk carries metadata (docs_url, title, section_title, + chunk_index, token_count). + +4. RUNBOOKS + Input: --runbooks-dir (default: ./runbooks) + Output: cache/{slug}/{version}/chunks/*.json (flat, at base level) + Action: Convert Markdown runbooks to JSON chunk files. + Stored flat (not in doc-specific subdirectories). + +5. EMBED + Input: cache/{slug}/{version}/chunks/**/*.json + Output: --output-dir containing FAISS index + metadata.json + Action: Load JSON chunks as TextNode objects, create + VectorStoreIndex, persist to output directory. +``` + +## Key Abstractions + +### HTML Chunking Library (`scripts/html_chunking/`) + +The chunker operates on parsed HTML DOM trees via BeautifulSoup. The algorithm is recursive and structure-aware. + +**Entry point**: `chunk_html(html_content, source_url, max_token_limit, count_tag_tokens) -> list[Chunk]` + +**Short-circuit**: If the entire document fits within `max_token_limit`, it is returned as a single chunk. + +**Primary splitter**: `_split_element_by_children(element, options)` iterates over direct children of an HTML element, accumulating them into a chunk until the token limit is exceeded. Special grouping rules: +- **Sections with IDs**: Processed recursively after flushing the current chunk. Section context (ID) is tracked and wrapped around chunk content. +- **Headings** (h1-h6): Grouped with the following sibling element to keep heading + first content together. +- **Paragraphs ending with `:`**: Grouped with the following table, list, or definition list to keep introductory text + content together. +- **Oversized children**: Recursively split via `_split_element_by_children_no_grouping()`. + +**Secondary splitter**: `_split_element_by_children_no_grouping(element, options)` accumulates children without grouping heuristics. Delegates to specialized splitters for structured elements: +- `_split_table(table, options)` -- Splits by rows, preserving `` header in every chunk. Oversized rows are split by cells. +- `_split_list(list_element, options)` -- Splits `
    `/`
      ` by `
    • ` items, preserving list wrapper tags and attributes. Oversized items are recursively split. +- `_split_code(pre_element, options)` -- Splits `
      ` blocks by lines, preserving wrapper tag and attributes.
      +- `_split_definition_list(div_element, options)` -- Splits `
      ` by `
      `/`
      ` pairs, preserving wrapper structure. + +**Fallback**: `_linear_split(html_content, options)` -- last-resort character-based splitting using a 3.5 characters-per-token ratio estimate. Emits a warning when used. + +### Anchor and Section Tracking + +After chunking, a stateful post-processing pass assigns metadata: +- `last_seen_anchor` -- Most recent HTML element ID encountered across chunks. Persists across chunk boundaries so chunks without IDs inherit the previous anchor. +- `last_heading_text` -- Most recent heading text, used for `section_title` metadata. +- `chapter_anchor` -- Set when an `h2` heading is encountered. Used to build two-level anchored URLs. + +### URL Construction + +`_get_anchored_url(source_url, my_id, parent_id)` builds deep-linked URLs: +- Replaces `/html-single/` with `/html/` in the source URL and strips trailing `/`. +- If no anchor ID: returns the source URL as-is. +- With anchor ID: `{source_url}/{parent_id}#{my_id}` (if parent_id present) or `{source_url}/{my_id}`. + +### Chunk Intermediate Format + +Each chunk is persisted as a JSON file: +```json +{ + "id": "{doc_id}_chunk_0001", + "content": "", + "metadata": { + "docs_url": "https://docs.redhat.com/.../monitoring#my-section", + "title": "Monitoring", + "section_title": "Configuring alerting rules", + "chunk_index": 1, + "total_chunks": 42, + "token_count": 350, + "source_file": "monitoring/index.html", + "doc_name": "monitoring", + "doc_id": "monitoring", + "version": "4.18", + "doc_type": "openshift_container_platform_documentation" + } +} +``` + +### Input Source Resolution + +Three mutually exclusive input methods: +- `--doc-url` -- A full URL to a documentation page. Slug and version are parsed from the URL path. Supports both Red Hat documentation format (`/documentation/{lang}/{slug}/{version}/...`) and arbitrary URLs. +- `--doc-url-slug` + `--doc-url-version` -- Product slug (e.g., `openshift_container_platform`) with explicit version. If the slug looks like a URL, it is parsed to extract the actual slug. +- `--config-file` -- YAML or JSON file with a `products` list, each containing `slug`/`version` or `url`. Enables batch processing of multiple products in one invocation. + +### Pipeline Step Functions + +Each step follows a consistent pattern: +```python +def run_{step}_step(args, paths, [product,] logger) -> bool: +``` +Returns `True` on success, `False` on failure. The main loop checks return values and either continues (with `--continue-on-error`) or aborts. + +## Integration Points + +### html_chunking → html_embeddings + +`chunk_html.py` imports from the chunking library via `sys.path.insert`: +```python +sys.path.insert(0, str(Path(__file__).parent.parent / "html_chunking")) +from chunker import chunk_html, Chunk +from tokenizer import count_html_tokens +``` +This is not a proper package import -- it relies on filesystem adjacency. + +### chunks → embeddings + +`load_chunks_as_nodes(chunks_dir, logger)` reads all `*.json` files (excluding `*_summary.json`) from the chunks directory, constructs `TextNode` objects from each chunk's `content`, `metadata`, and `id`, and returns them for indexing. + +## Implementation Notes + +- The HTML pipeline operates on HTML content throughout -- HTML tags are preserved in chunks and counted toward the token budget by default. The `--no-count-tag-tokens` flag excludes HTML tags from token counting. + +- The pipeline writes intermediate results to a cache directory structure (`cache/{slug}/{version}/{step}/`), enabling `--use-cached-downloads` to skip re-fetching. Each step can be run independently if its inputs exist. + +- `--continue-on-error` allows the pipeline to proceed past failed steps using data from previous runs. This is useful for iterative development. + +- The embedding step does NOT apply the whitespace filter used by the plaintext pipeline. All loaded chunks are embedded. + +- Runbook chunks are stored as flat JSON files in the base chunks directory (not in doc-specific subdirectories). This is how `run_embedding_step` distinguishes them from doc chunks when `--specific-doc` is used: doc chunks are in subdirectories, runbooks are at the base level. + +- `chunk_html_documents()` writes a `chunking_summary.json` alongside the chunks with statistics: total files, processed files, total chunks, and chunking parameters. + +- `validate_chunks()` provides optional post-hoc validation with a 10% tolerance on chunk token size. Undersized chunks are those below 10% of max_token_limit. + +- The `extract_metadata_from_path()` function derives `doc_name`, `doc_id`, `version`, and `doc_type` from the file's relative path within the cache directory. Version is extracted via regex matching path components against `^\d+\.\d+(\.\d+)?$`. + +- Logging outputs to both stdout and a `html_embeddings.log` file. diff --git a/.ai/spec/how/lsc-library.md b/.ai/spec/how/lsc-library.md new file mode 100644 index 000000000..521800a36 --- /dev/null +++ b/.ai/spec/how/lsc-library.md @@ -0,0 +1,279 @@ +# lsc Library -- Architecture + +This spec documents the `lsc/src/lightspeed_rag_content/` installable Python library -- the most recent and most capable pipeline implementation, supporting multiple vector store backends (FAISS, PostgreSQL, llama-stack faiss, llama-stack sqlite-vec). + +## Module Map + +| Path | Key Symbols | +|---|---| +| `lsc/src/lightspeed_rag_content/document_processor.py` | `DocumentProcessor`, `_Config`, `_BaseDB`, `_LlamaIndexDB`, `_LlamaStackDB` | +| `lsc/src/lightspeed_rag_content/metadata_processor.py` | `MetadataProcessor` (abstract base) | +| `lsc/src/lightspeed_rag_content/okp.py` | `OKPMetadataProcessor`, `parse_metadata()`, `yield_files_related_to_projects()`, `is_file_related_to_projects()`, `metadata_has_url_and_title()` | +| `lsc/src/lightspeed_rag_content/utils.py` | `get_common_arg_parser()` | +| `lsc/src/lightspeed_rag_content/asciidoc/asciidoctor_converter.py` | `AsciidoctorConverter` | + +## Class Hierarchy + +``` +DocumentProcessor Public API. Owns a _Config and a _BaseDB subclass. + │ + ├── _Config Attribute-bag for keyword args (dynamic __getattr__). + │ + └── _BaseDB Abstract base. Initializes LlamaIndex Settings. + │ Provides _got_whitespace(), _filter_out_invalid_nodes(), + │ _split_and_filter(). + │ + ├── _LlamaIndexDB FAISS and PostgreSQL backends via LlamaIndex. + │ Accumulates nodes in _good_nodes list. + │ save() creates VectorStoreIndex and writes metadata.json. + │ + └── _LlamaStackDB llama-stack faiss and sqlite-vec backends. + Accumulates documents/chunks in self.documents list. + save() writes YAML config, runs async llama-stack client, + updates config with vector_store_id. +``` + +## Data Flow -- DocumentProcessor Lifecycle + +``` +1. __init__(chunk_size, chunk_overlap, model_name, embeddings_model_dir, + vector_store_type, manual_chunking, doc_type, ...) + ├── Create _Config bag from kwargs + ├── Validate config via _check_config(): + │ - Warn if manual_chunking=False with faiss (not supported) + │ - Warn if table_name set for non-postgres backend + ├── Set HF_HOME = embeddings_model_dir, TRANSFORMERS_OFFLINE = "1" + └── Instantiate DB backend via _get_db(): + "faiss" or "postgres" → _LlamaIndexDB(config) + "llamastack-faiss/sqlite-vec" → _LlamaStackDB(config) + +2. process(docs_dir, metadata, required_exts, file_extractor, + unreachable_action, ignore_list) + ├── Create SimpleDirectoryReader with metadata.populate as file_metadata + ├── Load documents: reader.load_data(num_workers=config.num_workers) + ├── If unreachable_action != "warn": + │ ├── Separate ignore-listed docs from checkable docs + │ ├── Filter checkable docs by url_reachable == True + │ ├── If unreachable found: + │ │ "fail" → raise RuntimeError + │ │ "drop" → keep only reachable + ignored docs + │ └── Merge reachable + ignored back into docs + ├── db.add_docs(docs) → split, filter, accumulate + └── Increment _num_embedded_files by len(docs) + + (process() can be called multiple times to accumulate docs from + multiple sources before a single save()) + +3. save(index, output_dir) + ├── Calculate exec_time = int(time.time() - start_time) + └── db.save(index, output_dir, _num_embedded_files, exec_time) +``` + +## _LlamaIndexDB Internals + +### Initialization + +```python +def __init__(self, config): + super().__init__(config) # Configures Settings: chunk_size, chunk_overlap, + # embed_model, llm=None, node_parser + + # Compute embedding dimension dynamically + config.embedding_dimension = len( + Settings.embed_model.get_text_embedding("random text") + ) + + # Create vector store + if config.vector_store_type == "faiss": + faiss_index = faiss.IndexFlatIP(config.embedding_dimension) + vector_store = FaissVectorStore(faiss_index=faiss_index) + + elif config.vector_store_type == "postgres": + # All Postgres config from environment variables: + # POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_HOST, + # POSTGRES_PORT, POSTGRES_DATABASE + vector_store = PGVectorStore.from_params(...) + + self.storage_context = StorageContext.from_defaults(vector_store=vector_store) + self._good_nodes = [] +``` + +### Document Processing + +`add_docs(docs)` splits documents via `Settings.text_splitter`, filters out nodes without whitespace (via `_filter_out_invalid_nodes`), and appends valid nodes to `_good_nodes`. + +### Persistence + +`save()` creates a `VectorStoreIndex` from accumulated `_good_nodes`, sets the index ID, persists the storage context, and writes `metadata.json` with: +- `execution-time`, `llm` ("None"), `embedding-model`, `index-id` +- `vector-db` ("faiss.IndexFlatIP" or "PGVectorStore") +- `embedding-dimension`, `chunk`, `overlap`, `total-embedded-files` + +## _LlamaStackDB Internals + +### Initialization + +```python +def __init__(self, config): + super().__init__(config) # Settings configuration + + # Resolve model path (absolute for directories, name for repos) + if os.path.exists(config.embeddings_model_dir): + self.model_name_or_dir = os.path.realpath(config.embeddings_model_dir) + else: + self.model_name_or_dir = config.model_name + + # Compute embedding dimension via SentenceTransformer + model = SentenceTransformer(self.model_name_or_dir) + config.embedding_dimension = model.get_sentence_embedding_dimension() + + # Database filename: "faiss_store.db" or "sqlitevec_store.db" + self.db_filename = config.vector_store_type[11:] + "_store.db" + + # Create temp dir, set LLAMA_STACK_CONFIG_DIR to prevent + # using host's ~/.llama content + self.tmp_dir = tempfile.TemporaryDirectory(prefix="ls-rag-") + os.environ["LLAMA_STACK_CONFIG_DIR"] = self.tmp_dir.name + + # Deferred imports of llama_stack (not in pyproject.toml deps) + import llama_stack_api + from llama_stack.core.library_client import AsyncLlamaStackAsLibraryClient +``` + +### Document Accumulation + +`add_docs(docs)` behaves differently based on `manual_chunking`: + +**Manual chunking** (`manual_chunking=True`, the default): +- Splits documents via `_split_and_filter()` (same as _LlamaIndexDB). +- Converts each node to a dict with: `content`, `metadata` (including `document_id`), `chunk_metadata` (document_id, chunk_id, source), and `chunk_id`. + +**Auto chunking** (`manual_chunking=False`): +- Wraps each document as a `RAGDocument` object with: `document_id`, `content`, `mime_type="text/plain"`, `metadata`. +- No splitting is performed -- llama-stack handles chunking at upload time. + +### Persistence + +`save()` orchestrates the llama-stack workflow: +1. Create output directory. +2. Compute paths: db_file, files_metadata_db_file, cfg_file. +3. Write YAML config from `TEMPLATE` via `write_yaml_config()`. +4. Run `asyncio.run(_run_llama_stack(cfg_file, index))`. +5. Update YAML with the created vector_store_id via `_update_yaml_config()`. + +### Manual Chunking Path (`_insert_prechunked_documents`) + +``` +1. Create vector store: + client.vector_stores.create(name=index, provider_id=index, + embedding_model="sentence-transformers/{model}", + embedding_dimension=dim) + +2. Group chunks by source document ID → doc_groups dict + +3. Upload placeholder files per source document (concurrent): + For each doc group: + - Create empty BytesIO file + - client.files.create(file=file_obj, purpose="assistants") + - Update all chunks in group with uploaded file's ID + All uploads via asyncio.gather() + +4. Compute embeddings per chunk: + For each chunk: + - client.embeddings.create(input=content, model=embedding_model) + - Build chunk dict with content, metadata, embedding, etc. + +5. Batch insert: + client.vector_io.insert(vector_store_id=vs.id, chunks=all_chunks) +``` + +### Auto Chunking Path (`_upload_and_process_files`) + +``` +For each RAGDocument (sequential, one at a time): + 1. Upload file: client.files.create(file=BytesIO(content), purpose="assistants") + 2. Attach to vector store: client.vector_stores.files.create( + vector_store_id, file_id, attributes=metadata, + chunking_strategy={type: "static", max_chunk_size_tokens, chunk_overlap_tokens}) + 3. Poll for completion (up to 5 minutes, 0.5s interval): + client.vector_stores.files.retrieve() until status in + ("completed", "failed", "cancelled") + 4. On failure: retry up to 3 times with 1s backoff + 5. After all retries exhausted: add to failed_docs list + +If any files failed: raise RuntimeError +``` + +### YAML Config Template + +The `TEMPLATE` class variable generates a complete llama-stack configuration file: +- **APIs**: files, tool_runtime, vector_io, inference +- **Providers**: `sentence-transformers` for inference, `localfs` for files, `rag-runtime` for tool_runtime, configurable provider for vector_io (faiss or sqlite-vec) +- **Storage**: SQLite backends for metadata, KV store, and SQL store. The KV store path (`kv_db_path`) points to the output database file. +- **Registered model**: Embedding model with dimension and provider mapping. + +After vector store creation, `_update_yaml_config()` replaces `vector_stores: []` with the actual vector store section containing dimension, model, provider ID, and store ID. + +## MetadataProcessor Contract + +```python +class MetadataProcessor(ABC): + def __init__(self, hermetic_build: bool) + + def populate(self, file_path: str) -> dict: + # Returns {"docs_url": str, "title": str, "url_reachable": bool} + # Calls url_function() for URL, get_file_title() for title, + # ping_url() for reachability (skipped if hermetic_build=True) + + def get_file_title(self, file_path: str) -> str: + # Reads first line, strips "\n" and leading "# " + + def ping_url(self, url: str, retries: int = 3) -> bool: + # HTTP GET with 30s timeout, retries on failure or non-200 + + @abstractmethod + def url_function(self, file_path: str) -> str: + # Subclass implements: derive source URL from file path +``` + +### OKPMetadataProcessor + +Subclass for OKP/errata files. Overrides both `url_function` and `get_file_title`: +- `url_function()`: Parses TOML frontmatter, returns `metadata["extra"]["reference_url"]`. +- `get_file_title()`: Parses TOML frontmatter, returns `metadata["title"]`. + +Supporting functions: +- `parse_metadata(filepath)` -- Reads file, extracts TOML between `+++` markers via regex, parses with `tomllib`. +- `yield_files_related_to_projects(directory, projects)` -- Yields paths of `.md` files in directory whose `portal_product_names` match any of the given project names (case-insensitive substring match). +- `is_file_related_to_projects(metadata, projects)` -- Returns `True` if any project name is a substring of any product name in the metadata. +- `metadata_has_url_and_title(metadata)` -- Returns `True` if `reference_url` exists in `extra` and `title` is non-empty. + +## AsciidoctorConverter + +Wraps the `asciidoctor` CLI binary: +- Default target format: `"text"`, using the custom Ruby converter at `ruby_asciidoc/asciidoc_text_converter.rb`. +- Built-in asciidoctor formats (`html5`, `xhtml5`, `manpage`) use no custom converter file. +- Attribute files (YAML) are read and converted to `-a key=value` CLI arguments. +- `convert(source_file, destination_file)` runs `subprocess.run(command, check=True, capture_output=True)`. Creates destination directories if needed. +- Constructor validates that `asciidoctor` is on PATH via `shutil.which()`. +- Custom converter files are located via `importlib.resources.files()` relative to the package. + +## Implementation Notes + +- `_Config` uses a private `__attributes` dict with `__getattr__`/`__setattr__` overrides rather than `self.__dict__` updates. The mangled name `_Config__attributes` is handled specially in `__setattr__` via `super().__setattr__()`. This pattern avoids polluting the instance namespace and helps with type checking. + +- `_BaseDB.__init__` sets `Settings.node_parser = MarkdownNodeParser()` when `doc_type` is `"markdown"` or `"html"`. This is because LlamaIndex's `HTMLReader` converts HTML to Markdown internally, so both content types benefit from Markdown-aware node parsing. + +- `_LlamaStackDB` has a documented limitation: it can only work once per instance. The YAML config file and database files in the temp directory are not cleaned up between runs. Creating a second index would require a new `DocumentProcessor` instance. + +- `DocumentProcessor._check_config()` warns but does not fail when `manual_chunking=False` is passed with FAISS (`auto_chunking` only works with llama-stack), or when `table_name` is set for non-Postgres backends. + +- The `process()` method can be called multiple times before `save()`. This is the intended pattern for combining docs from multiple sources (e.g., OCP docs + runbooks) into a single index. + +- `_LlamaStackDB` uses deferred imports (`import llama_stack_api` inside `__init__`) because llama-stack packages are not in the project's core `pyproject.toml` dependencies. They are optional, only needed when using llama-stack backends. + +- The `TEMPLATE` uses double-brace `{{}}` for literal braces in the YAML output since it uses Python's `str.format()` for interpolation. + +- PostgreSQL configuration for `_LlamaIndexDB` is read entirely from environment variables (`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_HOST`, `POSTGRES_PORT`, `POSTGRES_DATABASE`), not from `_Config`. This means Postgres config cannot be passed via CLI arguments. + +- The `_LlamaStackDB` placeholder file upload pattern (uploading empty files for citation metadata) exists because llama-stack's citation system needs a `file_id` to link chunks back to source documents. The actual content is in the chunks, not the files. diff --git a/.ai/spec/how/plaintext-pipeline.md b/.ai/spec/how/plaintext-pipeline.md new file mode 100644 index 000000000..ec719732f --- /dev/null +++ b/.ai/spec/how/plaintext-pipeline.md @@ -0,0 +1,120 @@ +# Plaintext Pipeline -- Architecture + +This spec documents the implementation of `scripts/generate_embeddings.py` -- the production pipeline invoked by the main Containerfile to generate FAISS vector indexes from pre-converted plaintext OCP documentation and Markdown runbooks. + +## Module Map + +| Path | Key Symbols | +|---|---| +| `scripts/generate_embeddings.py` | `ocp_file_metadata_func()`, `runbook_file_metadata_func()`, `file_metadata_func()`, `ping_url()`, `get_file_title()`, `got_whitespace()`, `str2bool()` | + +## Data Flow + +``` +CLI args parsed + │ + ├── Set module-level globals: + │ OCP_DOCS_VERSION, HERMETIC_BUILD, + │ EMBEDDINGS_ROOT_DIR (abs path to docs folder), + │ RUNBOOKS_ROOT_DIR (abs path to runbooks folder) + │ + ├── Configure LlamaIndex Settings: + │ Settings.chunk_size = args.chunk (default 380) + │ Settings.chunk_overlap = args.overlap (default 0) + │ Settings.embed_model = HuggingFaceEmbedding(model_name=args.model_dir) + │ Settings.llm = resolve_llm(None) + │ + ├── Compute embedding dimension: len(embed_model.get_text_embedding("random text")) + │ + ├── Create FAISS index: faiss.IndexFlatIP(embedding_dimension) + │ └── Wrap in FaissVectorStore → StorageContext + │ + ├── Load OCP docs: + │ SimpleDirectoryReader(args.folder, recursive=True, + │ file_metadata=ocp_file_metadata_func) + │ └── ocp_file_metadata_func builds docs_url: + │ OCP_DOCS_ROOT_URL + version + relative_path(.txt→.html) + │ + ├── Split into nodes: + │ Settings.text_splitter.get_nodes_from_documents(documents) + │ + ├── Filter: keep only TextNode with whitespace in text + │ → good_nodes list + │ + ├── Load runbooks: + │ SimpleDirectoryReader(args.runbooks, recursive=True, + │ required_exts=[".md"], + │ file_extractor={".md": FlatReader()}, + │ file_metadata=runbook_file_metadata_func) + │ └── runbook_file_metadata_func builds docs_url: + │ RUNBOOKS_ROOT_URL + relative_path + │ + ├── Split runbook nodes, append to good_nodes + │ (no whitespace filter applied to runbook nodes) + │ + ├── Create VectorStoreIndex(good_nodes, storage_context) + │ → triggers embedding generation for all nodes + │ + ├── Set index ID: index.set_index_id(args.index) + │ + ├── Persist: index.storage_context.persist(persist_dir=PERSIST_FOLDER) + │ → writes docstore.json, index_store.json, graph_store.json, vector_store.json + │ + └── Write metadata.json with execution metrics +``` + +## Key Abstractions + +### Metadata functions + +`file_metadata_func(file_path, docs_url_func)` is the generic metadata populator. It: +1. Calls `docs_url_func(file_path)` to derive the URL. +2. Calls `get_file_title(file_path)` to extract the title from the first line. +3. Optionally calls `ping_url(docs_url)` if `HERMETIC_BUILD` is false. +4. Returns `{"docs_url": docs_url, "title": title}`. + +`ocp_file_metadata_func(file_path)` and `runbook_file_metadata_func(file_path)` are thin wrappers that pass a URL-builder lambda to `file_metadata_func`. + +### URL construction + +- **OCP docs**: `https://docs.openshift.com/container-platform/{OCP_DOCS_VERSION}/{relative_path}.html` + - `relative_path` = `file_path` with `EMBEDDINGS_ROOT_DIR` prefix stripped and `.txt` suffix replaced with `.html`. +- **Runbooks**: `https://github.com/openshift/runbooks/blob/master/alerts/{relative_path}` + - `relative_path` = `file_path` with `RUNBOOKS_ROOT_DIR` prefix stripped. + +### Node filtering + +`got_whitespace(text)` iterates character-by-character and returns `True` if any character is whitespace. Only `TextNode` instances passing this check are included in `good_nodes`. This filters out degenerate chunks (e.g., single tokens without spaces). + +## Implementation Notes + +- `UNREACHABLE_DOCS` is a module-level global counter incremented via the `global` keyword. Not thread-safe, but the script is single-threaded. + +- OCP docs and runbooks are loaded separately but merged into a single `good_nodes` list before index creation, producing one combined FAISS index per OCP version. + +- Runbook nodes bypass the whitespace filter -- they are appended directly after splitting via `good_nodes.extend(runbook_nodes)`. + +- `FlatReader()` is used for `.md` files to prevent LlamaIndex's default Markdown parser from interpreting structure. This treats Markdown as raw plaintext, preserving headings and formatting characters in the chunk text. + +- The output directory is sanitized via `os.path.normpath("/" + args.output).lstrip("/")` to prevent path traversal (fix for OLS-823). + +- `str2bool(value)` exists because Python's `bool("False")` returns `True`. The function handles string representations of booleans for the `--hermetic-build` argument, accepting: `true/false`, `yes/no`, `on/off`, `1/0`, `t/f`, `y/n`. + +- The script runs as `__main__` only -- it has no importable API. All logic is in the `if __name__ == "__main__":` block. + +- Metadata recorded in `metadata.json` includes `"llm": "None"` (string, not null) because no LLM is used during embedding generation. + +- The Containerfile invokes this script once per OCP version in a shell loop: + ``` + for OCP_VERSION in $(ls -1 ocp-product-docs-plaintext); do + python3.11 generate_embeddings.py \ + -f ocp-product-docs-plaintext/${OCP_VERSION} \ + -r runbooks/alerts \ + -md embeddings_model \ + -mn ${EMBEDDING_MODEL} \ + -o vector_db/ocp_product_docs/${OCP_VERSION} \ + -i ocp-product-docs-$(echo $OCP_VERSION | sed 's/\./_/g') \ + -v ${OCP_VERSION} \ + -hb $HERMETIC + done + ``` diff --git a/.ai/spec/how/project-structure.md b/.ai/spec/how/project-structure.md new file mode 100644 index 000000000..5cd6947ac --- /dev/null +++ b/.ai/spec/how/project-structure.md @@ -0,0 +1,116 @@ +# Project Structure -- Architecture + +OpenShift LightSpeed RAG Content is organized into four areas: `lsc/` (installable Python library), `scripts/` (standalone pipeline scripts and utilities), `byok/` (BYOK tooling), and root-level build/config files. The project has no runtime component -- all code runs during build or development. + +## Module Map + +### `lsc/src/lightspeed_rag_content/` -- Installable Python library + +| Path | Purpose | +|---|---| +| `document_processor.py` | `DocumentProcessor` class -- orchestrates document loading, chunking, embedding, and vector store persistence. Delegates to `_LlamaIndexDB` (FAISS/PostgreSQL) or `_LlamaStackDB` (llama-stack faiss/sqlite-vec). 836 lines. | +| `metadata_processor.py` | `MetadataProcessor` abstract base -- defines the `populate()` callback for LlamaIndex's `file_metadata` parameter. Subclasses implement `url_function()` to derive URLs from file paths. 100 lines. | +| `okp.py` | `OKPMetadataProcessor` -- parses TOML frontmatter from OKP/errata files. Helpers: `parse_metadata()`, `yield_files_related_to_projects()`, `is_file_related_to_projects()`, `metadata_has_url_and_title()`. 153 lines. | +| `utils.py` | `get_common_arg_parser()` -- shared CLI argument definitions (folder, model-dir, chunk, overlap, output, index, workers, vector-store-type, auto-chunking). 72 lines. | +| `asciidoc/asciidoctor_converter.py` | `AsciidoctorConverter` -- wraps the `asciidoctor` CLI for AsciiDoc format conversion. Supports custom Ruby converter extensions and per-version YAML attribute files. 160 lines. | +| `asciidoc/ruby_asciidoc/` | Ruby converter extensions for asciidoctor. `asciidoc_text_converter.rb` implements the custom text output format. | +| `asciidoc/__main__.py` | CLI entry point for AsciiDoc conversion. | + +### `scripts/` -- Standalone pipeline scripts + +| Path | Purpose | +|---|---| +| `generate_embeddings.py` | **Plaintext pipeline** -- loads OCP docs + runbooks, generates FAISS index. The script invoked by the production Containerfile. 248 lines. | +| `html_chunking/chunker.py` | Semantic HTML chunker -- splits HTML by DOM structure (sections, tables, lists, code blocks, definition lists). Generates anchor-aware metadata. 408 lines. | +| `html_chunking/tokenizer.py` | `count_html_tokens()` -- token counting for HTML content with optional tag token counting. | +| `html_chunking/parser.py` | HTML parsing utilities for the chunking library. | +| `html_chunking/test_chunker.py` | Unit tests for HTML chunking logic. | +| `html_embeddings/generate_embeddings.py` | **HTML pipeline** orchestrator -- 5-step pipeline: download, strip, chunk, runbooks, embed. Supports batch processing via config file. 659 lines. | +| `html_embeddings/download_docs.py` | `download_documentation()` -- fetches HTML docs from Red Hat documentation portal. | +| `html_embeddings/strip_html.py` | `strip_html_content()` -- removes non-content HTML (navigation, headers, footers, scripts, styles). | +| `html_embeddings/chunk_html.py` | `chunk_html_documents()` -- bridges the HTML chunking library to the embeddings pipeline. Manages per-document output directories and metadata extraction. 470 lines. | +| `html_embeddings/process_runbooks.py` | `process_runbooks()` -- converts Markdown runbooks to JSON chunk files for the HTML pipeline. | +| `html_embeddings/utils.py` | `setup_logging()`, `create_directory_structure()`, `validate_dependencies()`, `sanitize_directory_path()`. 264 lines. | +| `html_embeddings/test_html_embeddings.py` | Unit tests for the HTML embeddings pipeline. | +| `asciidoctor-text/convert-it-all.py` | Bulk AsciiDoc-to-plaintext conversion using topic maps. Reads `_topic_map.yml`, filters by distribution, and converts each referenced `.adoc` file. | +| `asciidoctor-text/text-converter.rb` | Ruby text format converter extension for asciidoctor. | +| `get_ocp_plaintext_docs.sh` | Clones openshift-docs for a given version, runs AsciiDoc conversion, applies exclusions from `config/exclude.conf`. | +| `get_runbooks.sh` | Sparse-checkout of `alerts/` directory from openshift/runbooks repo. Removes README files, empty dirs, deprecated dirs. | +| `query_rag.py` | Debug utility -- loads a persisted FAISS index and retrieves top-k similar nodes for a query. | +| `distance.py` | Debug utility -- computes cosine + euclidean distance between two text embeddings. | +| `iterate_docstore.py` | Debug utility -- dumps all nodes from a vector DB's docstore.json. | +| `download_embeddings_model.py` | Downloads the embedding model from HuggingFace via `snapshot_download()`. Removes unneeded files (pytorch_model.bin, onnx/, openvino/). | +| `generate_packages_to_prefetch.py` | Generates Cachi2-compatible requirements files for hermetic builds. Complex: strips torch, handles CPU wheel separately, computes hashes. | +| `verify_rag_image_test.py` | Integration test -- verifies container image has `/rag/vector_db/{version}/index_store.json` and `/rag/embeddings_model/config.json`. | + +### `byok/` -- BYOK tooling + +| Path | Purpose | +|---|---| +| `generate_embeddings_tool.py` | BYOK embedding generator -- simplified pipeline for customer Markdown. Uses `FlatReader` and YAML frontmatter parsing. 128 lines. | +| `Containerfile.tool` | BYOK tool container definition (buildah + Python + model + script). | +| `Containerfile.output` | BYOK output container template (vectors only, built inside the tool container). | +| `README.md` | BYOK usage documentation: environment variables, frontmatter format, examples. | + +### `config/` -- Content configuration + +| Path | Purpose | +|---|---| +| `exclude.conf` | Newline-delimited list of relative file paths to exclude from OCP docs after AsciiDoc conversion. | + +### `ocp-product-docs-plaintext/` -- Committed OCP documentation + +Contains plaintext-converted OCP documentation organized by version (`4.16/` through `4.22/`). Each version directory preserves the category structure from openshift-docs (e.g., `applications/`, `architecture/`, `authentication/`, `backup_and_restore/`, etc.). + +### `runbooks/` -- Committed alert runbooks + +Contains Markdown runbooks organized under `alerts/` with operator-specific subdirectories (e.g., `cluster-etcd-operator/`, `cluster-dns-operator/`, `openshift-virtualization-operator/`). + +### `embeddings_model/` -- Sentence-transformer model + +Contains the `sentence-transformers/all-mpnet-base-v2` model files (`config.json`, `tokenizer.json`, `vocab.txt`, `1_Pooling/` config). The `model.safetensors` binary is not committed -- it is downloaded at build time or fetched from Cachi2. + +### Root-level build and config files + +| Path | Purpose | +|---|---| +| `Containerfile` | Main RAG content image -- multi-stage build (builder, minimal). | +| `Makefile` | Developer-facing build automation (install-deps, update-docs, build-image, format, verify, etc.). | +| `pyproject.toml` | PDM project metadata. Dependencies, optional groups (cpu/gpu), ruff/mypy config. | +| `pdm.lock.cpu` / `pdm.lock.gpu` | PDM lockfiles per compute flavor. | +| `requirements.cpu.txt` / `requirements.gpu.txt` | Exported pip dependencies with hashes, generated by `pdm export`. | +| `requirements-build.txt` | Build-time pip dependencies for Cachi2. | +| `rpms.in.yaml` / `rpms.lock.yaml` | RPM dependency spec + lockfile for Cachi2 hermetic builds. | +| `artifacts.lock.yaml` | Pinned `model.safetensors` URL + SHA256 checksum. | +| `renovate.json` | Renovate bot config -- Python package auto-updates disabled. | +| `.gitleaks.toml` | Secret scanning configuration. | +| `.syft.yaml` | SBOM generation configuration. | +| `ubi.repo` / `cuda.repo` | DNF repository files for use inside container builds. | +| `OWNERS` | GitHub ownership: approvers list. | +| `CLAUDE.md` / `AGENTS.md` | Development guide for AI agents. | + +## Dependency Management + +**Package manager**: PDM (not pip/poetry). Two lockfiles exist per compute flavor: +- `pdm.lock.cpu` -- pins PyTorch CPU variant from `download.pytorch.org/whl/cpu`. +- `pdm.lock.gpu` -- pins standard PyTorch from PyPI. + +**Makefile `FLAVOR` variable** (default: `cpu`): Maps to `TORCH_GROUP` which selects the lockfile and requirements file for all PDM operations. + +**Core dependencies**: `llama-index-core`, `llama-index-vector-stores-faiss`, `llama-index-embeddings-huggingface`, `llama-index-readers-file`, `faiss-cpu`, `torch`, `huggingface-hub`, `accelerate`, `python-frontmatter`, `beautifulsoup4`, `aiohttp`, `PyYAML`, `urllib3`. + +**Dev dependencies**: `black`, `mypy`, `ruff`, `types-requests`. + +**Optional (not in pyproject.toml)**: `llama-stack-api`, `llama-stack-core`, `pgvector`, `psycopg` -- used by the lsc library for non-FAISS backends. + +## Key Relationships + +1. **Production Containerfile uses `scripts/generate_embeddings.py`**, not the lsc library. The lsc library is for downstream consumers and alternative backends. + +2. **HTML pipeline is standalone** -- it does not share code with the plaintext pipeline or the lsc library. It has its own download, strip, chunk, and embed steps. + +3. **Content acquisition scripts exist in both `scripts/` and `lsc/scripts/`**. The `lsc/` copies are the maintained versions; the `scripts/` copies are the originals used by the Containerfile and Makefile. + +4. **The lsc library is an installable package** (`lsc/` contains its own `pyproject.toml` structure via the `src/` layout), but it is not published to PyPI. It is imported directly by downstream projects. + +5. **Test infrastructure is minimal**: `scripts/html_chunking/test_chunker.py` and `scripts/html_embeddings/test_html_embeddings.py` use `unittest`. `scripts/verify_rag_image_test.py` verifies container image contents. No pytest configuration exists. diff --git a/.ai/spec/what/README.md b/.ai/spec/what/README.md new file mode 100644 index 000000000..9d1a02bc1 --- /dev/null +++ b/.ai/spec/what/README.md @@ -0,0 +1,24 @@ +# Behavioral Specifications (what/) + +These specs define WHAT the RAG content pipeline must do -- testable behavioral rules, configuration surface, constraints, and planned changes. They are technology-neutral where possible and survive a complete rewrite in a different framework. + +## Spec Index + +| Spec | Description | +|------|-------------| +| [system-overview.md](system-overview.md) | Project purpose, boundaries, integration contract with lightspeed-service | +| [content-sources.md](content-sources.md) | OCP docs, runbooks, OKP content -- acquisition, versioning, metadata, exclusions | +| [embedding-pipeline.md](embedding-pipeline.md) | Shared behavioral rules for all pipeline variants: chunking, embedding, vector store output, index organization | +| [byok.md](byok.md) | Bring Your Own Knowledge -- customer content import via tool container | +| [container-build.md](container-build.md) | Container images, hermetic builds, CI/CD pipelines, dependency management | + +## How to Use These Specs + +- **Fixing a bug**: Read the relevant spec to understand correct behavior, then compare against the code. +- **Adding a feature**: Check if the spec covers the requirement. Update the spec before implementing. +- **Refactoring**: Use the specs as acceptance criteria. The implementation can change freely as long as it meets the behavioral rules. +- **Understanding planned work**: Look for `[PLANNED: OLS-XXXX]` markers inline and "Planned Changes" sections. + +## Relationship to how/ Specs + +These `what/` specs define the behavioral contract. The [`how/` specs](../how/README.md) describe the current implementation architecture. Read `what/` to understand requirements, read `how/` to understand the codebase structure. diff --git a/.ai/spec/what/byok.md b/.ai/spec/what/byok.md new file mode 100644 index 000000000..77007fca7 --- /dev/null +++ b/.ai/spec/what/byok.md @@ -0,0 +1,64 @@ +# BYOK (Bring Your Own Knowledge) + +BYOK enables customers to create custom RAG indexes from their own documentation, so that OpenShift LightSpeed responses incorporate organization-specific knowledge alongside standard product documentation. + +## Behavioral Rules + +1. BYOK accepts customer-supplied Markdown files as input, mounted at `/markdown` in the tool container. + +2. Markdown files may optionally contain YAML frontmatter (delimited by `---`) with `title` and `url` fields. If present, these override the default metadata extraction. + +3. If no YAML frontmatter is present, `title` is extracted from the first line of the file (stripping any leading `# ` prefix), and `docs_url` defaults to the file path. + +4. Only `.md` files are processed. Files are read recursively from the input directory using `FlatReader` (raw text ingestion -- no Markdown structure parsing). + +5. The BYOK tool uses the same embedding model (`sentence-transformers/all-mpnet-base-v2`), default chunk size (380 tokens), default chunk overlap (0), and FAISS output format (`IndexFlatIP`) as the main pipeline. Customer indexes are interchangeable with product indexes from lightspeed-service's perspective. + +6. The BYOK tool produces a container image as an OCI tar archive at `/output/{tag}.tar`. The output image contains the vector DB at `/rag/vector_db/`. + +7. The tool container uses `buildah` internally to build the output image. It runs a nested container build using `Containerfile.output`, which: + - Uses the tool image itself as the builder stage to run the embedding generation script. + - Copies the generated vectors into a minimal UBI base image. + +8. Node filtering (the whitespace-only check applied in other pipelines) is NOT applied in BYOK. All chunks from the text splitter are included in the index. + +9. The output image contains only the vector DB, NOT the embedding model. The service must provide the model separately (typically from the main RAG content image). + +## Configuration Surface -- Environment Variables (Tool Container) + +| Variable | Default | Purpose | +|---|---|---| +| `OUT_IMAGE_TAG` | `byok-image` | Tag for the output container image | +| `VECTOR_DB_INDEX` | `vector_db_index` | Index ID string | +| `BYOK_TOOL_IMAGE` | `registry.redhat.io/.../lightspeed-rag-tool-rhel9:latest` | Base image for the tool stage | +| `UBI_BASE_IMAGE` | `registry.access.redhat.com/ubi9/ubi:latest` | Base image for the output container | +| `LOG_LEVEL` | `info` | buildah log level | + +## Configuration Surface -- CLI Arguments (`generate_embeddings_tool.py`) + +| Argument | Default | Purpose | +|---|---|---| +| `-i` / `--input-dir` | (required) | Input directory with Markdown content | +| `-emd` / `--embedding-model-dir` | `embeddings_model` | Embedding model directory | +| `-emn` / `--embedding-model-name` | (required) | HuggingFace repo ID | +| `-cs` / `--chunk-size` | `380` | Chunk size in tokens | +| `-co` / `--chunk-overlap` | `0` | Chunk overlap in tokens | +| `-o` / `--output-dir` | (required) | Vector DB output directory | +| `-id` / `--index-id` | (required) | Index ID string | + +## Constraints + +1. The tool container requires `buildah` and uses `BUILDAH_ISOLATION=chroot` for rootless container building. + +2. The tool container runs as root (USER 0) because buildah requires privilege for image building. + +3. BYOK uses CPU-only dependencies (`requirements.cpu.txt`). GPU acceleration is not supported for BYOK. + +4. The output directory path is sanitized via `os.path.normpath("/" + path).lstrip("/")` to prevent path traversal. + +5. The customer must mount their Markdown content at `/markdown` and their output directory at `/output` when running the tool container. + +## Planned Changes + +- [PLANNED: OCPSTRAT-1494 Phase 2] Seamless one-click import from Git repositories and Confluence, replacing the manual container-based workflow. +- [PLANNED: OLS-1872] Internal web source integration for BYOK. diff --git a/.ai/spec/what/container-build.md b/.ai/spec/what/container-build.md new file mode 100644 index 000000000..32da50e3c --- /dev/null +++ b/.ai/spec/what/container-build.md @@ -0,0 +1,89 @@ +# Container Build + +This spec defines the rules for building container images, hermetic build support, and CI/CD pipeline behavior. + +## Behavioral Rules -- Main RAG Content Image + +1. The build is a multi-stage process: a builder stage generates all vector indexes, then a minimal final stage copies only the output artifacts. + +2. The builder stage supports two base images selected by the `FLAVOR` build arg: + - `cpu`: UBI9 Python 3.11 (`registry.access.redhat.com/ubi9/python-311`). + - `gpu`: NVIDIA CUDA 12.9.1 devel on UBI9 (`nvcr.io/nvidia/cuda:12.9.1-devel-ubi9`), with additional system packages installed via dnf. + +3. The builder stage iterates over all version directories in `ocp-product-docs-plaintext/` and generates one FAISS index per version using the plaintext pipeline script. Each version's index includes both OCP docs and runbooks. + +4. After all indexes are generated, a `latest` symlink is created pointing to the highest version directory (determined by version-aware sorting). + +5. The final image uses `ubi9/ubi-minimal` (pinned by digest) as base and contains only: + - `/rag/vector_db/ocp_product_docs/` -- all version index directories plus the `latest` symlink. + - `/rag/embeddings_model/` -- the sentence-transformer model. + - `/licenses/LICENSE` -- Apache 2.0 license for enterprise contract compliance. + +6. The final image runs as non-root user (UID 65532, GID 65532). + +7. The embedding model's `model.safetensors` file is sourced based on the `HERMETIC` build arg: + - `HERMETIC=false`: Downloaded from HuggingFace at build time (URL pinned to a specific commit hash). + - `HERMETIC=true`: Copied from the Cachi2 prefetch cache at `/cachi2/output/deps/generic/model.safetensors`. + +8. Container labels must satisfy Red Hat enterprise contract requirements: `com.redhat.component`, `cpe`, `description`, `distribution-scope`, `io.k8s.description`, `io.k8s.display-name`, `io.openshift.tags`, `name`, `release`, `url`, `vendor`, `version`, `summary`. + +## Behavioral Rules -- BYOK Tool Image + +9. The BYOK tool image is built from `byok/Containerfile.tool`. It contains: `buildah`, Python 3.11, CPU Python dependencies, the embedding model, the BYOK embedding script (`generate_embeddings_tool.py`), and the output Containerfile template (`Containerfile.output`). + +10. The tool image's CMD runs `buildah build` to produce the customer's RAG image from Markdown content mounted at `/markdown`, then pushes the result as a tar archive to `/output/`. + +## Behavioral Rules -- Hermetic Builds + +11. Hermetic builds (`HERMETIC=true`) operate without network access during the container build step. + +12. All Python packages are pre-fetched via Cachi2 and installed from the prefetch cache. The requirements files include package hashes for verification. + +13. The embedding model binary (`model.safetensors`) is fetched as a generic artifact via Cachi2 with a pinned SHA256 hash defined in `artifacts.lock.yaml`. + +14. RPM packages are resolved and locked via `rpms.in.yaml` (input specification) and `rpms.lock.yaml` (locked versions). + +15. URL reachability validation is skipped during hermetic builds (the `--hermetic-build true` flag is passed to the embedding script). + +## Behavioral Rules -- CI/CD (Konflux/Tekton) + +16. Six pipelines exist as Tekton PipelineRun definitions: + - Push and pull-request variants for the main RAG content image. + - Push and pull-request variants for the BYOK tool image. + - Push and pull-request variants for an alternative build variant. + +17. Push pipelines trigger on merge to `main`. Pull-request pipelines trigger on PRs. + +18. All pipelines use hermetic builds with Cachi2 prefetch for pip packages, RPMs, and generic artifacts. + +19. The main RAG image builds with `FLAVOR=gpu` in CI. + +20. An integration test verifies the built image contains the expected paths: a `index_store.json` file under `/rag/vector_db/` for at least one OCP version, and `config.json` under `/rag/embeddings_model/`. + +## Configuration Surface + +| Parameter | Type | Default | Purpose | +|---|---|---|---| +| `FLAVOR` | Build arg | `cpu` | Base image selection: `cpu` or `gpu` | +| `HERMETIC` | Build arg | `false` | Enable hermetic build mode | +| `EMBEDDING_MODEL` | Build arg | `sentence-transformers/all-mpnet-base-v2` | HuggingFace repo ID | +| `artifacts.lock.yaml` | File | -- | Pinned `model.safetensors` URL + SHA256 | +| `rpms.in.yaml` | File | -- | RPM dependency specifications | +| `rpms.lock.yaml` | File | -- | Locked RPM versions | +| `requirements.cpu.txt` | File | -- | Exported pip dependencies with hashes (CPU) | +| `requirements.gpu.txt` | File | -- | Exported pip dependencies with hashes (GPU) | +| `pdm.lock.cpu` | File | -- | PDM lockfile (CPU) | +| `pdm.lock.gpu` | File | -- | PDM lockfile (GPU) | +| `renovate.json` | File | -- | Dependency update automation config | + +## Constraints + +1. The NLTK data directory must be symlinked after pip install to make tokenization data available. + +2. GPU builds require the CUDA compatibility library path to be set for library discovery. + +3. Python dependencies are managed by PDM with separate lockfiles per compute flavor. The `pdm export` command generates the `requirements.*.txt` files used by pip inside the container build. + +4. The `ubi-minimal` final image is pinned by digest, not tag. Digest updates are managed by automated Konflux/Mintmaker PRs. + +5. Python package auto-updates via Renovate are disabled (configured in `renovate.json`). Dependency updates are manual and require lockfile regeneration. diff --git a/.ai/spec/what/content-sources.md b/.ai/spec/what/content-sources.md new file mode 100644 index 000000000..efbbc3c25 --- /dev/null +++ b/.ai/spec/what/content-sources.md @@ -0,0 +1,78 @@ +# Content Sources + +This spec defines the rules for acquiring, processing, and organizing input content before it enters the embedding pipeline. + +## Behavioral Rules -- OCP Product Documentation + +1. The source is the `openshift/openshift-docs` GitHub repository. Each OCP version is cloned from its corresponding `enterprise-{version}` branch (e.g., `enterprise-4.18`). + +2. AsciiDoc source files are converted to plaintext using the `asciidoctor` CLI with a custom Ruby text converter extension. Per-version attribute files provide version-specific AsciiDoc substitution values. + +3. The topic map file (`_topic_maps/_topic_map.yml`) with the distro filter `openshift-enterprise` determines which AsciiDoc files to convert. Only files referenced in the topic map for the `openshift-enterprise` distribution are included. + +4. Converted plaintext files are stored in `ocp-product-docs-plaintext/{version}/` with the directory hierarchy from the source repository preserved. + +5. Files listed in the exclusion configuration must be removed after conversion. The exclusion list is a newline-delimited file of relative paths within the version directory. + +6. All currently supported OCP versions must have a corresponding directory. Each directory name is a version string (e.g., `4.16`, `4.17`, `4.18`, `4.19`, `4.20`, `4.21`, `4.22`). + +7. The plaintext docs and runbooks are committed to the repository and kept up to date. Content acquisition scripts exist for refreshing them, but the committed content is what the production build uses. + +## Behavioral Rules -- Runbooks + +8. The source is the `openshift/runbooks` GitHub repository, `master` branch, restricted to the `alerts/` directory only. + +9. Acquisition uses sparse checkout with blob filtering (`--filter=blob:none`) to avoid cloning the full repository history. + +10. Only Markdown (`.md`) files are retained. `README.md` files, empty directories, and `deprecated/` directories are removed after checkout. + +11. Runbooks are stored in `runbooks/alerts/` with operator-specific subdirectories preserved (e.g., `runbooks/alerts/cluster-etcd-operator/`). + +## Behavioral Rules -- OKP (Errata) Content + +12. OKP files are Markdown documents with TOML frontmatter delimited by `+++` markers. + +13. OKP files are filtered by project name via case-insensitive substring matching against the `portal_product_names` list in the TOML metadata's `extra` section. A project name like "OpenStack" matches any product name containing that substring (e.g., "Red Hat OpenStack Platform"). + +14. OKP files must have both a non-empty `reference_url` in `extra` and a non-empty `title` in the TOML metadata to be included. Files missing either field are skipped with a warning. + +## Behavioral Rules -- Document Metadata + +15. Every document processed for embedding must carry at minimum two metadata fields: `docs_url` (source URL) and `title` (document title). The lsc library additionally includes `url_reachable` (boolean indicating URL validity) to support the `unreachable_action` filtering rules below. The plaintext pipeline does not include `url_reachable` in metadata. + +16. `title` is extracted from the first line of the document file, stripping any leading `# ` Markdown heading prefix. For OKP files, `title` comes from the TOML frontmatter instead. + +17. `docs_url` is derived from the file path using content-type-specific URL construction rules: + - **OCP docs**: `https://docs.openshift.com/container-platform/{version}/{relative_path}.html` -- the `.txt` extension is replaced with `.html`, and the embeddings root directory prefix is stripped. + - **Runbooks**: `https://github.com/openshift/runbooks/blob/master/alerts/{relative_path}` -- the runbooks root directory prefix is stripped. + - **BYOK content**: Defaults to the file path. If YAML frontmatter with a `url` field is present, that URL is used instead. + - **OKP content**: The `reference_url` from the TOML frontmatter `extra` section. + +18. URL reachability is validated via HTTP GET with retries. The lsc library's `MetadataProcessor` uses configurable retries (default 3) with a 30-second timeout. The plaintext pipeline's `ping_url` uses a single attempt with a 30-second timeout. Validation is skipped during hermetic builds in both pipelines. + +19. In the lsc library, unreachable URLs can be handled three ways, controlled by the `unreachable_action` parameter: + - `warn` -- log a warning and include the document in the index (default). + - `fail` -- raise an error and abort. + - `drop` -- exclude the document from the index. + The plaintext pipeline always uses the `warn` behavior -- it logs unreachable URLs and counts them but never drops or fails. + +20. In the lsc library, an ignore list allows specific document titles to bypass URL validation. Documents with titles in the ignore list are included in the index regardless of their `url_reachable` status. + +## Configuration Surface + +- `config/exclude.conf` -- newline-delimited list of relative file paths to exclude from OCP docs after conversion. +- `ocp-product-docs-plaintext/{version}/` directories -- each directory name is an OCP version string. Adding or removing directories changes which versions are indexed. +- `scripts/asciidoctor-text/{version}/attributes.yaml` (and `lsc/scripts/asciidoctor-text/{version}/attributes.yaml`) -- per-version AsciiDoc attribute files providing version-specific substitution values. +- `--hermetic-build` / `-hb` CLI flag -- disables URL reachability validation. +- `--folder` / `-f` CLI flag -- specifies the input document directory. +- `--runbooks` / `-r` CLI flag -- specifies the runbooks directory. + +## Constraints + +1. AsciiDoc conversion requires `asciidoctor` and `ruby` to be installed on the system. These are not Python dependencies and must be available in the build environment. + +2. Content acquisition scripts require network access to clone from GitHub. The production container build uses pre-committed content and does not run these scripts -- they are developer/maintenance tools. + +3. The OCP documentation repository is large. Single-branch cloning (`--single-branch`) is required to keep clone times manageable. + +4. Runbook acquisition uses sparse checkout to avoid downloading irrelevant content from the runbooks repository. diff --git a/.ai/spec/what/embedding-pipeline.md b/.ai/spec/what/embedding-pipeline.md new file mode 100644 index 000000000..854317436 --- /dev/null +++ b/.ai/spec/what/embedding-pipeline.md @@ -0,0 +1,89 @@ +# Embedding Pipeline + +This spec defines the shared behavioral rules that all pipeline implementations (plaintext, HTML, lsc library) must satisfy. Individual pipeline architectures are documented in the corresponding `how/` specs. + +## Behavioral Rules -- Chunking + +1. Documents must be split into chunks before embedding. The default chunk size is configurable via CLI with a default of 380 tokens and 0 overlap. + +2. Plaintext documents use LlamaIndex's default sentence splitter (`SentenceSplitter`), which splits on sentence boundaries and respects the configured chunk size and overlap. + +3. Markdown and HTML documents may use LlamaIndex's `MarkdownNodeParser` for section-aware splitting. This parser splits on Markdown heading boundaries, producing one node per section. + +4. HTML documents may alternatively use a semantic chunking algorithm that respects HTML DOM structure -- splitting at section, table, list, code block, and definition list boundaries -- and preserves HTML anchor IDs for deep-link metadata generation. + +5. Chunks consisting entirely of non-whitespace characters (containing no spaces, tabs, or newlines) must be filtered out as invalid. This catches degenerate chunks produced from non-textual content. + +## Behavioral Rules -- Embedding + +6. Embeddings are generated using a HuggingFace-compatible sentence-transformer model loaded from a local filesystem directory. + +7. The default embedding model is `sentence-transformers/all-mpnet-base-v2`, producing 768-dimensional vectors. The model must be redistributable under an Apache 2.0 compatible license. [PLANNED: OLS-1729 -- fine-tuned embedding models] + +8. The embedding dimension is determined dynamically at initialization by encoding a probe string through the model, not hardcoded. This ensures correctness regardless of which model is loaded. + +9. The environment variables `HF_HOME` and `TRANSFORMERS_OFFLINE=1` must be set before loading the model to prevent runtime model downloads and direct HuggingFace Hub to the local model directory. + +10. The LLM setting must be explicitly set to `None` (via `resolve_llm(None)`) to prevent LlamaIndex from attempting to load a language model, which is not needed for embedding generation. + +## Behavioral Rules -- Vector Store Output + +11. The primary vector store backend is FAISS using `IndexFlatIP` (inner product similarity). `IndexFlatIP` requires normalized vectors; the sentence-transformers model produces normalized embeddings by default. + +12. Alternative backends are supported by the lsc library: PostgreSQL via `PGVectorStore`, llama-stack with faiss, and llama-stack with sqlite-vec. + +13. Each index is identified by an index ID string. For OCP product docs, the convention is `ocp-product-docs-{version}` with dots replaced by underscores (e.g., `ocp-product-docs-4_19`). + +14. FAISS indexes persisted via LlamaIndex produce a directory containing: `docstore.json`, `index_store.json`, `graph_store.json`, `vector_store.json`. + +15. A `metadata.json` file must be written alongside the index containing at minimum: execution time, embedding model name, index ID, vector DB type, embedding dimension, chunk size, chunk overlap, and total embedded files count. + +16. llama-stack backends produce a `llama-stack.yaml` configuration file and a provider-specific database file (`faiss_store.db` or `sqlitevec_store.db`) instead of the LlamaIndex JSON files. + +## Behavioral Rules -- Index Organization + +17. Each OCP version gets its own index in a separate directory: `vector_db/ocp_product_docs/{version}/`. + +18. A `latest` symlink must be created pointing to the highest version directory, determined by version-aware sorting. + +19. Runbook embeddings are merged into each OCP version index -- they are combined with OCP doc nodes into a single index, not stored separately. + +## Behavioral Rules -- Metadata per Chunk + +20. Each chunk stored in the vector index must carry metadata including at minimum: `docs_url` (source URL) and `title` (document title). + +21. For HTML pipeline chunks, additional metadata is carried: `section_title` (nearest heading text), `chunk_index` (position within the source document), `total_chunks` (total chunks from that document), `token_count` (tokens in the chunk), and `source_file` (relative path to the source file). + +22. For llama-stack backends, `document_id` must also be present in chunk metadata to support the citation linking mechanism. + +## Configuration Surface + +| CLI Argument | Default | Purpose | +|---|---|---| +| `--folder` / `-f` | (required) | Input document directory | +| `--model-dir` / `-md` | `embeddings_model` | Path to the HuggingFace-compatible embedding model directory | +| `--model-name` / `-mn` | (none) | HuggingFace repo ID of the embedding model (for metadata recording) | +| `--chunk` / `-c` | `380` | Chunk size in tokens | +| `--overlap` / `-l` | `0` | Chunk overlap in tokens | +| `--output` / `-o` | (required) | Vector DB output directory | +| `--index` / `-i` | (required) | Index ID string | +| `--vector-store-type` | `faiss` | Backend: `faiss`, `postgres`, `llamastack-faiss`, `llamastack-sqlite-vec` | +| `--auto-chunking` | `False` | Delegate chunking to llama-stack runtime (lsc library only) | +| `--hermetic-build` / `-hb` | `False` | Skip URL reachability checks | +| `--workers` / `-w` | `None` | Number of workers for parallel document loading | + +## Constraints + +1. The embedding model used to build an index must be identical to the model used by lightspeed-service for query embedding. A model mismatch produces meaningless similarity scores. + +2. FAISS indexes are loaded read-only by lightspeed-service. They must never be modified at runtime. + +3. The `--auto-chunking` flag only applies to llama-stack backends. FAISS always uses manual (pre-split) chunking. + +4. PostgreSQL backend configuration is read from environment variables (`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_HOST`, `POSTGRES_PORT`, `POSTGRES_DATABASE`), not CLI arguments. + +## Planned Changes + +- [PLANNED: OLS-1729] Embedding model fine-tuning -- generate domain-specific vocabulary from OCP docs and augment the embedding model's vocabulary corpus. +- [PLANNED: OLS-2294] Add metadata generation stage -- add a dedicated metadata generation/enrichment step to the pipeline. +- [PLANNED: OLS-2903] OKP-based RAG -- integrate OKP (OpenShift Knowledge Platform) errata content. diff --git a/.ai/spec/what/system-overview.md b/.ai/spec/what/system-overview.md new file mode 100644 index 000000000..3307610b8 --- /dev/null +++ b/.ai/spec/what/system-overview.md @@ -0,0 +1,61 @@ +# System Overview + +OpenShift LightSpeed RAG Content is a build-time artifact producer for the OpenShift LightSpeed AI assistant. It converts OpenShift product documentation and operational runbooks into pre-built FAISS vector indexes, packages them alongside the embedding model into container images, and publishes those images for consumption by the lightspeed-service at runtime. + +## Behavioral Rules + +1. The project produces pre-built FAISS vector indexes from three content sources: OCP product documentation, OpenShift alert runbooks, and customer-supplied Markdown (BYOK). The indexes are packaged as container images consumed by lightspeed-service. + +2. The project is a build-time artifact producer. All computation -- document conversion, chunking, embedding generation, and vector store creation -- happens during the container image build or via offline scripts. The project never runs at runtime. + +3. Two container image artifacts are produced: + - **Main RAG content image**: Contains all OCP version vector indexes, the `latest` symlink, and the embedding model. Consumed by lightspeed-service as a volume mount. + - **BYOK tool image**: Contains buildah, the embedding toolchain, and the embedding model. Used by customers to build custom RAG images from their own Markdown content. + +4. Three pipeline implementations exist for generating vector indexes, each producing FAISS-compatible output: + - **Plaintext pipeline** (`scripts/generate_embeddings.py`): The production pipeline used by the main Containerfile. Processes pre-converted plaintext OCP docs and Markdown runbooks. + - **HTML pipeline** (`scripts/html_embeddings/`): Downloads HTML documentation from the Red Hat portal, strips non-content markup, performs semantic HTML chunking, and generates embeddings. + - **lsc library** (`lsc/src/lightspeed_rag_content/`): An installable Python library supporting multiple vector store backends (FAISS, PostgreSQL, llama-stack faiss, llama-stack sqlite-vec). + +5. The embedding model must be redistributable under an Apache 2.0 compatible license. + +6. The project supports two compute flavors: CPU and GPU. The CPU flavor uses a PyTorch build from pytorch.org without CUDA. The GPU flavor uses standard PyTorch with NVIDIA CUDA 12.9. + +## Integration Contract + +### Consumed by lightspeed-service + +The main RAG content image is mounted as a volume by lightspeed-service (typically via the OpenShift LightSpeed operator). The service reads: + +- `/rag/vector_db/ocp_product_docs/{version}/` -- persisted FAISS vector store files (`docstore.json`, `index_store.json`, `graph_store.json`, `vector_store.json`, `metadata.json`). +- `/rag/vector_db/ocp_product_docs/latest` -- symlink to the highest OCP version directory. +- `/rag/embeddings_model/` -- the HuggingFace-compatible sentence-transformer model used to embed user queries at runtime. + +The service loads these indexes read-only at startup via LlamaIndex's `StorageContext.from_defaults()` and uses the same embedding model to encode queries for vector similarity search. + +### Integration invariant + +The embedding model used to generate the indexes must be identical to the model used by lightspeed-service for query embedding. A mismatch produces meaningless similarity scores. The model identity is currently enforced by shipping the model inside the RAG content image and configuring the service to use it via `ols_config.reference_content.embeddings_model_path`. + +### Operator integration + +The OpenShift LightSpeed operator configures RAG content references via the CRD. Each index entry specifies `product_docs_index_path`, `product_docs_index_id`, and `product_docs_origin`. The operator mounts the RAG content image and maps these paths. [PLANNED: OLS-1812 -- per-index embedding model path in CRD] + +## Constraints + +1. Python 3.11 is required (`requires-python = "==3.11.*"`). + +2. Both CPU and GPU compute flavors must be supported. The Containerfile selects the base image via the `FLAVOR` build arg. + +3. Hermetic builds (no network access during build) must be supported for Konflux/Cachi2 CI. All dependencies -- Python packages, RPMs, and the embedding model binary -- must be prefetchable. + +4. The project uses PDM for dependency management with separate lockfiles per compute flavor (`pdm.lock.cpu`, `pdm.lock.gpu`). + +## Planned Changes + +- [PLANNED: OLS-2294] Add metadata generation stage to the pipeline. +- [PLANNED: OLS-1729] Embedding model fine-tuning -- use fine-tuned models for better domain-specific retrieval accuracy. +- [PLANNED: OLS-2903] OKP-based RAG -- propose plans to use OKP (OpenShift Knowledge Platform) content with OLS. +- [PLANNED: OLS-2704] RAG as a service / MCP -- externalize RAG retrieval behind an MCP interface. +- [PLANNED: OCPSTRAT-1495] Include OCP KCS (Knowledge-Centered Service) content in OLS. +- [PLANNED: OCPSTRAT-1492] Include OCP layered product knowledge (CNV, ACM, RHOSO) in OLS.