diff --git a/.ai/spec/README.md b/.ai/spec/README.md new file mode 100644 index 000000000..88504048e --- /dev/null +++ b/.ai/spec/README.md @@ -0,0 +1,41 @@ +# OpenShift LightSpeed RAG Content -- Specifications + +These specs define the requirements, behaviors, and architecture for the lightspeed-rag-content project. They are organized into two layers: + +- **[`what/`](what/README.md)** -- Behavioral rules: WHAT the system must do and WHY. Technology-neutral, testable assertions. Use these to understand requirements, fix bugs, or rebuild components. +- **[`how/`](how/README.md)** -- Architecture specs: HOW the current implementation is structured. Module boundaries, data flow, design patterns. Use these to navigate, modify, and extend the codebase. + +## Scope + +These specs cover the **lightspeed-rag-content** project only -- the offline pipeline that produces pre-built vector indexes and packages them as container images. The lightspeed-service (which consumes these artifacts at runtime), the operator, and the console plugin are separate projects. + +## Audience + +AI agents (Claude). Specs optimize for precision, unambiguous rules, and machine-parseable structure. + +## Quick Start + +| I want to... | Read | +|--------------|------| +| Understand what this project does | `what/system-overview.md` | +| Understand content sources and acquisition | `what/content-sources.md` | +| Understand the embedding pipeline rules | `what/embedding-pipeline.md` | +| Understand BYOK (customer content) | `what/byok.md` | +| Understand the container build process | `what/container-build.md` | +| Navigate the codebase | `how/project-structure.md` | +| Modify the plaintext pipeline | `how/plaintext-pipeline.md` | +| Modify the HTML pipeline | `how/html-pipeline.md` | +| Modify the lsc library | `how/lsc-library.md` | +| Modify the container build or CI | `how/container-build.md` | +| See what's planned | Look for `[PLANNED: OLS-XXXX]` in `what/` specs | + +## Conventions + +- `[PLANNED: OLS-XXXX]` markers in `what/` specs indicate existing rules about to change due to open Jira work. +- "Planned Changes" sections list new capabilities not yet in code. +- User-configurable values are referenced by CLI argument name or environment variable name. +- Internal constants are stated as behavioral rules without numeric values; `how/` specs may include specific values. + +## Relationship to lightspeed-service + +This project produces artifacts consumed by lightspeed-service. The service's `what/rag.md` spec describes how it loads and queries these indexes at runtime. This project's specs describe how the indexes are built. The integration contract is documented in `what/system-overview.md`. diff --git a/.ai/spec/how/README.md b/.ai/spec/how/README.md new file mode 100644 index 000000000..6e5ee1052 --- /dev/null +++ b/.ai/spec/how/README.md @@ -0,0 +1,25 @@ +# Architecture Specifications (how/) + +These specs describe HOW the RAG content pipeline is structured -- module boundaries, data flow, design patterns, key abstractions, and implementation decisions. They are grounded in the current Python codebase and should be updated when the code changes. + +## Spec Index + +| Spec | Description | +|------|-------------| +| [project-structure.md](project-structure.md) | Directory layout, module map, dependency management, key relationships | +| [plaintext-pipeline.md](plaintext-pipeline.md) | `scripts/generate_embeddings.py` -- the production pipeline used by the Containerfile | +| [html-pipeline.md](html-pipeline.md) | `scripts/html_embeddings/` + `scripts/html_chunking/` -- HTML-based pipeline with semantic chunking | +| [lsc-library.md](lsc-library.md) | `lsc/src/lightspeed_rag_content/` -- installable library with multi-backend support | +| [container-build.md](container-build.md) | Containerfiles, Makefile targets, Konflux/Tekton pipelines, dependency management | + +## When to Read These + +- **Navigating the codebase**: Start with `project-structure.md` to understand where things live. +- **Modifying a pipeline**: Read the relevant pipeline spec to understand the current architecture before making changes. +- **Adding a new vector store backend**: Read `lsc-library.md` for the `_BaseDB` extension pattern. +- **Debugging**: The data flow sections trace the exact path documents take through the pipeline. +- **Changing the build**: Read `container-build.md` for Containerfile stages and Konflux pipeline structure. + +## Relationship to what/ Specs + +The [`what/` specs](../what/README.md) define behavioral contracts (technology-neutral). These `how/` specs describe the implementation that fulfills those contracts. When the two diverge, the `what/` spec is the source of truth for correct behavior, and the `how/` spec should be updated to reflect the current code. diff --git a/.ai/spec/how/container-build.md b/.ai/spec/how/container-build.md new file mode 100644 index 000000000..c84548376 --- /dev/null +++ b/.ai/spec/how/container-build.md @@ -0,0 +1,216 @@ +# Container Build -- Architecture + +This spec documents the Containerfiles, Makefile targets, and Konflux/Tekton pipeline configurations that build and publish the project's container images. + +## Module Map + +| Path | Purpose | +|---|---| +| `Containerfile` | Main RAG content image -- multi-stage build (builder → minimal) | +| `byok/Containerfile.tool` | BYOK tool image -- buildah + Python + model + script | +| `byok/Containerfile.output` | BYOK output image template -- vectors only, built inside tool container | +| `Makefile` | Developer-facing build automation | +| `.tekton/lightspeed-ocp-rag-push.yaml` | Konflux push pipeline for main RAG image | +| `.tekton/lightspeed-ocp-rag-pull-request.yaml` | Konflux PR pipeline for main RAG image | +| `.tekton/lightspeed-rag-tool-push.yaml` | Konflux push pipeline for BYOK tool image | +| `.tekton/lightspeed-rag-tool-pull-request.yaml` | Konflux PR pipeline for BYOK tool image | +| `.tekton/own-app-lightspeed-rag-content-push.yaml` | Alternative build variant push pipeline | +| `.tekton/own-app-lightspeed-rag-content-pull-request.yaml` | Alternative build variant PR pipeline | +| `.tekton/integration-tests/lightspeed-rag-content-image-verification.yaml` | Integration test -- validates image contents | +| `pyproject.toml` | PDM project metadata, dependency groups, linting config | +| `requirements.cpu.txt` / `requirements.gpu.txt` | Exported pip dependencies with hashes | +| `pdm.lock.cpu` / `pdm.lock.gpu` | PDM lockfiles per compute flavor | +| `rpms.in.yaml` / `rpms.lock.yaml` | RPM dependency spec + lockfile for Cachi2 | +| `artifacts.lock.yaml` | Pinned model.safetensors URL + SHA256 | +| `renovate.json` | Dependency update automation config | + +## Main Containerfile -- Build Stages + +### Stage 1: Base image selection + +Two named stages define the base images. The `FLAVOR` build arg (default: `cpu`) selects which one is used: + +```dockerfile +FROM registry.access.redhat.com/ubi9/python-311 as cpu-base +FROM nvcr.io/nvidia/cuda:12.9.1-devel-ubi9 as gpu-base +FROM ${FLAVOR}-base as lightspeed-rag-builder +``` + +The GPU base installs additional system packages: `python3.11`, `python3.11-pip`, `libcudnn9`, `libnccl`, `libcusparselt0`. + +### Stage 2: Builder (`lightspeed-rag-builder`) + +``` +USER 0, WORKDIR /workdir + +1. pip install requirements.gpu.txt +2. Symlink NLTK data: + ln -s .../site-packages/llama_index/core/_static/nltk_cache /root/nltk_data +3. COPY ocp-product-docs-plaintext, runbooks, embeddings_model +4. Acquire model.safetensors: + HERMETIC=true → cp /cachi2/output/deps/generic/model.safetensors + HERMETIC=false → curl from HuggingFace (pinned commit SHA) +5. GPU validation (FLAVOR=gpu only): + python3.11 -c "import torch; print(torch.version.cuda); print(torch.cuda.is_available())" + (requires LD_LIBRARY_PATH=/usr/local/cuda-12/compat) +6. COPY scripts/generate_embeddings.py +7. For each OCP_VERSION in $(ls -1 ocp-product-docs-plaintext): + python3.11 generate_embeddings.py \ + -f ocp-product-docs-plaintext/${VERSION} \ + -r runbooks/alerts \ + -md embeddings_model \ + -mn ${EMBEDDING_MODEL} \ + -o vector_db/ocp_product_docs/${VERSION} \ + -i ocp-product-docs-$(echo $VERSION | sed 's/\./_/g') \ + -v ${VERSION} \ + -hb $HERMETIC +8. Create latest symlink: + LATEST=$(ls -1 vector_db/ocp_product_docs/ | sort -V | tail -n 1) + ln -s ${LATEST} vector_db/ocp_product_docs/latest +``` + +### Stage 3: Final image + +```dockerfile +FROM registry.access.redhat.com/ubi9/ubi-minimal@sha256:{digest} +COPY --from=lightspeed-rag-builder /workdir/vector_db/ocp_product_docs /rag/vector_db/ocp_product_docs +COPY --from=lightspeed-rag-builder /workdir/embeddings_model /rag/embeddings_model +RUN mkdir /licenses +COPY LICENSE /licenses/ +# Enterprise contract labels (com.redhat.component, cpe, vendor, etc.) +USER 65532:65532 +``` + +The `ubi-minimal` image is pinned by SHA256 digest. Digest updates are managed by automated Konflux/Mintmaker PRs. + +## BYOK Containerfile.tool + +``` +FROM ubi9/ubi:latest + ├── dnf install buildah python3.11 python3.11-pip + ├── pip install requirements.cpu.txt (--no-deps) + ├── COPY embeddings_model + ├── Acquire model.safetensors (same HERMETIC logic as main Containerfile) + ├── COPY byok/generate_embeddings_tool.py, byok/Containerfile.output + ├── Enterprise contract labels + ├── Set environment: + │ _BUILDAH_STARTED_IN_USERNS="" + │ BUILDAH_ISOLATION=chroot + │ OUT_IMAGE_TAG, BYOK_TOOL_IMAGE, UBI_BASE_IMAGE, LOG_LEVEL, VECTOR_DB_INDEX + └── CMD: buildah build \ + --build-arg BYOK_TOOL_IMAGE=$BYOK_TOOL_IMAGE \ + --build-arg UBI_BASE_IMAGE=$UBI_BASE_IMAGE \ + --env VECTOR_DB_INDEX=$VECTOR_DB_INDEX \ + -t $OUT_IMAGE_TAG -f Containerfile.output \ + -v /markdown:/markdown:Z . \ + && buildah push $OUT_IMAGE_TAG docker-archive:/output/$OUT_IMAGE_TAG.tar +``` + +## BYOK Containerfile.output + +``` +FROM ${BYOK_TOOL_IMAGE} as tool + USER 0, WORKDIR /workdir + RUN python3.11 generate_embeddings_tool.py \ + -i /markdown -emd embeddings_model \ + -emn sentence-transformers/all-mpnet-base-v2 \ + -o vector_db -id $VECTOR_DB_INDEX + +FROM ${UBI_BASE_IMAGE} + COPY --from=tool /workdir/vector_db /rag/vector_db +``` + +## Makefile Targets + +| Target | Command | Purpose | +|---|---|---| +| `install-tools` | `pip3.11 install pdm` | Install PDM if not present | +| `pdm-lock-check` | `pdm lock --check --group {cpu,gpu}` | Validate both lockfiles | +| `install-deps` | `pdm sync --group $(TORCH_GROUP) --lockfile pdm.lock.$(TORCH_GROUP)` | Install runtime deps | +| `install-deps-test` | `pdm sync --dev --group $(TORCH_GROUP) ...` | Install dev deps | +| `update-deps` | `pdm update --update-all ... && pdm export ...` | Update + regenerate requirements.*.txt | +| `check-types` | `mypy --explicit-package-bases scripts` | Type checking | +| `format` | `black scripts && ruff check scripts --fix` | Code formatting | +| `verify` | `black --check scripts && ruff check scripts` | Lint verification | +| `update-docs` | Loop: `get_ocp_plaintext_docs.sh $V` + `get_runbooks.sh` | Refresh committed content | +| `update-model` | `python scripts/download_embeddings_model.py` | Download embedding model | +| `build-image` | `podman build -t rag-content .` | Local container build | +| `model-safetensors` | `wget model.safetensors` if not present | Download model binary | + +`FLAVOR` variable (default: `cpu`) maps to `TORCH_GROUP` which selects the lockfile and requirements file. The `verify` and `format` targets apply `--per-file-ignores=scripts/*:S101` to allow assert statements in scripts. + +## Konflux Pipeline Structure + +All six pipelines are Tekton PipelineRun definitions that follow the same pattern: + +### Prefetch dependencies + +Cachi2 prefetches three dependency types: +- **pip**: From `requirements.{cpu|gpu}.txt` with hashes. +- **rpm**: From `rpms.lock.yaml`. +- **generic**: From `artifacts.lock.yaml` (model.safetensors URL + SHA256). + +### Build + +Uses `buildah` task with: +- `hermetic=true` -- network-isolated build. +- Build args: `FLAVOR=gpu`, `HERMETIC=true`. +- The prefetched dependencies are injected into the build context. + +### Post-build + +- **Source image**: Created for artifact provenance tracking. +- **Label check**: Validates enterprise contract labels. +- **Integration test** (push pipelines only): Runs `lightspeed-rag-content-image-verification.yaml`. + +### Integration test + +`lightspeed-rag-content-image-verification.yaml` is a Tekton Task that: +1. Mounts the built image. +2. Checks for `/rag/vector_db/{version}/index_store.json` for at least one OCP version. +3. Checks for `/rag/embeddings_model/config.json`. +4. Fails if either path is missing. + +## Dependency Management Flow + +``` +pyproject.toml +├── [project.dependencies] Core deps (llama-index, faiss, etc.) +├── [project.optional-dependencies] +│ cpu = [torch @ https://...cpu...] CPU PyTorch wheel (pinned URL + hash) +│ gpu = [torch==2.6.0] GPU PyTorch from PyPI +└── [tool.pdm.dev-dependencies] + dev = [black, mypy, ruff, types-requests] + + │ + ▼ +pdm lock → pdm.lock.cpu / pdm.lock.gpu + │ + ▼ +pdm export → requirements.cpu.txt / requirements.gpu.txt + (with --hashes for pip install verification) + +rpms.in.yaml → rpms.lock.yaml + (Cachi2 RPM resolution for container build) + +artifacts.lock.yaml + (model.safetensors URL + SHA256 for Cachi2 generic artifact) + +renovate.json: + - Python package auto-updates: DISABLED + - Konflux references: auto-updated +``` + +## Implementation Notes + +- The main Containerfile always installs `requirements.gpu.txt` regardless of `FLAVOR`. The `FLAVOR` arg only affects the base image selection (CPU vs GPU). This means the CPU builder installs GPU-compatible torch, which works but is larger than necessary. + +- The `--no-deps` flag is used in the BYOK tool's `pip install` but NOT in the main Containerfile. This prevents pip from pulling transitive dependencies that might conflict with the locked set. + +- `generate_packages_to_prefetch.py` (in `lsc/scripts/`) is a complex script for Cachi2 hermetic build preparation. It: copies the project stub, removes torch from pyproject.toml, runs `pip-compile` to generate requirements.txt, removes torch + nvidia packages, separately downloads the CPU torch wheel from PyPI, computes its hash, and generates `requirements-build.txt`. This script is not invoked during the container build itself -- it is a developer tool for maintaining the Cachi2 prefetch inputs. + +- The NLTK data symlink (`ln -s .../nltk_cache /root/nltk_data`) is required because LlamaIndex's sentence tokenizer depends on NLTK's `punkt` tokenizer data. The data is bundled with the llama-index-core package but needs to be discoverable at the default NLTK data path. + +- GPU builds set `LD_LIBRARY_PATH=/usr/local/cuda-12/compat` for CUDA library discovery. This is needed both during the torch validation step and during the embedding generation loop. + +- The `ubi-minimal` final image digest is periodically updated by Konflux/Mintmaker automation, which submits PRs to update the `@sha256:...` pinning. diff --git a/.ai/spec/how/html-pipeline.md b/.ai/spec/how/html-pipeline.md new file mode 100644 index 000000000..1f92bf88e --- /dev/null +++ b/.ai/spec/how/html-pipeline.md @@ -0,0 +1,163 @@ +# HTML Pipeline -- Architecture + +This spec documents the HTML-based embedding pipeline in `scripts/html_embeddings/` and the semantic chunking library in `scripts/html_chunking/`. This pipeline downloads HTML documentation from the Red Hat portal, strips non-content markup, performs semantic HTML chunking that preserves document structure and anchor IDs, and generates FAISS vector indexes. + +## Module Map + +| Path | Key Symbols | +|---|---| +| `scripts/html_embeddings/generate_embeddings.py` | `main()`, `setup_environment()`, `run_download_step()`, `run_strip_step()`, `run_chunk_step()`, `run_runbooks_step()`, `run_embedding_step()`, `load_chunks_as_nodes()` | +| `scripts/html_embeddings/download_docs.py` | `download_documentation()` | +| `scripts/html_embeddings/strip_html.py` | `strip_html_content()` | +| `scripts/html_embeddings/chunk_html.py` | `chunk_html_documents()`, `chunk_single_html_file()`, `extract_metadata_from_path()`, `validate_chunks()`, `get_chunking_stats()` | +| `scripts/html_embeddings/process_runbooks.py` | `process_runbooks()` | +| `scripts/html_embeddings/utils.py` | `setup_logging()`, `create_directory_structure()`, `validate_dependencies()`, `sanitize_directory_path()` | +| `scripts/html_chunking/chunker.py` | `chunk_html()`, `ChunkingOptions`, `Chunk`, `find_first_anchor()`, `get_document_title()`, `_split_element_by_children()`, `_split_element_by_children_no_grouping()`, `_split_table()`, `_split_list()`, `_split_code()`, `_split_definition_list()`, `_linear_split()`, `_get_anchored_url()` | +| `scripts/html_chunking/tokenizer.py` | `count_html_tokens()` | + +## Data Flow -- 5-Step Pipeline + +``` +1. DOWNLOAD + Input: --doc-url | --doc-url-slug | --config-file + Output: cache/{slug}/{version}/downloads/*.html + Action: Fetch HTML pages from docs.redhat.com portal. + Supports single URL, slug+version, or batch config file. + +2. STRIP + Input: cache/{slug}/{version}/downloads/*.html + Output: cache/{slug}/{version}/stripped/*.html + Action: Remove non-content HTML (navigation, header, footer, + scripts, styles). Preserve document body structure. + +3. CHUNK + Input: cache/{slug}/{version}/stripped/*.html + Output: cache/{slug}/{version}/chunks/{doc_name}/*_chunk_NNNN.json + Action: Semantic HTML chunking → individual JSON chunk files. + Each chunk carries metadata (docs_url, title, section_title, + chunk_index, token_count). + +4. RUNBOOKS + Input: --runbooks-dir (default: ./runbooks) + Output: cache/{slug}/{version}/chunks/*.json (flat, at base level) + Action: Convert Markdown runbooks to JSON chunk files. + Stored flat (not in doc-specific subdirectories). + +5. EMBED + Input: cache/{slug}/{version}/chunks/**/*.json + Output: --output-dir containing FAISS index + metadata.json + Action: Load JSON chunks as TextNode objects, create + VectorStoreIndex, persist to output directory. +``` + +## Key Abstractions + +### HTML Chunking Library (`scripts/html_chunking/`) + +The chunker operates on parsed HTML DOM trees via BeautifulSoup. The algorithm is recursive and structure-aware. + +**Entry point**: `chunk_html(html_content, source_url, max_token_limit, count_tag_tokens) -> list[Chunk]` + +**Short-circuit**: If the entire document fits within `max_token_limit`, it is returned as a single chunk. + +**Primary splitter**: `_split_element_by_children(element, options)` iterates over direct children of an HTML element, accumulating them into a chunk until the token limit is exceeded. Special grouping rules: +- **Sections with IDs**: Processed recursively after flushing the current chunk. Section context (ID) is tracked and wrapped around chunk content. +- **Headings** (h1-h6): Grouped with the following sibling element to keep heading + first content together. +- **Paragraphs ending with `:`**: Grouped with the following table, list, or definition list to keep introductory text + content together. +- **Oversized children**: Recursively split via `_split_element_by_children_no_grouping()`. + +**Secondary splitter**: `_split_element_by_children_no_grouping(element, options)` accumulates children without grouping heuristics. Delegates to specialized splitters for structured elements: +- `_split_table(table, options)` -- Splits by rows, preserving `` header in every chunk. Oversized rows are split by cells. +- `_split_list(list_element, options)` -- Splits `
    `/`