openshift · joshuawilson · Apr 20, 2026
diff --git a/.ai/spec/README.md b/.ai/spec/README.md
@@ -0,0 +1,41 @@
+# OpenShift LightSpeed RAG Content -- Specifications
+
+These specs define the requirements, behaviors, and architecture for the lightspeed-rag-content project. They are organized into two layers:
+
+- **[`what/`](what/README.md)** -- Behavioral rules: WHAT the system must do and WHY. Technology-neutral, testable assertions. Use these to understand requirements, fix bugs, or rebuild components.
+- **[`how/`](how/README.md)** -- Architecture specs: HOW the current implementation is structured. Module boundaries, data flow, design patterns. Use these to navigate, modify, and extend the codebase.
+
+## Scope
+
+These specs cover the **lightspeed-rag-content** project only -- the offline pipeline that produces pre-built vector indexes and packages them as container images. The lightspeed-service (which consumes these artifacts at runtime), the operator, and the console plugin are separate projects.
+
+## Audience
+
+AI agents (Claude). Specs optimize for precision, unambiguous rules, and machine-parseable structure.
+
+## Quick Start
+
+| I want to... | Read |
+|--------------|------|
+| Understand what this project does | `what/system-overview.md` |
+| Understand content sources and acquisition | `what/content-sources.md` |
+| Understand the embedding pipeline rules | `what/embedding-pipeline.md` |
+| Understand BYOK (customer content) | `what/byok.md` |
+| Understand the container build process | `what/container-build.md` |
+| Navigate the codebase | `how/project-structure.md` |
+| Modify the plaintext pipeline | `how/plaintext-pipeline.md` |
+| Modify the HTML pipeline | `how/html-pipeline.md` |
+| Modify the lsc library | `how/lsc-library.md` |
+| Modify the container build or CI | `how/container-build.md` |
+| See what's planned | Look for `[PLANNED: OLS-XXXX]` in `what/` specs |
+
+## Conventions
+
+- `[PLANNED: OLS-XXXX]` markers in `what/` specs indicate existing rules about to change due to open Jira work.
+- "Planned Changes" sections list new capabilities not yet in code.
+- User-configurable values are referenced by CLI argument name or environment variable name.
+- Internal constants are stated as behavioral rules without numeric values; `how/` specs may include specific values.
+
+## Relationship to lightspeed-service
+
+This project produces artifacts consumed by lightspeed-service. The service's `what/rag.md` spec describes how it loads and queries these indexes at runtime. This project's specs describe how the indexes are built. The integration contract is documented in `what/system-overview.md`.
diff --git a/.ai/spec/how/README.md b/.ai/spec/how/README.md
@@ -0,0 +1,25 @@
+# Architecture Specifications (how/)
+
+These specs describe HOW the RAG content pipeline is structured -- module boundaries, data flow, design patterns, key abstractions, and implementation decisions. They are grounded in the current Python codebase and should be updated when the code changes.
+
+## Spec Index
+
+| Spec | Description |
+|------|-------------|
+| [project-structure.md](project-structure.md) | Directory layout, module map, dependency management, key relationships |
+| [plaintext-pipeline.md](plaintext-pipeline.md) | `scripts/generate_embeddings.py` -- the production pipeline used by the Containerfile |
+| [html-pipeline.md](html-pipeline.md) | `scripts/html_embeddings/` + `scripts/html_chunking/` -- HTML-based pipeline with semantic chunking |
+| [lsc-library.md](lsc-library.md) | `lsc/src/lightspeed_rag_content/` -- installable library with multi-backend support |
+| [container-build.md](container-build.md) | Containerfiles, Makefile targets, Konflux/Tekton pipelines, dependency management |
+
+## When to Read These
+
+- **Navigating the codebase**: Start with `project-structure.md` to understand where things live.
+- **Modifying a pipeline**: Read the relevant pipeline spec to understand the current architecture before making changes.
+- **Adding a new vector store backend**: Read `lsc-library.md` for the `_BaseDB` extension pattern.
+- **Debugging**: The data flow sections trace the exact path documents take through the pipeline.
+- **Changing the build**: Read `container-build.md` for Containerfile stages and Konflux pipeline structure.
+
+## Relationship to what/ Specs
+
+The [`what/` specs](../what/README.md) define behavioral contracts (technology-neutral). These `how/` specs describe the implementation that fulfills those contracts. When the two diverge, the `what/` spec is the source of truth for correct behavior, and the `how/` spec should be updated to reflect the current code.
diff --git a/.ai/spec/how/container-build.md b/.ai/spec/how/container-build.md
@@ -0,0 +1,216 @@
+# Container Build -- Architecture
+
+This spec documents the Containerfiles, Makefile targets, and Konflux/Tekton pipeline configurations that build and publish the project's container images.
+
+## Module Map
+
+| Path | Purpose |
+|---|---|
+| `Containerfile` | Main RAG content image -- multi-stage build (builder → minimal) |
+| `byok/Containerfile.tool` | BYOK tool image -- buildah + Python + model + script |
+| `byok/Containerfile.output` | BYOK output image template -- vectors only, built inside tool container |
+| `Makefile` | Developer-facing build automation |
+| `.tekton/lightspeed-ocp-rag-push.yaml` | Konflux push pipeline for main RAG image |
+| `.tekton/lightspeed-ocp-rag-pull-request.yaml` | Konflux PR pipeline for main RAG image |
+| `.tekton/lightspeed-rag-tool-push.yaml` | Konflux push pipeline for BYOK tool image |
+| `.tekton/lightspeed-rag-tool-pull-request.yaml` | Konflux PR pipeline for BYOK tool image |
+| `.tekton/own-app-lightspeed-rag-content-push.yaml` | Alternative build variant push pipeline |
+| `.tekton/own-app-lightspeed-rag-content-pull-request.yaml` | Alternative build variant PR pipeline |
+| `.tekton/integration-tests/lightspeed-rag-content-image-verification.yaml` | Integration test -- validates image contents |
+| `pyproject.toml` | PDM project metadata, dependency groups, linting config |
+| `requirements.cpu.txt` / `requirements.gpu.txt` | Exported pip dependencies with hashes |
+| `pdm.lock.cpu` / `pdm.lock.gpu` | PDM lockfiles per compute flavor |
+| `rpms.in.yaml` / `rpms.lock.yaml` | RPM dependency spec + lockfile for Cachi2 |
+| `artifacts.lock.yaml` | Pinned model.safetensors URL + SHA256 |
+| `renovate.json` | Dependency update automation config |
+
+## Main Containerfile -- Build Stages
+
+### Stage 1: Base image selection
+
+Two named stages define the base images. The `FLAVOR` build arg (default: `cpu`) selects which one is used:
+
+```dockerfile
+FROM registry.access.redhat.com/ubi9/python-311 as cpu-base
+FROM nvcr.io/nvidia/cuda:12.9.1-devel-ubi9 as gpu-base
+FROM ${FLAVOR}-base as lightspeed-rag-builder
+```
+
+The GPU base installs additional system packages: `python3.11`, `python3.11-pip`, `libcudnn9`, `libnccl`, `libcusparselt0`.
+
+### Stage 2: Builder (`lightspeed-rag-builder`)
+
+```
+USER 0, WORKDIR /workdir
+
+1. pip install requirements.gpu.txt
+2. Symlink NLTK data:
+     ln -s .../site-packages/llama_index/core/_static/nltk_cache /root/nltk_data
+3. COPY ocp-product-docs-plaintext, runbooks, embeddings_model
+4. Acquire model.safetensors:
+     HERMETIC=true  → cp /cachi2/output/deps/generic/model.safetensors
+     HERMETIC=false → curl from HuggingFace (pinned commit SHA)
+5. GPU validation (FLAVOR=gpu only):
+     python3.11 -c "import torch; print(torch.version.cuda); print(torch.cuda.is_available())"
+     (requires LD_LIBRARY_PATH=/usr/local/cuda-12/compat)
+6. COPY scripts/generate_embeddings.py
+7. For each OCP_VERSION in $(ls -1 ocp-product-docs-plaintext):
+     python3.11 generate_embeddings.py \
+       -f ocp-product-docs-plaintext/${VERSION} \
+       -r runbooks/alerts \
+       -md embeddings_model \
+       -mn ${EMBEDDING_MODEL} \
+       -o vector_db/ocp_product_docs/${VERSION} \
+       -i ocp-product-docs-$(echo $VERSION | sed 's/\./_/g') \
+       -v ${VERSION} \
+       -hb $HERMETIC
+8. Create latest symlink:
+     LATEST=$(ls -1 vector_db/ocp_product_docs/ | sort -V | tail -n 1)
+     ln -s ${LATEST} vector_db/ocp_product_docs/latest
+```
+
+### Stage 3: Final image
+
+```dockerfile
+FROM registry.access.redhat.com/ubi9/ubi-minimal@sha256:{digest}
+COPY --from=lightspeed-rag-builder /workdir/vector_db/ocp_product_docs /rag/vector_db/ocp_product_docs
+COPY --from=lightspeed-rag-builder /workdir/embeddings_model /rag/embeddings_model
+RUN mkdir /licenses
+COPY LICENSE /licenses/
+# Enterprise contract labels (com.redhat.component, cpe, vendor, etc.)
+USER 65532:65532
+```
+
+The `ubi-minimal` image is pinned by SHA256 digest. Digest updates are managed by automated Konflux/Mintmaker PRs.
+
+## BYOK Containerfile.tool
+
+```
+FROM ubi9/ubi:latest
+  ├── dnf install buildah python3.11 python3.11-pip
+  ├── pip install requirements.cpu.txt (--no-deps)
+  ├── COPY embeddings_model
+  ├── Acquire model.safetensors (same HERMETIC logic as main Containerfile)
+  ├── COPY byok/generate_embeddings_tool.py, byok/Containerfile.output
+  ├── Enterprise contract labels
+  ├── Set environment:
+  │     _BUILDAH_STARTED_IN_USERNS=""
+  │     BUILDAH_ISOLATION=chroot
+  │     OUT_IMAGE_TAG, BYOK_TOOL_IMAGE, UBI_BASE_IMAGE, LOG_LEVEL, VECTOR_DB_INDEX
+  └── CMD: buildah build \
+        --build-arg BYOK_TOOL_IMAGE=$BYOK_TOOL_IMAGE \
+        --build-arg UBI_BASE_IMAGE=$UBI_BASE_IMAGE \
+        --env VECTOR_DB_INDEX=$VECTOR_DB_INDEX \
+        -t $OUT_IMAGE_TAG -f Containerfile.output \
+        -v /markdown:/markdown:Z . \
+     && buildah push $OUT_IMAGE_TAG docker-archive:/output/$OUT_IMAGE_TAG.tar
+```
+
+## BYOK Containerfile.output
+
+```
+FROM ${BYOK_TOOL_IMAGE} as tool
+  USER 0, WORKDIR /workdir
+  RUN python3.11 generate_embeddings_tool.py \
+      -i /markdown -emd embeddings_model \
+      -emn sentence-transformers/all-mpnet-base-v2 \
+      -o vector_db -id $VECTOR_DB_INDEX
+
+FROM ${UBI_BASE_IMAGE}
+  COPY --from=tool /workdir/vector_db /rag/vector_db
+```
+
+## Makefile Targets
+
+| Target | Command | Purpose |
+|---|---|---|
+| `install-tools` | `pip3.11 install pdm` | Install PDM if not present |
+| `pdm-lock-check` | `pdm lock --check --group {cpu,gpu}` | Validate both lockfiles |
+| `install-deps` | `pdm sync --group $(TORCH_GROUP) --lockfile pdm.lock.$(TORCH_GROUP)` | Install runtime deps |
+| `install-deps-test` | `pdm sync --dev --group $(TORCH_GROUP) ...` | Install dev deps |
+| `update-deps` | `pdm update --update-all ... && pdm export ...` | Update + regenerate requirements.*.txt |
+| `check-types` | `mypy --explicit-package-bases scripts` | Type checking |
+| `format` | `black scripts && ruff check scripts --fix` | Code formatting |
+| `verify` | `black --check scripts && ruff check scripts` | Lint verification |
+| `update-docs` | Loop: `get_ocp_plaintext_docs.sh $V` + `get_runbooks.sh` | Refresh committed content |
+| `update-model` | `python scripts/download_embeddings_model.py` | Download embedding model |
+| `build-image` | `podman build -t rag-content .` | Local container build |
+| `model-safetensors` | `wget model.safetensors` if not present | Download model binary |
+
+`FLAVOR` variable (default: `cpu`) maps to `TORCH_GROUP` which selects the lockfile and requirements file. The `verify` and `format` targets apply `--per-file-ignores=scripts/*:S101` to allow assert statements in scripts.
+
+## Konflux Pipeline Structure
+
+All six pipelines are Tekton PipelineRun definitions that follow the same pattern:
+
+### Prefetch dependencies
+
+Cachi2 prefetches three dependency types:
+- **pip**: From `requirements.{cpu|gpu}.txt` with hashes.
+- **rpm**: From `rpms.lock.yaml`.
+- **generic**: From `artifacts.lock.yaml` (model.safetensors URL + SHA256).
+
+### Build
+
+Uses `buildah` task with:
+- `hermetic=true` -- network-isolated build.
+- Build args: `FLAVOR=gpu`, `HERMETIC=true`.
+- The prefetched dependencies are injected into the build context.
+
+### Post-build
+
+- **Source image**: Created for artifact provenance tracking.
+- **Label check**: Validates enterprise contract labels.
+- **Integration test** (push pipelines only): Runs `lightspeed-rag-content-image-verification.yaml`.
+
+### Integration test
+
+`lightspeed-rag-content-image-verification.yaml` is a Tekton Task that:
+1. Mounts the built image.
+2. Checks for `/rag/vector_db/{version}/index_store.json` for at least one OCP version.
+3. Checks for `/rag/embeddings_model/config.json`.
+4. Fails if either path is missing.
+
+## Dependency Management Flow
+
+```
+pyproject.toml
+├── [project.dependencies]           Core deps (llama-index, faiss, etc.)
+├── [project.optional-dependencies]
+│     cpu = [torch @ https://...cpu...]   CPU PyTorch wheel (pinned URL + hash)
+│     gpu = [torch==2.6.0]               GPU PyTorch from PyPI
+└── [tool.pdm.dev-dependencies]
+      dev = [black, mypy, ruff, types-requests]
+
+     │
+     ▼
+pdm lock → pdm.lock.cpu / pdm.lock.gpu
+     │
+     ▼
+pdm export → requirements.cpu.txt / requirements.gpu.txt
+     (with --hashes for pip install verification)
+
+rpms.in.yaml → rpms.lock.yaml
+     (Cachi2 RPM resolution for container build)
+
+artifacts.lock.yaml
+     (model.safetensors URL + SHA256 for Cachi2 generic artifact)
+
+renovate.json:
+  - Python package auto-updates: DISABLED
+  - Konflux references: auto-updated
+```
+
+## Implementation Notes
+
+- The main Containerfile always installs `requirements.gpu.txt` regardless of `FLAVOR`. The `FLAVOR` arg only affects the base image selection (CPU vs GPU). This means the CPU builder installs GPU-compatible torch, which works but is larger than necessary.
+
+- The `--no-deps` flag is used in the BYOK tool's `pip install` but NOT in the main Containerfile. This prevents pip from pulling transitive dependencies that might conflict with the locked set.
+
+- `generate_packages_to_prefetch.py` (in `lsc/scripts/`) is a complex script for Cachi2 hermetic build preparation. It: copies the project stub, removes torch from pyproject.toml, runs `pip-compile` to generate requirements.txt, removes torch + nvidia packages, separately downloads the CPU torch wheel from PyPI, computes its hash, and generates `requirements-build.txt`. This script is not invoked during the container build itself -- it is a developer tool for maintaining the Cachi2 prefetch inputs.
+
+- The NLTK data symlink (`ln -s .../nltk_cache /root/nltk_data`) is required because LlamaIndex's sentence tokenizer depends on NLTK's `punkt` tokenizer data. The data is bundled with the llama-index-core package but needs to be discoverable at the default NLTK data path.
+
+- GPU builds set `LD_LIBRARY_PATH=/usr/local/cuda-12/compat` for CUDA library discovery. This is needed both during the torch validation step and during the embedding generation loop.
+
+- The `ubi-minimal` final image digest is periodically updated by Konflux/Mintmaker automation, which submits PRs to update the `@sha256:...` pinning.