This repository demonstrates and compares two powerful approaches for knowledge injection in large language models:
- Retrieval-Augmented Generation (RAG)
- Cache-Augmented Generation (CAG)
The experiments use the quantized Llama 3.1 8B Instruct model and analyze their effectiveness and efficiency for Ukrainian pop culture question-answering.
- Python 3.10
- Poetry (for environment and dependency management)
- Up to 16GB VRAM (GPU highly recommended for Llama 3.1 8B Instruct)
- Hugging Face account with accepted request for the
meta-llama/Llama-3.1-8B-Instructmodel
-
Clone this repository
git clone https://github.com/Alex2135/RAG_vs_CAG_analysis cd RAG_vs_CAG_analysis -
Set up the Poetry virtual environment
poetry config virtualenvs.in-project true poetry env use python3.10 source .venv/bin/activate poetry install
-
Request access to the Llama 3.1 8B Instruct model
- Visit: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
- Click “Access request” and wait for approval
-
Log in to Hugging Face from the notebook
- Insert your HF token when prompted (from https://huggingface.co/settings/tokens)
Open in Jupyter Lab (recommended for step-by-step analysis):
poetry run jupyter lab-
Loads a quantized Llama 3.1 8B model and tokenizer
-
Prepares pop culture context (biography of Stepan Giga) for experiments
-
Implements three knowledge-injection strategies:
- Direct context-injection: Entire biography as prompt
- RAG: FAISS+MiniLM-based retrieval of relevant context chunks for each question
- CAG: One-time KV-caching of knowledge, enabling efficient follow-up questions
-
Compares number of input tokens required by each method
-
Visualizes results (absolute & relative token usage) using matplotlib
The experiment demonstrates the trade-offs between classic context-stuffing, RAG, and CAG for token efficiency and suitability to large LLMs like Llama-3.