Skip to content

AdrianoZaghi/RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SILVIA — Retrieval-Augmented Generation (RAG) System

SILVIA is a modular Retrieval-Augmented Generation (RAG) system designed to build domain-specific chatbots using PDF documents as external knowledge sources.


🚀 Jumpstart

To launch SILVIA:

python SILVIA.py

Follow the CLI instructions.

You can configure the RAG chatbot using:

config/parameters.yaml

By default, this configuration builds a RAG system using the PDF documents stored in:

data/Accurate

📁 Project Structure

config/

Configuration folder containing parameters for a test RAG named "Accurate".


data/

This folder contains all documents and derived artifacts used for prompt augmentation.

data/Accurate/

At the moment, it includes:

This folder will also contain:

  • Chunk database

    • Currently stored as .json
    • Format example:
      [
        {
          "chunk_id": "AI_act.pdf_p1_c0",
          "text": "<chunk text>",
          "metadata": {
            "file": "AI_act.pdf",
            "page": 1
          }
        }
      ]
  • FAISS index

    • faiss_index.idx
    • Stores vector embeddings for similarity search

Logs_and_output/

Stores outputs produced during pipeline testing.

Includes:

  • Query results

    • Retrieved chunks for each query, along with similarity scores
  • Augmented prompts

    • Prompts resulting from chunk retrieval and prompt assembly
  • Images

    • Histograms of:
      • Chunk length distribution for the tested database
      • Query distance distribution from retrieved entries

ingest/

Utilities for creating a chunk index from a collection of PDF files.

Includes:

  • pdf_loader.py
    Searches for all .pdf files and loads their content.

  • text_cleaner.py
    Applies the following cleaning steps:

    • Replace null characters (\x00) with spaces
    • Collapse consecutive spaces and tabs into a single space
    • Reduce sequences of 3+ newlines to at most 2
    • Strip leading and trailing spaces and newlines
    • Return the cleaned text
  • chunker.py
    Implements a simple chunking strategy:

    • Fixed-length chunks
    • Fixed overlap between consecutive chunks

    Chunking options (defined in the config file):

    • chunk_size
    • overlap
  • build_index.py
    Manages creation of the chunk database (chunks.json)


embedding/

Utilities for embedding creation and vector database management.

Includes:

  • build_db.py
    Creates a FAISS index:

    • data/<RAG_name>/faiss_index.idx
    • Associated metadata file
  • query_db.py
    Implements the FAISSRetriever class, which:

    • Handles similarity queries
    • Can store retrieved chunks and scores during testing

prompting/

Utilities for assembling the final prompt provided to the language model.


llm/

Wrappers around Large Language Models.
Currently implemented:

  • OpenAI models

utils/

General-purpose utility functions.


⚙️ Configuration File (parameters.yaml)

Name

RAG_name

Unique identifier for the RAG setup.

IMPORTANT
RAG_name must match the name of the folder inside data/ containing the documents used for prompt augmentation.


Ingest Options

layout:
  • true: preserve original PDF layout
  • false: extract text linearly
chunk_size:

Approximate number of characters per chunk.

overlap:

Number of overlapping characters between consecutive chunks.


Embedding and Retrieval

model:

Supported models:

  • BAAI/bge-base-en-v1.5
  • all-MiniLM-L6-v2
similarity:

Similarity metric:

  • CosSim → cosine similarity
  • L2 → Euclidean distance
retrieval_size:

Number of chunks retrieved for prompt augmentation.


Prompt Parameters

All prompt fields are inserted verbatim into the final prompt passed to the LLM.

tone:

Tone of the assistant’s responses (e.g. formal, friendly, natural).

verbosity:

Level of detail in responses (e.g. concise, detailed).

preambolo:

System-level instructions prepended to every query.

citation_style:

Citation format for retrieved chunks (e.g. [chunk_3]).


OpenAI Model Parameters

api_key:

OpenAI API key (required to use online models).

model:

LLM model (e.g. gpt-4o-mini, gpt-4.1, gpt-3.5-turbo).

temperature:

Controls randomness in generation.

max_tokens:

Maximum number of tokens in the output.

top_p:

Nucleus sampling parameter.

frequency_penalty:

Penalizes repeated tokens.

presence_penalty:

Encourages new content.

timeout:

Maximum wait time (seconds) for the API response.


Logging

verbose:

Enable verbose logging for debugging and monitoring.

About

A RAG model in honor of the greatest of all, SILVIA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages