SILVIA — Retrieval-Augmented Generation (RAG) System

SILVIA is a modular Retrieval-Augmented Generation (RAG) system designed to build domain-specific chatbots using PDF documents as external knowledge sources.

🚀 Jumpstart

To launch SILVIA:

python SILVIA.py

Follow the CLI instructions.

You can configure the RAG chatbot using:

config/parameters.yaml

By default, this configuration builds a RAG system using the PDF documents stored in:

data/Accurate

📁 Project Structure

config/

Configuration folder containing parameters for a test RAG named "Accurate".

data/

This folder contains all documents and derived artifacts used for prompt augmentation.

data/Accurate/

At the moment, it includes:

AI Act documentation, taken from
https://www.aiact-info.eu/full-text-and-pdf-download/

This folder will also contain:

Chunk database

Currently stored as .json

Format example:

[
  {
    "chunk_id": "AI_act.pdf_p1_c0",
    "text": "<chunk text>",
    "metadata": {
      "file": "AI_act.pdf",
      "page": 1
    }
  }
]

FAISS index
- faiss_index.idx
- Stores vector embeddings for similarity search

Logs_and_output/

Stores outputs produced during pipeline testing.

Includes:

Query results
- Retrieved chunks for each query, along with similarity scores
Augmented prompts
- Prompts resulting from chunk retrieval and prompt assembly
Images
- Histograms of:
  - Chunk length distribution for the tested database
  - Query distance distribution from retrieved entries

ingest/

Utilities for creating a chunk index from a collection of PDF files.

Includes:

pdf_loader.py
Searches for all .pdf files and loads their content.
text_cleaner.py
Applies the following cleaning steps:
- Replace null characters (\x00) with spaces
- Collapse consecutive spaces and tabs into a single space
- Reduce sequences of 3+ newlines to at most 2
- Strip leading and trailing spaces and newlines
- Return the cleaned text
chunker.py
Implements a simple chunking strategy:
- Fixed-length chunks
- Fixed overlap between consecutive chunks
Chunking options (defined in the config file):
- chunk_size
- overlap
build_index.py
Manages creation of the chunk database (chunks.json)

embedding/

Utilities for embedding creation and vector database management.

Includes:

build_db.py
Creates a FAISS index:
- data/<RAG_name>/faiss_index.idx
- Associated metadata file
query_db.py
Implements the FAISSRetriever class, which:
- Handles similarity queries
- Can store retrieved chunks and scores during testing

prompting/

Utilities for assembling the final prompt provided to the language model.

llm/

Wrappers around Large Language Models.
Currently implemented:

OpenAI models

utils/

General-purpose utility functions.

⚙️ Configuration File (`parameters.yaml`)

Name

RAG_name

Unique identifier for the RAG setup.

IMPORTANT
RAG_name must match the name of the folder inside data/ containing the documents used for prompt augmentation.

Ingest Options

layout:

true: preserve original PDF layout
false: extract text linearly

chunk_size:

Approximate number of characters per chunk.

overlap:

Number of overlapping characters between consecutive chunks.

Embedding and Retrieval

model:

Supported models:

BAAI/bge-base-en-v1.5
all-MiniLM-L6-v2

similarity:

Similarity metric:

CosSim → cosine similarity
L2 → Euclidean distance

retrieval_size:

Number of chunks retrieved for prompt augmentation.

Prompt Parameters

All prompt fields are inserted verbatim into the final prompt passed to the LLM.

tone:

Tone of the assistant’s responses (e.g. formal, friendly, natural).

verbosity:

Level of detail in responses (e.g. concise, detailed).

preambolo:

System-level instructions prepended to every query.

citation_style:

Citation format for retrieved chunks (e.g. [chunk_3]).

OpenAI Model Parameters

api_key:

OpenAI API key (required to use online models).

model:

LLM model (e.g. gpt-4o-mini, gpt-4.1, gpt-3.5-turbo).

temperature:

Controls randomness in generation.

max_tokens:

Maximum number of tokens in the output.

top_p:

Nucleus sampling parameter.

frequency_penalty:

Penalizes repeated tokens.

presence_penalty:

Encourages new content.

timeout:

Maximum wait time (seconds) for the API response.

Logging

verbose:

Enable verbose logging for debugging and monitoring.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SILVIA — Retrieval-Augmented Generation (RAG) System

🚀 Jumpstart

📁 Project Structure

config/

data/

data/Accurate/

Logs_and_output/

ingest/

embedding/

prompting/

llm/

utils/

⚙️ Configuration File (`parameters.yaml`)

Name

Ingest Options

Embedding and Retrieval

Prompt Parameters

OpenAI Model Parameters

Logging

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Ingest		Ingest
config		config
embedding		embedding
llm		llm
prompting		prompting
utils		utils
.gitignore		.gitignore
README.md		README.md
SILVIA.py		SILVIA.py

Folders and files

Latest commit

History

Repository files navigation

SILVIA — Retrieval-Augmented Generation (RAG) System

🚀 Jumpstart

📁 Project Structure

config/

data/

data/Accurate/

Logs_and_output/

ingest/

embedding/

prompting/

llm/

utils/

⚙️ Configuration File (parameters.yaml)

Name

Ingest Options

Embedding and Retrieval

Prompt Parameters

OpenAI Model Parameters

Logging

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

⚙️ Configuration File (`parameters.yaml`)

Packages