TrustVar: A Dynamic Framework for Trustworthiness Evaluation and Task Variation Analysis in Large Language Models
📖 Start with Documentation Overview for quick understanding of structure and navigation.
TrustVar is a framework built on our previous LLM trustworthiness testing system. While we previously focused on how LLMs handle tasks, we now rethink the evaluation procedure itself. TrustVar shifts the focus: we investigate the quality of tasks themselves, not just model behavior.
Unlike traditional frameworks that test models through tasks, TrustVar tests tasks through models. We analyze tasks as research objects, measuring their ambiguity, sensitivity, and structure, then examine how these parameters influence model behavior.
- Task Variation Generation: Automatically creates families of task reformulations
- Model Robustness Testing: Evaluates model stability under formulation changes
- Task Sensitivity Index (TSI): Measures how strongly formulations affect model success
- Multi-language Support: English and Russian tasks with extensible architecture
- Interactive Pipeline: Unified system for data loading, task generation, variation, model evaluation, and visual analysis
- Project Architecture
- Project Structure
- Quick Start
- Documentation
- System Components
- API
- Metrics
- Deployment
- Development
- Support
- MongoDB — Primary database for storing tasks, results, and metrics
- Langchain Backend — Server-side for request processing and interaction with language models
- Streamlit Frontend — Modern web interface for monitoring and management
- Task Runners — Set of specialized task processors
- Ollama — Service for local language model execution
- Task Creation → MongoDB (collection
tasks) - Task Processing → Task Processor → MongoDB (collections
queue_*) - Inference Execution → Runner → Langchain Backend → LLM
- Metrics Collection → Metrics Runner → MongoDB
- Visualization → Streamlit Frontend → MongoDB
TrustVar/
├── docs/ # Project documentation
│ ├── README.md # Main documentation
│ ├── components.md # Component documentation
│ ├── deployment.md # Deployment guide
│ ├── api.md # API documentation
│ ├── metrics.md # Metrics documentation
│ └── Screenshot 2025-07-24 at 13.35.16.png # Architecture diagram
├── utils/ # Utilities and common components
│ ├── constants.py # Constants and settings
│ ├── db_client.py # MongoDB client
│ ├── src.py # Helper functions
│ ├── sync_task.py # Task synchronization
│ └── __init__.py
├── runners/ # Task processors
│ ├── run.py # Main task processor
│ ├── run_metrics.py # Metrics processor
│ ├── run_regexp.py # Data extraction from responses
│ ├── run_rta_queuer.py # RtA task processor
│ ├── task_processor.py # Task processor
│ └── README.md # Processors documentation
├── monitoring/ # Web monitoring interface
│ ├── app_main.py # Main Streamlit application
│ ├── config.yaml # Authentication configuration
│ ├── dataset_management.py # Dataset management
│ ├── metrics.py # Metrics display
│ ├── prompts_tasks.py # Task creation
│ ├── tasks.py # Task visualization
│ └── src.py # Helper functions
├── langchain_back/ # Backend service
├── pyproject.toml # Poetry configuration
├── docker-compose.yml # Docker Compose configuration
├── Dockerfile # Docker image
└── README.md # Main README
- Docker and Docker Compose
- Python 3.11+ (for local development)
- Poetry (for dependency management)
-
Clone the repository:
git clone <repository-url> cd TrustVar
-
Create
.envfile with environment variables:MONGO_INITDB_ROOT_USERNAME=admin MONGO_INITDB_ROOT_PASSWORD=password MONGO_INITDB_ROOT_PORT=27017 YANDEX_API_KEY=your_yandex_key OPENAI_KEY=your_openai_key API_URL=http://localhost:45321/generate OLLAMA_BASE_URL=http://localhost:12345 CURRENT_UID=1000 CURRENT_GID=1000
-
Launch all services:
docker-compose up -d
-
Download datasets and auxiliary information:
After running
docker-compose up, you need to download the datasets and auxiliary information from our Google Drive and upload them to MongoDB:The drive contains:
- Accuracy_Groups.json - Accuracy metrics grouped by categories
- Accuracy.json - Main accuracy dataset
- Correlation.json - Correlation metrics
- IncludeExclude.json - Include/Exclude analysis data
- RtAR.json - Refuse to Answer metrics
- TFNR.json - True False Negative Rate metrics
- jailbreak.json - Jailbreak detection tasks
- ood_detection.json - Out-of-distribution detection
- privacy_assessment.json - Privacy assessment tasks
- stereotypes_detection_3.json - Stereotype detection
- tasks.json - Task definitions
- And many more specialized datasets...
Instructions:
- Download all JSON files from the Google Drive
- Place them in the
data/datasets/directory of your TrustVar installation - Run the upload script to populate MongoDB:
cd data python upload.py - Restart the services if necessary:
docker-compose restart
-
Open the web interface:
- Monitoring: http://localhost:27366 (or http://83.143.66.61:27366 for remote access)
- MongoDB Express: http://localhost:8081
Authentication credentials:
- Username:
user - Password:
resu123
-
Install dependencies:
poetry install
-
Activate virtual environment:
poetry shell
-
Launch individual components:
# Backend python langchain_back/main.py # Frontend streamlit run monitoring/app_main.py --server.port 27366 # Runners python -m runners.run python -m runners.run_metrics python -m runners.task_processor
Detailed documentation is available in the docs/ folder:
- Main Documentation - System overview and architecture
- System Components - Detailed description of all components
- Setup - Setup and installation guide
- Deployment - Deployment guide
- API - API documentation
- Metrics - Description of supported metrics
Primary database for storing tasks, results, and metrics.
Server-side for processing requests to language models.
Modern web interface for monitoring and management.
Set of specialized task processors:
run.py- Main task processorrun_metrics.py- Metrics processorrun_regexp.py- Data extraction from responsesrun_rta_queuer.py- RtA task processortask_processor.py- Task processor
Service for local language model execution.
The system provides REST API for interaction with language models:
- POST /generate - Response generation
- GET /health - Health check
- GET /models - Model list
Detailed API documentation: docs/api.md
Supported metric types:
- Accuracy - Response accuracy
- RtA (Refuse to Answer) - Analysis of answer refusals
- Correlation - Correlation with reference answers
- Include/Exclude - Analysis of element inclusion/exclusion
Detailed metrics description: docs/metrics.md
docker-compose up -dDetailed Kubernetes deployment guide: docs/deployment.md
- Modular architecture with clear separation of responsibilities
- Docker containerization for simplified deployment
- Poetry for dependency management
- Streamlit for modern web interface
- Create a new module in
runners/ - Add configuration in
utils/constants.py - Update web interface in
monitoring/
- Add model to
MODELSlist inutils/constants.py - Configure corresponding provider in Langchain Backend
- Update documentation
For support:
- Check documentation in the
docs/folder - Study component logs
- Create an issue in the project repository
This project is licensed under the MIT License.
