Skip to content
KUMAR SIVARAJAN edited this page Oct 13, 2025 · 2 revisions

Welcome to the TrustScoreEval wiki!

A trust score for an LLM is a quantitative assessment of its trustworthiness, measuring its reliability, safety, and performance across multiple dimensions. Since trust is not a single characteristic, evaluation involves a multifaceted framework that combines automated metrics, benchmarks, and human feedback.

Key dimensions of an LLM trust score

A holistic trust score evaluates a model on several critical aspects:

Accuracy and factual consistency: Measures whether the model's output is factually correct. High scores indicate that the model avoids "hallucinations"—generating confident but false or unsubstantiated information.

Safety and harmlessness: Evaluates the model's ability to avoid generating harmful, toxic, or biased content. Benchmarks like ForbiddenQuestions and datasets from red-teaming exercises are used to test a model's safety guardrails.

Fairness and bias: Quantifies any biases related to gender, race, or socioeconomic status that the model may have learned from its training data. Evaluation ensures the model does not unfairly discriminate against certain demographic groups.

Relevance: Assesses how well the response directly and concisely addresses the user's prompt. A high relevance score means the model does not go off-topic.

Coherence and fluency: Scores the logical flow, consistency, and readability of the generated text. A fluent and coherent response is natural and easy to understand.

Robustness: Measures the model's resilience to different variations in input. This includes testing against adversarial attacks, such as "jailbreaking" prompts, which are designed to circumvent safety mechanisms.

Reliability and consistency: Tracks whether the model can produce similar responses to the same input over time. Consistency is particularly important for applications where deterministic outputs are critical.

Explainability and transparency: While often difficult for LLMs, evaluation in this area focuses on ensuring the model can provide understandable justifications for its outputs, building user confidence.

Evaluation frameworks and methods

Evaluating trustworthiness requires a combination of automated testing and human oversight.

# Automated evaluation

Benchmarks: Standardized tests are used to compare LLM capabilities.

Examples include:

HELM: A Holistic Evaluation of Language Models that covers multiple facets like accuracy, robustness, and safety.

TruthfulQA: Specifically designed to measure truthfulness and avoid hallucinations.

DecodingTrust: Provides a comprehensive safety evaluation framework across eight perspectives, including toxicity, stereotypes, and privacy.

Metrics: Tools apply various metrics to score performance:

Reference-based: Classic metrics like ROUGE and BLEU compare the LLM's output to a human-written reference, though they can miss semantic nuances.

Reference-free: Advanced metrics, such as embedding similarity and model-based scoring, assess content without a predefined answer, often leveraging another LLM.

LLM-as-a-Judge: A powerful LLM is used to evaluate the output of another model against a predefined rubric. This method is fast and scalable, but it may introduce bias or miss subtleties that a human would catch.

Human-in-the-loop (HITL) evaluation

Expert reviews: Subject-matter experts manually review outputs, which is vital for high-stakes tasks in fields like medicine or finance where nuance and context are critical.

User feedback: In production, mechanisms like "thumbs-up/thumbs-down" ratings provide continuous user feedback to monitor for performance shifts.

Red teaming: Human teams purposefully attempt to find weaknesses in the model by crafting adversarial prompts to elicit harmful or incorrect responses.

Tools for trust score evaluation

Numerous platforms and libraries are available to help automate and manage the evaluation process.

RAGAS: An open-source framework for evaluating Retrieval-Augmented Generation (RAG) systems, focusing on metrics like faithfulness and answer relevance.

DeepEval: A Python library that allows for unit-testing LLM outputs in a CI/CD pipeline and comes with pre-built metrics for testing truthfulness and bias.

LangSmith: An evaluation platform that integrates with LangChain, offering logging, version control, and human feedback queues to help debug and test LLM applications.

Fiddler AI: A platform that generates "Trust Scores" across various dimensions (e.g., safety, toxicity, relevance) to monitor and govern LLM applications at scale.

Trust score evaluation for RAG applications

For RAG systems, trustworthiness is especially dependent on how the LLM uses the provided context to generate answers.

Faithfulness: Measures if the response is factually consistent with the retrieved context. This is crucial for avoiding hallucinations.

Contextual precision and recall: These metrics evaluate the quality of the retrieval component. They check if the system retrieves relevant documents (recall) while avoiding irrelevant ones (precision).

Answer relevance: Checks if the generated answer is relevant to the original user query, based on the retrieved context.

Trust-Score (RAG): Research has introduced a holistic metric specifically for RAG setups that combines metrics for retrieval and generation to quantify hallucinations.