[SC-12193] Enable custom test result description output structure by juanmleng · Pull Request #425 · validmind/validmind-library

juanmleng · 2025-09-15T08:49:12Z

Pull Request Description

What

This PR adds comprehensive context management capabilities to run_test() by introducing a new context parameter that accepts a dictionary with three optional keys:

test_description: Allows users to override a test's built-in docstring with custom documentation
additional_context: Provides business context, thresholds, decision criteria, and any background information
instructions: Controls output formatting and presentation style

vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "model": vm_model,
        "dataset": vm_test_ds
    },
    context={
        "test_description":"...",
        "instructions":"...",
        "additional_context":"...",
    },
)

Why

Users now have complete control over all aspects of context that drive LLM test description generation.

test_description Parameter:
Addresses the need for domain-specific and regulatory-compliant test documentation. By default, ValidMind tests include technical documentation about their statistical methodology, but for generic tests like histograms or descriptive statistics, this mechanical explanation often provides less value than understanding what variables or features are being analyzed and their business significance. This parameter allows users to replace generic methodology descriptions with meaningful explanations of the data being examined, override ValidMind's built-in documentation when different terminology or structure is preferred, and ensure regulatory compliance by using specific required language for test definitions.

additional_context Parameter:
Allows users to provide any background information necessary for intelligent interpretation of test results. Rather than generating generic assessments, this parameter enables the LLM to understand the specific context that matters for each use case. Users can include business rules like performance thresholds and decision criteria, organizational context such as risk tolerances and regulatory requirements, real-time information like current dates or risk indicators, stakeholder priorities, model purpose and operational constraints, or any other background information that helps the LLM interpret results within their specific business context. This flexible parameter serves as a catch-all for contextual information, allowing users to provide whatever background details are most relevant to their situation.

instructions Parameter:
The instructions parameter gives users complete control over how test descriptions are formatted. This addresses the reality that different tests, document sections or audiences may need different formats, such as concise summaries with clear recommendations, detailed methodology discussions, or regulatory audiences expect specific compliance-focused language. Users can create structured templates that ensure consistent organizational reporting standards, combine hardcoded mandatory text (such as policy references and disclaimers) with dynamic LLM analysis, and implement sophisticated formatting requirements that maintain professional presentation while leveraging AI-generated insights.

How to test

All three parameters working together

vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={"model": vm_model, "dataset": vm_test_ds},
    context={
        "test_description": "Customer churn assessment performance for retail banking applications. Class 0=retention, Class 1=churn",
        "additional_context": "AUC >0.85=APPROVE, Recall >50%=viable retention program", 
        "instructions": "Format as executive summary with clear go/no-go decision and risk rating"
    }
)

Backwards compatibility - environment variable still works (temporary)

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = "Format as executive summary"
vm.tests.run_test(...)  # Uses environment variable (will be deprecated)

Parameter overrides environment variable when both provided

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = "Old format"
vm.tests.run_test(..., instructions="New format")  # Uses "New format"

What needs special review?

We need to decide the backwards compatibility period. When are we removing the use of the environmental variable?

Dependencies, breaking changes, and deployment notes

The notebook notebooks/how_to/add_context_to_llm_descriptions.ipynb has been removed and replaced by notebooks/how_to/custom_test_result_descriptions.ipynb
No breaking changes. However, users should migrate from environment variables to parameters, as environment variable support will probably be removed in future releases.

Dependencies:

https://github.com/validmind/backend/pull/1963

Release notes

This release introduces comprehensive context management for test descriptions through a new context parameter in run_test() that accepts a dictionary with three optional keys:

test_description for overriding test documentation with domain-specific content
additional_context for providing business context, real-time information or any other background information relevant for the analysis of test results
instructions for controlling output formatting and style

These parameters give users complete control over how LLM-generated test descriptions are created. The existing environment variable approach (VALIDMIND_LLM_DESCRIPTIONS_CONTEXT) remains fully supported for backwards compatibility, with the instructions parameter taking precedence when both are provided. Users are encouraged to migrate to the new parameter-based approach, as environment variable support will be deprecated in future releases.

Checklist

…ription-output-structure

johnwalz97

lgtm

validmind/tests/__types__.py

AnilSorathiya

For consistency purpose, we are using the user_instructions an input in run_task function, while the instructions has been used as params. Also, the names are different as well.
Just wondering, can we bring consistency?

    result = vm.experimental.agents.run_task(
        task="code_explainer",
        input={
            "source_code": source_code,
            "user_instructions": user_instructions
        }
    )
    result.log(content_id=content_id)
    return result
    ```

juanmleng · 2025-09-17T08:51:27Z

For consistency purpose, we are using the user_instructions an input in run_task function, while the instructions has been used as params. Also, the names are different as well. Just wondering, can we bring consistency?
    result = vm.experimental.agents.run_task(
        task="code_explainer",
        input={
            "source_code": source_code,
            "user_instructions": user_instructions
        }
    )
    result.log(content_id=content_id)
    return result
    ```

Actually, I was thinking that perhaps we could consider consolidating the context parameters into a single dictionary, similar to inputs. This would logically group all context elements that drive the description customization:

Current approach:
run_test(doc="...", instructions="...", knowledge="...")

Alternative approach:
run_test(context={"doc": "...", "format_instructions": "...", "knowledge": "..."})

Regarding the naming of parameters, I am open to suggestions. Perhaps instructions is too vague, so maybe format_instructions seems more targeted? But happy to hear other suggestions.

With respect to the code explainer, perhaps we could follow a similar approach, where we separate "what to process" (inputs) vs "how to process it" (context).

@AnilSorathiya @johnwalz97 @cachafla @kristof87 Any thoughts?

AnilSorathiya · 2025-09-17T11:41:49Z

run_test(context={"doc": "...", "format_instructions": "...", "knowledge": "..."})

I like the idea keeping llm parameters/inputs in a separate dictionary format. It allows us to add more parameters without changing the run_test function signature(run_test) also gives clear separation between test's inputs/params and llm inputs/params.
For parameters naming, in my opinion knowledge word doesn't fit well with llm world. I would say it's better to use context or additional_context. Also, the format_instructions looks to me it is for formatting the llm output.

How about the following:

doc -> test_info
format_instructions -> user_instructions (more general)
.... -> system_instructions (allow to additional system instructions to add into system prompt)
knowledge -> additional_context or context
Then run_test signature will be:

vm.tests.run_test(
    test_id="validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_dataset},
    params={
        "abc": "xyz",
    },
    llm_inputs={
        "test_info": "xyz",
        "system_instructions": "xyz",
        "user_instructions": "xyz",
        "additional_context": "xyz",
    },
)

This is up for discussion so please put forward your opinions as well @cachafla, @juanmleng and @johnwalz97.

cachafla

Awesome

cachafla · 2025-09-17T15:55:59Z

Ah, removing my approval since I didn't see @AnilSorathiya's suggestion. Reading...

validbeck · 2025-09-18T18:12:21Z

Can we change the filename to customize_test_result_descriptions.ipynb to match our style guide conventions? I will put the notebook on the list for more in-depth editing as well. :)

kristof87 · 2025-09-19T04:45:53Z

Great addition!

kristof87

Looks great!

cachafla

Awesome! 🙌

juanmleng · 2025-09-19T07:50:52Z

Can we change the filename to customize_test_result_descriptions.ipynb to match our style guide conventions? I will put the notebook on the list for more in-depth editing as well. :)

Good point @validbeck. Done

github-actions · 2025-09-19T07:52:38Z

PR Summary

This PR introduces enhanced customization options for LLM-generated test descriptions in the ValidMind Library. The changes allow users to supply additional context through new parameters such as instructions and additional_context alongside an optional override for the test’s built-in documentation (test_description).

Key functional changes include:

Removal of two notebooks from one location and addition of a new notebook to showcase customized test result descriptions. These notebooks provide detailed examples for how to override default LLM descriptions by leveraging context parameters such as business rules, template structures, and mixed static/dynamic content.
Updates to the version numbers in the pyproject.toml and __version__.py files (from 2.9.4 to 2.9.5), which are isolated from the core logic modifications.
Modifications to the function generate_description in validmind/ai/test_descriptions.py to accept two new optional parameters: instructions and additional_context. These parameters are then forwarded to the LLM API call allowing for more tailored output.
Changes in multiple functions across the library (e.g., get_result_description) to prioritize user-supplied context over the default global context and to ensure that custom instructions override environment settings where provided.
Enhancements in the test runner implementation (validmind/tests/run.py) to validate and extract context keys from a provided dictionary. The new helper function _validate_context verifies that only allowed keys are included and that their values are strings.

Overall, the PR improves the flexibility of how test outputs are explained by the LLM and provides users an easier way to incorporate business logic and formatting requirements directly into the test result description generation workflow.

Test Suggestions

Write unit tests to verify that the new parameters instructions and additional_context are correctly passed through the call stack from the run_test function to generate_description.
Test the _validate_context function with valid and invalid context dictionaries (e.g., wrong key names, non-string values).
Develop integration tests to run sample tests that use all three context parameters (test_description, instructions, additional_context) ensuring that the LLM (or its mock) outputs descriptions that incorporate these customizations.
Ensure backward compatibility by verifying that if no custom context is provided, the system falls back to the existing global settings.

juanmleng added 2 commits September 15, 2025 10:46

Support for custom context

658f400

Merge branch 'main' into juan/sc-12193/enable-custom-test-result-desc…

17ffe66

…ription-output-structure

juanmleng self-assigned this Sep 15, 2025

juanmleng added the enhancement New feature or request label Sep 15, 2025

johnwalz97 approved these changes Sep 15, 2025

View reviewed changes

validmind/tests/__types__.py Show resolved Hide resolved

AnilSorathiya reviewed Sep 15, 2025

View reviewed changes

juanmleng requested review from cachafla and kristof87 September 15, 2025 14:17

cachafla approved these changes Sep 17, 2025

View reviewed changes

cachafla self-requested a review September 17, 2025 15:55

Passing context parameters via dictionary

59308ae

kristof87 approved these changes Sep 19, 2025

View reviewed changes

cachafla approved these changes Sep 19, 2025

View reviewed changes

Rename notebook

48afa33

2.9.5

e376467

juanmleng merged commit 4005b18 into main Sep 19, 2025
17 checks passed

juanmleng deleted the juan/sc-12193/enable-custom-test-result-description-output-structure branch September 19, 2025 08:19

Conversation

juanmleng commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

What

Why

How to test

All three parameters working together

Backwards compatibility - environment variable still works (temporary)

Parameter overrides environment variable when both provided

What needs special review?

Dependencies, breaking changes, and deployment notes

Dependencies:

Release notes

Checklist

Uh oh!

johnwalz97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AnilSorathiya left a comment

Choose a reason for hiding this comment

Uh oh!

juanmleng commented Sep 17, 2025

Uh oh!

AnilSorathiya commented Sep 17, 2025

Uh oh!

cachafla left a comment

Choose a reason for hiding this comment

Uh oh!

cachafla commented Sep 17, 2025

Uh oh!

validbeck commented Sep 18, 2025

Uh oh!

kristof87 commented Sep 19, 2025

Uh oh!

kristof87 left a comment

Choose a reason for hiding this comment

Uh oh!

cachafla left a comment

Choose a reason for hiding this comment

Uh oh!

juanmleng commented Sep 19, 2025

Uh oh!

github-actions bot commented Sep 19, 2025

PR Summary

Test Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

juanmleng commented Sep 15, 2025 •

edited

Loading