Skip to content

[SC-12193] Enable custom test result description output structure#425

Merged
juanmleng merged 5 commits intomainfrom
juan/sc-12193/enable-custom-test-result-description-output-structure
Sep 19, 2025
Merged

[SC-12193] Enable custom test result description output structure#425
juanmleng merged 5 commits intomainfrom
juan/sc-12193/enable-custom-test-result-description-output-structure

Conversation

@juanmleng
Copy link
Contributor

@juanmleng juanmleng commented Sep 15, 2025

Pull Request Description

What

This PR adds comprehensive context management capabilities to run_test() by introducing a new context parameter that accepts a dictionary with three optional keys:

  • test_description: Allows users to override a test's built-in docstring with custom documentation
  • additional_context: Provides business context, thresholds, decision criteria, and any background information
  • instructions: Controls output formatting and presentation style
vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "model": vm_model,
        "dataset": vm_test_ds
    },
    context={
        "test_description":"...",
        "instructions":"...",
        "additional_context":"...",
    },
)

Why

Users now have complete control over all aspects of context that drive LLM test description generation.

test_description Parameter:
Addresses the need for domain-specific and regulatory-compliant test documentation. By default, ValidMind tests include technical documentation about their statistical methodology, but for generic tests like histograms or descriptive statistics, this mechanical explanation often provides less value than understanding what variables or features are being analyzed and their business significance. This parameter allows users to replace generic methodology descriptions with meaningful explanations of the data being examined, override ValidMind's built-in documentation when different terminology or structure is preferred, and ensure regulatory compliance by using specific required language for test definitions.

additional_context Parameter:
Allows users to provide any background information necessary for intelligent interpretation of test results. Rather than generating generic assessments, this parameter enables the LLM to understand the specific context that matters for each use case. Users can include business rules like performance thresholds and decision criteria, organizational context such as risk tolerances and regulatory requirements, real-time information like current dates or risk indicators, stakeholder priorities, model purpose and operational constraints, or any other background information that helps the LLM interpret results within their specific business context. This flexible parameter serves as a catch-all for contextual information, allowing users to provide whatever background details are most relevant to their situation.

instructions Parameter:
The instructions parameter gives users complete control over how test descriptions are formatted. This addresses the reality that different tests, document sections or audiences may need different formats, such as concise summaries with clear recommendations, detailed methodology discussions, or regulatory audiences expect specific compliance-focused language. Users can create structured templates that ensure consistent organizational reporting standards, combine hardcoded mandatory text (such as policy references and disclaimers) with dynamic LLM analysis, and implement sophisticated formatting requirements that maintain professional presentation while leveraging AI-generated insights.

How to test

All three parameters working together

vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={"model": vm_model, "dataset": vm_test_ds},
    context={
        "test_description": "Customer churn assessment performance for retail banking applications. Class 0=retention, Class 1=churn",
        "additional_context": "AUC >0.85=APPROVE, Recall >50%=viable retention program", 
        "instructions": "Format as executive summary with clear go/no-go decision and risk rating"
    }
)

Backwards compatibility - environment variable still works (temporary)

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = "Format as executive summary"
vm.tests.run_test(...)  # Uses environment variable (will be deprecated)

Parameter overrides environment variable when both provided

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = "Old format"
vm.tests.run_test(..., instructions="New format")  # Uses "New format"

What needs special review?

We need to decide the backwards compatibility period. When are we removing the use of the environmental variable?

Dependencies, breaking changes, and deployment notes

  • The notebook notebooks/how_to/add_context_to_llm_descriptions.ipynb has been removed and replaced by notebooks/how_to/custom_test_result_descriptions.ipynb
  • No breaking changes. However, users should migrate from environment variables to parameters, as environment variable support will probably be removed in future releases.

Dependencies:

https://github.com/validmind/backend/pull/1963

Release notes

This release introduces comprehensive context management for test descriptions through a new context parameter in run_test() that accepts a dictionary with three optional keys:

  • test_description for overriding test documentation with domain-specific content
  • additional_context for providing business context, real-time information or any other background information relevant for the analysis of test results
  • instructions for controlling output formatting and style

These parameters give users complete control over how LLM-generated test descriptions are created. The existing environment variable approach (VALIDMIND_LLM_DESCRIPTIONS_CONTEXT) remains fully supported for backwards compatibility, with the instructions parameter taking precedence when both are provided. Users are encouraged to migrate to the new parameter-based approach, as environment variable support will be deprecated in future releases.

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

@juanmleng juanmleng self-assigned this Sep 15, 2025
@juanmleng juanmleng added the enhancement New feature or request label Sep 15, 2025
Copy link
Contributor

@johnwalz97 johnwalz97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@AnilSorathiya AnilSorathiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency purpose, we are using the user_instructions an input in run_task function, while the instructions has been used as params. Also, the names are different as well.
Just wondering, can we bring consistency?

    result = vm.experimental.agents.run_task(
        task="code_explainer",
        input={
            "source_code": source_code,
            "user_instructions": user_instructions
        }
    )
    result.log(content_id=content_id)
    return result
    ```

@juanmleng
Copy link
Contributor Author

For consistency purpose, we are using the user_instructions an input in run_task function, while the instructions has been used as params. Also, the names are different as well. Just wondering, can we bring consistency?

    result = vm.experimental.agents.run_task(
        task="code_explainer",
        input={
            "source_code": source_code,
            "user_instructions": user_instructions
        }
    )
    result.log(content_id=content_id)
    return result
    ```

Actually, I was thinking that perhaps we could consider consolidating the context parameters into a single dictionary, similar to inputs. This would logically group all context elements that drive the description customization:

Current approach:
run_test(doc="...", instructions="...", knowledge="...")

Alternative approach:
run_test(context={"doc": "...", "format_instructions": "...", "knowledge": "..."})

Regarding the naming of parameters, I am open to suggestions. Perhaps instructions is too vague, so maybe format_instructions seems more targeted? But happy to hear other suggestions.

With respect to the code explainer, perhaps we could follow a similar approach, where we separate "what to process" (inputs) vs "how to process it" (context).

@AnilSorathiya @johnwalz97 @cachafla @kristof87 Any thoughts?

@AnilSorathiya
Copy link
Contributor

run_test(context={"doc": "...", "format_instructions": "...", "knowledge": "..."})

I like the idea keeping llm parameters/inputs in a separate dictionary format. It allows us to add more parameters without changing the run_test function signature(run_test) also gives clear separation between test's inputs/params and llm inputs/params.
For parameters naming, in my opinion knowledge word doesn't fit well with llm world. I would say it's better to use context or additional_context. Also, the format_instructions looks to me it is for formatting the llm output.

How about the following:

  • doc -> test_info
  • format_instructions -> user_instructions (more general)
  • .... -> system_instructions (allow to additional system instructions to add into system prompt)
  • knowledge -> additional_context or context
    Then run_test signature will be:
vm.tests.run_test(
    test_id="validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_dataset},
    params={
        "abc": "xyz",
    },
    llm_inputs={
        "test_info": "xyz",
        "system_instructions": "xyz",
        "user_instructions": "xyz",
        "additional_context": "xyz",
    },
)

This is up for discussion so please put forward your opinions as well @cachafla, @juanmleng and @johnwalz97.

Copy link
Contributor

@cachafla cachafla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

@cachafla cachafla self-requested a review September 17, 2025 15:55
@cachafla
Copy link
Contributor

Ah, removing my approval since I didn't see @AnilSorathiya's suggestion. Reading...

@validbeck
Copy link
Collaborator

Can we change the filename to customize_test_result_descriptions.ipynb to match our style guide conventions? I will put the notebook on the list for more in-depth editing as well. :)

@kristof87
Copy link

Great addition!

Copy link

@kristof87 kristof87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Contributor

@cachafla cachafla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 🙌

@juanmleng
Copy link
Contributor Author

Can we change the filename to customize_test_result_descriptions.ipynb to match our style guide conventions? I will put the notebook on the list for more in-depth editing as well. :)

Good point @validbeck. Done

@github-actions
Copy link
Contributor

PR Summary

This PR introduces enhanced customization options for LLM-generated test descriptions in the ValidMind Library. The changes allow users to supply additional context through new parameters such as instructions and additional_context alongside an optional override for the test’s built-in documentation (test_description).

Key functional changes include:

  • Removal of two notebooks from one location and addition of a new notebook to showcase customized test result descriptions. These notebooks provide detailed examples for how to override default LLM descriptions by leveraging context parameters such as business rules, template structures, and mixed static/dynamic content.

  • Updates to the version numbers in the pyproject.toml and __version__.py files (from 2.9.4 to 2.9.5), which are isolated from the core logic modifications.

  • Modifications to the function generate_description in validmind/ai/test_descriptions.py to accept two new optional parameters: instructions and additional_context. These parameters are then forwarded to the LLM API call allowing for more tailored output.

  • Changes in multiple functions across the library (e.g., get_result_description) to prioritize user-supplied context over the default global context and to ensure that custom instructions override environment settings where provided.

  • Enhancements in the test runner implementation (validmind/tests/run.py) to validate and extract context keys from a provided dictionary. The new helper function _validate_context verifies that only allowed keys are included and that their values are strings.

Overall, the PR improves the flexibility of how test outputs are explained by the LLM and provides users an easier way to incorporate business logic and formatting requirements directly into the test result description generation workflow.

Test Suggestions

  • Write unit tests to verify that the new parameters instructions and additional_context are correctly passed through the call stack from the run_test function to generate_description.
  • Test the _validate_context function with valid and invalid context dictionaries (e.g., wrong key names, non-string values).
  • Develop integration tests to run sample tests that use all three context parameters (test_description, instructions, additional_context) ensuring that the LLM (or its mock) outputs descriptions that incorporate these customizations.
  • Ensure backward compatibility by verifying that if no custom context is provided, the system falls back to the existing global settings.

@juanmleng juanmleng merged commit 4005b18 into main Sep 19, 2025
17 checks passed
@juanmleng juanmleng deleted the juan/sc-12193/enable-custom-test-result-description-output-structure branch September 19, 2025 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants