Skip to content

Conversation

AmineRaouane
Copy link
Member

Summary

This PR introduces an AI client for interacting with large language models (LLMs) and adds a new heuristic analyzer to detect inconsistencies between code and its docstrings.

Description of changes

  • Added an AIClient class to enable seamless integration with LLMs .
  • Implemented a new heuristic analyzer called MatchingDocstringsAnalyzer that detect inconsistencies between code and its docstrings.
  • Integrated the new heuristic into the heuristics.py registry.
  • Updated detect_malicious_metadata_check.py to include and run this new heuristic.
  • Added unit tests to verify correct detection of missing or mismatched docstrings.

Related issues

None

Checklist

  • I have reviewed the contribution guide.
  • My PR title and commits follow the Conventional Commits convention.
  • My commits include the "Signed-off-by" line.
  • I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green verified label should appear next to all of your commits on GitHub.
  • I have updated the relevant documentation, if applicable.
  • I have tested my changes and verified they work as expected.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 11, 2025
@behnazh-w behnazh-w requested a review from art1f1c3R August 15, 2025 01:24

SYSTEM_PROMPT = """
You are a security expert analyzing a PyPI package. Determine if the package description is secure.
you must score between 0 and 100 based on the following criteria:
Copy link
Member

@art1f1c3R art1f1c3R Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me more about the types of responses models give to this prompt? My main questions are how the model assigns a score from 0 to 100. It is asked to look at several aspects (High-level description summary, benefit, how to install, how to use) and their consistency with each other. What happens if one of those sections is missing? Is that "less consistent"? Do any of the section comparisons get weighted more than the other (e.g. "how to use" vs "high-level description summary" has more weight than "how to use" vs "benefit")? If a description uses clear headings, is it going to be scored higher than one that does not, when both include equivalent information? If a package has a pretty bare-bones description with essentially no information on any of these headings, is it labelled inconsistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I didn’t specify more in the system prompt as you can see, the model will decide the overall score. If needed, I can add examples of best and worst packages in the prompt to guide it better.

Right now, the model isn’t explicitly told to weigh one section over another; it just evaluates the consistency across all provided aspects (high-level description, benefit, how to install, how to use) and gives a score from 0 to 100. If one of those sections is missing, the model will likely consider it “less consistent” overall, which should lower the score.

The returned format looks like this:
{ "score": ... , "reason": ... }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any examples from existing benign packages? How consistent are the LLM results when we perform multiple runs on the same package (i.e., reproducability and consistency in the LLM's results)?

pytest.skip("AI client disabled - skipping test")


def test_analyze_consistent_description_pass(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like we patch and hardcode the return values for analyzer.client.invoke. Isn't pass/fail mostly determined by that result though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that’s exactly why I set it up this way I wanted something fixed across all models, so patching and hardcoding the return values for analyzer.client.invoke ensures consistency. The pass/fail is indeed mostly determined by that result, and this approach keeps it stable so the outcome isn’t affected by model variance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is being tested here? If we hardcode the return values from the analysis with mock_result, when won't those assert statements always be true? What part of the functionality are we ensuring is behaving correctly?

mock_invoke.assert_called_once()


def test_analyze_inconsistent_description_fail(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

"The inconsistent code, or null."
}
/no_think
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain to us what does /no_think means in this context and why it's used in this prompt ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/no_think here is used to prevent “thinking” or reasoning steps from the model. This way, it skips generating internal reasoning chains and directly produces the output, which reduces response time .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this affect the quality of the response the LLM produces? How large is the tradeoff of response time to the accuracy and correctness of the LLM's response?

"""Check whether the docstrings and the code components are consistent."""

SYSTEM_PROMPT = """
You are a code master who can detect the inconsistency of the code with the docstrings that describes its components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any results from running this on existing PyPI packages, and how the LLM performed?


logger: logging.Logger = logging.getLogger(__name__)

T = TypeVar("T", bound=BaseModel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the T type used in any other places? I can't seem to find any usage, what is the use case for this?

"problog >= 2.2.6,<3.0.0",
"cryptography >=44.0.0,<45.0.0",
"semgrep == 1.113.0",
"pydantic >= 2.11.5,<2.12.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pydantic currently used? I can see it being used for the T type in openai_client.py, but I do not see it used elsewhere. The src/macaron/ai/README.md states it is passed as an argument to invoke, but I don't seem to see anywhere that is performed?

pyproject.toml Outdated
"cryptography >=44.0.0,<45.0.0",
"semgrep == 1.113.0",
"pydantic >= 2.11.5,<2.12.0",
"gradio_client == 1.4.3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gradio_client appears to be unused.

@AmineRaouane AmineRaouane force-pushed the matching-docstring branch 2 times, most recently from 165bbb6 to d9b52b1 Compare September 6, 2025 01:27
…code–docstring consistency

Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants