feat: add AI module for LLM interaction and a heuristic for checking code–docstring consistency #1121

AmineRaouane · 2025-07-11T20:34:59Z

Summary

This PR introduces an AI client for interacting with large language models (LLMs) and adds a new heuristic analyzer to detect inconsistencies between code and its docstrings.

Description of changes

Added an AIClient class to enable seamless integration with LLMs .
Implemented a new heuristic analyzer called MatchingDocstringsAnalyzer that detect inconsistencies between code and its docstrings.
Integrated the new heuristic into the heuristics.py registry.
Updated detect_malicious_metadata_check.py to include and run this new heuristic.
Added unit tests to verify correct detection of missing or mismatched docstrings.

Related issues

None

Checklist

I have reviewed the contribution guide.
My PR title and commits follow the Conventional Commits convention.
My commits include the "Signed-off-by" line.
I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green verified label should appear next to all of your commits on GitHub.
I have updated the relevant documentation, if applicable.
I have tested my changes and verified they work as expected.

art1f1c3R · 2025-08-15T04:56:27Z

src/macaron/malware_analyzer/pypi_heuristics/metadata/inconsistent_description.py

+
+    SYSTEM_PROMPT = """
+    You are a security expert analyzing a PyPI package. Determine if the package description is secure.
+    you must score between 0 and 100 based on the following criteria:


Can you tell me more about the types of responses models give to this prompt? My main questions are how the model assigns a score from 0 to 100. It is asked to look at several aspects (High-level description summary, benefit, how to install, how to use) and their consistency with each other. What happens if one of those sections is missing? Is that "less consistent"? Do any of the section comparisons get weighted more than the other (e.g. "how to use" vs "high-level description summary" has more weight than "how to use" vs "benefit")? If a description uses clear headings, is it going to be scored higher than one that does not, when both include equivalent information? If a package has a pretty bare-bones description with essentially no information on any of these headings, is it labelled inconsistent?

Well, I didn’t specify more in the system prompt as you can see, the model will decide the overall score. If needed, I can add examples of best and worst packages in the prompt to guide it better.

Right now, the model isn’t explicitly told to weigh one section over another; it just evaluates the consistency across all provided aspects (high-level description, benefit, how to install, how to use) and gives a score from 0 to 100. If one of those sections is missing, the model will likely consider it “less consistent” overall, which should lower the score.

The returned format looks like this:
{ "score": ... , "reason": ... }

Do you have any examples from existing benign packages? How consistent are the LLM results when we perform multiple runs on the same package (i.e., reproducability and consistency in the LLM's results)?

art1f1c3R · 2025-08-15T05:32:24Z

tests/malware_analyzer/pypi/test_inconsistent_description.py

+        pytest.skip("AI client disabled - skipping test")
+
+
+def test_analyze_consistent_description_pass(


So it looks like we patch and hardcode the return values for analyzer.client.invoke. Isn't pass/fail mostly determined by that result though?

Yeah, that’s exactly why I set it up this way I wanted something fixed across all models, so patching and hardcoding the return values for analyzer.client.invoke ensures consistency. The pass/fail is indeed mostly determined by that result, and this approach keeps it stable so the outcome isn’t affected by model variance.

So what is being tested here? If we hardcode the return values from the analysis with mock_result, when won't those assert statements always be true? What part of the functionality are we ensuring is behaving correctly?

art1f1c3R · 2025-08-15T05:33:02Z

tests/malware_analyzer/pypi/test_inconsistent_description.py

+        mock_invoke.assert_called_once()
+
+
+def test_analyze_inconsistent_description_fail(


tromai · 2025-08-15T05:43:02Z

src/macaron/malware_analyzer/pypi_heuristics/sourcecode/matching_docstrings.py

+            "The inconsistent code, or null."
+        }
+
+        /no_think


Could you please explain to us what does /no_think means in this context and why it's used in this prompt ?

/no_think here is used to prevent “thinking” or reasoning steps from the model. This way, it skips generating internal reasoning chains and directly produces the output, which reduces response time .

How does this affect the quality of the response the LLM produces? How large is the tradeoff of response time to the accuracy and correctness of the LLM's response?

art1f1c3R · 2025-08-15T06:20:50Z

src/macaron/malware_analyzer/pypi_heuristics/sourcecode/matching_docstrings.py

+    """Check whether the docstrings and the code components are consistent."""
+
+    SYSTEM_PROMPT = """
+        You are a code master who can detect the inconsistency of the code with the docstrings that describes its components.


Do you have any results from running this on existing PyPI packages, and how the LLM performed?

art1f1c3R · 2025-09-04T02:42:37Z

src/macaron/ai/clients/openai_client.py

+
+logger: logging.Logger = logging.getLogger(__name__)
+
+T = TypeVar("T", bound=BaseModel)


Is the T type used in any other places? I can't seem to find any usage, what is the use case for this?

art1f1c3R · 2025-09-04T02:45:53Z

pyproject.toml

    "problog >= 2.2.6,<3.0.0",
    "cryptography >=44.0.0,<45.0.0",
    "semgrep == 1.113.0",
+    "pydantic >= 2.11.5,<2.12.0",


Is pydantic currently used? I can see it being used for the T type in openai_client.py, but I do not see it used elsewhere. The src/macaron/ai/README.md states it is passed as an argument to invoke, but I don't seem to see anywhere that is performed?

art1f1c3R · 2025-09-04T02:46:55Z

pyproject.toml

    "cryptography >=44.0.0,<45.0.0",
    "semgrep == 1.113.0",
+    "pydantic >= 2.11.5,<2.12.0",
+    "gradio_client == 1.4.3",


gradio_client appears to be unused.

…code–docstring consistency Signed-off-by: Amine <amine.raouane@enim.ac.ma>

Signed-off-by: Amine <amine.raouane@enim.ac.ma>

AmineRaouane requested review from behnazh-w and tromai as code owners July 11, 2025 20:35

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 11, 2025

behnazh-w requested a review from art1f1c3R August 15, 2025 01:24

art1f1c3R reviewed Aug 15, 2025

View reviewed changes

tromai reviewed Aug 15, 2025

View reviewed changes

art1f1c3R reviewed Aug 15, 2025

View reviewed changes

art1f1c3R reviewed Sep 4, 2025

View reviewed changes

AmineRaouane force-pushed the matching-docstring branch 2 times, most recently from 165bbb6 to d9b52b1 Compare September 6, 2025 01:27

AmineRaouane added 4 commits September 6, 2025 02:30

feat: add AI module for LLM interaction and a heuristic for checking …

b9c1921

…code–docstring consistency Signed-off-by: Amine <amine.raouane@enim.ac.ma>

feat(ai): improve robustness of AI client

65e54a1

Signed-off-by: Amine <amine.raouane@enim.ac.ma>

feat: add Inconsistent Description heuristic

6da5458

Signed-off-by: Amine <amine.raouane@enim.ac.ma>

refactor: move threshold configuration to defaults.ini

d0a0d65

Signed-off-by: Amine <amine.raouane@enim.ac.ma>

AmineRaouane force-pushed the matching-docstring branch from d9b52b1 to e969fb3 Compare September 6, 2025 01:32

chore(tests): improve test coverage and apply minor heuristic changes

2df3401

Signed-off-by: Amine <amine.raouane@enim.ac.ma>

AmineRaouane force-pushed the matching-docstring branch from e969fb3 to 2df3401 Compare September 6, 2025 01:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add AI module for LLM interaction and a heuristic for checking code–docstring consistency #1121

feat: add AI module for LLM interaction and a heuristic for checking code–docstring consistency #1121

Uh oh!

AmineRaouane commented Jul 11, 2025

Uh oh!

art1f1c3R Aug 15, 2025 •

edited

Loading

Uh oh!

AmineRaouane Aug 15, 2025

Uh oh!

art1f1c3R Aug 28, 2025

Uh oh!

art1f1c3R Aug 15, 2025

Uh oh!

AmineRaouane Aug 15, 2025

Uh oh!

art1f1c3R Aug 28, 2025

Uh oh!

art1f1c3R Aug 15, 2025

Uh oh!

tromai Aug 15, 2025

Uh oh!

AmineRaouane Aug 15, 2025

Uh oh!

art1f1c3R Sep 5, 2025

Uh oh!

art1f1c3R Aug 15, 2025

Uh oh!

art1f1c3R Sep 4, 2025

Uh oh!

art1f1c3R Sep 4, 2025

Uh oh!

art1f1c3R Sep 4, 2025

Uh oh!

Uh oh!

		pytest.skip("AI client disabled - skipping test")


		def test_analyze_consistent_description_pass(

		mock_invoke.assert_called_once()


		def test_analyze_inconsistent_description_fail(


		logger: logging.Logger = logging.getLogger(__name__)

		T = TypeVar("T", bound=BaseModel)

feat: add AI module for LLM interaction and a heuristic for checking code–docstring consistency #1121

Are you sure you want to change the base?

feat: add AI module for LLM interaction and a heuristic for checking code–docstring consistency #1121

Uh oh!

Conversation

AmineRaouane commented Jul 11, 2025

Summary

Description of changes

Related issues

Checklist

Uh oh!

art1f1c3R Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

art1f1c3R Aug 15, 2025 •

edited

Loading