-
Notifications
You must be signed in to change notification settings - Fork 29
feat: add AI module for LLM interaction and a heuristic for checking code–docstring consistency #1121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
||
SYSTEM_PROMPT = """ | ||
You are a security expert analyzing a PyPI package. Determine if the package description is secure. | ||
you must score between 0 and 100 based on the following criteria: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell me more about the types of responses models give to this prompt? My main questions are how the model assigns a score from 0 to 100. It is asked to look at several aspects (High-level description summary, benefit, how to install, how to use) and their consistency with each other. What happens if one of those sections is missing? Is that "less consistent"? Do any of the section comparisons get weighted more than the other (e.g. "how to use" vs "high-level description summary" has more weight than "how to use" vs "benefit")? If a description uses clear headings, is it going to be scored higher than one that does not, when both include equivalent information? If a package has a pretty bare-bones description with essentially no information on any of these headings, is it labelled inconsistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I didn’t specify more in the system prompt as you can see, the model will decide the overall score. If needed, I can add examples of best and worst packages in the prompt to guide it better.
Right now, the model isn’t explicitly told to weigh one section over another; it just evaluates the consistency across all provided aspects (high-level description, benefit, how to install, how to use) and gives a score from 0 to 100. If one of those sections is missing, the model will likely consider it “less consistent” overall, which should lower the score.
The returned format looks like this:
{ "score": ... , "reason": ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any examples from existing benign packages? How consistent are the LLM results when we perform multiple runs on the same package (i.e., reproducability and consistency in the LLM's results)?
pytest.skip("AI client disabled - skipping test") | ||
|
||
|
||
def test_analyze_consistent_description_pass( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like we patch and hardcode the return values for analyzer.client.invoke
. Isn't pass/fail mostly determined by that result though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that’s exactly why I set it up this way I wanted something fixed across all models, so patching and hardcoding the return values for analyzer.client.invoke
ensures consistency. The pass/fail is indeed mostly determined by that result, and this approach keeps it stable so the outcome isn’t affected by model variance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is being tested here? If we hardcode the return values from the analysis with mock_result
, when won't those assert
statements always be true? What part of the functionality are we ensuring is behaving correctly?
mock_invoke.assert_called_once() | ||
|
||
|
||
def test_analyze_inconsistent_description_fail( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
"The inconsistent code, or null." | ||
} | ||
/no_think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain to us what does /no_think
means in this context and why it's used in this prompt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/no_think
here is used to prevent “thinking” or reasoning steps from the model. This way, it skips generating internal reasoning chains and directly produces the output, which reduces response time .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this affect the quality of the response the LLM produces? How large is the tradeoff of response time to the accuracy and correctness of the LLM's response?
"""Check whether the docstrings and the code components are consistent.""" | ||
|
||
SYSTEM_PROMPT = """ | ||
You are a code master who can detect the inconsistency of the code with the docstrings that describes its components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any results from running this on existing PyPI packages, and how the LLM performed?
|
||
logger: logging.Logger = logging.getLogger(__name__) | ||
|
||
T = TypeVar("T", bound=BaseModel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the T
type used in any other places? I can't seem to find any usage, what is the use case for this?
"problog >= 2.2.6,<3.0.0", | ||
"cryptography >=44.0.0,<45.0.0", | ||
"semgrep == 1.113.0", | ||
"pydantic >= 2.11.5,<2.12.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is pydantic
currently used? I can see it being used for the T
type in openai_client.py
, but I do not see it used elsewhere. The src/macaron/ai/README.md
states it is passed as an argument to invoke
, but I don't seem to see anywhere that is performed?
pyproject.toml
Outdated
"cryptography >=44.0.0,<45.0.0", | ||
"semgrep == 1.113.0", | ||
"pydantic >= 2.11.5,<2.12.0", | ||
"gradio_client == 1.4.3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gradio_client
appears to be unused.
165bbb6
to
d9b52b1
Compare
…code–docstring consistency Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
d9b52b1
to
e969fb3
Compare
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
e969fb3
to
2df3401
Compare
Summary
This PR introduces an AI client for interacting with large language models (LLMs) and adds a new heuristic analyzer to detect inconsistencies between code and its docstrings.
Description of changes
AIClient
class to enable seamless integration with LLMs .MatchingDocstringsAnalyzer
that detect inconsistencies between code and its docstrings.heuristics.py
registry.detect_malicious_metadata_check.py
to include and run this new heuristic.Related issues
None
Checklist
verified
label should appear next to all of your commits on GitHub.