-
Notifications
You must be signed in to change notification settings - Fork 28
feat: add AI module for LLM interaction and a heuristic for checking code–docstring consistency #1121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…code–docstring consistency Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
|
||
SYSTEM_PROMPT = """ | ||
You are a security expert analyzing a PyPI package. Determine if the package description is secure. | ||
you must score between 0 and 100 based on the following criteria: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell me more about the types of responses models give to this prompt? My main questions are how the model assigns a score from 0 to 100. It is asked to look at several aspects (High-level description summary, benefit, how to install, how to use) and their consistency with each other. What happens if one of those sections is missing? Is that "less consistent"? Do any of the section comparisons get weighted more than the other (e.g. "how to use" vs "high-level description summary" has more weight than "how to use" vs "benefit")? If a description uses clear headings, is it going to be scored higher than one that does not, when both include equivalent information? If a package has a pretty bare-bones description with essentially no information on any of these headings, is it labelled inconsistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I didn’t specify more in the system prompt as you can see, the model will decide the overall score. If needed, I can add examples of best and worst packages in the prompt to guide it better.
Right now, the model isn’t explicitly told to weigh one section over another; it just evaluates the consistency across all provided aspects (high-level description, benefit, how to install, how to use) and gives a score from 0 to 100. If one of those sections is missing, the model will likely consider it “less consistent” overall, which should lower the score.
The returned format looks like this:
{ "score": ... , "reason": ... }
pytest.skip("AI client disabled - skipping test") | ||
|
||
|
||
def test_analyze_consistent_description_pass( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like we patch and hardcode the return values for analyzer.client.invoke
. Isn't pass/fail mostly determined by that result though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that’s exactly why I set it up this way I wanted something fixed across all models, so patching and hardcoding the return values for analyzer.client.invoke
ensures consistency. The pass/fail is indeed mostly determined by that result, and this approach keeps it stable so the outcome isn’t affected by model variance.
mock_invoke.assert_called_once() | ||
|
||
|
||
def test_analyze_inconsistent_description_fail( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
"The inconsistent code, or null." | ||
} | ||
|
||
/no_think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain to us what does /no_think
means in this context and why it's used in this prompt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/no_think
here is used to prevent “thinking” or reasoning steps from the model. This way, it skips generating internal reasoning chains and directly produces the output, which reduces response time .
"""Check whether the docstrings and the code components are consistent.""" | ||
|
||
SYSTEM_PROMPT = """ | ||
You are a code master who can detect the inconsistency of the code with the docstrings that describes its components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any results from running this on existing PyPI packages, and how the LLM performed?
Summary
This PR introduces an AI client for interacting with large language models (LLMs) and adds a new heuristic analyzer to detect inconsistencies between code and its docstrings.
Description of changes
AIClient
class to enable seamless integration with LLMs .MatchingDocstringsAnalyzer
that detect inconsistencies between code and its docstrings.heuristics.py
registry.detect_malicious_metadata_check.py
to include and run this new heuristic.Related issues
None
Checklist
verified
label should appear next to all of your commits on GitHub.