Skip to content

Commit bfa4a33

Browse files
authored
chore: change heuristic names and rules (#1176)
Following the addition of the MINIMAL_CONTENT, UNSECURE_DESCRIPTION, and STUB_NAME heuristics, some refactoring to their use in the ProbLog rules has been done to ensure no false negatives occur. Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
1 parent 8b3c4b5 commit bfa4a33

File tree

9 files changed

+120
-117
lines changed

9 files changed

+120
-117
lines changed

src/macaron/malware_analyzer/README.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -65,27 +65,35 @@ When a heuristic fails, with `HeuristicResult.FAIL`, then that is an indicator b
6565
> ```
6666
> The script will download the top 5000 PyPI packages and update the resource file automatically.
6767
68-
11. **Fake Email**
68+
11. **Similar Projects**
69+
- **Description**: Checks whether the maintainer(s) of the package have released other packages with close structural similarity.
70+
- **Rule**: Return 'HeuristicResult.FAIL` upon finding the first similar package. Return `HeuristicResult.PASS` if no similar packages are found.
71+
- **Dependency**: None
72+
73+
12. **Fake Email**
6974
- **Description**: Checks if the package maintainer or author has a suspicious or invalid email.
7075
- **Rule**: Return `HeuristicResult.FAIL` if the email is invalid; otherwise, return `HeuristicResult.PASS`.
7176
- **Dependency**: None.
7277
73-
74-
12. **Minimal Content**
75-
- **Description**: Checks if the package has a small number of files.
76-
- **Rule**: Return `HeuristicResult.FAIL` if the number of files is strictly less than FILES_THRESHOLD; otherwise, return `HeuristicResult.PASS`.
78+
13. **Type Stub File**
79+
- **Description**: Checks if the package has a small number of `.pyi` stub files.
80+
- **Rule**: Return `HeuristicResult.FAIL` if the number of `.pyi` files is strictly less than FILES_THRESHOLD; otherwise, return `HeuristicResult.PASS`.
7781
- **Dependency**: None.
7882
79-
13. **Unsecure Description**
80-
- **Description**: Checks if the package description is unsecure, such as not having a descriptive keywords that indicates its a stub package .
81-
- **Rule**: Return `HeuristicResult.FAIL` if no descriptive word is found in the package description or summary ; otherwise, return `HeuristicResult.PASS`.
83+
14. **Package Description Intent**
84+
- **Description**: Checks if the package description contains keywords indicating it is a stub package or dependency confusion prevention placeholder.
85+
- **Rule**: Return `HeuristicResult.FAIL` if no keyword is found in the package description or summary ; otherwise, return `HeuristicResult.PASS`.
8286
- **Dependency**: None.
8387
88+
15. **Stub Name**
89+
- **Description**: Checks if the package name contains the `"stub"` keyword, indicating that it is likely intended to be a stub package and not downloaded.
90+
- **Rule**: Return `HeuristicResult.PASS` if the keywork `"stub"` is found in the package name; otherwise, return `HeuristicResult.FAIL`.
91+
8492
### Source Code Analysis with Semgrep
8593
**PyPI Source Code Analyzer**
8694
- **Description**: Uses Semgrep, with default rules written in `src/macaron/resources/pypi_malware_rules` and custom rules available by supplying a path to `custom_semgrep_rules` in `defaults.ini`, to scan the package `.tar` source code.
8795
- **Rule**: If any Semgrep rule is triggered, the heuristic fails with `HeuristicResult.FAIL` and subsequently fails the package with `CheckResultType.FAILED`. If no rule is triggered, the heuristic passes with `HeuristicResult.PASS` and the `CheckResultType` result from the combination of all other heuristics is maintained.
88-
- **Dependency**: Will be run if the Source Code Repo fails. This dependency can be bypassed by suppying `--force-analyze-source` in the CLI.
96+
- **Dependency**: Will be run if the Source Code Repo fails. This dependency can be bypassed by supplying `--force-analyze-source` in the CLI.
8997
9098
This feature is currently a work in progress, and supports detection of code obfuscation techniques and remote exfiltration behaviors. It uses Semgrep OSS for detection. `defaults.ini` may be used to provide custom rules and exclude them:
9199
- `disabled_default_rulesets`: supply to this a comma separated list of the names of default Semgrep rule files (excluding the `.yaml` extension) to disable all rule IDs in that file.

src/macaron/malware_analyzer/pypi_heuristics/heuristics.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,11 @@ class Heuristics(str, Enum):
4949
#: Indicates that the package has a similar structure to other packages maintained by the same user.
5050
SIMILAR_PROJECTS = "similar_projects"
5151

52-
#: Indicates that the package has minimal content.
53-
MINIMAL_CONTENT = "minimal_content"
52+
#: Indicates that the package has minimal .pyi type stub files.
53+
TYPE_STUB_FILE = "type_stub_file"
5454

55-
#: Indicates that the package's description is unsecure, such as not having a descriptive keywords.
56-
UNSECURE_DESCRIPTION = "unsecure_description"
55+
#: Indicates from the package's description it is intended to be used as a stub or placeholder package.
56+
PACKAGE_DESCRIPTION_INTENT = "package_description_intent"
5757

5858
#: Indicates that the package contains stub files.
5959
STUB_NAME = "stub_name"

src/macaron/malware_analyzer/pypi_heuristics/metadata/unsecure_description.py renamed to src/macaron/malware_analyzer/pypi_heuristics/metadata/package_description_intent.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Copyright (c) 2024 - 2025, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

4-
"""This analyzer checks if a PyPI package has unsecure description."""
4+
"""This analyzer checks if a PyPI package is a stub or placeholder package, using its description and summary."""
55

66
import logging
77
import re
@@ -15,17 +15,17 @@
1515
logger: logging.Logger = logging.getLogger(__name__)
1616

1717

18-
class UnsecureDescriptionAnalyzer(BaseHeuristicAnalyzer):
19-
"""Check whether the package's description is unsecure."""
18+
class PackageDescriptionIntentAnalyzer(BaseHeuristicAnalyzer):
19+
"""Package description contains keywords indicating it is a stub package or dependency confusion prevention placeholder."""
2020

2121
SECURE_DESCRIPTION_REGEX = re.compile(
22-
r"\b(?:internal|private|stub|placeholder|dependency confusion|security|namespace protection|reserved|harmless|prevent)\b",
22+
r"\b(?:stub|placeholder|dependency confusion|security|namespace protection|reserved|prevent)\b",
2323
re.IGNORECASE,
2424
)
2525

2626
def __init__(self) -> None:
2727
super().__init__(
28-
name="unsecure_description_analyzer", heuristic=Heuristics.UNSECURE_DESCRIPTION, depends_on=None
28+
name="package_description_intent", heuristic=Heuristics.PACKAGE_DESCRIPTION_INTENT, depends_on=None
2929
)
3030

3131
def analyze(self, pypi_package_json: PyPIPackageJsonAsset) -> tuple[HeuristicResult, dict[str, JsonType]]:
@@ -52,5 +52,5 @@ def analyze(self, pypi_package_json: PyPIPackageJsonAsset) -> tuple[HeuristicRes
5252
summary = json_extract(package_json, ["info", "summary"], str)
5353
data = f"{description} {summary}"
5454
if self.SECURE_DESCRIPTION_REGEX.search(data):
55-
return HeuristicResult.PASS, {"message": "Package description is secure"}
56-
return HeuristicResult.FAIL, {"message": "Package description is unsecure"}
55+
return HeuristicResult.PASS, {"message": "Package description indicates a stub or placeholder package."}
56+
return HeuristicResult.FAIL, {"message": "Package description does not indicate a stub or placeholder package."}

src/macaron/malware_analyzer/pypi_heuristics/metadata/similar_projects.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ def __init__(self) -> None:
2424
super().__init__(
2525
name="similar_project_analyzer",
2626
heuristic=Heuristics.SIMILAR_PROJECTS,
27+
# TODO: these dependencies are used as this heuristic currently downloads many package sourcecode
28+
# tarballs. Refactoring this heuristic to run more efficiently means this should have depends_on=None.
2729
depends_on=[
2830
(Heuristics.EMPTY_PROJECT_LINK, HeuristicResult.FAIL),
2931
(Heuristics.ONE_RELEASE, HeuristicResult.FAIL),

src/macaron/malware_analyzer/pypi_heuristics/metadata/minimal_content.py renamed to src/macaron/malware_analyzer/pypi_heuristics/metadata/type_stub_file.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Copyright (c) 2024 - 2025, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

4-
"""This analyzer checks if a PyPI package has minimal content."""
4+
"""This analyzer checks if a PyPI package has minimal .pyi stub content."""
55

66
import logging
77
import os
@@ -15,15 +15,15 @@
1515
logger: logging.Logger = logging.getLogger(__name__)
1616

1717

18-
class MinimalContentAnalyzer(BaseHeuristicAnalyzer):
19-
"""Check whether the package has minimal content."""
18+
class TypeStubFileAnalyzer(BaseHeuristicAnalyzer):
19+
"""Check whether the package has minimal .pyi stub content."""
2020

2121
FILES_THRESHOLD = 10
2222

2323
def __init__(self) -> None:
2424
super().__init__(
25-
name="minimal_content_analyzer",
26-
heuristic=Heuristics.MINIMAL_CONTENT,
25+
name="type_stub_file",
26+
heuristic=Heuristics.TYPE_STUB_FILE,
2727
depends_on=None,
2828
)
2929

src/macaron/slsa_analyzer/checks/detect_malicious_metadata_check.py

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,15 @@
2222
from macaron.malware_analyzer.pypi_heuristics.metadata.empty_project_link import EmptyProjectLinkAnalyzer
2323
from macaron.malware_analyzer.pypi_heuristics.metadata.fake_email import FakeEmailAnalyzer
2424
from macaron.malware_analyzer.pypi_heuristics.metadata.high_release_frequency import HighReleaseFrequencyAnalyzer
25-
from macaron.malware_analyzer.pypi_heuristics.metadata.minimal_content import MinimalContentAnalyzer
2625
from macaron.malware_analyzer.pypi_heuristics.metadata.one_release import OneReleaseAnalyzer
26+
from macaron.malware_analyzer.pypi_heuristics.metadata.package_description_intent import (
27+
PackageDescriptionIntentAnalyzer,
28+
)
2729
from macaron.malware_analyzer.pypi_heuristics.metadata.similar_projects import SimilarProjectAnalyzer
2830
from macaron.malware_analyzer.pypi_heuristics.metadata.source_code_repo import SourceCodeRepoAnalyzer
31+
from macaron.malware_analyzer.pypi_heuristics.metadata.type_stub_file import TypeStubFileAnalyzer
2932
from macaron.malware_analyzer.pypi_heuristics.metadata.typosquatting_presence import TyposquattingPresenceAnalyzer
3033
from macaron.malware_analyzer.pypi_heuristics.metadata.unchanged_release import UnchangedReleaseAnalyzer
31-
from macaron.malware_analyzer.pypi_heuristics.metadata.unsecure_description import UnsecureDescriptionAnalyzer
3234
from macaron.malware_analyzer.pypi_heuristics.metadata.wheel_absence import WheelAbsenceAnalyzer
3335
from macaron.malware_analyzer.pypi_heuristics.sourcecode.pypi_sourcecode_analyzer import PyPISourcecodeAnalyzer
3436
from macaron.malware_analyzer.pypi_heuristics.sourcecode.suspicious_setup import SuspiciousSetupAnalyzer
@@ -368,8 +370,8 @@ def run_check(self, ctx: AnalyzeContext) -> CheckResultData:
368370
TyposquattingPresenceAnalyzer,
369371
FakeEmailAnalyzer,
370372
SimilarProjectAnalyzer,
371-
UnsecureDescriptionAnalyzer,
372-
MinimalContentAnalyzer,
373+
PackageDescriptionIntentAnalyzer,
374+
TypeStubFileAnalyzer,
373375
]
374376

375377
# name used to query the result of all problog rules, so it can be accessed outside the model.
@@ -419,20 +421,17 @@ def run_check(self, ctx: AnalyzeContext) -> CheckResultData:
419421
failed({Heuristics.CLOSER_RELEASE_JOIN_DATE.value}),
420422
forceSetup.
421423
422-
% Package released with a name similar to a popular package.
424+
% Package released recently with little detail, forcing setup.py to run, and suspected of typosquatting.
423425
{Confidence.HIGH.value}::trigger(malware_high_confidence_4) :-
424426
quickUndetailed,
425427
forceSetup,
426-
failed({Heuristics.TYPOSQUATTING_PRESENCE.value}),
427-
failed({Heuristics.STUB_NAME.value}).
428+
failed({Heuristics.TYPOSQUATTING_PRESENCE.value}).
428429
429-
% Package released with dependency confusion .
430+
% Package forces setup.py to run, has a high version number and is not intended to be a stub package.
430431
{Confidence.HIGH.value}::trigger(malware_high_confidence_5) :-
431432
forceSetup,
432-
failed({Heuristics.MINIMAL_CONTENT.value}),
433433
failed({Heuristics.STUB_NAME.value}),
434-
failed({Heuristics.ANOMALOUS_VERSION.value}),
435-
failed({Heuristics.UNSECURE_DESCRIPTION.value}).
434+
failed({Heuristics.ANOMALOUS_VERSION.value}).
436435
437436
% Package released recently with little detail, with multiple releases as a trust marker, but frequent and with
438437
% the same code.
@@ -442,12 +441,14 @@ def run_check(self, ctx: AnalyzeContext) -> CheckResultData:
442441
failed({Heuristics.UNCHANGED_RELEASE.value}),
443442
passed({Heuristics.SUSPICIOUS_SETUP.value}).
444443
445-
% Package released recently with little detail and an anomalous version number for a single-release package.
444+
% Package released recently with little detail and an anomalous version number for a single-release package. The
445+
% package is not intended to be a stub package.
446446
{Confidence.MEDIUM.value}::trigger(malware_medium_confidence_2) :-
447447
quickUndetailed,
448448
failed({Heuristics.ONE_RELEASE.value}),
449449
failed({Heuristics.ANOMALOUS_VERSION.value}),
450-
failed({Heuristics.UNSECURE_DESCRIPTION.value}).
450+
failed({Heuristics.TYPE_STUB_FILE.value}),
451+
failed({Heuristics.PACKAGE_DESCRIPTION_INTENT.value}).
451452
452453
% Package has no links, one release or multiple quick releases, and a suspicious maintainer who recently
453454
% joined, has a fake email address, and other similarly-structured projects.
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Copyright (c) 2024 - 2025, Oracle and/or its affiliates. All rights reserved.
2+
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
3+
4+
"""Tests for the PackageDescriptionIntentAnalyzer heuristic."""
5+
6+
from unittest.mock import MagicMock
7+
8+
import pytest
9+
10+
from macaron.errors import HeuristicAnalyzerValueError
11+
from macaron.malware_analyzer.pypi_heuristics.heuristics import HeuristicResult
12+
from macaron.malware_analyzer.pypi_heuristics.metadata.package_description_intent import (
13+
PackageDescriptionIntentAnalyzer,
14+
)
15+
16+
17+
@pytest.fixture(name="analyzer")
18+
def analyzer_() -> PackageDescriptionIntentAnalyzer:
19+
"""Pytest fixture to create an PackageDescriptionIntentAnalyzer instance."""
20+
return PackageDescriptionIntentAnalyzer()
21+
22+
23+
def test_no_info(analyzer: PackageDescriptionIntentAnalyzer, pypi_package_json: MagicMock) -> None:
24+
"""Test the analyzer raises an error when no package info is found."""
25+
pypi_package_json.package_json = {}
26+
with pytest.raises(HeuristicAnalyzerValueError):
27+
analyzer.analyze(pypi_package_json)
28+
29+
30+
@pytest.mark.parametrize(
31+
("metadata", "expected_result"),
32+
[
33+
pytest.param(
34+
{"description": "A harmless package to prevent typosquatting attacks"},
35+
HeuristicResult.PASS,
36+
id="test_harmless_package_description",
37+
),
38+
pytest.param(
39+
{"summary": "placeholder package to prevent dependency confusion attacks"},
40+
HeuristicResult.PASS,
41+
id="test_harmless_package_summary",
42+
),
43+
pytest.param(
44+
{"description": "A regular public package", "summary": "does regular things"},
45+
HeuristicResult.FAIL,
46+
id="test_no_intention",
47+
),
48+
],
49+
)
50+
def test_analyze_scenarios(
51+
analyzer: PackageDescriptionIntentAnalyzer,
52+
pypi_package_json: MagicMock,
53+
metadata: dict,
54+
expected_result: HeuristicResult,
55+
) -> None:
56+
"""Test the analyzer with various metadata scenarios."""
57+
pypi_package_json.package_json = {"info": metadata}
58+
result, _ = analyzer.analyze(pypi_package_json)
59+
assert result == expected_result

tests/malware_analyzer/pypi/test_minimal_content.py renamed to tests/malware_analyzer/pypi/test_type_stub_file.py

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,24 @@
11
# Copyright (c) 2024 - 2025, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

4-
"""Tests for the MinimalContentAnalyzer heuristic."""
4+
"""Tests for the TypeStubFileAnalyzer heuristic."""
55

66
from unittest.mock import MagicMock, patch
77

88
import pytest
99

1010
from macaron.errors import SourceCodeError
1111
from macaron.malware_analyzer.pypi_heuristics.heuristics import HeuristicResult
12-
from macaron.malware_analyzer.pypi_heuristics.metadata.minimal_content import MinimalContentAnalyzer
12+
from macaron.malware_analyzer.pypi_heuristics.metadata.type_stub_file import TypeStubFileAnalyzer
1313

1414

1515
@pytest.fixture(name="analyzer")
16-
def analyzer_() -> MinimalContentAnalyzer:
17-
"""Pytest fixture to create a MinimalContentAnalyzer instance."""
18-
return MinimalContentAnalyzer()
16+
def analyzer_() -> TypeStubFileAnalyzer:
17+
"""Pytest fixture to create a TypeStubFileAnalyzer instance."""
18+
return TypeStubFileAnalyzer()
1919

2020

21-
def test_analyze_sufficient_files_pass(analyzer: MinimalContentAnalyzer, pypi_package_json: MagicMock) -> None:
21+
def test_analyze_sufficient_files_pass(analyzer: TypeStubFileAnalyzer, pypi_package_json: MagicMock) -> None:
2222
"""Test the analyzer passes when the package has sufficient files."""
2323
pypi_package_json.download_sourcecode.return_value = True
2424
pypi_package_json.package_sourcecode_path = "/fake/path"
@@ -30,7 +30,7 @@ def test_analyze_sufficient_files_pass(analyzer: MinimalContentAnalyzer, pypi_pa
3030
pypi_package_json.download_sourcecode.assert_called_once()
3131

3232

33-
def test_analyze_exactly_threshold_files_pass(analyzer: MinimalContentAnalyzer, pypi_package_json: MagicMock) -> None:
33+
def test_analyze_exactly_threshold_files_pass(analyzer: TypeStubFileAnalyzer, pypi_package_json: MagicMock) -> None:
3434
"""Test the analyzer passes when the package has exactly the threshold number of files."""
3535
pypi_package_json.download_sourcecode.return_value = True
3636
pypi_package_json.package_sourcecode_path = "/fake/path"
@@ -41,7 +41,7 @@ def test_analyze_exactly_threshold_files_pass(analyzer: MinimalContentAnalyzer,
4141
assert result == HeuristicResult.PASS
4242

4343

44-
def test_analyze_insufficient_files_fail(analyzer: MinimalContentAnalyzer, pypi_package_json: MagicMock) -> None:
44+
def test_analyze_insufficient_files_fail(analyzer: TypeStubFileAnalyzer, pypi_package_json: MagicMock) -> None:
4545
"""Test the analyzer fails when the package has insufficient files."""
4646
pypi_package_json.download_sourcecode.return_value = True
4747
pypi_package_json.package_sourcecode_path = "/fake/path"
@@ -52,7 +52,7 @@ def test_analyze_insufficient_files_fail(analyzer: MinimalContentAnalyzer, pypi_
5252
assert result == HeuristicResult.FAIL
5353

5454

55-
def test_analyze_no_files_fail(analyzer: MinimalContentAnalyzer, pypi_package_json: MagicMock) -> None:
55+
def test_analyze_no_files_fail(analyzer: TypeStubFileAnalyzer, pypi_package_json: MagicMock) -> None:
5656
"""Test the analyzer fails when the package has no files."""
5757
pypi_package_json.download_sourcecode.return_value = True
5858
pypi_package_json.package_sourcecode_path = "/fake/path"
@@ -63,7 +63,7 @@ def test_analyze_no_files_fail(analyzer: MinimalContentAnalyzer, pypi_package_js
6363
assert result == HeuristicResult.FAIL
6464

6565

66-
def test_analyze_download_failed_raises_error(analyzer: MinimalContentAnalyzer, pypi_package_json: MagicMock) -> None:
66+
def test_analyze_download_failed_raises_error(analyzer: TypeStubFileAnalyzer, pypi_package_json: MagicMock) -> None:
6767
"""Test the analyzer raises SourceCodeError when source code download fails."""
6868
pypi_package_json.download_sourcecode.return_value = False
6969

@@ -84,8 +84,8 @@ def test_analyze_download_failed_raises_error(analyzer: MinimalContentAnalyzer,
8484
(15, HeuristicResult.PASS),
8585
],
8686
)
87-
def test_analyze_various_file_counts(
88-
analyzer: MinimalContentAnalyzer,
87+
def test_analyze_file_counts(
88+
analyzer: TypeStubFileAnalyzer,
8989
pypi_package_json: MagicMock,
9090
file_count: int,
9191
expected_result: HeuristicResult,

0 commit comments

Comments
 (0)