Skip to content

feat(sbom): symlink-aware SBOM filesystem graph (fs_tree)#459

Open
willis89pr wants to merge 229 commits intomainfrom
fstree
Open

feat(sbom): symlink-aware SBOM filesystem graph (fs_tree)#459
willis89pr wants to merge 229 commits intomainfrom
fstree

Conversation

@willis89pr
Copy link
Collaborator

@willis89pr willis89pr commented Jul 17, 2025

Symlink-aware SBOM filesystem graph; fs_tree lookup across relationship plugins; path utils & tests

Summary

This PR makes relationship resolution symlink-aware and more accurate by introducing a first-class filesystem graph (fs_tree) inside the SBOM model and teaching the .NET/ELF/PE/Java plugins to resolve dependencies via exact path lookups before falling back to legacy heuristics. It also adds small ergonomics (path utils), targeted logging, safer error handling, and a comprehensive test suite.

Motivation

Encoding the install tree and symlink edges in a graph lets us: (1) resolve by canonical path, (2) follow links deterministically, and (3) avoid spurious edges.

What changed

1) SBOM model: new fs_tree and helper APIs

  • Add fs_tree: nx.DiGraph tracking directory hierarchy and symlink edges (type="symlink", optional subtype="file|directory").

  • New path and lookup helpers:

    • _add_software_to_fs_tree() builds path hierarchy and tags nodes with software_uuid.
    • get_software_by_path() normalizes paths and resolves entries via fs_tree with symlink traversal.
    • get_symlink_sources_for_path() performs reverse traversal to find all symlinks pointing to a given target.
    • record_symlink(), _add_symlink_edge(), and expand_pending_dir_symlinks() / expand_pending_file_symlinks() handle immediate and deferred symlink creation.
    • record_hash_node() and get_hash_equivalents() track content-equivalent files via SHA-256 nodes.
    • inject_symlink_metadata() regenerates legacy-style fileNameSymlinks and installPathSymlinks fields from the graph.
  • Extend add_software_entries to merge duplicates, discover symlinks, attach Contains edges, and link identical hashes.

  • Split graph builders: build_rel_graph() for logical relationships; fs_tree is kept separate; filter out Path/symlink edges from to_dict_override().

  • Added docstrings and safety checks across all new helpers.

2) generate.py: symlink capture during crawl

  • Inject filename- and install-path symlinks into Software entries before adding them.
  • Record hash-based equivalence edges for symlink pairs.
  • After crawl, expand deferred symlinks and inject legacy-style symlink metadata using inject_symlink_metadata().
  • Debug logs added; broad exceptions tightened.

3) Relationship plugins

  • .NET (dotnet_relationship.py):

    • Unified path normalization with normalize_path.
    • Resolve via sbom.get_software_by_path, handle app.config (probing.privatePath, <codeBase href=...>), unmanaged imports via dotnetImplMap.
    • Added culture/version filtering; clarified that Culture is used for filtering, not probing subdirs.
    • Fixed probe directory construction to normalize consistently.
  • ELF (elf_relationship.py):

    • Introduced structured docstrings and debug logs.
    • RPATH/RUNPATH resolution respects DF_1_NODEFLIB.
    • Expanded DST substitution with $ORIGIN and $LIB.
    • Phased resolution: fs_tree → legacy → heuristic.
    • New: standardized final-emission debug logs and clarified when no match is found, e.g. "[ELF][final] {dependent} Uses {name} → UUID={match} [phase]" or "... → no match".
    • New: condensed DF_1_NODEFLIB check and explicitly logs default search paths when the flag is not set.
    • New: replaced ad-hoc print with logger.debug for expanded runpaths.
  • PE (pe_relationship.py):

    • Uses fs_tree → legacy → same-directory match (Phase 2/3).
    • Clarified test case docstring and renamed misleading “symlink heuristic” test to same-directory match.
  • Java (java_relationship.py):

    • Corrected Phase 1 test to align importer path with fs_tree lookups.
    • Corrected Phase 3 heuristic test to require same parent dir + filename, with Phase 2 failing.

4) Merge and graph hygiene

  • Skip Path nodes in root computation, merges, and relationship output.

5) New path utilities

  • surfactant/utils/paths.py

    • normalize_path(*parts) → str ensures consistent POSIX normalization across Windows/Unix.
    • basename_posix(path) → str.

6) Tests

  • .NET:

    • Removed redundant manual fs_tree.add_node().
    • Clarified test_dotnet_culture_subdir docstring (filtering only).
    • Added test_dotnet_heuristic_match for Phase 3.
  • ELF:

    • Switched from private _record_symlink to public API; expanded docstrings for clarity.
    • Updated example fixture: includes an explicit consumer (uuid-4-consumer) to exercise system-path fallback; fixture also demonstrates alias mapping for libalias.so.
    • Added test_symlink_heuristic_match_edge to force heuristic after clearing direct symlink edge.
  • PE:

    • Cleaned up commented fs_tree code.
    • Renamed and clarified same-directory test; added explicit path normalization assert.
  • Java:

    • Corrected Phase 1 test to actually hit fs_tree resolution.
    • Corrected Phase 3 test to properly require same-dir heuristic.
  • Added tests/utils/test_paths.py for normalize_path.

  • Added tests/sbomtypes/test_fs_tree.py to validate fs_tree population and lookup.

Risk & compatibility

  • Behavioral improvements only: JSON outputs exclude path/symlink edges.
  • fs_tree is internal (non-serialized).

Performance considerations

  • Linear path-tree construction, O(1) symlink recording.
  • Faster lookups due to direct node queries; legacy/heuristic only if needed.

Logging & DX

  • Concise logger.debug traces for resolution phases.
  • Failures to record symlinks warn, not error.

SBOM model note (post-deserialization)

  • New: after constructing from existing data, SBOM now rebuilds symlink and hash edges during __post_init__ so fs_tree lookups work consistently across load/merge workflows. This scans each installPath (and directory children) to re-register symlinks and hashes into both graphs.

Test plan

pytest -q tests/relationships/test_dotnet_relationship.py
pytest -q tests/relationships/test_elf_relationship.py
pytest -q tests/relationships/test_java_relationship.py
pytest -q tests/relationships/test_pe_relationship.py
pytest -q tests/utils/test_paths.py
pytest -q tests/sbomtypes/test_fs_tree.py
pytest -q  # full suite

willis89pr and others added 14 commits July 17, 2025 14:35
- Add `fs_tree: nx.DiGraph` to `SBOM`, excluded from JSON serialization
- Populate `fs_tree` in SBOM constructor via `_add_software_to_fs_tree`, splitting each `installPath` into parent–child edges and tagging leaf nodes with `software_uuid`
- Introduce `SBOM._record_symlink(link, target, subtype)` to record symlink edges in both:
  - the main relationship graph (`MultiDiGraph`) with `type="symlink"`
  - the filesystem graph (`fs_tree`) with `type="symlink"` and optional `subtype` ("file" or "directory")
- Enhance `add_software_entries()` to scan each `installPath` and its immediate children for symlinks, invoking `_record_symlink` for both file- and directory–level symlinks
- Update `generate.py` to inject filename- and install-path symlinks into each `Software` entry before adding to SBOM, so they’re captured by `add_software_entries()`
- Refactor `elf_relationship` plugin to:
  - Prefer `fs_tree`–based `get_software_by_path()` lookups for ELF dependencies
  - Fall back to legacy `installPath` matching, then a directory-based symlink heuristic
  - Emit detailed `logger.debug()` statements (via Loguru) indicating which resolution path was used
  - Improve docstrings around RPATH/RUNPATH, DST substitution, and relationship phases
- Expand DST-handling helpers (`generate_search_paths`, `generate_runpaths`, `substitute_all_dst`) with clearer comments, normalization, and debug traces
- Update `.NET` relationship plugin to use `get_software_by_path` for absolute imports and cleaned-up probing logic
- Add comprehensive unit tests:
  - `tests/sbomtypes/test_fs_tree.py` to verify `fs_tree` population and `get_software_by_path`
  - `tests/relationships/test_elf_relationship.py` covering absolute, relative, system, origin, RPATH, and symlink heuristics
- Minor cleanup: prevent `fs_tree` from being serialized and remove unused whitespace
Add “# pylint: disable=redefined-outer-name” to the top of:
- tests/relationships/test_elf_relationship.py
- tests/sbomtypes/test_fs_tree.py

This silences warnings about pytest fixtures shadowing outer-scope names.
- Documented _add_software_to_fs_tree method with explanation of behavior, arguments, and side effects
- Enhanced safety: ensure final install path node exists before tagging
- Normalized install paths to POSIX format for consistency
- Added type hints for clarity
- No logic changes to other methods; only added minor inline comments and spacing
…ationship

- Introduced `normalize_path` utility in `surfactant.utils.paths` to standardize path handling across components.
- Replaced all raw `PurePosixPath` and `PureWindowsPath` calls with `normalize_path` in:
  - `SBOM` class (`_sbom.py`): install path processing, software lookup, and symlink handling.
  - `dotnet_relationship.py`: resolving absolute paths for dependency resolution.
- Added new utility module `utils.paths` and test suite `test_paths.py` to verify path normalization behavior across various cases.
- Removed redundant single-argument shortcut that bypassed normalization.
- Updated normalize_path() to explicitly replace backslashes in all path parts.
- Ensures consistent POSIX-style output for inputs like "C:\\Program Files\\App".
- Fixes test failures caused by improper handling of Windows-style paths.
…tion

- Replaced all manual `.as_posix()` conversions with `normalize_path(...)` to ensure consistent POSIX-style lookup keys.
- Normalized candidate paths used in `sbom.get_software_by_path()` during .NET relationship resolution.
- Updated codeBase path resolution to use structured path objects instead of prematurely stringifying.
- Refactored `get_dotnet_probedirs()` to normalize all output paths and avoid path handling inconsistencies.
- Added docstring to `get_dotnet_probedirs()` for clarity.

Fixes failing .NET relationship tests caused by inconsistent path formats in `installPath` vs lookup paths.
@github-actions
Copy link

github-actions bot commented Jul 24, 2025

🧪 SBOM Results (16/16)

❗️ sample_sboms (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ srectest_no1 (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ NET_app_config_test_no1 (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ uimage_files (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ mach_o_dylib_test_no1 (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ rpm_pkg_files (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ cpio_files (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ coff_files (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ Windows_dll_test_no1 (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ a_out_files (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ ELF_shared_obj_test_no1 (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ zstandard (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ cd_iso_files (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ mac_os_dmg (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ java_class_no1 (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

❗️ msitest_no1 (Link)

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 127, in generate_sbom_string
    ctx.invoke(
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/cmd/generate.py", line 645, in sbom
    new_sbom.add_software_entries(entries, parent_entry=parent_entry)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 772, in add_software_entries
    self.add_software(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 493, in add_software
    self._add_software_to_fs_tree(sw)
  File "/home/runner/work/Surfactant/Surfactant/surfactant/sbomtypes/_sbom.py", line 196, in _add_software_to_fs_tree
    parts = pathlib.PurePosixPath(norm_path).parts
NameError: name 'pathlib' is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 251, in test_gha
    new_data["sbom"] = generate_sbom_string(
  File "/home/runner/work/Surfactant/Surfactant/scripts/regressions.py", line 133, in generate_sbom_string
    raise RuntimeError(f"Failed to invoke SBOM generation: {e}") from e
RuntimeError: Failed to invoke SBOM generation: name 'pathlib' is not defined

For commit e2caede (Run 22003120076)
Compared against commit 509252c (Run 21837526485)

willis89pr and others added 15 commits July 24, 2025 18:33
… heuristic test

- In `example_sbom` fixture, record a symlink from `/opt/alt/lib/libalias.so` to `/opt/alt/lib/libreal.so` for `sw8` to exercise the symlink handling logic
- Add a new parametrized test `test_symlink_heuristic_match_edge` that clears existing fs_tree entries and verifies that the heuristic correctly matches symlinked dependencies when no direct matches exist
…ionally removing fs_tree edge and node

Updated `test_symlink_heuristic_match_edge` to defensively check for the existence of the symlink edge and node
in `fs_tree` before attempting to remove them. This avoids `KeyError` raised by NetworkX when the edge does not exist,
ensuring the test remains stable even if the graph structure changes upstream.

Improves test resilience and correctness by explicitly targeting the intended symlink edge
(`/opt/alt/lib/libalias.so` → `/opt/alt/lib/libreal.so`).
…k and logging

- Updated `get_windows_pe_dependencies()` to use a modern three-phase resolution strategy:
  1. Primary: Exact path match using `sbom.get_software_by_path()` (fs_tree)
  2. Secondary: Legacy string-based matching on `installPath` and `fileName`
  3. Tertiary: Heuristic fallback using shared directories and `fileName` match

- Replaced `find_installed_software()` usage with normalized path lookups.
- Introduced detailed `loguru.debug()` logging to trace each match attempt and outcome.
- Enhanced `establish_relationships()` with structured import phase handling and debug output.
- Improved `has_required_fields()` using a cleaner `any(...)` check with docstring and type hint.
- Added full docstrings to clarify purpose and logic for maintainability.

These changes bring PE relationship handling in line with ELF and .NET plugins, ensuring consistency,
improved symlink resolution, and better match accuracy across Windows-style paths.
- Introduced test suite for `pe_relationship.py` covering:
  - Primary resolution via `fs_tree` using `get_software_by_path()`
  - Legacy fallback using `installPath` + `fileName` matching
  - Heuristic fallback using same-directory + filename pattern
  - Negative test case for unmatched DLLs
  - Unit test for `has_required_fields()` utility function

- Includes thorough docstrings and inline comments for clarity and maintainability.
- Ensures consistent behavior with ELF/.NET plugin resolution logic.

File added: tests/relationships/test_pe_relationship.py
…d tests

- Replaced legacy class-based resolution with dynamic 3-phase import matching:
  1. Exact path resolution via sbom.get_software_by_path() (fs_tree)
  2. Legacy fallback via installPath + fileName match
  3. Heuristic fallback via shared directory and filename

- Removed static _ExportDict and global class-to-UUID mapping
- Added detailed logging and comments for maintainability
- Introduced helper `class_to_path()` for FQCN to class file path

test:
- Added pytest suite covering all resolution phases:
  - fs_tree match
  - legacy installPath fallback
  - heuristic directory-based fallback
  - negative case with no match

New file: tests/relationships/test_java_relationship.py
@willis89pr
Copy link
Collaborator Author

@nightlark
All recent comments need to be addressed.

Relationships:
Dotnet - Complete
ELF - Complete
PE - Reviewing Phase 2 matches Legacy and double check pytests after
Java - Need to review Phase 2 matches Legacy and double check pytests after

willis89pr and others added 10 commits February 2, 2026 09:16
…tware; update tests

- Make PE Phase 2 a true legacy fallback by delegating to windows_utils.find_installed_software()
- Remove self-edge suppression to match legacy relationship emission behavior
- Clarify PE resolver docstrings around fs_tree symlink traversal and legacy probing
- Update tests to force Phase 2 execution and add directory-case mismatch regression
- Remove dotnet_relationship_legacy module
…M-scoped

- Tighten has_required_fields() to only accept dict metadata containing "javaClasses" to avoid type errors when non-dict metadata objects are present.
- Add SBOM-scoped caching for the export→supplier lookup table using a weakref to the SBOM instance, preventing cross-run/test state leakage while avoiding repeated rebuilds.
- Refactor Phase 2 fallback to mirror legacy export-dict behavior more closely by iterating javaClasses → javaImports directly (instead of building a set of imports).
- Leave Phase 1 (fs_tree path resolution) as an explicit TODO placeholder and document the intended resolution order (fs_tree first, legacy fallback second).
- Improve debug logging around legacy fallback resolution and final relationship emission.
Annotate _ExportDict._sbom_ref as Optional[weakref.ref[SBOM]] so pylint recognizes the weakref as callable when checking the cached SBOM instance.
- Suppress pylint false-positive on weakref call in SBOM-scoped export cache
- Fix legacy debug tag to use [Java] instead of [PE]
- Remove unused class_to_path helper (Phase 1 placeholder remains)
…se 2 test

Remove unused java_class_path/test_sbom fixtures and the Phase 1 fs_tree test (Phase 1 intentionally not implemented). Rename the Phase 2 test to reflect the legacy export-dict fallback behavior.
@willis89pr willis89pr requested a review from nightlark February 3, 2026 21:31
Comment on lines +955 to +958
3. **Gathered Filename Aliases:** additional names from `sw.fileName`
that were injected during the gather phase but are not canonical
basenames of the install paths (e.g., bash-completion stubs like
"runuser" for "su").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure when this last class would ever occur. If "runuser" is a small script just calling "su", then the hash won't match "su" and it will get its own "runuser" software entry (ideally at some point recognizizing that it executes "su" so a "Runs" relationship can be created).

Comment on lines +1026 to +1032
primary_basenames = {PurePosixPath(p).name for p in (sw.installPath or [])}
file_name_extras = set(sw.fileName or []) - primary_basenames
if file_name_extras:
file_symlinks |= file_name_extras
logger.debug(
f"[fs_tree] Added gathered filename aliases for {sw.UUID}: {sorted(file_name_extras)}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc the way we handle things with adding install paths a file name entry for the basename will always be added as well, so I think this case would ideally never be reached.

# Skip path/symlink edges during merge as well
if str(rel_type).lower() == "symlink":
continue
if sbom_m.graph.nodes.get(src, {}).get("type") == "Path":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make all of the node types lower case for consistency ("path" and "hash" instead of "Path" and "Hash")?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a symlink-aware filesystem graph (fs_tree) inside the SBOM model and updates multiple relationship plugins to resolve dependencies via path-based lookups (with symlink traversal) before falling back to legacy matching. It also adds path normalization utilities and a broad set of new/updated tests around filesystem graph behavior and relationship resolution.

Changes:

  • Add SBOM fs_tree (directory hierarchy + symlink/hash edges) plus lookup/recording helpers (get_software_by_path, record_symlink, pending symlink expansion, legacy symlink metadata injection).
  • Update relationship plugins (.NET/ELF/PE/Java) to prefer fs_tree lookups (with logging and fallbacks), and update merge/generate flows for path/symlink handling.
  • Add new path utilities (normalize_path, basename_posix) and new tests validating path normalization, fs_tree population/lookup, and relationship resolution.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
surfactant/sbomtypes/_sbom.py Adds fs_tree, symlink/hash recording + traversal helpers, pending symlink expansion, and filters filesystem edges from serialized relationships.
surfactant/cmd/generate.py Records symlinks/hashes during crawl and injects legacy symlink metadata derived from fs_tree.
surfactant/cmd/merge.py Filters out Path nodes from root computation/system relationship attachment.
surfactant/utils/paths.py Adds path normalization/basename helpers used across plugins and SBOM graph code.
surfactant/relationships/dotnet_relationship.py Moves to fs_tree-first probing with legacy fallbacks; adds structured debug logging.
surfactant/relationships/elf_relationship.py Adds fs_tree-first matching and clearer runpath/default-path logic with debug logging.
surfactant/relationships/pe_relationship.py Adds fs_tree-first resolution (case-insensitive) with legacy fallback and debug logging.
surfactant/relationships/java_relationship.py Makes export-dict caching SBOM-aware (weakref) and adds structured logging (fs_tree phase still TODO).
surfactant/relationships/_internal/windows_utils.py Adds shared .NET probe-dir construction helper (get_dotnet_probedirs).
surfactant/output/cytrics_writer.py Adds debug log when writing SBOM output.
tests/sbomtypes/test_fs_tree.py New tests validating fs_tree construction, lookup, symlink traversal, and serialization filtering.
tests/utils/test_paths.py New tests for normalize_path behavior and edge cases.
tests/relationships/test_dotnet_relationship.py New .NET relationship tests covering multiple resolution paths.
tests/relationships/test_elf_relationship.py New ELF relationship tests covering multiple scenarios (currently includes debug prints).
tests/relationships/test_pe_relationship.py New PE relationship tests for fs_tree + fallbacks.
tests/relationships/test_java_relationship.py New Java relationship tests for legacy export matching.
tests/symlink/test_resolve_links.py Removes old symlink-resolution test (superseded by fs_tree behavior/tests).
tests/relationships/test_java.py Removes old Java relationship test (replaced by new java_relationship tests).
.gitignore Minor formatting change.

Comment on lines +100 to +117
# Debug prints are helpful during bring-up, but can be noisy in CI.
# Keep them for now; if logs are cluttered, consider replacing with logger.debug or removing.
print(f"==== RUNNING: {label} ====")
sbom, case_map = example_sbom

# Retrieve the consumer under test and the expected supplier UUID
sw, expected_uuid = case_map[label]

# Pull the ELF metadata for this software (may include elfDependencies, elfRunpath/Rpath, etc.)
metadata = sw.metadata[0] if sw.metadata else {}
print("Dependency paths:", metadata.get("elfDependencies", []))
print("fs_tree nodes:", list(sbom.fs_tree.nodes))

# Optional trace: show how raw dependency strings normalize to POSIX and what fs_tree returns
for dep in metadata.get("elfDependencies", []):
norm = pathlib.PurePosixPath(dep).as_posix()
print(f"Trying lookup: {norm} ->", sbom.get_software_by_path(norm))

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test includes multiple print(...) statements (and comments suggesting to keep them) which will pollute CI output and make failures harder to read. Please remove these prints or convert them to logger.debug (or pytest's caplog) so test output stays clean by default.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +23
def test_trailing_slash_is_preserved():
"""Strip trailing slashes from non-root POSIX paths."""
assert normalize_path("C:/App/") == "C:/App" # PosixPath strips trailing slashes
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name test_trailing_slash_is_preserved contradicts the assertion (normalize_path("C:/App/") == "C:/App"), i.e., the trailing slash is not preserved. Rename the test (and/or adjust the docstring) so the name reflects the intended behavior being asserted.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 809 to 831
# Symlink capture under each installPath ---
for raw in sw.installPath or []:
p = pathlib.Path(raw)

# If the installPath itself is a symlink (file or dir)
if p.is_symlink():
real = p.resolve()
subtype = "file" if not p.is_dir() else "directory"
logger.debug(f"Found installPath symlink: {p} → {real} (subtype={subtype})")
# Call the helper to record this symlink in fs_tree
self._record_symlink(str(p), str(real), subtype=subtype)

# If it's a directory, scan immediate children for symlinks
if p.is_dir():
for child in p.iterdir():
if child.is_symlink():
real = child.resolve()
subtype = "file" if not child.is_dir() else "directory"
logger.debug(
f"Found child symlink: {child} → {real} (subtype={subtype})"
)
self._record_symlink(str(child), str(real), subtype=subtype)

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SBOM.add_software_entries() is calling pathlib.Path(raw).is_symlink()/is_dir()/iterdir() on values from Software.installPath. installPath represents the intended install location, not necessarily a real path on the machine running Surfactant, so this will (a) produce nondeterministic results depending on host filesystem contents, (b) potentially scan outside the extraction root, and (c) slow down SBOM construction/merge. Symlink discovery should be driven by the crawl/extractor inputs (e.g., the symlink maps built during generate and recorded via record_symlink) or by paths relative to the extraction directory, not by probing host paths in SBOM model code.

Suggested change
# Symlink capture under each installPath ---
for raw in sw.installPath or []:
p = pathlib.Path(raw)
# If the installPath itself is a symlink (file or dir)
if p.is_symlink():
real = p.resolve()
subtype = "file" if not p.is_dir() else "directory"
logger.debug(f"Found installPath symlink: {p}{real} (subtype={subtype})")
# Call the helper to record this symlink in fs_tree
self._record_symlink(str(p), str(real), subtype=subtype)
# If it's a directory, scan immediate children for symlinks
if p.is_dir():
for child in p.iterdir():
if child.is_symlink():
real = child.resolve()
subtype = "file" if not child.is_dir() else "directory"
logger.debug(
f"Found child symlink: {child}{real} (subtype={subtype})"
)
self._record_symlink(str(child), str(real), subtype=subtype)
# Symlink capture based on installPath is intentionally *not* performed here.
# `installPath` represents the intended installation location and may not
# correspond to real paths on the host running Surfactant. Probing the host
# filesystem here would lead to nondeterministic behavior and potential
# traversal outside the extraction root.
#
# Symlink information should instead be recorded during crawl/extraction
# (e.g., via `record_symlink`) and expanded using the dedicated methods
# such as `expand_pending_dir_symlinks` and `expand_pending_file_symlinks`.

Copilot uses AI. Check for mistakes.
Comment on lines +162 to +168
# Initialize fs_tree
self.fs_tree = nx.DiGraph()

# Populate from installPaths (if present)
for sw in self.software:
self._add_software_to_fs_tree(sw)

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__post_init__() currently only populates fs_tree from Software.installPath nodes. Because fs_tree (and symlink edges) are excluded from serialization, SBOMs loaded from JSON will not have any of the symlink edges/synthetic alias paths needed for get_software_by_path() symlink traversal (especially for directory-symlink-expanded child paths). If the intent is for fs_tree lookups to work post-deserialization/merge, __post_init__ needs a rebuild step that reconstructs symlink/hash equivalence from persisted data (e.g., installPathSymlinks metadata and/or existing installPath aliases) without probing the host filesystem.

Suggested change
# Initialize fs_tree
self.fs_tree = nx.DiGraph()
# Populate from installPaths (if present)
for sw in self.software:
self._add_software_to_fs_tree(sw)
# Initialize and rebuild the filesystem tree from any persisted metadata
self._rebuild_fs_tree_from_metadata()
def _rebuild_fs_tree_from_metadata(self) -> None:
"""
Rebuild the in-memory filesystem tree (fs_tree) from persisted SBOM metadata.
This method:
- Initializes a fresh NetworkX DiGraph for fs_tree.
- Re-adds all software install paths via _add_software_to_fs_tree.
- Reconstructs symlink and hash-equivalence relationships from any
available metadata on Software instances (e.g., installPathSymlinks),
without probing the host filesystem.
"""
# Always start from a clean graph to avoid mixing prior state with
# deserialized data.
self.fs_tree = nx.DiGraph()
for sw in self.software:
# Primary install paths
self._add_software_to_fs_tree(sw)
# Optional symlink metadata: expected to be a mapping of
# {symlink_path: target_path}. We use getattr to avoid assuming that
# all Software instances provide this attribute.
symlinks = getattr(sw, "installPathSymlinks", None)
if not symlinks or not isinstance(symlinks, dict):
continue
for link_path, target_path in symlinks.items():
if not link_path or not target_path:
continue
# Normalize to a consistent POSIX-like form
norm_link = normalize_path(link_path)
norm_target = normalize_path(target_path)
# Ensure both the symlink path and its target path have their
# directory structure reflected in fs_tree.
for norm in (norm_link, norm_target):
parts = pathlib.PurePosixPath(norm).parts
for i in range(1, len(parts)):
parent = normalize_path(*parts[:i])
child = normalize_path(*parts[: i + 1])
self.fs_tree.add_edge(parent, child)
if not self.fs_tree.has_node(norm):
self.fs_tree.add_node(norm)
# Mark the symlink relationship explicitly. Direction is from
# the symlink path to the real target path.
self.fs_tree.add_edge(norm_link, norm_target, is_symlink=True)
# Associate the symlink node with the software UUID (for parity
# with primary install paths).
self.fs_tree.nodes[norm_link]["software_uuid"] = sw.UUID
# Ensure hash-equivalence works for the symlink path as well.
if getattr(sw, "sha256", None):
try:
self.record_hash_node(norm_link, sw.sha256)
except Exception as e: # pylint: disable=broad-exception-caught
logger.warning(
f"[fs_tree] Failed to attach hash edge for symlink {norm_link}: {e}"
)

Copilot uses AI. Check for mistakes.
# dotnetAssemblyRef must present, otherwise we have no info on .NET imports
"""
Check whether the metadata includes .NET assembly references.
"""
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_required_fields() does "dotnetAssemblyRef" in metadata without checking that metadata is a dict; if a plugin emits non-dict metadata entries (or None), this will raise and prevent relationship processing. Align this with other relationship plugins by guarding with isinstance(metadata, dict) before key checks.

Suggested change
"""
"""
if not isinstance(metadata, dict):
return False

Copilot uses AI. Check for mistakes.
willis89pr and others added 6 commits February 13, 2026 14:38
Replace Unicode arrows (→, ↔), bullet points (•), smart quotes (' " "),
ellipsis (…), dashes (—, ‐), and other non-ASCII characters with their
ASCII equivalents to ensure compatibility with automated documentation
generation tools.

Addresses feedback from @nightlark in PR #459.

Co-Authored-By: Claude (claude-sonnet-4.5) <noreply@anthropic.com>
Remove "SurfActant plugin:" prefix from establish_relationships()
docstrings in dotnet_relationship and pe_relationship modules, as
the plugin nature is already clear from context.

Addresses feedback from @nightlark in PR #459 (r2784541293).

Co-Authored-By: Claude (claude-sonnet-4.5) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants