Skip to content

Implement Diagnostic Fault Library with basic DFM and SOVD interface#5

Open
bburda42dot wants to merge 9 commits intoeclipse-opensovd:mainfrom
bburda42dot:main
Open

Implement Diagnostic Fault Library with basic DFM and SOVD interface#5
bburda42dot wants to merge 9 commits intoeclipse-opensovd:mainfrom
bburda42dot:main

Conversation

@bburda42dot
Copy link

@bburda42dot bburda42dot commented Feb 25, 2026

Summary

Complete implementation of the Diagnostic Fault Library - a Rust library for managing diagnostic fault reporting, processing, and querying in Software-Defined Vehicles. Replaces the initial scaffold (src/lib.rs, api.rs, catalog.rs, etc.) with a production-grade multi-crate workspace aligned with the S-CORE module template.

What changed

Architecture - multi-crate workspace

  • Reorganized from a single flat crate into three workspace crates:
    • common - shared types: FaultId, FaultRecord, FaultCatalog, DebounceMode, IPC service types, compliance tags
    • fault_lib - reporter-side API: Reporter with debounce filtering, enabling-condition guards, IpcWorker with retry queue (exponential backoff), LogHook observability, FaultManagerSink
    • dfm_lib - Diagnostic Fault Manager: FaultRecordProcessor, AgingManager, SovdFaultManager with KVS-backed storage, EnablingConditionRegistry, OperationCycle provider abstraction
  • Added xtask crate for developer automation
  • Deleted original scaffold files (src/lib.rs, src/api.rs, src/model.rs, src/catalog.rs, src/config.rs, src/ids.rs, src/sink.rs, src/utils.rs)

Features

  • Reporter-side debounce filtering - CountWithinWindow, HoldTime, EdgeWithCooldown, CountThreshold modes
  • Enabling conditions - E2E flow: reporters register conditions, DFM evaluates before processing
  • IPC worker - iceoryx2-based transport with bounded channel, backpressure, and retry queue with exponential backoff
  • Fault aging & reset - policy-driven aging evaluation with operation cycle integration
  • SOVD fault API - typed status, counters, ISO 8601 timestamps, full FaultId variant support (Numeric/Text/Uuid)
  • Graceful shutdown - deadlock prevention via cooperative shutdown mechanism
  • Memory safety - replaced Box::leak with Cow<str>, bounded channels

Safety & quality

  • #[deny(clippy::unwrap_used)] enforced in runtime code - all todo!(), expect(), and unwrap() replaced with proper error handling
  • Raw TODO comments replaced with documented error paths
  • Comprehensive test suite: unit tests inlined in source, integration tests (tests/integration/) covering lifecycle transitions, multi-catalog scenarios, persistent storage, and report-query flows
  • Miri-compatible for memory safety validation

Project structure alignment

  • .bazelrc, MODULE.bazel, BUILD files for Bazel 8 support
  • .vscode/settings.json and extensions.json for development environment
  • .ruff.toml, .yamlfmt, rustfmt.toml for formatting consistency
  • Updated README.md with architecture overview, getting started, and examples
  • Issue and PR templates added

Checklist

  • I have tested my changes locally
  • I have added or updated documentation
  • I have linked related issues or discussions
  • I have added or updated tests

Related

This work is continuation of #4

Notes for Reviewers

Code is quite large, so it is better to review commit by commit. I split them into categories: "common", "fault-lib", "dfm" etc.

Migrate from single-crate layout to multi-crate workspace with
Bazel 8.3 + Cargo dual build system. Add xtask runner for common
development commands.
IPC-safe types (IpcDuration, IpcTimestamp), fault descriptors,
catalog configuration, debounce/enabling condition config,
query protocol definitions, and iceoryx2 service types.
Fault reporter API, IPC worker with exponential backoff retry,
fault catalog validation, enabling condition management, and
FaultManagerSink for iceoryx2 transport.
SOVD-compliant fault manager with KVS persistent storage, aging
manager, operation cycle tracking, fault record processor, and
query server with iceoryx2 IPC transport.
E2E tests covering lifecycle transitions, debounce/aging/cycles,
persistent storage, concurrent access, boundary values, error
paths, multi-catalog, JSON catalog loading, IPC query/clear,
and report-and-query flow.
Workflows: build/test, clippy lint, rustfmt, miri, coverage,
copyright header check, cargo audit (pinned to SHA), Bazel
format check. All workflows set permissions: contents: read.
…rence

Architecture overview, fault catalog/reporter/DFM sequence
diagrams, library architecture drawing, Sphinx docs scaffold,
and HVAC component design reference example.
@bburda42dot bburda42dot requested a review from a team as a code owner February 25, 2026 14:52
@vinodreddy-g
Copy link

vinodreddy-g commented Feb 26, 2026

@bburda42dot Just wanted to know,Why was this PR not started on top of the Initial commit in #4 from Qorix and started from scratch and moved all the files here , when it says continuation from #4?

@bburda42dot
Copy link
Author

@bburda42dot Just wanted to know,Why was this PR not started on top of the Initial commit in #4 from Qorix and started from scratch and moved all the files here , when it says continuation from #4?

@vinodreddy-g I did start on top of Qorix's initial commit from #4 - this PR is a direct continuation of that work. On top of the original ~4.9k lines, I added 63 commits (21k+ lines added, ~800 removed) with significant changes and improvements.

The resulting 64-commit history was hard to review as-is, so before opening this PR I squashed them all into a cleaner, logically grouped commit history specifically to enable commit-by-commit review. That squash is why the git history may look like it was started from scratch, but the code lineage traces directly back to #4.

If proper attribution is important to you, feel free to point out which parts of the current code originate from the original PR and I can add Co-Authored-By to the relevant commits.

@vinodreddy-g
Copy link

vinodreddy-g commented Feb 26, 2026

@bburda42dot Just wanted to know,Why was this PR not started on top of the Initial commit in #4 from Qorix and started from scratch and moved all the files here , when it says continuation from #4?

@vinodreddy-g I did start on top of Qorix's initial commit from #4 - this PR is a direct continuation of that work. On top of the original ~4.9k lines, I added 63 commits (21k+ lines added, ~800 removed) with significant changes and improvements.

The resulting 64-commit history was hard to review as-is, so before opening this PR I squashed them all into a cleaner, logically grouped commit history specifically to enable commit-by-commit review. That squash is why the git history may look like it was started from scratch, but the code lineage traces directly back to #4.

If proper attribution is important to you, feel free to point out which parts of the current code originate from the original PR and I can add Co-Authored-By to the relevant commits.

@bburda42dot ok so you split/changed the initial commit for easy review and added a lot of changes offcourse.
we had some design decisions/assumed some things in our initial implementation , wanted to know if these were discussed or aligned in some discussions in opensovd in last weeks like:
1 -The handling of fault catalog shown in fault_catalog.svg
2 - The interfaces between the lib and DFM shown in Registering new fault in the system.svg etc
3 -Behaviour of if fault doesn't exist in - new_fault.svg
4 - we currently had used iceoryx2 mostly directly and should it be moved to mw::com api to connect to S-core as next steps.

Could you update also the design changes/add in the svg/puml files to follow the new changes easily from #4 .

@FScholPer
Copy link

FScholPer commented Feb 26, 2026

To 4. we should start with what we have now (iceoryx2) later we can evaluate the migration to mw::com. For the artifacts potential next step(not now) could be using sphinx needs

@bburda42dot
Copy link
Author

ok so you split/changed the initial commit for easy review [...] wanted to know if these were discussed or aligned [...]

@vinodreddy-g Thanks for the detailed questions. These changes weren't discussed in OpenSOVD architecture meetings - they follow from the design doc requirements and the code review feedback on #4. Happy to discuss any of them in the next Architecture meeting if needed.

I've updated all diagrams in the latest force-push, so you can follow the design changes visually. Here's the breakdown:

1. Fault catalog (fault_catalog.svg)

Core idea is the same - builder pattern, SHA-256 hash verification with DFM, decentral catalogs. Main change: the original diagram had an UpdateFaultCatalogRequest path (hash mismatch -> push catalog to DFM). The current version is simpler - hash mismatch = Err(CatalogVerification), no catalog sync over IPC. Both sides load the same catalog files; a mismatch is a deployment error caught early. (The #4 code had check_fault_catalog() returning Result<bool> but the result was discarded in FaultApi::new(), so mismatches were silently ignored.)

FaultCatalog/FaultCatalogBuilder moved to the common crate (as flagged in review - dfm_lib depended on fault_lib just for catalog types).

2. Interfaces between lib and DFM (Registering new fault in the system.svg)

That diagram showed FaultMonitor + explicit register_fault() + get_fault() returning Some<Fault>/None. The #4 code was already heading in a different direction - it used Reporter (not FaultMonitor) and had no register_fault() IPC call. I kept the approach the code actually took: Reporter as per-fault handle, catalog hash handshake as implicit registration, iceoryx2 pub-sub services.

I removed Registering new fault in the system.svg since it was the only diagram without a .puml source and didn't match the implementation. new_fault.svg already covers the full registration + reporting flow.

On DFM side: I added DfmTransport trait (in dfm_lib/src/transport.rs) to abstract the IPC transport - the original FaultLibCommunicator used iceoryx2 directly. The FaultSinkApi trait on the reporter side was already there from #4 - I kept it.

3. Fault doesn't exist in catalog (new_fault.svg)

The #4 code had Reporter::new() returning Option<Self> - missing fault IDs returned None. I changed it to Result<Self, ReporterError> with a dedicated FaultIdNotFound(FaultId) variant, so the caller gets a clear error with the fault ID. Same fail-fast intent, more informative. The new_fault.puml/.svg are rewritten to match.

4. iceoryx2 vs mw::com

Agreed with @FScholPer - iceoryx2 for now, evaluate mw::com migration later. The transport is now isolated behind traits on both sides (FaultSinkApi + DfmTransport), so swapping means implementing those two traits with no business logic changes.

5. Diagram updates

All diagrams are now up to date in docs/puml/:

Diagram Status
fault_catalog.puml/.svg Updated - current init + hash verification flow
new_fault.puml/.svg Updated - ReporterApi::new() with Err(FaultIdNotFound) path
new_enable_condition.puml/.svg New - EC registration and report_status() flow
enable_condition_ntf.puml/.svg New - cross-app EC status notifications via DFM
local_enable_condition_ntf.puml/.svg New - local (same-app) EC notifications, fast path without IPC
query_clear.puml/.svg New - DFM query/clear IPC protocol
Registering new fault in the system.svg Removed - outdated, new_fault.svg covers this
lib_arch.drawio/.svg Updated - renamed FaultMgrClient to FaultManagerSink, fixed typo

Copy link

@lh-sag lh-sag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have stopped reviewing for now.
@bburda42dot will you wait until you got enough feedback or do you want to fix the mentioned issues asap?

@@ -0,0 +1,11 @@
---
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You override the organization wide templates defined here https://github.com/eclipse-opensovd/.github
Is this intended? I would rather have a consistent view.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about it, took an inspiration form S-CORE module template. Will fix it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, removed the repo-level issue templates entirely.

- name: Install cargo-llvm-cov
uses: taiki-e/install-action@cargo-llvm-cov

- name: Cache cargo registry & target
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use https://github.com/Swatinem/rust-cache instead of manually rolling out rust caching?

You might also consider https://github.com/actions-rust-lang/setup-rust-toolchain which picks up the rust version from the rust-toolchain avoiding to maintain multiple versions across CI and local builds. And comes with builtin caching support.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, replaced dtolnay/rust-toolchain + actions/cache with actions-rust-lang/setup-rust-toolchain@v1 in all workflow files.

@@ -0,0 +1,37 @@
# *******************************************************************************
# Copyright (c) 2025 Contributors to the Eclipse Foundation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are references to 2025 in the copyright header. Please use 2026 since this is a new project.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, some files were from #4 so I did not change the year, but I will fix it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I couldn't change Bazel BUILD file's copyright year, because the s-core bazel copyright check was not passing with 2026 set when I ran it on the CI.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, setting 2026 in Bazel's files won't work because score_cr_checker is not allowing to do so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed all non-Bazel files to 2026.

@@ -0,0 +1,221 @@
/*
* Copyright (c) 2026 The Contributors to Eclipse OpenSOVD (see CONTRIBUTORS)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a mixture of various copyright banner.

I would prefer this one everywhere:

PDX-FileCopyrightText: Copyright (c) 2026 Contributors to the Eclipse Foundation
SPDX-License-Identifier: Apache-2.0

But please double check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix all copyrights in new files to follow correct one and use 2026 year.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, unified the header to the standard format used across all .rs files.

@bburda42dot
Copy link
Author

bburda42dot commented Mar 2, 2026

I have stopped reviewing for now. @bburda42dot will you wait until you got enough feedback or do you want to fix the mentioned issues asap?

@lh-sag I think that we will end up with some follow up issues anyway, but it would be great to get enough feedback before merge. The only thing is "how much feedback is enough?" :)

bburda42dot added a commit to bburda42dot/fault-lib that referenced this pull request Mar 2, 2026
- Remove repo-level issue templates (.github/ISSUE_TEMPLATE/) to use
  org-wide templates from eclipse-opensovd/.github
- Replace dtolnay/rust-toolchain + actions/cache with
  actions-rust-lang/setup-rust-toolchain@v1 (reads rust-toolchain.toml,
  built-in caching) in all 5 workflow files
- Fix copyright year to 2026 in lint.yml, .gitignore, docs/conf.py,
  docs/index.rst, docs/design/design.md
- Unify copyright banner in hvac_component_design_reference.rs to match
  the xtask-enforced format used in all .rs files
bburda42dot added a commit to bburda42dot/fault-lib that referenced this pull request Mar 2, 2026
- Remove repo-level issue templates (.github/ISSUE_TEMPLATE/) to use
  org-wide templates from eclipse-opensovd/.github
- Replace dtolnay/rust-toolchain + actions/cache with
  actions-rust-lang/setup-rust-toolchain@v1 (reads rust-toolchain.toml,
  built-in caching) in all 5 workflow files
- Fix copyright year to 2026 in lint.yml, .gitignore, docs/conf.py,
  docs/index.rst, docs/design/design.md
- Unify copyright banner in hvac_component_design_reference.rs to match
  the xtask-enforced format used in all .rs files
bburda42dot added a commit to bburda42dot/fault-lib that referenced this pull request Mar 2, 2026
- Remove repo-level issue templates (.github/ISSUE_TEMPLATE/) to use
  org-wide templates from eclipse-opensovd/.github
- Replace dtolnay/rust-toolchain + actions/cache with
  actions-rust-lang/setup-rust-toolchain@v1 (reads rust-toolchain.toml,
  built-in caching) in all 5 workflow files
- Fix copyright year to 2026 in lint.yml, .gitignore, docs/conf.py,
  docs/index.rst, docs/design/design.md
- Unify copyright banner in hvac_component_design_reference.rs to match
  the xtask-enforced format used in all .rs files
- Remove repo-level issue templates (.github/ISSUE_TEMPLATE/) to use
  org-wide templates from eclipse-opensovd/.github
- Replace dtolnay/rust-toolchain + actions/cache with
  actions-rust-lang/setup-rust-toolchain@v1 (reads rust-toolchain.toml,
  built-in caching) in all 5 workflow files
- Fix copyright year to 2026 in lint.yml, .gitignore, docs/conf.py,
  docs/index.rst, docs/design/design.md
- Unify copyright banner in hvac_component_design_reference.rs to match
  the xtask-enforced format used in all .rs files
@lh-sag
Copy link

lh-sag commented Mar 2, 2026

I have stopped reviewing for now. @bburda42dot will you wait until you got enough feedback or do you want to fix the mentioned issues asap?

@lh-sag I think that we will end up with some follow up issues anyway, but it would be great to get enough feedback before merge. The only thing is "how much feedback is enough?" :)

We can decide next weeks architecture meeting how to proceed.

I will try to provide some feedback incrementally, whenever I have some spare time.

uses: actions-rust-lang/setup-rust-toolchain@v1

- name: Build all crates
run: cargo build --workspace
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add --locked to all cargo commands which support it?

This will speed up the build pipeline and allows for better reproducibility.

# SPDX-License-Identifier: Apache-2.0
# *******************************************************************************

name: Copyright Check
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: lint, coverage and copyright are available in https://github.com/eclipse-opensovd/cicd-workflows which should be re-used and extended if possible, so we don't re-invent pipelines for all repos.
It would be great to add missing things to the cicd repo instead, so all opensovd projects can profit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw that this is using bazel for copyright check. While I do get why bazel is used in the fault lib, I still believe it makes sense to align the CI approaches in opensovd.

env:
COVERAGE_THRESHOLD: 90

jobs:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: coverage isn't available in https://github.com/eclipse-opensovd/cicd-workflows yet. Would be great to move it there, so it can be shared among all repos.

@@ -0,0 +1,344 @@
// Copyright (c) 2026 Contributors to the Eclipse Foundation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: the copyright header differs from cda and proxy. Imho it would be a good idea to align them. (Would come for free when re-using ci/cd repo)

unsafe_code = "forbid"

[workspace.lints.clippy]
todo = "deny"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Again ci nag, in the architecture group a couple of week ago we aligned on common linting rules: https://github.com/eclipse-opensovd/cicd-workflows/tree/main/shared-lints

I.e. pedantic and index_slicing are missing here.
Adding todo to the shared rules makes sense imho.
The std_instead_of. can be kept here as they mostly make sense for the fault lib.

Cargo.toml Outdated
env_logger = "0.11.8"
iceoryx2 = { git = "https://github.com/eclipse-iceoryx/iceoryx2.git", rev = "eba5da4b8d8cb03bccf1394d88a05e31f58838dc" }
iceoryx2-bb-container = { git = "https://github.com/eclipse-iceoryx/iceoryx2.git", rev = "eba5da4b8d8cb03bccf1394d88a05e31f58838dc" }
log = "0.4.22"
Copy link

@alexmohr alexmohr Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: not aligned yet, but does it make sense to using 'tracing' everywhere instead of logging, for example integration into dlt is only available via 'tracing' at the moment, which is probably interesting for fault lib as well.

rustfmt.toml Outdated
# check configuration fields here: https://rust-lang.github.io/rustfmt/?version=v1.6.0&search=


tab_spaces = 4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Would be cool make this common as well. i.e. cda is only using max_width = 100 and no tab_spaces. Maybe worth discussing in the next architecture round.

Also new line at EOF missing. (Last CI nag I promise: The ci/cd workflows also check for that)


create IpcWorker
FaultLib -> IpcWorker : spawn thread + create\niceoryx2 publisher ("dfm/event") +\nEC notification subscriber
FaultLib -> FaultLib : create iceoryx2 subscriber\n("dfm/event/hash/response")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bburda42dot So currently, fault lib fully relies on iceoryx2, is that correct? But when integrating OpenSOVD into S-CORE, we would have to use S-CORE's LoLa implementation instead. Has that already been considered in the current design?

Copy link
Author

@bburda42dot bburda42dot Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stmuench No. fault-lib does not fully rely on Iceoryx2. The design already anticipates alternative IPC backends.

There are three abstraction traits that decouple the core logic from the transport layer:

  1. FaultSinkApi (reporter/app side: Abstracts how fault events are sent to the DFM. The production implementation uses iceoryx2, but any backend can implement this trait. The docs even explicitly note: “Implementations can be S-CORE IPC.”
  2. DfmTransport (DFM run-loop side): Abstracts how the DFM receives events and publishes responses. The default is Iceoryx2Transport, but the trait is designed for pluggable backends (e.g., “implementations can use shared-memory IPC, in-process channels, or any other messaging backend”).
  3. DfmQueryApi (query/clear side): Abstracts SOVD fault query/clear, with both in-process and IPC implementations.

This means that integrating S-CORE’s LoLa would mean providing new implementations of these three traits. The core fault-management logic (catalogs, debouncing, aging, SOVD state machine) would not need to change.

… feedback

- Reformat codebase with max_width=100 (aligned with CDA)
- Enable clippy pedantic lints from shared-lints baseline
- Add clippy.toml (too-many-lines-threshold=130, allow-unwrap-in-tests)
- Fix clippy warnings: wildcard imports, redundant closures, casts, literals
- Migrate log/env_logger to tracing/tracing-subscriber
- Replace lint.yml, format.yml, copyright.yml with pr-checks.yml
  using cicd-workflows rust-lint-and-format-action (dual nightly: pinned strict + latest advisory)
- Add pre-commit.yaml using cicd-workflows pre-commit-action
- Migrate build_test.yml, coverage.yml, miri.yml to cicd-workflows reusable workflows
- Add --locked to all CI cargo commands
- All workflows reference eclipse-opensovd/cicd-workflows@main
- Remove repo-level PR templates (use org-wide from eclipse-opensovd/.github)
@bburda42dot bburda42dot changed the title Implement Diagnostic Fault Library with basic DFM, SOVD interface, and CI infrastructure Implement Diagnostic Fault Library with basic DFM and SOVD interface Mar 4, 2026

jobs:
build-and-test:
uses: eclipse-opensovd/cicd-workflows/.github/workflows/build-and-test.yml@main
Copy link

@alexmohr alexmohr Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I would recommend pinning this to a SHA instead, to prevent breaking builds here accidentally when something incompatible in merged to main in cicd (applies to all @main )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants