Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro kilted showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro rolling showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro ardent showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro bouncy showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro crystal showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro eloquent showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro dashing showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro galactic showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro foxy showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro iron showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro lunar showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro jade showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro indigo showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro hydro showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro kinetic showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro melodic showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange

No version for distro noetic showing humble. Known supported distros are highlighted in the buttons above.

Package Summary

Version 0.6.0
License Apache-2.0
Build type AMENT_CMAKE
Use RECOMMENDED

Repository Summary

Checkout URI https://github.com/selfpatch/ros2_medkit.git
VCS Type git
VCS Version main
Last Updated 2026-07-03
Dev Status DEVELOPED
Released RELEASED
Contributing Help Wanted (-)
Good First Issues (-)
Pull Requests to Review (-)

Package Description

Central fault manager node for ros2_medkit fault management system

Maintainers

  • bburda

Authors

No additional authors.

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"

# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"

Note: The skip_correlation_auto_clear request field was added post-0.4.0. Adding a request field changes the service type hash, so callers built against ros2_medkit_msgs 0.4.0 or earlier must rebuild to keep talking to fault_manager.

Services

Service Type Description
~/report_fault ros2_medkit_msgs/srv/ReportFault Report a fault occurrence
~/list_faults ros2_medkit_msgs/srv/ListFaults Query faults with filtering
~/clear_fault ros2_medkit_msgs/srv/ClearFault Clear/acknowledge a fault
~/get_snapshots ros2_medkit_msgs/srv/GetSnapshots Get topic snapshots for a fault

Features

  • Multi-source aggregation: Same fault_code from different sources creates a single fault
  • Occurrence tracking: Counts total reports and tracks all reporting sources
  • Severity escalation: Fault severity is updated if a higher severity is reported
  • Persistent storage: SQLite backend ensures faults survive node restarts
  • Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
  • Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
  • Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across clear_fault (see below)
  • Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
  • Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history

Parameters

Parameter Type Default Description
storage_type string "sqlite" Storage backend: "sqlite" or "memory"
database_path string "/var/lib/ros2_medkit/faults.db" Path to SQLite database file
confirmation_threshold int -1 Counter value at which faults are confirmed
healing_enabled bool false Enable automatic healing via PASSED events
healing_threshold int 3 Counter value at which faults are healed
auto_confirm_after_sec double 0.0 Auto-confirm PREFAILED faults after timeout (0 = disabled)
entity_thresholds.config_file string "" Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.

Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.

Parameter Type Default Description
snapshots.enabled bool true Enable/disable snapshot capture
snapshots.background_capture bool false Use background subscriptions (caches latest message) vs on-demand capture
snapshots.timeout_sec double 1.0 Timeout waiting for topic message (on-demand mode)
snapshots.max_message_size int 65536 Maximum message size in bytes (larger messages skipped)
snapshots.default_topics string[] [] Topics to capture for all faults
snapshots.config_file string "" Path to YAML config for fault_specific and patterns
snapshots.recapture_cooldown_sec double 60.0 Min seconds between captures for the same fault code.
snapshots.max_per_fault int 10 Max snapshots retained per fault.
snapshots.capture_pool_size int 2 Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer.
snapshots.capture_queue_depth int 16 Max pending captures before the full-queue policy applies (>= 1).
snapshots.capture_queue_full_policy string reject_newest Policy when the queue is full: reject_newest or drop_oldest.

Topic Resolution Priority:

  1. fault_specific - Exact match for fault code (configured via YAML config file)
  2. patterns - Regex pattern match (configured via YAML config file)
  3. default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

```yaml fault_specific:

File truncated at 100 lines see the full file

CHANGELOG

Changelog for package ros2_medkit_fault_manager

Forthcoming

  • Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (record_hash = sha256(prev_hash + canonical(event)) via OpenSSL EVP SHA-256) with a persisted chain head, a verify routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. verify reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. BEFORE UPDATE / BEFORE DELETE triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so verify detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)

0.6.0 (2026-06-22)

  • Bounded concurrent snapshot capture under fault storms with a CaptureThreadPool and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456)
  • Entity-scoped rosbag capture by default (#431)
  • Made rosbag capture enablement crash-safe (#430)
  • Contributors: \@bburda, \@mfaferek93

0.5.0 (2026-06-08)

  • ClearFault honors the new skip_correlation_auto_clear request flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395)
  • Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
  • Concurrency and lifetime hardening: serialize concurrent subscription creation in SnapshotCapture, join capture threads in the FaultManagerNode destructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros
  • Aggregation security hardening and improved test coverage
  • Build: adopt the centralized ROS2MedkitWarnings and ROS2MedkitSanitizers cmake modules and bugprone / special-member-functions clang-tidy checks
  • Contributors: \@bburda

0.4.0 (2026-03-20)

  • Per-entity confirmation and healing thresholds via manifest configuration (#269)
  • Default rosbag storage format changed from sqlite3 to mcap
  • Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
  • Build: use shared cmake modules from ros2_medkit_cmake package
  • Build: centralized clang-tidy configuration
  • Contributors: \@bburda

0.3.0 (2026-02-27)

  • Accurate HIGHEST_SEVERITY reassignment and stale fault_to_cluster_ cleanup (#221)
  • Clean up pending_clusters_ when fault cleared before min_count (#211)
  • Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
  • Contributors: \@bburda, \@eclipse0922

0.2.0 (2026-02-07)

  • Initial rosdistro release
  • Central fault management node with ROS 2 services:
    • ReportFault - report FAILED/PASSED events with debounce filtering
    • GetFaults - query faults with filtering by severity, status, correlation
    • ClearFault - clear/acknowledge faults
  • Debounce filtering with configurable thresholds:
    • FAILED events decrement counter, PASSED events increment
    • Configurable confirmation_threshold (default: -1, immediate)
    • Optional healing support (healing_enabled, healing_threshold)
    • Time-based auto-confirmation (auto_confirm_after_sec)
    • CRITICAL severity bypasses debounce
  • Dual storage backends:
    • SQLite persistent storage with WAL mode (default)
    • In-memory storage for testing/lightweight deployments
  • Snapshot capture on fault confirmation:
    • Topic data captured as JSON with configurable topic resolution
    • Priority: fault_specific > patterns > default_topics
    • Stored in SQLite with indexed fault_code lookup
    • Auto-cleanup on fault clear

File truncated at 100 lines see the full file

Launch files

No launch files found

Messages

No message files found.

Services

No service files found

Plugins

No plugins found.

Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange