Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file
Package Dependencies
System Dependencies
Launch files
Messages
Services
Plugins
Recent questions tagged ros2_medkit_fault_manager at Robotics Stack Exchange
Package Summary
| Version | 0.6.0 |
| License | Apache-2.0 |
| Build type | AMENT_CMAKE |
| Use | RECOMMENDED |
Repository Summary
| Checkout URI | https://github.com/selfpatch/ros2_medkit.git |
| VCS Type | git |
| VCS Version | main |
| Last Updated | 2026-07-03 |
| Dev Status | DEVELOPED |
| Released | RELEASED |
| Contributing |
Help Wanted (-)
Good First Issues (-) Pull Requests to Review (-) |
Package Description
Maintainers
- bburda
Authors
ros2_medkit_fault_manager
Central fault manager node for the ros2_medkit fault management system.
Overview
The FaultManager node provides a central point for fault aggregation and lifecycle management.
It receives fault reports from multiple sources, aggregates them by fault_code, and provides
query and clearing interfaces.
Quick Start
By default, faults are confirmed immediately when reported - no additional configuration needed.
# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py
# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
"{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"
# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
"{statuses: ['CONFIRMED']}"
# Clear a fault (cascade-clears correlated symptoms by default)
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: false}"
# Clear without touching correlated symptoms
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
"{fault_code: 'MOTOR_OVERHEAT', skip_correlation_auto_clear: true}"
Note: The
skip_correlation_auto_clearrequest field was added post-0.4.0. Adding a request field changes the service type hash, so callers built againstros2_medkit_msgs0.4.0 or earlier must rebuild to keep talking tofault_manager.
Services
| Service | Type | Description |
|---|---|---|
~/report_fault |
ros2_medkit_msgs/srv/ReportFault |
Report a fault occurrence |
~/list_faults |
ros2_medkit_msgs/srv/ListFaults |
Query faults with filtering |
~/clear_fault |
ros2_medkit_msgs/srv/ClearFault |
Clear/acknowledge a fault |
~/get_snapshots |
ros2_medkit_msgs/srv/GetSnapshots |
Get topic snapshots for a fault |
Features
-
Multi-source aggregation: Same
fault_codefrom different sources creates a single fault - Occurrence tracking: Counts total reports and tracks all reporting sources
- Severity escalation: Fault severity is updated if a higher severity is reported
- Persistent storage: SQLite backend ensures faults survive node restarts
- Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
-
Freeze-frame retention: One compact JSON freeze-frame per fault code, retained across
clear_fault(see below) - Fault correlation (optional): Root cause analysis with symptom muting and auto-clear
- Tamper-evident audit log (optional): Append-only, hash-chained record of fault state transitions for verifiable history
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_type |
string | "sqlite" |
Storage backend: "sqlite" or "memory"
|
database_path |
string | "/var/lib/ros2_medkit/faults.db" |
Path to SQLite database file |
confirmation_threshold |
int | -1 |
Counter value at which faults are confirmed |
healing_enabled |
bool | false |
Enable automatic healing via PASSED events |
healing_threshold |
int | 3 |
Counter value at which faults are healed |
auto_confirm_after_sec |
double | 0.0 |
Auto-confirm PREFAILED faults after timeout (0 = disabled) |
entity_thresholds.config_file |
string | "" |
Path to YAML file with per-entity debounce threshold overrides |
Snapshot Parameters
Snapshots capture topic data when faults are confirmed for post-mortem debugging.
Each confirm also writes a freeze-frame: a single compact JSON object mapping every captured topic to its value at confirmation time, keyed by fault code. It differs from per-topic snapshots in two ways: snapshots are deleted when the fault is cleared, while the freeze-frame is retained across clear_fault (once the snapshots are gone, ~/get_fault serves the retained frame so the confirmed-state record stays available after acknowledgement); and a re-confirm that captures nothing (e.g. source publishers down) never overwrites an existing non-empty frame. A fault code with no configured capture set gets no freeze-frame row; a configured capture that samples nothing on its first run records an empty {} frame. Freeze-frame storage is bounded by the number of distinct fault codes (one row per code, replaced in place) and rows are never evicted.
Under a fault storm, captures are bounded by a worker pool (capture_pool_size) draining a bounded queue (capture_queue_depth); excess captures are dropped per capture_queue_full_policy and logged (throttled). The pool is shared and is created when snapshots or rosbag is enabled, so these parameters bound both. capture_pool_size parallelizes freeze-frame snapshot capture only - rosbag is single-writer and records one fault at a time regardless of pool size.
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshots.enabled |
bool | true |
Enable/disable snapshot capture |
snapshots.background_capture |
bool | false |
Use background subscriptions (caches latest message) vs on-demand capture |
snapshots.timeout_sec |
double | 1.0 |
Timeout waiting for topic message (on-demand mode) |
snapshots.max_message_size |
int | 65536 |
Maximum message size in bytes (larger messages skipped) |
snapshots.default_topics |
string[] | [] |
Topics to capture for all faults |
snapshots.config_file |
string | "" |
Path to YAML config for fault_specific and patterns
|
snapshots.recapture_cooldown_sec |
double | 60.0 |
Min seconds between captures for the same fault code. |
snapshots.max_per_fault |
int | 10 |
Max snapshots retained per fault. |
snapshots.capture_pool_size |
int | 2 |
Max concurrent capture threads under a fault storm (>= 1). Parallelizes snapshot capture only; rosbag stays single-writer. |
snapshots.capture_queue_depth |
int | 16 |
Max pending captures before the full-queue policy applies (>= 1). |
snapshots.capture_queue_full_policy |
string | reject_newest |
Policy when the queue is full: reject_newest or drop_oldest. |
Topic Resolution Priority:
-
fault_specific- Exact match for fault code (configured via YAML config file) -
patterns- Regex pattern match (configured via YAML config file) -
default_topics- Fallback for all faults
Example YAML config file (snapshots.yaml):
```yaml fault_specific:
File truncated at 100 lines see the full file
Changelog for package ros2_medkit_fault_manager
Forthcoming
- Optional append-only, hash-chained audit log of fault state
transitions: each transition appends one immutable row
(
record_hash = sha256(prev_hash + canonical(event))via OpenSSL EVP SHA-256) with a persisted chain head, averifyroutine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited.verifyreads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering.BEFORE UPDATE/BEFORE DELETEtriggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, soverifydetects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (#483)
0.6.0 (2026-06-22)
- Bounded concurrent snapshot capture under fault storms with a
CaptureThreadPooland configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (#456) - Entity-scoped rosbag capture by default (#431)
- Made rosbag capture enablement crash-safe (#430)
- Contributors: \@bburda, \@mfaferek93
0.5.0 (2026-06-08)
-
ClearFaulthonors the newskip_correlation_auto_clearrequest flag so per-entity fault clears can opt out of cascade-clearing correlated symptom fault codes (#395) - Three-layer protection against unbounded snapshot growth (bounded buffers plus pruning)
- Concurrency and lifetime hardening: serialize concurrent
subscription creation in
SnapshotCapture, join capture threads in theFaultManagerNodedestructor, and defense-in-depth shutdown guards to prevent teardown crashes across distros - Aggregation security hardening and improved test coverage
- Build: adopt the centralized
ROS2MedkitWarningsandROS2MedkitSanitizerscmake modules andbugprone/special-member-functionsclang-tidy checks - Contributors: \@bburda
0.4.0 (2026-03-20)
- Per-entity confirmation and healing thresholds via manifest configuration (#269)
- Default rosbag storage format changed from
sqlite3tomcap - Support for namespaced fault manager nodes - gateway resolves service/topic names when the fault manager runs in a custom namespace
- Build: use shared cmake modules from
ros2_medkit_cmakepackage - Build: centralized clang-tidy configuration
- Contributors: \@bburda
0.3.0 (2026-02-27)
- Accurate HIGHEST_SEVERITY reassignment and stale
fault_to_cluster_cleanup (#221) - Clean up
pending_clusters_when fault cleared beforemin_count(#211) - Multi-distro CI support for ROS 2 Humble, Jazzy, and Rolling (#219, #242)
- Contributors: \@bburda, \@eclipse0922
0.2.0 (2026-02-07)
- Initial rosdistro release
- Central fault management node with ROS 2 services:
- ReportFault - report FAILED/PASSED events with debounce filtering
- GetFaults - query faults with filtering by severity, status, correlation
- ClearFault - clear/acknowledge faults
- Debounce filtering with configurable thresholds:
- FAILED events decrement counter, PASSED events increment
- Configurable confirmation_threshold (default: -1, immediate)
- Optional healing support (healing_enabled, healing_threshold)
- Time-based auto-confirmation (auto_confirm_after_sec)
- CRITICAL severity bypasses debounce
- Dual storage backends:
- SQLite persistent storage with WAL mode (default)
- In-memory storage for testing/lightweight deployments
- Snapshot capture on fault confirmation:
- Topic data captured as JSON with configurable topic resolution
- Priority: fault_specific > patterns > default_topics
- Stored in SQLite with indexed fault_code lookup
- Auto-cleanup on fault clear
File truncated at 100 lines see the full file