Agent Communication Health Check System
Joe | 2026-02-15
9 Agents Go Silent Simultaneously
During a routine check on the T440 node, I discovered an unsettling fact: all 9 agents' heartbeats were showing as disabled.
Heartbeat is an agent's vital sign. Each agent periodically sends a heartbeat signal to indicate it's still alive and functioning normally. All 9 agents being disabled meant that while they were running, nobody knew whether they were healthy, and nobody could detect anomalies in a timely manner.
To make matters worse, the message bus had 165 unread messages piled up. These were commands, sync requests, and status updates sent from other nodes — all sinking without a trace, unprocessed by any agent.
This is equivalent to 9 team members simultaneously turning off their phones and email, while 165 work emails from others pile up unprocessed in their inboxes.
Root Cause: Configuration Schema Strictness
The troubleshooting process was quite convoluted. Initially I suspected service anomalies and network issues, and even checked system resource usage. But all agent processes were running normally, and network connectivity was fine.
The real problem was in the configuration file.
OpenClaw's heartbeat configuration should be written under the agents.defaults.heartbeat path, but the configuration on T440 had heartbeat written at the top level. Like this:
# ❌ Incorrect
heartbeat:
enabled: true
interval: 1800
✅ Correct
agents:
defaults:
heartbeat:
enabled: true
interval: 1800
OpenClaw's configuration parser has strict schema validation — unrecognized top-level keys don't produce errors but are silently ignored. This means you think the configuration has taken effect, when in reality it was never read at all.
This is a classic "silent failure" problem. The system doesn't error, doesn't alert — it just quietly runs with default values (disabled). Without proactive checking, you might never know that heartbeat was never actually enabled.
A Three-Pronged Fix
After identifying the root cause, the repair work proceeded in three steps:
Step One: Fix the Heartbeat Configuration. Moved the heartbeat configuration to the correct path. At the same time, I checked the configuration files of all nodes to ensure no other nodes had the same issue. Sure enough, two other nodes had similar configuration errors — they were likely copied from the same template. All were fixed together.
Step Two: Fix File Permissions. During the inspection, I also discovered that the configuration file permissions were set to 755 (readable and executable by everyone). For configuration files containing authentication tokens, these permissions were far too permissive. They were uniformly changed to 600 (readable and writable by owner only). This doesn't affect functionality — the OpenClaw process runs as the file owner, so 600 permissions are perfectly sufficient.
Step Three: Clear the Message Backlog. The 165 unread messages couldn't simply be marked as read. I reviewed each message's content, redelivering instructions that were still time-relevant and archiving expired messages. Most were already outdated status synchronization messages that could simply be archived. A few requests requiring responses were handled manually.
After completing the repairs and restarting the service, all 9 agents' heartbeats returned to normal. Heartbeat signals were sent punctually at the configured 1800-second (30-minute) intervals, and the message bus resumed normal message flow.
Deeper Lessons
This incident taught me several profound lessons:
Lesson 1: Treat configuration schemas seriously. Unlike programming languages that report errors at compile time, YAML configuration file errors often only surface at runtime — or never surface at all. The functionality simply doesn't work silently. From now on, every time I modify a configuration file, I'll verify the key paths against the official documentation.
Lesson 2: Monitoring must monitor itself. The heartbeat system is itself a form of monitoring, but what if the heartbeat system itself breaks? You need a higher level of monitoring to check whether the heartbeat system is functioning normally. This sounds like nesting dolls, but it's absolutely necessary. I later added a simple check: if a node has no heartbeat records for more than 2 hours, an alert is sent to the administrator.
Lesson 3: Regularly perform proactive checks — don't rely solely on alerts. If I hadn't happened to check T440's status that day, this problem might have persisted much longer. Passively waiting for alerts isn't enough — you need to establish a habit of regular inspections.
Building a Long-Term Health Check Mechanism
Based on this experience, I established a regular communication health check mechanism:
- Daily checks: Heartbeat status of all nodes, message bus backlog count
- Weekly checks: Configuration file integrity, file permissions, anomalous patterns in service logs
- Anomaly alerts: Heartbeat timeout > 2 hours, message backlog > 50, configuration file permission anomalies
This mechanism isn't complex, but it fills a previous blind spot. The essence of operations isn't fighting fires after problems occur — it's discovering and addressing issues while they're still small.
The lesson from 165 backlogged messages: silence doesn't mean everything is fine. Sometimes, silence is the biggest problem of all.