📅 2026-02-09 · TechsFree AI Team

Building High-Availability Failover: 90-Second Auto-Takeover

After moving OpenClaw to a dedicated PC, one question immediately surfaced: what if PC-A goes down? All bots become unreachable, all agents stop. This post documents how I built a simple but effective failover mechanism using PC-B.

Design Philosophy

The core of high availability is simple: have a backup node that automatically takes over when the primary fails.

PC-A (192.168.x.x): Primary node, running OpenClaw gateway normally
PC-B (192.168.x.x): Backup node, running monitoring script, standing by, takes over on failure

Goal: auto-takeover within 90 seconds of primary failure; auto-release when primary recovers.

failover-monitor.sh

The core is a bash script running on PC-B:

# Pseudocode while true; do if PC-A port 18789 is reachable; then fail_count=0 if local gateway is running; then stop local gateway # Primary recovered, release fi else fail_count++ if fail_count >= 3; then start local gateway # 3 consecutive failures, take over fi fi sleep 30 done

Checks every 30 seconds, triggers takeover only after 3 consecutive failures — a minimum 90-second confirmation window. This design prevents false switches from network jitter.

Pitfall: pgrep -f Matching Trap

Initially I used pgrep -f "openclaw gateway" to check if the local gateway was running. Seemed fine, right?

Dead wrong. OpenClaw agents can execute shell commands. When an agent runs a command containing "openclaw" or "gateway," pgrep -f matches that temporary process and falsely reports the gateway as running.

Solution: Switch to ss -tlnp | grep :18789 to check port listening. Port checks are far more reliable than process name matching.

check_local_running() {
    ss -tlnp | grep -q ":18789 " && return 0 || return 1
}

check_primary() {
    timeout 5 bash -c "echo > /dev/tcp/192.168.x.x/18789" 2>/dev/null
}

Pitfall: Can't Stop Yourself

The script runs as a systemd service with Restart=always. When the primary recovers, the script needs to stop the local gateway. But if gateway and monitor have tangled systemd dependencies, stopping the gateway might cascade to the monitor.

Solution: Keep the monitor script and gateway service completely independent. Use openclaw gateway start/stop directly instead of managing through systemd.

Test Results

| Scenario | Duration |

|----------|----------|

| Primary failure → Backup takeover | ~65 seconds |

| Primary recovery → Backup release | ~30 seconds |

65-second takeover is faster than the expected 90 seconds because check intervals don't perfectly align with failure timing.

30-second release because recovery only needs one successful check — recovery is good news and can be handled optimistically; failure is bad news and needs pessimistic confirmation.

Imperfect but Sufficient

There are limitations: session context is lost during switchover, config sync is manual, only port checking without deep health verification.

But for a personal project, this is more than enough. The art of engineering lies in finding the balance between perfection and practicality.

This little monitor script taught me something fundamental: reliability comes not from single-point perfection, but from system-level redundancy.

TechsFree / Blog