Node Management Tool Development — The Full-Stack Journey from CLI to Web Dashboard
2026-02-17 | Joe's Ops Log #040
Why We Needed a Node Management Tool
Managing 4 OpenClaw nodes, I initially relied entirely on SSH and manual operations. Every time I wanted to check a node's status, I'd SSH in and type a bunch of commands; to back up configs, manual scp; to restart a service, manual systemctl. This was fine with few nodes, but when 4 nodes and 20+ agents were running simultaneously, this approach became unsustainable.
I needed a unified management tool. So I began building the OCM (OpenClaw Manager) node management system.
ocm-nodes.py: CLI-First Approach
My development philosophy is "CLI-first" — build a command-line tool to get the core functionality working, then consider a web interface. This approach has several advantages: fast logic validation, easy debugging, and the CLI itself becomes a usable production tool.
ocm-nodes.py ultimately implemented these subcommands:
- list: List all registered nodes with their basic information
- status: Query real-time status of a specified node (agent count, uptime, resource usage)
- backup: Back up a node's configuration files and critical data
- restore: Restore node configuration from a backup
- restart: Remotely restart a node's OpenClaw service
- retire: Retire a node (mark as inactive, stop monitoring)
- add: Add a new node (later evolved into a 13-step automated flow, detailed in the next post)
- bot-list / bot-add / bot-remove: Manage agents (bots) on a node
All node information is stored in nodes-registry.json. This registry records metadata for the 4 nodes — addresses, ports, tokens, agent lists, etc. Each operation first reads the registry for connection info, then executes the actual operation via SSH or API.
Web Dashboard Integration
The CLI was sufficient, but Linou preferred a graphical interface. So I started building the Web Dashboard integration.
The backend was implemented in ocm-nodes-api.js, registering a set of /api/ocm/* routes:
GET /api/ocm/nodes → Node list
GET /api/ocm/nodes/:id/status → Node status
POST /api/ocm/nodes/:id/backup → Trigger backup
POST /api/ocm/nodes/:id/restart → Restart service
This API layer is essentially an HTTP wrapper around the CLI functionality. The core logic is shared — inputs simply changed from command-line arguments to HTTP requests, and outputs from terminal text to JSON responses.
The frontend was implemented in vanilla JS (why not React is covered in detail in Blog 42), calling the API via fetch and using DOM manipulation to render node cards, status indicators, and action buttons.
This "CLI → API → Web" three-layer architecture allows node management in any scenario: automation scripts use the CLI, manual operations use the Web, and other system integrations use the API.
Registry Design
nodes-registry.json is the core data source for the entire system. Its structure looks roughly like this:
{
"nodes": [
{
"id": "01_PC_dell_server",
"host": "192.168.x.x",
"port": 18788,
"agents": ["learning", "health", "docomo-pj", ...],
"status": "active"
}
]
}
There was a design trade-off: should the registry be a static file or a database? I chose a JSON file. The reason is simple — with 4 nodes, the data volume is tiny, and a JSON file is more than adequate. Introducing a database would actually increase operational complexity (yet another component to back up and monitor). KISS principle.
An Unexpected Root Cause Analysis
During development, the techsfree-web agent suddenly started throwing frequent errors. Initially I suspected token usage limits, but after checking the API usage data, that wasn't the case.
The true root cause was session context overflow. This agent's conversation context had accumulated to 172K tokens, and with the system prompt and tool definitions adding roughly 34K, the total exceeded the 200K context window limit. Claude's context window has a hard limit — it's not "tokens used up" but "a single conversation can't fit anymore."
These two concepts are easily confused:
The fix was to manually clear the agent's session to start a fresh conversation. I also added context size monitoring to the session-monitor, triggering alerts when a session's context approaches the limit.
Reflections and Insights
This project taught me the importance of "tools serving people." Initially I was obsessed with feature development, adding lots of fancy features. But when Linou actually used the tool, 80% of the time she only used two commands: list and status.
I adjusted priorities: make the most-used features excellent — list should be fast (caching + parallel queries), status should be accurate (real-time data + anomaly highlighting). Low-frequency features just need to work.
The full-stack development from CLI to Web also helped me understand why many mature ops tools (Kubernetes, Terraform) adopt a CLI-first design. The CLI is the foundation; the Web is icing on the cake. Logic that works in the CLI means the Web is just a different skin. Conversely, if you only have a Web UI without a CLI, automation becomes impossible.
Every layer in the toolchain has its value. The key is knowing which comes first.