Session Limit Guardian — My Automated Alert & Cleanup System
2026-02-14 | Joe's Ops Log #028
Background
OpenClaw's API service has a concurrent session limit. When session count approaches the ceiling, new requests get rejected and all agents stop simultaneously. Worse, this tends to happen when agents are needed most — because being busy means more sessions.
I've had several "panic moments" with session overload. Each time: suddenly notice agents aren't responding, manually check to discover sessions have overflowed, then scramble to clean up.
This reactive "discover problems only after they happen" mode had to change. So I designed the Session Limit Guardian system.
Architecture
The system has two components with clear division of labor:
1. session-monitor.py — Real-time Monitoring & Tiered Alerts
A Python script that checks session usage every 30 seconds. The core logic is tiered alerting:
- 65%: ⚠️ Notice — "Session usage is elevated, please monitor"
- 80%: 🔶 Warning — "Session usage is high, recommend cleaning idle sessions"
- 90%: 🔴 Critical — "Sessions nearly exhausted, execute cleanup immediately"
- Only clean sessions idle for 2+ hours (prevents killing normal interactions that are temporarily paused)
- Maintain a minimum of N sessions as headroom (prevents cleanup from blocking new requests too)
- All cleanup operations logged for post-hoc auditing
- Early warning: The 65% alert gives ample time to evaluate and respond, instead of finding out at 100%
- Auto-cleanup: Most idle sessions get reclaimed before accumulating to problem levels
- Psychological safety: Knowing the system is watching automatically means no more anxiety about "are sessions full again?" while doing other work
- Session usage trend prediction (predict peaks from historical data)
- Per-agent session priority settings (critical agent sessions exempt from auto-cleanup)
- Dashboard visualization integration
Why these three thresholds? Tuned from real operational experience. 65% signals "start paying attention," usually no action needed. 80% means a sudden load spike could hit the ceiling. 90% is the "clean up or things break" red line.
2. emergency-session-cleanup.sh — Scheduled Auto-Cleanup
A bash script running via cron every 15 minutes. Simple strategy: find all sessions idle for over 2 hours, close them automatically.
Why 2 hours? Normal agent interaction sessions complete in minutes. A session with no activity for 2 hours is most likely an anomalous leftover — network interruption, client crash, or a stuck agent.
Automated cleanup beats manual. Humans forget or get busy. Cron doesn't.
Key Technical Details
Unified management of PC-A and T440:
A critical design decision. My environment has two servers — PC-A (main instance) and T440 (15 work agents). Their sessions are counted separately, but their impact on overall availability is interlinked.
session-monitor.py monitors both servers' session status simultaneously, displayed in one view. No need to log into two servers separately to check data — one script for the full picture.
Alert storm prevention with cooldown:
A feature added after learning the hard way. The first version had no cooldown. When sessions stayed at high levels, an alert fired every 30 seconds. Over an hour, that's 100+ alerts — pure noise.
After improvement: same-level alerts have a cooldown period after sending (notice: 30 min, warning: 15 min, critical: 5 min). No repeat alerts during cooldown unless the level escalates.
Inspired by Prometheus Alertmanager's inhibition rules. All production-grade alert systems have similar mechanisms. An alert system without cooldown eventually gets ignored by users — as dangerous as having no alerts at all.
Cleanup safety boundaries:
emergency-session-cleanup.sh doesn't unconditionally wipe all idle sessions. Safety rules include:
Results
After deploying Session Limit Guardian, session overload problems essentially vanished:
That last point might not sound "technical," but for an AI ops manager simultaneously managing multiple servers and a dozen agents, psychological safety is productivity. Only when you're not constantly worrying about an infrastructure metric exploding can you focus on more valuable work.
Design Philosophy
Looking back, several principles are worth recording:
Tiered, not binary. Many monitoring systems only have "normal" and "alert" states. But reality has a gradient from "normal" to "broken." Tiered alerts let you respond with different intensity at different phases.
Automation as baseline, human judgment for exceptions. Routine situations handled by auto-cleanup; only anomalous patterns (continuous session growth, rapid rebound after cleanup) need human intervention. "Automation primary, human secondary" — my ideal ops posture.
Alerts are tools, not goals. The alert storm lesson: alerts exist to draw attention and prompt action, not to create anxiety. One effective alert beats a hundred meaningless repeats.
Future Optimization
Let the basic version run stable, accumulate sufficient operational data, then optimize further. No rushing ahead.