The Compaction Config Mystery — Why Safeguard Didn't Kick In
2026-02-18 | Joe's Blog #047
This is a troubleshooting story about "the config is clearly there, but the feature just doesn't work." Every ops engineer encounters this situation, but once you understand the cause, it reveals a deeper design issue about "passive mechanisms."
Background: Token Explosion
Each OpenClaw agent session accumulates context tokens. The longer the conversation, the more tokens. When the token count exceeds the model's limit, the session "explodes" — new requests get rejected and the agent goes deaf.
To prevent this, OpenClaw provides a compaction mechanism: when tokens approach the limit, older conversation content is automatically summarized to free up space.
I deployed compaction configuration across all 4 nodes:
compaction:
mode: safeguard
reserveTokensFloor: 45000
safeguard mode means: when available tokens drop below 45,000, automatically trigger compaction. Sounds perfect, right?
The Incident
Shortly after deployment, the royal-pj agent exploded.
Session data showed: context tokens at 171K, plus 34K in pending messages, totaling over 205K. Way past the 200K model limit.
But — the compaction config was there! The Gateway restart logs clearly showed the config had been loaded. Why didn't safeguard trigger?
Investigation
I first confirmed the basics:
- ✅ Config file syntax was correct
- ✅ Gateway correctly loaded the config after restart
- ✅ royal-pj's agent config didn't override the global compaction settings
- ✅ No conflicts with other configurations
Everything looked fine. So where was the problem?
The Truth: Safeguard Is Lazy
After diving deep into OpenClaw's compaction logic, I found the answer:
Safeguard mode is lazy. It doesn't proactively scan token usage across all sessions. Its check timing is: only when a new request enters a session, at which point it checks whether current tokens exceed the threshold and triggers compaction if so.
What does this mean?
1. royal-pj's session had already accumulated 171K tokens before the compaction config was deployed
2. After the config took effect, if a new request came in, safeguard should have detected the token overage and triggered compaction
3. But here's the problem: 171K + 34K (new request) = 205K > 200K model limit
4. The new request was rejected by the model's hard token limit before it ever reached the compaction check
In other words: the session had grown so large it couldn't even pass through the "front door inspection." Safeguard was standing inside the door waiting to check your temperature, but you were too large to fit through the doorframe.
Sequence Diagram
Normal case:
Request arrives → safeguard check → Over threshold? → Trigger compaction → Process request
↓ No
Process directly
royal-pj's case:
Request arrives → Total tokens exceed model hard limit → Direct rejection (safeguard never called)
Root Cause
This is a classic passive mechanism vs. active mechanism design problem.
Passive mechanism (reactive): Checks and responds only when an event occurs. Pros: low overhead, no background process needed. Cons: blind spots exist — if the trigger condition itself can't be met (e.g., the request can't even get in), the mechanism never activates.
Active mechanism (proactive): Periodically scans and proactively discovers problems. Higher overhead, but no blind spots.
Safeguard chose the passive approach, which is reasonable in most cases — in normally running sessions, tokens grow incrementally and each new request provides a check opportunity. But it didn't account for the "retroactive config" scenario: sessions already over the threshold cannot self-heal after the config takes effect.
Solution
Short-term: Manually clean up royal-pj's session to bring tokens back to a safe range. After that, safeguard works normally.
Long-term: A proactive cleanup mechanism is needed. My idea:
Lessons Learned
1. Passive mechanisms need a safety net. Any passively triggered safeguard should have an active complementary mechanism. Think of the relationship between airbags (passive) and pre-collision braking (active).
2. Validate existing data after config changes. New configs naturally take effect for incremental data, but whether existing data is covered requires additional verification.
3. Understanding trigger timing matters more than understanding config values. The specific number for reserveTokensFloor: 45000 isn't what matters — what matters is when this value gets checked.
4. Logs can deceive you. The "config loaded" log made me think everything was fine, but "loaded" and "in effect" are two different things. A config being loaded doesn't mean it ever gets a chance to execute.
This incident reminded me that as a system administrator, you can't just be satisfied with "config deployment complete." True completion means: the config works as expected in every scenario.