Monitoring Nightmare That Became a Dashboard Dream
07:00 – Check outputs folder: 0 files mới.
07:30 – Vẫn 0 files.
08:00 – Panic mode: “Sao tụi mày không làm việc?”
SSH vào server frantically
20 tmux windows mở ra
Không biết window nào là agent nào
Scroll qua Agent 7: “Em đang chờ Agent 3 từ 3 tiếng trước…”
Check Agent 3: Terminal stuck ở “Press Enter to continue”
Facepalm so hard, hàng xóm nghe thấy
Đây là câu chuyện về việc build monitoring cho AI team sau khi phát hiện họ “đình công” cả đêm vì một phím Enter.
Mục lục
- Flying Blind: The Pre-Monitoring Dark Ages
- The Wake-Up Call Incident
- The Monitoring Journey Begins
- Sticky Notes Planning Session
- First Attempt: Caveman Monitoring
- Building The Real Monitoring System
- The Beautiful Dashboard
- The Metrics Revolution
- Shocking Discoveries From Metrics
- Unexpected Insights From Monitoring
- Discovery #1: The Morning Rush
- Discovery #2: The SEO Perfectionist
- Discovery #3: The Memory Leak
- The Alert System Evolution
- V1: Log File Alerts
- V2: Terminal Notifications
- V3: Multi-Channel Alerts
- Bug Museum: Monitoring Edition
- The Invisible Character Bug
- The Timezone Nightmare
- The Happy Ending
- Monitoring Philosophy Learned
- Final Stats
Flying Blind: The Pre-Monitoring Dark Ages
Trước khi có monitoring, workflow của tôi:
# What I knew:
./start-growth-team.sh # Team starts (probably)
./queue-task.sh "Blog" # Task queued (hopefully)
./process-queue.sh # Processing (maybe?)
# What I didn't know:
- Ai đang làm gì?
- Task nào đang stuck?
- Processing được bao nhiêu %?
- Agent còn sống không?
- API bill đang burn bao nhiêu?
The Manual Detective Work:
# Morning routine kiểm tra
tmux attach -t growth-team
Ctrl+B, 1 # Check Analytics Agent
# "Hmm, đang research Docker"
Ctrl+B, 2 # Check Content Agent
# "Waiting for input... từ lúc nào?"
Ctrl+B, 3 # Check SEO Agent
# "Optimizing keywords..." *scroll up* "...từ 2 AM?!"
# ... 6 more agents
# Total time: 15 phút chỉ để biết status
The Wake-Up Call Incident
Một buổi sáng đẹp trời:
Client: “Anh ơi, series Docker 5 bài hôm qua đâu?”
Me: “Để em check…” 10 phút detective work “…ơ, có vẻ agent bị stuck”
Client: “Stuck? AI mà cũng stuck?”
Me: “It’s… complicated. Em fix ngay!”
Reality: Analytics Agent waiting for API response. API đã timeout từ 1 AM. 6 tiếng đồng hồ wasted.
That’s when I knew: Tôi cần monitoring. NGAY.
The Monitoring Journey Begins
Sticky Notes Planning Session
Dán đầy màn hình:
NEED TO SEE:
- [ ] Which agents online/offline
- [ ] Current task của mỗi agent
- [ ] Running hay stuck?
- [ ] Queue còn bao nhiêu?
- [ ] Processing time
- [ ] Success/fail rate
- [ ] Disk space (learned hard way)
- [ ] API calls count
- [ ] Coffee level (not joking)
First Attempt: Caveman Monitoring
monitor-v1.sh – “It ain’t pretty but it works”:
#!/bin/bash
# The "better than nothing" monitor
echo "=== AGENT STATUS ==="
tmux list-windows -t growth-team 2>/dev/null || echo "TMUX DEAD!"
echo -e "\n=== QUEUE STATUS ==="
echo "Pending: $(ls queue/pending 2>/dev/null | wc -l)"
echo "Processing: $(ls queue/processing 2>/dev/null | wc -l)"
echo "Completed: $(ls queue/completed 2>/dev/null | wc -l)"
echo -e "\n=== DISK SPACE ==="
df -h . | tail -1
# Run every 30 seconds
watch -n 30 ./monitor-v1.sh
Ugly nhưng ít nhất biết được basics.
Building The Real Monitoring System
The Beautiful Dashboard
Sau vài giờ coding:
┌─────────────────────────────────────────┐
│ 🚀 Growth Engine Control Center │
│ 2024-01-06 15:30:45 │
└─────────────────────────────────────────┘
📊 System Overview
─────────────────────────────────────────
Uptime: 47h 23m 15s
Total Tasks: 234 | Success Rate: 97.4%
📁 Project Queues
─────────────────────────────────────────
Project Name ⏳ Pending 🔄 Active ✅ Done
Docker Series 12 1 8
K8s Deep Dive 5 0 15
Daily Blogs Jan 3 2 45
Urgent Client 0 1 3
🤖 Agent Status Dashboard
─────────────────────────────────────────
Analytics Specialist ● Active [Researching: Redis Clustering]
Content Strategist ● Active [Writing: Docker Best Practices]
SEO Specialist ⚡ STUCK [Waiting for API response - 23m]
Review Specialist ● Idle [Last: 5 minutes ago]
Growth Hacker ● Active [Planning: Viral LinkedIn post]
Social Media Manager ● Idle [Queue empty]
Email Marketer ✗ OFFLINE [Crashed at 14:23:01]
Team Manager ● Active [Coordinating: 3 tasks]
⚠️ Alerts (Last Hour)
─────────────────────────────────────────
[14:23] Email Marketer crashed - Auto-restarting...
[14:45] SEO API rate limit warning (450/500)
[15:02] Disk usage 78% - Running cleanup...
The Metrics Revolution
Shocking Discoveries From Metrics
Sau 1 tuần thu thập data:
📊 Task Performance Analysis
─────────────────────────────────────────
Task Type Avg Time Success Rate
Blog Post 18m 32s 98.2%
Email Campaign 43m 17s 89.3% ⚠️
Social Thread 12m 45s 99.1%
Research Report 34m 22s 94.7%
🎯 Bottleneck Analysis
─────────────────────────────────────────
SEO Optimization phase: 45% of total time
Why: Checking 200+ keyword variations
Fix: Limit to top 50 → 70% faster!
Unexpected Insights From Monitoring
Discovery #1: The Morning Rush
📊 Task Completion by Hour (UTC)
00-03: ██░░░░░░░░ 10%
03-06: ████░░░░░░ 20%
06-09: ██████████ 50% ← Peak performance
09-12: ████░░░░░░ 20%
12-15: ██░░░░░░░░ 10%
15-18: █░░░░░░░░░ 5%
18-21: █░░░░░░░░░ 5%
21-24: ░░░░░░░░░░ 0%
Insight: Agents work best 6-9 AM
Action: Schedule important tasks for morning
Discovery #2: The SEO Perfectionist
Average Processing Time by Agent:
- Analytics: 5m 12s ⚡
- Content: 12m 34s ✓
- SEO: 38m 47s 🐌
- Review: 3m 22s ⚡
Investigation: SEO Agent checking EVERY
possible keyword combination.
"docker" → 847 variations checked 😱
Discovery #3: The Memory Leak
Week 1: RSS Memory 2.1GB
Week 2: RSS Memory 4.3GB
Week 3: RSS Memory 7.8GB
Week 4: System OOM killed tmux
Root cause: Content Agent saving
EVERY draft version in memory
Fix: Implement draft rotation
The Alert System Evolution
V1: Log File Alerts
echo "ERROR: $message" >> alerts.log
# Problem: Nobody reads logs
V2: Terminal Notifications
echo -e "${RED}🚨 ALERT: $message${NC}"
# Problem: Only see if watching
V3: Multi-Channel Alerts
alert() {
local severity=$1
local message=$2
# Always log
echo "$(date)|$severity|$message" >> alerts.log
# Terminal notification
echo -e "${RED}🚨 [$severity] $message${NC}"
# Critical = Slack notification
if [ "$severity" = "CRITICAL" ]; then
curl -s -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"🚨 $message\"}"
fi
# Add to dashboard alert queue
echo "$message" > dashboard/latest_alert.txt
}
Bug Museum: Monitoring Edition
The Invisible Character Bug
# 3 hours debugging why status check failed
Agent status: "Active " # Note the space
if [ "$status" = "Active" ]; then # No match!
# This never runs
fi
# Fix: Trim whitespace
status=$(echo "$status" | xargs)
The Timezone Nightmare
# Server: UTC
# My laptop: GMT+7
# Metrics showing: "Task completed in -7 hours"
# Me: "Time travel achieved?"
# Reality: Timezone hell
# Fix: Everything in UTC
The Happy Ending
Before Monitoring:
- Flying blind
- Debugging = 30+ minutes
- Random failures
- Stressed all the time
- “Is it working?” checks manually
After Monitoring:
- Real-time visibility
- Issues spotted instantly
- Auto-recovery for common problems
- Peaceful mornings
- “All systems green” sips coffee
Morning routine bây giờ:
08:00 - Mở phone
08:01 - Check dashboard
08:02 - All green ✅
08:03 - Continue scrolling Reddit
Monitoring Philosophy Learned
- “If you can’t see it, you can’t fix it”
– Visibility beats assumptions every time - “If you can’t measure it, you can’t improve it”
– Metrics revealed issues I never knew existed - “Alert fatigue is real”
– Only alert on actionable issues - “Automate recovery when possible”
– Let the system heal itself - “But most importantly: If it’s working, DON’T TOUCH IT”
– Learned this the hard way at 2 AM
Final Stats
📊 Growth Engine Monitoring Impact
─────────────────────────────────────────
Metric Before After
Uptime ~70% 99.3%
Avg Debug Time 45min 5min
Stuck Task Recovery Manual Auto
Peace of Mind None Yes
Weekend Emergencies Many Zero
Coffee Consumption ☕☕☕☕☕ ☕☕☕
Questions for developers:
- Monitoring setup yêu thích của bạn?
- Worst “could’ve caught with monitoring” story?
- Terminal dashboards hay web UI team?
P.S: If you’re running production systems without monitoring, you’re not brave – you’re playing Russian roulette. Build monitoring. Sleep better.
Ngày 7: Grand Finale – 7 Days, 8 AI Agents, and 100 Lessons About Building Your Digital Workforce! 🎓