Comprehensive monitoring dashboard with real-time metrics, proactive alerting, historical trend analysis, and resource utilization forecasting for Azure VMs.
All monitoring commands use tag-based VM discovery, ensuring consistent behavior across azlin w, azlin ps, azlin top, and the monitoring dashboard.
Key Features:
- Tag-based discovery (primary): Discovers VMs with
azlin-managed=truetag - Name-prefix fallback (backward compatibility): Falls back to VMs with "azlin-" prefix
- Custom name support: Works with any VM name format, including compound names like "hostname:session"
- Consistent behavior: Same discovery logic as
azlin list
Supported VM Name Formats:
- Standard:
azlin-vm-1234567890 - Custom:
myproject - Compound:
myhost:dev,api-server:prod
For complete VM discovery documentation, troubleshooting, and migration guide, see VM Discovery for Monitoring Commands.
# Launch real-time monitoring dashboard
azlin monitor dashboard
# View configured alerts
azlin monitor alert list
# Check resource forecast
azlin monitor forecast --at-risk-onlyLive monitoring dashboard showing CPU, memory, disk, and network metrics for all VMs:
# Launch with default 60-second refresh
azlin monitor dashboard
# Faster refresh (30 seconds)
azlin monitor dashboard --refresh-interval 30
# Filter by resource group
azlin monitor dashboard --resource-group my-dev-vmsDashboard Display:
╭─ VM Monitoring Dashboard ─ Updated: 2025-12-01 20:30:15 ─────────────╮
│ VM Name CPU% Memory% Disk R/W (MB/s) Network I/O (MB/s) │
├────────────────────────────────────────────────────────────────────────┤
│ dev-vm-01 45.2 62.1 12.3 / 8.5 1.2 / 0.8 │
│ dev-vm-02 78.9 89.2 45.1 / 23.4 5.3 / 3.2 │
│ dev-vm-03 12.4 34.5 3.2 / 1.8 0.4 / 0.2 │
╰────────────────────────────────────────────────────────────────────────╯
Press 'q' to quit | 'r' to refresh | '+' faster | '-' slower
Color Coding:
- 🟢 Green (<70%): Normal
- 🟡 Yellow (70-85%): Elevated
- 🔴 Red (>85%): High
Keyboard Shortcuts:
q: Quit dashboardr: Refresh immediately+: Increase refresh rate-: Decrease refresh rate
Configure alert rules that trigger notifications when thresholds are breached:
# List all alert rules
azlin monitor alert list
# Add new alert
azlin monitor alert add high_cpu \
--metric cpu_percent \
--threshold 80 \
--severity warning
# Enable/disable alerts
azlin monitor alert enable high_cpu
azlin monitor alert disable high_cpu
# Delete alert
azlin monitor alert delete high_cpuDefault Alert Rules:
high_cpu: CPU >80% (warning)critical_cpu: CPU >95% (critical)high_memory: Memory >85% (warning)critical_memory: Memory >95% (critical)disk_space: Disk >90% (warning)
Notification Channels:
- Email (SMTP)
- Slack (webhook)
- Generic webhook (custom integrations)
Query and export historical metrics:
# View last 7 days for a VM
azlin monitor history dev-vm-01
# View last 30 days
azlin monitor history dev-vm-01 --days 30
# Query specific metric
azlin monitor history dev-vm-01 --metric cpu_percent --days 14
# Export to CSV
azlin monitor history dev-vm-01 --days 30 --export metrics.csvHistorical Data Retention:
- 7 days: Raw metrics (1-minute intervals)
- 30 days: Hourly aggregated
- 90 days: Daily aggregated
Predict future resource utilization using trend analysis:
# Forecast all VMs (30 days ahead)
azlin monitor forecast
# Forecast specific VM (7 days ahead)
azlin monitor forecast --vm-name dev-vm-01 --days 7
# Show only at-risk VMs
azlin monitor forecast --at-risk-onlyForecast Output:
Resource Utilization Forecast (30 days)
VM: dev-vm-01
├─ CPU: Current: 45% → 7d: 52% → 30d: 68% (📈 Increasing)
├─ Memory: Current: 62% → 7d: 64% → 30d: 70% (📈 Stable)
└─ Disk: Current: 35% → 7d: 38% → 30d: 48% (📈 Increasing)
VM: dev-vm-02 ⚠️ AT RISK
├─ CPU: Current: 78% → 7d: 85% → 30d: 95% (🔴 Critical)
├─ Memory: Current: 89% → 7d: 92% → 30d: 98% (🔴 Critical)
└─ Disk: Current: 72% → 7d: 78% → 30d: 91% (🔴 Warning)
└─ Days until disk limit: 23 days
Trend Classifications:
- 📈 Increasing: >0.5% per day growth
- 📉 Decreasing: >0.5% per day decline
- ➡️ Stable: -0.5% to +0.5% per day
Alert rules are stored in ~/.azlin/alert_rules.yaml:
rules:
- name: high_cpu
metric: cpu_percent
threshold: 80.0
comparison: ">"
severity: warning
enabled: true
notification_channels: [email]
- name: critical_memory
metric: memory_percent
threshold: 95.0
comparison: ">"
severity: critical
enabled: true
notification_channels: [email, slack]
notification_config:
email:
enabled: true
smtp_host: smtp.gmail.com
smtp_port: 587
from_address: alerts@example.com
to_addresses:
- admin@example.com
# Password stored securely in system keyring
# Set with: azlin monitor alert config-email
slack:
enabled: false
webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
webhook:
enabled: false
url: https://example.com/alerts
auth_type: bearer # none, bearer, basic
auth_token: ${WEBHOOK_TOKEN} # From environment variable# Configure email notifications
azlin monitor alert config-email
# You'll be prompted for:
# - SMTP host (e.g., smtp.gmail.com)
# - SMTP port (e.g., 587)
# - From address
# - Password (stored securely in system keyring)
# - Recipient addressesGmail Setup:
- Enable 2-factor authentication on your Google account
- Generate an App Password: https://myaccount.google.com/apppasswords
- Use the app password when configuring email
# 1. Create Slack webhook:
# https://api.slack.com/messaging/webhooks
# 2. Add webhook to config
azlin monitor alert config-slack --webhook-url https://hooks.slack.com/services/YOUR/WEBHOOK/URL
# 3. Test notification
azlin monitor alert test-slack# Configure generic webhook
azlin monitor alert config-webhook \
--url https://example.com/alerts \
--auth-type bearer \
--auth-token $WEBHOOK_TOKEN
# Webhook payload format (JSON):
{
"alert": "high_cpu",
"vm_name": "dev-vm-01",
"metric": "cpu_percent",
"actual_value": 85.2,
"threshold": 80.0,
"severity": "warning",
"timestamp": "2025-12-01T20:30:15Z",
"message": "CPU usage on dev-vm-01 is 85.2% (threshold: 80%)"
}- Source: Azure Monitor REST API
- Frequency: 1-5 minutes (configurable)
- Method: Parallel collection using ThreadPoolExecutor
- Timeout: 30 seconds per VM
- Graceful Degradation: Continues if individual VMs fail
- Database: SQLite at
~/.azlin/metrics.db - Retention:
- Raw metrics: 7 days
- Hourly aggregated: 30 days
- Daily aggregated: 90 days
- Automatic Cleanup: Runs on each collection cycle
- Frequency: Every collection cycle (1-5 minutes)
- Suppression: No re-alerts for same VM+rule within 15 minutes
- Retry: 3 attempts with exponential backoff for failed notifications
- Method: Simple linear regression
- Data: Hourly aggregated metrics (minimum 7 days)
- Accuracy: Typically within 15% for 7-day predictions
- Limitations: Works best for linear trends, not sudden changes
| Metric | Description | Unit | Source |
|---|---|---|---|
cpu_percent |
Average CPU utilization | Percentage (0-100) | Azure Monitor |
memory_percent |
Memory utilization | Percentage (0-100) | Azure Monitor |
disk_read_bytes |
Disk read throughput | Bytes/second | Azure Monitor |
disk_write_bytes |
Disk write throughput | Bytes/second | Azure Monitor |
network_in_bytes |
Network ingress | Bytes/second | Azure Monitor |
network_out_bytes |
Network egress | Bytes/second | Azure Monitor |
| Severity | Description | Use Case |
|---|---|---|
info |
Informational | Non-critical notifications |
warning |
Warning | Resource usage elevated, may need attention |
critical |
Critical | Immediate attention required |
Symptoms: Dashboard shows stale data or no data
Solutions:
# 1. Check Azure CLI authentication
az account show
# 2. Verify VM access
az vm list --output table
# 3. Check metrics database
ls -lh ~/.azlin/metrics.db
# 4. View collection logs
azlin monitor debug --last-collectionSymptoms: No alert notifications despite high resource usage
Solutions:
# 1. Verify alert rules are enabled
azlin monitor alert list
# 2. Check notification config
azlin monitor alert test-email
azlin monitor alert test-slack
# 3. View alert history
azlin monitor alert history --days 7
# 4. Check suppression status
azlin monitor alert suppression-statusSymptoms: Query returns no data or gaps in data
Solutions:
# 1. Check database size and location
ls -lh ~/.azlin/metrics.db
# 2. Verify collection is running
azlin monitor status
# 3. Check retention settings
azlin monitor config show
# 4. Manual data integrity check
azlin monitor verify-databaseSymptoms: Predictions don't match actual usage
Common Causes:
- Insufficient historical data (need minimum 7 days)
- Recent usage pattern changes
- Non-linear growth (forecast assumes linear trends)
- Seasonal variations not accounted for
Solutions:
- Wait for more data (7-14 days minimum)
- Use shorter forecast periods (7 days instead of 30)
- Combine with manual review of trends
- Launch Time: <5 seconds (typical)
- Memory Usage: ~50 MB (10 VMs)
- CPU Usage: <5% during refresh
- Network: ~100 KB per refresh cycle
- Throughput: 10-50 VMs/second (parallel)
- API Calls: 6 calls per VM per collection
- Rate Limits: Azure Monitor: 12,000 requests/hour
- Database Size: ~1 MB per VM per month (raw metrics)
- Query Time: <100ms for 30-day range
- Aggregation Time: ~5 seconds for 90 days of data
- Uses Azure CLI authentication (
az login) - No custom credentials stored
- Leverages Azure RBAC for VM access
- SMTP passwords stored in system keyring (not config files)
- Webhook tokens via environment variables
- No plain-text credentials in config
- Metrics database has restricted permissions (0600)
- Error messages sanitized to prevent information disclosure
- No PII or sensitive data collected
# Azure RBAC role required
az role assignment create \
--assignee user@example.com \
--role "Monitoring Reader" \
--scope /subscriptions/{subscription-id}Monitor VMs during deployment and alert on issues:
# GitHub Actions example
- name: Monitor deployment VMs
run: |
# Start monitoring
azlin monitor dashboard --refresh-interval 30 &
MONITOR_PID=$!
# Deploy application
./deploy.sh
# Check for alerts during deployment
azlin monitor alert history --last 30min
# Stop monitoring
kill $MONITOR_PIDExport metrics to Grafana for advanced visualization:
# Export last 30 days to CSV
for vm in $(az vm list --query "[].name" -o tsv); do
azlin monitor history $vm --days 30 --export "${vm}_metrics.csv"
done
# Import to Grafana using CSV datasourceForward critical alerts to PagerDuty:
# In alert_rules.yaml
notification_config:
webhook:
enabled: true
url: https://events.pagerduty.com/v2/enqueue
auth_type: bearer
auth_token: ${PAGERDUTY_TOKEN}
# Webhook payload automatically formatted for PagerDuty Events APIBy default, every 60 seconds. Configurable from 1-5 minutes via --refresh-interval.
Yes, but you need to switch subscriptions using az account set before launching the dashboard.
The collector gracefully skips unreachable VMs and displays an error in the dashboard. Other VMs continue to be monitored.
Typically within 15% for 7-day predictions. Accuracy improves with more historical data (14-30 days).
Yes, edit ~/.azlin/alert_rules.yaml or use azlin monitor alert add to create custom rules.
Minimal. Azure Monitor API calls are free for basic metrics. Storage is negligible (<10 MB/VM/year).
Yes, use screen or tmux to keep it running:
screen -S azlin-monitor
azlin monitor dashboard
# Press Ctrl+A, D to detach
# screen -r azlin-monitor to reattach- Azure Monitor Metrics
- azlin distributed top - Live process monitoring
- azlin cost tracking - Cost estimation and tracking