Files
hmac-file-server/CLUSTER_RESTART_GUIDE.md

7.6 KiB

HMAC File Server Cluster Restart Guide

Overview

When HMAC File Server restarts in a distributed cluster environment, client applications on other VMs may need to be restarted or reconfigured to resume upload functionality. This guide addresses common scenarios and solutions.

🚨 Common Client Restart Scenarios

1. XMPP Server Integration (XEP-0363)

Scenario:

VM1: ejabberd/Prosody XMPP Server
VM2: HMAC File Server (restarted)

Why clients may need restart:

  • XMPP servers cache upload slot endpoints
  • HTTP client connection pools become stale
  • Upload session tokens may be invalidated
  • DNS/service discovery cache issues

Solutions:

# On XMPP Server VM:
systemctl reload ejabberd       # Reload config without full restart
# OR
systemctl restart ejabberd     # Full restart if reload doesn't work

# For Prosody:
systemctl reload prosody
# OR
prosodyctl reload

2. Application Servers with HTTP Clients

Scenario:

VM1,VM2,VM3: Web Applications
VM4: HMAC File Server (restarted)

Why clients may need restart:

  • HTTP client libraries maintain connection pools
  • Upload tokens cached in application memory
  • Circuit breakers may be in "open" state
  • Load balancer health checks failing

Solutions:

# Restart application servers:
systemctl restart your-app-server

# Or graceful reload if supported:
systemctl reload your-app-server
kill -USR1 $(pgrep -f your-app)

3. Load Balancer / Proxy Issues

Scenario:

VM1: nginx/haproxy Load Balancer
VM2: HMAC File Server (restarted)

Why restart needed:

  • Upstream connection pooling
  • Health check failures
  • Backend marking as "down"

Solutions:

# nginx:
nginx -s reload

# haproxy:
systemctl reload haproxy
# OR disable/enable backend:
echo "disable server hmac-backend/server1" | socat stdio /var/run/haproxy.sock
echo "enable server hmac-backend/server1" | socat stdio /var/run/haproxy.sock

🔧 Enhanced Configuration for Cluster Resilience

Server-Level Settings

[server]
# Graceful shutdown to allow client reconnections
graceful_shutdown_timeout = "300s"
connection_drain_timeout = "120s"
restart_grace_period = "60s"

# Connection management
max_idle_conns_per_host = 5
idle_conn_timeout = "90s"
client_timeout = "300s"

Upload Session Persistence

[uploads]
# Enable session persistence across restarts
session_persistence = true
session_recovery_timeout = "300s"
client_reconnect_window = "120s"
upload_slot_ttl = "3600s"
retry_failed_uploads = true
max_upload_retries = 3

Redis-Based Session Sharing

[redis]
redisenabled = true
redisaddr = "redis-cluster:6379"
# Store upload sessions in Redis for cluster-wide persistence
redishealthcheckinterval = "30s"

🚀 Automated Restart Coordination

1. Service Discovery Integration

#!/bin/bash
# restart-coordination.sh
# Notify cluster components of HMAC server restart

HMAC_SERVER_VM="vm2"
XMPP_SERVERS=("vm1" "vm3")
APP_SERVERS=("vm4" "vm5" "vm6")

echo "HMAC File Server restart initiated on $HMAC_SERVER_VM"

# Wait for HMAC server to be ready
while ! curl -s http://$HMAC_SERVER_VM:8080/health > /dev/null; do
    echo "Waiting for HMAC server to be ready..."
    sleep 5
done

echo "HMAC server is ready. Notifying cluster components..."

# Restart XMPP servers
for server in "${XMPP_SERVERS[@]}"; do
    echo "Reloading XMPP server on $server"
    ssh $server "systemctl reload ejabberd || systemctl restart ejabberd"
done

# Restart application servers
for server in "${APP_SERVERS[@]}"; do
    echo "Restarting application server on $server"
    ssh $server "systemctl restart your-app-server"
done

echo "Cluster restart coordination completed"

2. Health Check Integration

#!/bin/bash
# hmac-health-check.sh
# Advanced health check that validates upload functionality

HMAC_URL="http://localhost:8080"
SECRET="f6g4ldPvQM7O2UTFeBEUUj33VrXypDAcsDt0yqKrLiOr5oQW"

# Test basic connectivity
if ! curl -s -f "$HMAC_URL/health" > /dev/null; then
    echo "CRITICAL: HMAC server not responding"
    exit 2
fi

# Test upload endpoint availability
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$HMAC_URL/upload")
if [ "$HTTP_CODE" != "405" ] && [ "$HTTP_CODE" != "401" ]; then
    echo "WARNING: Upload endpoint returning unexpected code: $HTTP_CODE"
    exit 1
fi

echo "OK: HMAC File Server is healthy"
exit 0

3. Consul/etcd Integration

#!/bin/bash
# consul-hmac-restart.sh
# Integrate with Consul for service discovery

# Deregister service before restart
curl -X PUT "http://consul:8500/v1/agent/service/deregister/hmac-file-server"

# Restart HMAC server
systemctl restart hmac-file-server

# Wait for service to be ready
while ! ./hmac-health-check.sh; do
    sleep 5
done

# Re-register service
curl -X PUT "http://consul:8500/v1/agent/service/register" \
  -d '{
    "ID": "hmac-file-server",
    "Name": "hmac-file-server",
    "Address": "vm2",
    "Port": 8080,
    "Check": {
      "HTTP": "http://vm2:8080/health",
      "Interval": "30s"
    }
  }'

# Trigger dependent service reloads via Consul watches
consul event -name="hmac-restarted" "HMAC File Server restarted on vm2"

🔍 Troubleshooting Client Issues

Symptoms of Client Restart Needed:

  • Upload requests returning "connection refused"
  • Timeouts on upload attempts
  • "Service temporarily unavailable" errors
  • Cached upload slots returning 404/410

Diagnosis Commands:

# Check client connection pools
ss -tuln | grep :8080

# Test upload endpoint from client VM
curl -I http://hmac-server:8080/upload

# Check client application logs
journalctl -u your-app-server -f

# Verify DNS resolution
nslookup hmac-server
dig hmac-server

Quick Fixes:

# Clear client-side DNS cache
systemctl restart systemd-resolved

# Reset client HTTP connections
ss -K dst hmac-server

# Force application reconnection
systemctl restart your-app-server

🎯 Best Practices for Production

1. Rolling Restarts

  • Use multiple HMAC server instances behind load balancer
  • Restart one instance at a time
  • Monitor client reconnection success

2. Health Check Integration

  • Implement deep health checks that test upload functionality
  • Use health check results in load balancer decisions
  • Monitor client-side connection success rates

3. Session Persistence

  • Use Redis cluster for session sharing
  • Implement upload session recovery
  • Provide client reconnection grace periods

4. Monitoring and Alerts

# Monitor upload success rates
watch -n 30 'curl -s http://hmac-server:9090/metrics | grep upload_success_total'

# Monitor client connections
watch -n 10 'ss -tuln | grep :8080 | wc -l'

# Monitor Redis session store
redis-cli info keyspace

📋 Restart Checklist

Before HMAC Server Restart:

  • Identify all client VMs and applications
  • Verify Redis cluster health
  • Check current upload queue status
  • Notify operations team

During Restart:

  • Execute graceful shutdown
  • Monitor client reconnection attempts
  • Verify upload session recovery
  • Check Redis session persistence

After Restart:

  • Restart/reload client applications as needed
  • Verify upload functionality from all client VMs
  • Monitor error rates and connection counts
  • Update monitoring dashboards

Client Applications to Restart:

  • XMPP servers (ejabberd, Prosody)
  • Web application servers
  • API gateway services
  • Load balancers (if upstream issues)
  • Monitoring agents

This comprehensive approach ensures minimal disruption during HMAC File Server restarts in distributed environments.