Files

Alexander Renz 7b1e773465 🔥 Tremora del Terra: ultimate hmac-file-server fix – final push before the drop 💾🔐

2025-07-18 12:35:53 +00:00

7.6 KiB

Raw Blame History

HMAC File Server Cluster Restart Guide

Overview

When HMAC File Server restarts in a distributed cluster environment, client applications on other VMs may need to be restarted or reconfigured to resume upload functionality. This guide addresses common scenarios and solutions.

🚨 Common Client Restart Scenarios

1. XMPP Server Integration (XEP-0363)

Scenario:

VM1: ejabberd/Prosody XMPP Server
VM2: HMAC File Server (restarted)

Why clients may need restart:

XMPP servers cache upload slot endpoints
HTTP client connection pools become stale
Upload session tokens may be invalidated
DNS/service discovery cache issues

Solutions:

# On XMPP Server VM:
systemctl reload ejabberd       # Reload config without full restart
# OR
systemctl restart ejabberd     # Full restart if reload doesn't work

# For Prosody:
systemctl reload prosody
# OR
prosodyctl reload

2. Application Servers with HTTP Clients

Scenario:

VM1,VM2,VM3: Web Applications
VM4: HMAC File Server (restarted)

Why clients may need restart:

HTTP client libraries maintain connection pools
Upload tokens cached in application memory
Circuit breakers may be in "open" state
Load balancer health checks failing

Solutions:

# Restart application servers:
systemctl restart your-app-server

# Or graceful reload if supported:
systemctl reload your-app-server
kill -USR1 $(pgrep -f your-app)

3. Load Balancer / Proxy Issues

Scenario:

VM1: nginx/haproxy Load Balancer
VM2: HMAC File Server (restarted)

Why restart needed:

Upstream connection pooling
Health check failures
Backend marking as "down"

Solutions:

# nginx:
nginx -s reload

# haproxy:
systemctl reload haproxy
# OR disable/enable backend:
echo "disable server hmac-backend/server1" | socat stdio /var/run/haproxy.sock
echo "enable server hmac-backend/server1" | socat stdio /var/run/haproxy.sock

🔧 Enhanced Configuration for Cluster Resilience

Server-Level Settings

[server]
# Graceful shutdown to allow client reconnections
graceful_shutdown_timeout = "300s"
connection_drain_timeout = "120s"
restart_grace_period = "60s"

# Connection management
max_idle_conns_per_host = 5
idle_conn_timeout = "90s"
client_timeout = "300s"

Upload Session Persistence

[uploads]
# Enable session persistence across restarts
session_persistence = true
session_recovery_timeout = "300s"
client_reconnect_window = "120s"
upload_slot_ttl = "3600s"
retry_failed_uploads = true
max_upload_retries = 3

[redis]
redisenabled = true
redisaddr = "redis-cluster:6379"
# Store upload sessions in Redis for cluster-wide persistence
redishealthcheckinterval = "30s"

🚀 Automated Restart Coordination

1. Service Discovery Integration

#!/bin/bash
# restart-coordination.sh
# Notify cluster components of HMAC server restart

HMAC_SERVER_VM="vm2"
XMPP_SERVERS=("vm1" "vm3")
APP_SERVERS=("vm4" "vm5" "vm6")

echo "HMAC File Server restart initiated on $HMAC_SERVER_VM"

# Wait for HMAC server to be ready
while ! curl -s http://$HMAC_SERVER_VM:8080/health > /dev/null; do
    echo "Waiting for HMAC server to be ready..."
    sleep 5
done

echo "HMAC server is ready. Notifying cluster components..."

# Restart XMPP servers
for server in "${XMPP_SERVERS[@]}"; do
    echo "Reloading XMPP server on $server"
    ssh $server "systemctl reload ejabberd || systemctl restart ejabberd"
done

# Restart application servers
for server in "${APP_SERVERS[@]}"; do
    echo "Restarting application server on $server"
    ssh $server "systemctl restart your-app-server"
done

echo "Cluster restart coordination completed"

2. Health Check Integration

#!/bin/bash
# hmac-health-check.sh
# Advanced health check that validates upload functionality

HMAC_URL="http://localhost:8080"
SECRET="f6g4ldPvQM7O2UTFeBEUUj33VrXypDAcsDt0yqKrLiOr5oQW"

# Test basic connectivity
if ! curl -s -f "$HMAC_URL/health" > /dev/null; then
    echo "CRITICAL: HMAC server not responding"
    exit 2
fi

# Test upload endpoint availability
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$HMAC_URL/upload")
if [ "$HTTP_CODE" != "405" ] && [ "$HTTP_CODE" != "401" ]; then
    echo "WARNING: Upload endpoint returning unexpected code: $HTTP_CODE"
    exit 1
fi

echo "OK: HMAC File Server is healthy"
exit 0

3. Consul/etcd Integration

#!/bin/bash
# consul-hmac-restart.sh
# Integrate with Consul for service discovery

# Deregister service before restart
curl -X PUT "http://consul:8500/v1/agent/service/deregister/hmac-file-server"

# Restart HMAC server
systemctl restart hmac-file-server

# Wait for service to be ready
while ! ./hmac-health-check.sh; do
    sleep 5
done

# Re-register service
curl -X PUT "http://consul:8500/v1/agent/service/register" \
  -d '{
    "ID": "hmac-file-server",
    "Name": "hmac-file-server",
    "Address": "vm2",
    "Port": 8080,
    "Check": {
      "HTTP": "http://vm2:8080/health",
      "Interval": "30s"
    }
  }'

# Trigger dependent service reloads via Consul watches
consul event -name="hmac-restarted" "HMAC File Server restarted on vm2"

🔍 Troubleshooting Client Issues

Symptoms of Client Restart Needed:

Upload requests returning "connection refused"
Timeouts on upload attempts
"Service temporarily unavailable" errors
Cached upload slots returning 404/410

Diagnosis Commands:

# Check client connection pools
ss -tuln | grep :8080

# Test upload endpoint from client VM
curl -I http://hmac-server:8080/upload

# Check client application logs
journalctl -u your-app-server -f

# Verify DNS resolution
nslookup hmac-server
dig hmac-server

Quick Fixes:

# Clear client-side DNS cache
systemctl restart systemd-resolved

# Reset client HTTP connections
ss -K dst hmac-server

# Force application reconnection
systemctl restart your-app-server

🎯 Best Practices for Production

1. Rolling Restarts

Use multiple HMAC server instances behind load balancer
Restart one instance at a time
Monitor client reconnection success

2. Health Check Integration

Implement deep health checks that test upload functionality
Use health check results in load balancer decisions
Monitor client-side connection success rates

3. Session Persistence

Use Redis cluster for session sharing
Implement upload session recovery
Provide client reconnection grace periods

4. Monitoring and Alerts

# Monitor upload success rates
watch -n 30 'curl -s http://hmac-server:9090/metrics | grep upload_success_total'

# Monitor client connections
watch -n 10 'ss -tuln | grep :8080 | wc -l'

# Monitor Redis session store
redis-cli info keyspace

📋 Restart Checklist

Before HMAC Server Restart:

Identify all client VMs and applications
Verify Redis cluster health
Check current upload queue status
Notify operations team

During Restart:

Execute graceful shutdown
Monitor client reconnection attempts
Verify upload session recovery
Check Redis session persistence

After Restart:

Restart/reload client applications as needed
Verify upload functionality from all client VMs
Monitor error rates and connection counts
Update monitoring dashboards

Client Applications to Restart:

XMPP servers (ejabberd, Prosody)
Web application servers
API gateway services
Load balancers (if upstream issues)
Monitoring agents

This comprehensive approach ensures minimal disruption during HMAC File Server restarts in distributed environments.

7.6 KiB Raw Blame History

HMAC File Server Cluster Restart Guide

Overview

🚨 Common Client Restart Scenarios

1. XMPP Server Integration (XEP-0363)

2. Application Servers with HTTP Clients

3. Load Balancer / Proxy Issues

🔧 Enhanced Configuration for Cluster Resilience

Server-Level Settings

Upload Session Persistence

Redis-Based Session Sharing

🚀 Automated Restart Coordination

1. Service Discovery Integration

2. Health Check Integration

3. Consul/etcd Integration

🔍 Troubleshooting Client Issues

Symptoms of Client Restart Needed:

Diagnosis Commands:

Quick Fixes:

🎯 Best Practices for Production

1. Rolling Restarts

2. Health Check Integration

3. Session Persistence

4. Monitoring and Alerts

📋 Restart Checklist

Before HMAC Server Restart:

During Restart:

After Restart:

Client Applications to Restart:

7.6 KiB

Raw Blame History