Files
hmac-file-server/QUEUE_RESILIENCE_GUIDE.md

9.4 KiB

Queue Resilience Configuration Guide

Overview

HMAC File Server 3.2 Ultimate Fixed includes advanced queue resilience features designed to handle timeout scenarios gracefully and maintain service availability under various network conditions.

Enhanced Configuration Sections

1. Server-Level Timeout Resilience

[server]
# Enhanced timeout resilience settings
graceful_shutdown_timeout = "300s"    # Time to wait for active uploads during shutdown
request_timeout = "7200s"             # Maximum time for any single request (2 hours)
keep_alive_timeout = "300s"           # HTTP keep-alive timeout
connection_drain_timeout = "180s"     # Time to drain connections during shutdown
upload_stall_timeout = "600s"         # Timeout if upload stalls (no data received)
download_stall_timeout = "300s"       # Timeout if download stalls
retry_after_timeout = "60s"           # Retry-After header when rejecting due to overload
max_concurrent_uploads = 100          # Maximum concurrent upload operations
upload_rate_limit = "10MB/s"          # Per-connection upload rate limit
connection_pool_size = 200            # Maximum connection pool size

Key Benefits:

  • Graceful Degradation: Server doesn't abruptly terminate active uploads during shutdown
  • Stall Detection: Automatically detects and handles stalled uploads/downloads
  • Connection Management: Limits concurrent operations to prevent resource exhaustion
  • Rate Limiting: Prevents individual connections from overwhelming the server

2. Enhanced Worker Configuration

[workers]
# Enhanced queue robustness settings
queue_timeout = "300s"               # Maximum time a job can wait in queue
queue_drain_timeout = "120s"         # Time to wait for queue drain during shutdown
worker_health_check = "30s"          # How often to check worker health
max_queue_retries = 3                # Max retries for failed queue operations
priority_queue_enabled = true        # Enable priority queuing for different file sizes
large_file_queue_size = 20           # Separate queue for files > 100MB
small_file_queue_size = 100          # Queue for files < 10MB
queue_backpressure_threshold = 0.8   # Queue usage % to start backpressure
circuit_breaker_enabled = true       # Enable circuit breaker for queue failures
circuit_breaker_threshold = 10       # Failures before opening circuit
circuit_breaker_timeout = "60s"      # Time before retrying after circuit opens

Key Benefits:

  • Priority Queuing: Large files don't block small file uploads
  • Health Monitoring: Workers are continuously monitored for failures
  • Circuit Breaking: Automatic failure detection and recovery
  • Backpressure Control: Gradual slowdown instead of hard failures

3. Advanced Queue Resilience

[queue_resilience]
enabled = true
# Timeout handling
queue_operation_timeout = "30s"      # Max time for queue operations
queue_full_behavior = "reject_oldest" # How to handle full queues
spillover_to_disk = true             # Use disk when memory queue is full
spillover_directory = "/tmp/hmac-queue-spillover"
spillover_max_size = "1GB"           # Max disk spillover size

# Queue persistence and recovery
persistent_queue = true              # Persist queue state
queue_recovery_enabled = true        # Recover queue state on restart
max_recovery_age = "24h"             # Max age of items to recover

# Health monitoring
queue_health_check_interval = "15s"  # Queue health check frequency
dead_letter_queue_enabled = true     # Failed items queue
dead_letter_max_retries = 5          # Max retries before dead letter
dead_letter_retention = "7d"         # Dead letter retention time

# Load balancing and prioritization  
priority_levels = 3                  # Number of priority levels
priority_aging_enabled = true        # Age items to higher priority
priority_aging_threshold = "300s"    # Time before aging up
load_balancing_strategy = "least_connections"

# Memory management
queue_memory_limit = "500MB"         # Max memory for queues
queue_gc_interval = "60s"            # Garbage collection interval
emergency_mode_threshold = 0.95      # Emergency mode trigger

Key Benefits:

  • Disk Spillover: Never lose uploads due to memory constraints
  • Queue Recovery: Resume operations after server restarts
  • Dead Letter Queuing: Handle persistently failing uploads
  • Priority Aging: Prevent starvation of lower-priority items

4. Comprehensive Timeout Configuration

[timeouts]
# Basic timeouts (existing)
readtimeout = "4800s"
writetimeout = "4800s" 
idletimeout = "4800s"

# Enhanced timeout resilience
handshake_timeout = "30s"           # TLS handshake timeout
header_timeout = "60s"              # HTTP header read timeout
body_timeout = "7200s"              # HTTP body read timeout
dial_timeout = "30s"                # Connection dial timeout
keep_alive_probe_interval = "30s"   # TCP keep-alive probe interval
keep_alive_probe_count = 9          # Keep-alive probes before giving up

# Adaptive timeouts based on file size
small_file_timeout = "60s"          # Files < 10MB
medium_file_timeout = "600s"        # Files 10MB-100MB
large_file_timeout = "3600s"        # Files 100MB-1GB
huge_file_timeout = "7200s"         # Files > 1GB

# Retry and backoff settings
retry_base_delay = "1s"             # Base delay between retries
retry_max_delay = "60s"             # Maximum delay between retries
retry_multiplier = 2.0              # Exponential backoff multiplier
max_retry_attempts = 5              # Maximum retry attempts

Key Benefits:

  • Adaptive Timeouts: Different timeouts based on file size
  • Connection Resilience: TCP keep-alive prevents silent failures
  • Exponential Backoff: Intelligent retry timing reduces server load
  • Granular Control: Fine-tuned timeouts for different operations

Timeout Scenario Handling

1. Network Interruption Scenarios

Mobile Network Switching:

  • Keep-alive probes detect network changes
  • Chunked uploads can resume after network restoration
  • Upload sessions persist through network interruptions

Slow Network Conditions:

  • Adaptive timeouts prevent premature termination
  • Rate limiting prevents network saturation
  • Progress monitoring detects actual stalls vs. slow transfers

2. Server Overload Scenarios

High Load Conditions:

  • Circuit breaker prevents cascade failures
  • Backpressure slows down new requests gracefully
  • Priority queuing ensures critical uploads continue

Memory Pressure:

  • Disk spillover prevents memory exhaustion
  • Queue garbage collection manages memory usage
  • Emergency mode provides last-resort protection

3. Application Restart Scenarios

Graceful Shutdown:

  • Active uploads get time to complete
  • Queue state is persisted before shutdown
  • Connections are drained properly

Recovery After Restart:

  • Queue state is restored from persistence
  • Upload sessions are recovered
  • Dead letter items are reprocessed

Monitoring and Observability

Queue Health Metrics

The enhanced configuration provides comprehensive metrics:

  • Queue Length: Current items in each queue
  • Queue Processing Time: Time items spend in queue
  • Worker Health: Individual worker status and performance
  • Circuit Breaker State: Open/closed status and failure counts
  • Spillover Usage: Disk spillover utilization
  • Dead Letter Queue: Failed item counts and reasons

Log Messages

Enhanced logging provides visibility into queue operations:

INFO: Queue backpressure activated (80% full)
WARN: Circuit breaker opened for upload queue (10 consecutive failures)
INFO: Spillover activated: 50MB written to disk
ERROR: Dead letter queue: Upload failed after 5 retries
INFO: Queue recovery: Restored 23 items from persistence

Best Practices

1. Configuration Tuning

For High-Volume Servers:

uploadqueuesize = 200
large_file_queue_size = 50
small_file_queue_size = 500
max_concurrent_uploads = 200
queue_memory_limit = "1GB"

For Memory-Constrained Environments:

uploadqueuesize = 50
spillover_to_disk = true
queue_memory_limit = "200MB"
emergency_mode_threshold = 0.85

For Mobile/Unreliable Networks:

keep_alive_probe_interval = "15s"
upload_stall_timeout = "300s"
max_retry_attempts = 8
retry_max_delay = "120s"

2. Monitoring Setup

Essential Metrics to Monitor:

  • Queue length trends
  • Worker health status
  • Circuit breaker activations
  • Spillover usage
  • Dead letter queue growth

Alert Thresholds:

  • Queue length > 80% capacity
  • Circuit breaker open for > 5 minutes
  • Dead letter queue growth > 10 items/hour
  • Spillover usage > 50% of limit

3. Troubleshooting

Common Issues and Solutions:

Frequent Timeouts:

  • Check network stability
  • Increase adaptive timeouts for file size
  • Enable more aggressive keep-alive settings

Queue Backlogs:

  • Monitor worker health
  • Check for resource constraints
  • Consider increasing worker count

Memory Issues:

  • Enable disk spillover
  • Reduce queue memory limit
  • Increase garbage collection frequency

Implementation Notes

The enhanced queue resilience features are designed to be:

  1. Backward Compatible: Existing configurations continue to work
  2. Opt-in: Features can be enabled individually
  3. Performance Conscious: Minimal overhead when not actively needed
  4. Configurable: All aspects can be tuned for specific environments

These enhancements make HMAC File Server significantly more robust in handling timeout scenarios while maintaining high performance and reliability.