# Queue Resilience Configuration Guide ## Overview HMAC File Server 3.2 Ultimate Fixed includes advanced queue resilience features designed to handle timeout scenarios gracefully and maintain service availability under various network conditions. ## Enhanced Configuration Sections ### 1. Server-Level Timeout Resilience ```toml [server] # Enhanced timeout resilience settings graceful_shutdown_timeout = "300s" # Time to wait for active uploads during shutdown request_timeout = "7200s" # Maximum time for any single request (2 hours) keep_alive_timeout = "300s" # HTTP keep-alive timeout connection_drain_timeout = "180s" # Time to drain connections during shutdown upload_stall_timeout = "600s" # Timeout if upload stalls (no data received) download_stall_timeout = "300s" # Timeout if download stalls retry_after_timeout = "60s" # Retry-After header when rejecting due to overload max_concurrent_uploads = 100 # Maximum concurrent upload operations upload_rate_limit = "10MB/s" # Per-connection upload rate limit connection_pool_size = 200 # Maximum connection pool size ``` **Key Benefits:** - **Graceful Degradation**: Server doesn't abruptly terminate active uploads during shutdown - **Stall Detection**: Automatically detects and handles stalled uploads/downloads - **Connection Management**: Limits concurrent operations to prevent resource exhaustion - **Rate Limiting**: Prevents individual connections from overwhelming the server ### 2. Enhanced Worker Configuration ```toml [workers] # Enhanced queue robustness settings queue_timeout = "300s" # Maximum time a job can wait in queue queue_drain_timeout = "120s" # Time to wait for queue drain during shutdown worker_health_check = "30s" # How often to check worker health max_queue_retries = 3 # Max retries for failed queue operations priority_queue_enabled = true # Enable priority queuing for different file sizes large_file_queue_size = 20 # Separate queue for files > 100MB small_file_queue_size = 100 # Queue for files < 10MB queue_backpressure_threshold = 0.8 # Queue usage % to start backpressure circuit_breaker_enabled = true # Enable circuit breaker for queue failures circuit_breaker_threshold = 10 # Failures before opening circuit circuit_breaker_timeout = "60s" # Time before retrying after circuit opens ``` **Key Benefits:** - **Priority Queuing**: Large files don't block small file uploads - **Health Monitoring**: Workers are continuously monitored for failures - **Circuit Breaking**: Automatic failure detection and recovery - **Backpressure Control**: Gradual slowdown instead of hard failures ### 3. Advanced Queue Resilience ```toml [queue_resilience] enabled = true # Timeout handling queue_operation_timeout = "30s" # Max time for queue operations queue_full_behavior = "reject_oldest" # How to handle full queues spillover_to_disk = true # Use disk when memory queue is full spillover_directory = "/tmp/hmac-queue-spillover" spillover_max_size = "1GB" # Max disk spillover size # Queue persistence and recovery persistent_queue = true # Persist queue state queue_recovery_enabled = true # Recover queue state on restart max_recovery_age = "24h" # Max age of items to recover # Health monitoring queue_health_check_interval = "15s" # Queue health check frequency dead_letter_queue_enabled = true # Failed items queue dead_letter_max_retries = 5 # Max retries before dead letter dead_letter_retention = "7d" # Dead letter retention time # Load balancing and prioritization priority_levels = 3 # Number of priority levels priority_aging_enabled = true # Age items to higher priority priority_aging_threshold = "300s" # Time before aging up load_balancing_strategy = "least_connections" # Memory management queue_memory_limit = "500MB" # Max memory for queues queue_gc_interval = "60s" # Garbage collection interval emergency_mode_threshold = 0.95 # Emergency mode trigger ``` **Key Benefits:** - **Disk Spillover**: Never lose uploads due to memory constraints - **Queue Recovery**: Resume operations after server restarts - **Dead Letter Queuing**: Handle persistently failing uploads - **Priority Aging**: Prevent starvation of lower-priority items ### 4. Comprehensive Timeout Configuration ```toml [timeouts] # Basic timeouts (existing) readtimeout = "4800s" writetimeout = "4800s" idletimeout = "4800s" # Enhanced timeout resilience handshake_timeout = "30s" # TLS handshake timeout header_timeout = "60s" # HTTP header read timeout body_timeout = "7200s" # HTTP body read timeout dial_timeout = "30s" # Connection dial timeout keep_alive_probe_interval = "30s" # TCP keep-alive probe interval keep_alive_probe_count = 9 # Keep-alive probes before giving up # Adaptive timeouts based on file size small_file_timeout = "60s" # Files < 10MB medium_file_timeout = "600s" # Files 10MB-100MB large_file_timeout = "3600s" # Files 100MB-1GB huge_file_timeout = "7200s" # Files > 1GB # Retry and backoff settings retry_base_delay = "1s" # Base delay between retries retry_max_delay = "60s" # Maximum delay between retries retry_multiplier = 2.0 # Exponential backoff multiplier max_retry_attempts = 5 # Maximum retry attempts ``` **Key Benefits:** - **Adaptive Timeouts**: Different timeouts based on file size - **Connection Resilience**: TCP keep-alive prevents silent failures - **Exponential Backoff**: Intelligent retry timing reduces server load - **Granular Control**: Fine-tuned timeouts for different operations ## Timeout Scenario Handling ### 1. Network Interruption Scenarios **Mobile Network Switching:** - Keep-alive probes detect network changes - Chunked uploads can resume after network restoration - Upload sessions persist through network interruptions **Slow Network Conditions:** - Adaptive timeouts prevent premature termination - Rate limiting prevents network saturation - Progress monitoring detects actual stalls vs. slow transfers ### 2. Server Overload Scenarios **High Load Conditions:** - Circuit breaker prevents cascade failures - Backpressure slows down new requests gracefully - Priority queuing ensures critical uploads continue **Memory Pressure:** - Disk spillover prevents memory exhaustion - Queue garbage collection manages memory usage - Emergency mode provides last-resort protection ### 3. Application Restart Scenarios **Graceful Shutdown:** - Active uploads get time to complete - Queue state is persisted before shutdown - Connections are drained properly **Recovery After Restart:** - Queue state is restored from persistence - Upload sessions are recovered - Dead letter items are reprocessed ## Monitoring and Observability ### Queue Health Metrics The enhanced configuration provides comprehensive metrics: - **Queue Length**: Current items in each queue - **Queue Processing Time**: Time items spend in queue - **Worker Health**: Individual worker status and performance - **Circuit Breaker State**: Open/closed status and failure counts - **Spillover Usage**: Disk spillover utilization - **Dead Letter Queue**: Failed item counts and reasons ### Log Messages Enhanced logging provides visibility into queue operations: ``` INFO: Queue backpressure activated (80% full) WARN: Circuit breaker opened for upload queue (10 consecutive failures) INFO: Spillover activated: 50MB written to disk ERROR: Dead letter queue: Upload failed after 5 retries INFO: Queue recovery: Restored 23 items from persistence ``` ## Best Practices ### 1. Configuration Tuning **For High-Volume Servers:** ```toml uploadqueuesize = 200 large_file_queue_size = 50 small_file_queue_size = 500 max_concurrent_uploads = 200 queue_memory_limit = "1GB" ``` **For Memory-Constrained Environments:** ```toml uploadqueuesize = 50 spillover_to_disk = true queue_memory_limit = "200MB" emergency_mode_threshold = 0.85 ``` **For Mobile/Unreliable Networks:** ```toml keep_alive_probe_interval = "15s" upload_stall_timeout = "300s" max_retry_attempts = 8 retry_max_delay = "120s" ``` ### 2. Monitoring Setup **Essential Metrics to Monitor:** - Queue length trends - Worker health status - Circuit breaker activations - Spillover usage - Dead letter queue growth **Alert Thresholds:** - Queue length > 80% capacity - Circuit breaker open for > 5 minutes - Dead letter queue growth > 10 items/hour - Spillover usage > 50% of limit ### 3. Troubleshooting **Common Issues and Solutions:** **Frequent Timeouts:** - Check network stability - Increase adaptive timeouts for file size - Enable more aggressive keep-alive settings **Queue Backlogs:** - Monitor worker health - Check for resource constraints - Consider increasing worker count **Memory Issues:** - Enable disk spillover - Reduce queue memory limit - Increase garbage collection frequency ## Implementation Notes The enhanced queue resilience features are designed to be: 1. **Backward Compatible**: Existing configurations continue to work 2. **Opt-in**: Features can be enabled individually 3. **Performance Conscious**: Minimal overhead when not actively needed 4. **Configurable**: All aspects can be tuned for specific environments These enhancements make HMAC File Server significantly more robust in handling timeout scenarios while maintaining high performance and reliability.