Files
hmac-file-server/QUEUE_RESILIENCE_GUIDE.md

269 lines
9.4 KiB
Markdown

# Queue Resilience Configuration Guide
## Overview
HMAC File Server 3.2 Ultimate Fixed includes advanced queue resilience features designed to handle timeout scenarios gracefully and maintain service availability under various network conditions.
## Enhanced Configuration Sections
### 1. Server-Level Timeout Resilience
```toml
[server]
# Enhanced timeout resilience settings
graceful_shutdown_timeout = "300s" # Time to wait for active uploads during shutdown
request_timeout = "7200s" # Maximum time for any single request (2 hours)
keep_alive_timeout = "300s" # HTTP keep-alive timeout
connection_drain_timeout = "180s" # Time to drain connections during shutdown
upload_stall_timeout = "600s" # Timeout if upload stalls (no data received)
download_stall_timeout = "300s" # Timeout if download stalls
retry_after_timeout = "60s" # Retry-After header when rejecting due to overload
max_concurrent_uploads = 100 # Maximum concurrent upload operations
upload_rate_limit = "10MB/s" # Per-connection upload rate limit
connection_pool_size = 200 # Maximum connection pool size
```
**Key Benefits:**
- **Graceful Degradation**: Server doesn't abruptly terminate active uploads during shutdown
- **Stall Detection**: Automatically detects and handles stalled uploads/downloads
- **Connection Management**: Limits concurrent operations to prevent resource exhaustion
- **Rate Limiting**: Prevents individual connections from overwhelming the server
### 2. Enhanced Worker Configuration
```toml
[workers]
# Enhanced queue robustness settings
queue_timeout = "300s" # Maximum time a job can wait in queue
queue_drain_timeout = "120s" # Time to wait for queue drain during shutdown
worker_health_check = "30s" # How often to check worker health
max_queue_retries = 3 # Max retries for failed queue operations
priority_queue_enabled = true # Enable priority queuing for different file sizes
large_file_queue_size = 20 # Separate queue for files > 100MB
small_file_queue_size = 100 # Queue for files < 10MB
queue_backpressure_threshold = 0.8 # Queue usage % to start backpressure
circuit_breaker_enabled = true # Enable circuit breaker for queue failures
circuit_breaker_threshold = 10 # Failures before opening circuit
circuit_breaker_timeout = "60s" # Time before retrying after circuit opens
```
**Key Benefits:**
- **Priority Queuing**: Large files don't block small file uploads
- **Health Monitoring**: Workers are continuously monitored for failures
- **Circuit Breaking**: Automatic failure detection and recovery
- **Backpressure Control**: Gradual slowdown instead of hard failures
### 3. Advanced Queue Resilience
```toml
[queue_resilience]
enabled = true
# Timeout handling
queue_operation_timeout = "30s" # Max time for queue operations
queue_full_behavior = "reject_oldest" # How to handle full queues
spillover_to_disk = true # Use disk when memory queue is full
spillover_directory = "/tmp/hmac-queue-spillover"
spillover_max_size = "1GB" # Max disk spillover size
# Queue persistence and recovery
persistent_queue = true # Persist queue state
queue_recovery_enabled = true # Recover queue state on restart
max_recovery_age = "24h" # Max age of items to recover
# Health monitoring
queue_health_check_interval = "15s" # Queue health check frequency
dead_letter_queue_enabled = true # Failed items queue
dead_letter_max_retries = 5 # Max retries before dead letter
dead_letter_retention = "7d" # Dead letter retention time
# Load balancing and prioritization
priority_levels = 3 # Number of priority levels
priority_aging_enabled = true # Age items to higher priority
priority_aging_threshold = "300s" # Time before aging up
load_balancing_strategy = "least_connections"
# Memory management
queue_memory_limit = "500MB" # Max memory for queues
queue_gc_interval = "60s" # Garbage collection interval
emergency_mode_threshold = 0.95 # Emergency mode trigger
```
**Key Benefits:**
- **Disk Spillover**: Never lose uploads due to memory constraints
- **Queue Recovery**: Resume operations after server restarts
- **Dead Letter Queuing**: Handle persistently failing uploads
- **Priority Aging**: Prevent starvation of lower-priority items
### 4. Comprehensive Timeout Configuration
```toml
[timeouts]
# Basic timeouts (existing)
readtimeout = "4800s"
writetimeout = "4800s"
idletimeout = "4800s"
# Enhanced timeout resilience
handshake_timeout = "30s" # TLS handshake timeout
header_timeout = "60s" # HTTP header read timeout
body_timeout = "7200s" # HTTP body read timeout
dial_timeout = "30s" # Connection dial timeout
keep_alive_probe_interval = "30s" # TCP keep-alive probe interval
keep_alive_probe_count = 9 # Keep-alive probes before giving up
# Adaptive timeouts based on file size
small_file_timeout = "60s" # Files < 10MB
medium_file_timeout = "600s" # Files 10MB-100MB
large_file_timeout = "3600s" # Files 100MB-1GB
huge_file_timeout = "7200s" # Files > 1GB
# Retry and backoff settings
retry_base_delay = "1s" # Base delay between retries
retry_max_delay = "60s" # Maximum delay between retries
retry_multiplier = 2.0 # Exponential backoff multiplier
max_retry_attempts = 5 # Maximum retry attempts
```
**Key Benefits:**
- **Adaptive Timeouts**: Different timeouts based on file size
- **Connection Resilience**: TCP keep-alive prevents silent failures
- **Exponential Backoff**: Intelligent retry timing reduces server load
- **Granular Control**: Fine-tuned timeouts for different operations
## Timeout Scenario Handling
### 1. Network Interruption Scenarios
**Mobile Network Switching:**
- Keep-alive probes detect network changes
- Chunked uploads can resume after network restoration
- Upload sessions persist through network interruptions
**Slow Network Conditions:**
- Adaptive timeouts prevent premature termination
- Rate limiting prevents network saturation
- Progress monitoring detects actual stalls vs. slow transfers
### 2. Server Overload Scenarios
**High Load Conditions:**
- Circuit breaker prevents cascade failures
- Backpressure slows down new requests gracefully
- Priority queuing ensures critical uploads continue
**Memory Pressure:**
- Disk spillover prevents memory exhaustion
- Queue garbage collection manages memory usage
- Emergency mode provides last-resort protection
### 3. Application Restart Scenarios
**Graceful Shutdown:**
- Active uploads get time to complete
- Queue state is persisted before shutdown
- Connections are drained properly
**Recovery After Restart:**
- Queue state is restored from persistence
- Upload sessions are recovered
- Dead letter items are reprocessed
## Monitoring and Observability
### Queue Health Metrics
The enhanced configuration provides comprehensive metrics:
- **Queue Length**: Current items in each queue
- **Queue Processing Time**: Time items spend in queue
- **Worker Health**: Individual worker status and performance
- **Circuit Breaker State**: Open/closed status and failure counts
- **Spillover Usage**: Disk spillover utilization
- **Dead Letter Queue**: Failed item counts and reasons
### Log Messages
Enhanced logging provides visibility into queue operations:
```
INFO: Queue backpressure activated (80% full)
WARN: Circuit breaker opened for upload queue (10 consecutive failures)
INFO: Spillover activated: 50MB written to disk
ERROR: Dead letter queue: Upload failed after 5 retries
INFO: Queue recovery: Restored 23 items from persistence
```
## Best Practices
### 1. Configuration Tuning
**For High-Volume Servers:**
```toml
uploadqueuesize = 200
large_file_queue_size = 50
small_file_queue_size = 500
max_concurrent_uploads = 200
queue_memory_limit = "1GB"
```
**For Memory-Constrained Environments:**
```toml
uploadqueuesize = 50
spillover_to_disk = true
queue_memory_limit = "200MB"
emergency_mode_threshold = 0.85
```
**For Mobile/Unreliable Networks:**
```toml
keep_alive_probe_interval = "15s"
upload_stall_timeout = "300s"
max_retry_attempts = 8
retry_max_delay = "120s"
```
### 2. Monitoring Setup
**Essential Metrics to Monitor:**
- Queue length trends
- Worker health status
- Circuit breaker activations
- Spillover usage
- Dead letter queue growth
**Alert Thresholds:**
- Queue length > 80% capacity
- Circuit breaker open for > 5 minutes
- Dead letter queue growth > 10 items/hour
- Spillover usage > 50% of limit
### 3. Troubleshooting
**Common Issues and Solutions:**
**Frequent Timeouts:**
- Check network stability
- Increase adaptive timeouts for file size
- Enable more aggressive keep-alive settings
**Queue Backlogs:**
- Monitor worker health
- Check for resource constraints
- Consider increasing worker count
**Memory Issues:**
- Enable disk spillover
- Reduce queue memory limit
- Increase garbage collection frequency
## Implementation Notes
The enhanced queue resilience features are designed to be:
1. **Backward Compatible**: Existing configurations continue to work
2. **Opt-in**: Features can be enabled individually
3. **Performance Conscious**: Minimal overhead when not actively needed
4. **Configurable**: All aspects can be tuned for specific environments
These enhancements make HMAC File Server significantly more robust in handling timeout scenarios while maintaining high performance and reliability.