🔥 Tremora del Terra: ultimate hmac-file-server fix – final push before the drop 💾🔐
This commit is contained in:
@ -1,268 +0,0 @@
|
||||
# Queue Resilience Configuration Guide
|
||||
|
||||
## Overview
|
||||
|
||||
HMAC File Server 3.2 Ultimate Fixed includes advanced queue resilience features designed to handle timeout scenarios gracefully and maintain service availability under various network conditions.
|
||||
|
||||
## Enhanced Configuration Sections
|
||||
|
||||
### 1. Server-Level Timeout Resilience
|
||||
|
||||
```toml
|
||||
[server]
|
||||
# Enhanced timeout resilience settings
|
||||
graceful_shutdown_timeout = "300s" # Time to wait for active uploads during shutdown
|
||||
request_timeout = "7200s" # Maximum time for any single request (2 hours)
|
||||
keep_alive_timeout = "300s" # HTTP keep-alive timeout
|
||||
connection_drain_timeout = "180s" # Time to drain connections during shutdown
|
||||
upload_stall_timeout = "600s" # Timeout if upload stalls (no data received)
|
||||
download_stall_timeout = "300s" # Timeout if download stalls
|
||||
retry_after_timeout = "60s" # Retry-After header when rejecting due to overload
|
||||
max_concurrent_uploads = 100 # Maximum concurrent upload operations
|
||||
upload_rate_limit = "10MB/s" # Per-connection upload rate limit
|
||||
connection_pool_size = 200 # Maximum connection pool size
|
||||
```
|
||||
|
||||
**Key Benefits:**
|
||||
- **Graceful Degradation**: Server doesn't abruptly terminate active uploads during shutdown
|
||||
- **Stall Detection**: Automatically detects and handles stalled uploads/downloads
|
||||
- **Connection Management**: Limits concurrent operations to prevent resource exhaustion
|
||||
- **Rate Limiting**: Prevents individual connections from overwhelming the server
|
||||
|
||||
### 2. Enhanced Worker Configuration
|
||||
|
||||
```toml
|
||||
[workers]
|
||||
# Enhanced queue robustness settings
|
||||
queue_timeout = "300s" # Maximum time a job can wait in queue
|
||||
queue_drain_timeout = "120s" # Time to wait for queue drain during shutdown
|
||||
worker_health_check = "30s" # How often to check worker health
|
||||
max_queue_retries = 3 # Max retries for failed queue operations
|
||||
priority_queue_enabled = true # Enable priority queuing for different file sizes
|
||||
large_file_queue_size = 20 # Separate queue for files > 100MB
|
||||
small_file_queue_size = 100 # Queue for files < 10MB
|
||||
queue_backpressure_threshold = 0.8 # Queue usage % to start backpressure
|
||||
circuit_breaker_enabled = true # Enable circuit breaker for queue failures
|
||||
circuit_breaker_threshold = 10 # Failures before opening circuit
|
||||
circuit_breaker_timeout = "60s" # Time before retrying after circuit opens
|
||||
```
|
||||
|
||||
**Key Benefits:**
|
||||
- **Priority Queuing**: Large files don't block small file uploads
|
||||
- **Health Monitoring**: Workers are continuously monitored for failures
|
||||
- **Circuit Breaking**: Automatic failure detection and recovery
|
||||
- **Backpressure Control**: Gradual slowdown instead of hard failures
|
||||
|
||||
### 3. Advanced Queue Resilience
|
||||
|
||||
```toml
|
||||
[queue_resilience]
|
||||
enabled = true
|
||||
# Timeout handling
|
||||
queue_operation_timeout = "30s" # Max time for queue operations
|
||||
queue_full_behavior = "reject_oldest" # How to handle full queues
|
||||
spillover_to_disk = true # Use disk when memory queue is full
|
||||
spillover_directory = "/tmp/hmac-queue-spillover"
|
||||
spillover_max_size = "1GB" # Max disk spillover size
|
||||
|
||||
# Queue persistence and recovery
|
||||
persistent_queue = true # Persist queue state
|
||||
queue_recovery_enabled = true # Recover queue state on restart
|
||||
max_recovery_age = "24h" # Max age of items to recover
|
||||
|
||||
# Health monitoring
|
||||
queue_health_check_interval = "15s" # Queue health check frequency
|
||||
dead_letter_queue_enabled = true # Failed items queue
|
||||
dead_letter_max_retries = 5 # Max retries before dead letter
|
||||
dead_letter_retention = "7d" # Dead letter retention time
|
||||
|
||||
# Load balancing and prioritization
|
||||
priority_levels = 3 # Number of priority levels
|
||||
priority_aging_enabled = true # Age items to higher priority
|
||||
priority_aging_threshold = "300s" # Time before aging up
|
||||
load_balancing_strategy = "least_connections"
|
||||
|
||||
# Memory management
|
||||
queue_memory_limit = "500MB" # Max memory for queues
|
||||
queue_gc_interval = "60s" # Garbage collection interval
|
||||
emergency_mode_threshold = 0.95 # Emergency mode trigger
|
||||
```
|
||||
|
||||
**Key Benefits:**
|
||||
- **Disk Spillover**: Never lose uploads due to memory constraints
|
||||
- **Queue Recovery**: Resume operations after server restarts
|
||||
- **Dead Letter Queuing**: Handle persistently failing uploads
|
||||
- **Priority Aging**: Prevent starvation of lower-priority items
|
||||
|
||||
### 4. Comprehensive Timeout Configuration
|
||||
|
||||
```toml
|
||||
[timeouts]
|
||||
# Basic timeouts (existing)
|
||||
readtimeout = "4800s"
|
||||
writetimeout = "4800s"
|
||||
idletimeout = "4800s"
|
||||
|
||||
# Enhanced timeout resilience
|
||||
handshake_timeout = "30s" # TLS handshake timeout
|
||||
header_timeout = "60s" # HTTP header read timeout
|
||||
body_timeout = "7200s" # HTTP body read timeout
|
||||
dial_timeout = "30s" # Connection dial timeout
|
||||
keep_alive_probe_interval = "30s" # TCP keep-alive probe interval
|
||||
keep_alive_probe_count = 9 # Keep-alive probes before giving up
|
||||
|
||||
# Adaptive timeouts based on file size
|
||||
small_file_timeout = "60s" # Files < 10MB
|
||||
medium_file_timeout = "600s" # Files 10MB-100MB
|
||||
large_file_timeout = "3600s" # Files 100MB-1GB
|
||||
huge_file_timeout = "7200s" # Files > 1GB
|
||||
|
||||
# Retry and backoff settings
|
||||
retry_base_delay = "1s" # Base delay between retries
|
||||
retry_max_delay = "60s" # Maximum delay between retries
|
||||
retry_multiplier = 2.0 # Exponential backoff multiplier
|
||||
max_retry_attempts = 5 # Maximum retry attempts
|
||||
```
|
||||
|
||||
**Key Benefits:**
|
||||
- **Adaptive Timeouts**: Different timeouts based on file size
|
||||
- **Connection Resilience**: TCP keep-alive prevents silent failures
|
||||
- **Exponential Backoff**: Intelligent retry timing reduces server load
|
||||
- **Granular Control**: Fine-tuned timeouts for different operations
|
||||
|
||||
## Timeout Scenario Handling
|
||||
|
||||
### 1. Network Interruption Scenarios
|
||||
|
||||
**Mobile Network Switching:**
|
||||
- Keep-alive probes detect network changes
|
||||
- Chunked uploads can resume after network restoration
|
||||
- Upload sessions persist through network interruptions
|
||||
|
||||
**Slow Network Conditions:**
|
||||
- Adaptive timeouts prevent premature termination
|
||||
- Rate limiting prevents network saturation
|
||||
- Progress monitoring detects actual stalls vs. slow transfers
|
||||
|
||||
### 2. Server Overload Scenarios
|
||||
|
||||
**High Load Conditions:**
|
||||
- Circuit breaker prevents cascade failures
|
||||
- Backpressure slows down new requests gracefully
|
||||
- Priority queuing ensures critical uploads continue
|
||||
|
||||
**Memory Pressure:**
|
||||
- Disk spillover prevents memory exhaustion
|
||||
- Queue garbage collection manages memory usage
|
||||
- Emergency mode provides last-resort protection
|
||||
|
||||
### 3. Application Restart Scenarios
|
||||
|
||||
**Graceful Shutdown:**
|
||||
- Active uploads get time to complete
|
||||
- Queue state is persisted before shutdown
|
||||
- Connections are drained properly
|
||||
|
||||
**Recovery After Restart:**
|
||||
- Queue state is restored from persistence
|
||||
- Upload sessions are recovered
|
||||
- Dead letter items are reprocessed
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Queue Health Metrics
|
||||
|
||||
The enhanced configuration provides comprehensive metrics:
|
||||
|
||||
- **Queue Length**: Current items in each queue
|
||||
- **Queue Processing Time**: Time items spend in queue
|
||||
- **Worker Health**: Individual worker status and performance
|
||||
- **Circuit Breaker State**: Open/closed status and failure counts
|
||||
- **Spillover Usage**: Disk spillover utilization
|
||||
- **Dead Letter Queue**: Failed item counts and reasons
|
||||
|
||||
### Log Messages
|
||||
|
||||
Enhanced logging provides visibility into queue operations:
|
||||
|
||||
```
|
||||
INFO: Queue backpressure activated (80% full)
|
||||
WARN: Circuit breaker opened for upload queue (10 consecutive failures)
|
||||
INFO: Spillover activated: 50MB written to disk
|
||||
ERROR: Dead letter queue: Upload failed after 5 retries
|
||||
INFO: Queue recovery: Restored 23 items from persistence
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Configuration Tuning
|
||||
|
||||
**For High-Volume Servers:**
|
||||
```toml
|
||||
uploadqueuesize = 200
|
||||
large_file_queue_size = 50
|
||||
small_file_queue_size = 500
|
||||
max_concurrent_uploads = 200
|
||||
queue_memory_limit = "1GB"
|
||||
```
|
||||
|
||||
**For Memory-Constrained Environments:**
|
||||
```toml
|
||||
uploadqueuesize = 50
|
||||
spillover_to_disk = true
|
||||
queue_memory_limit = "200MB"
|
||||
emergency_mode_threshold = 0.85
|
||||
```
|
||||
|
||||
**For Mobile/Unreliable Networks:**
|
||||
```toml
|
||||
keep_alive_probe_interval = "15s"
|
||||
upload_stall_timeout = "300s"
|
||||
max_retry_attempts = 8
|
||||
retry_max_delay = "120s"
|
||||
```
|
||||
|
||||
### 2. Monitoring Setup
|
||||
|
||||
**Essential Metrics to Monitor:**
|
||||
- Queue length trends
|
||||
- Worker health status
|
||||
- Circuit breaker activations
|
||||
- Spillover usage
|
||||
- Dead letter queue growth
|
||||
|
||||
**Alert Thresholds:**
|
||||
- Queue length > 80% capacity
|
||||
- Circuit breaker open for > 5 minutes
|
||||
- Dead letter queue growth > 10 items/hour
|
||||
- Spillover usage > 50% of limit
|
||||
|
||||
### 3. Troubleshooting
|
||||
|
||||
**Common Issues and Solutions:**
|
||||
|
||||
**Frequent Timeouts:**
|
||||
- Check network stability
|
||||
- Increase adaptive timeouts for file size
|
||||
- Enable more aggressive keep-alive settings
|
||||
|
||||
**Queue Backlogs:**
|
||||
- Monitor worker health
|
||||
- Check for resource constraints
|
||||
- Consider increasing worker count
|
||||
|
||||
**Memory Issues:**
|
||||
- Enable disk spillover
|
||||
- Reduce queue memory limit
|
||||
- Increase garbage collection frequency
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
The enhanced queue resilience features are designed to be:
|
||||
|
||||
1. **Backward Compatible**: Existing configurations continue to work
|
||||
2. **Opt-in**: Features can be enabled individually
|
||||
3. **Performance Conscious**: Minimal overhead when not actively needed
|
||||
4. **Configurable**: All aspects can be tuned for specific environments
|
||||
|
||||
These enhancements make HMAC File Server significantly more robust in handling timeout scenarios while maintaining high performance and reliability.
|
Reference in New Issue
Block a user