hmac-file-server/QUEUE_RESILIENCE_GUIDE.md

# Queue Resilience Configuration Guide

## Overview

HMAC File Server 3.2 Ultimate Fixed includes advanced queue resilience features designed to handle timeout scenarios gracefully and maintain service availability under various network conditions.

## Enhanced Configuration Sections

### 1. Server-Level Timeout Resilience

```toml
[server]
# Enhanced timeout resilience settings
graceful_shutdown_timeout = "300s"    # Time to wait for active uploads during shutdown
request_timeout = "7200s"             # Maximum time for any single request (2 hours)
keep_alive_timeout = "300s"           # HTTP keep-alive timeout
connection_drain_timeout = "180s"     # Time to drain connections during shutdown
upload_stall_timeout = "600s"         # Timeout if upload stalls (no data received)
download_stall_timeout = "300s"       # Timeout if download stalls
retry_after_timeout = "60s"           # Retry-After header when rejecting due to overload
max_concurrent_uploads = 100          # Maximum concurrent upload operations
upload_rate_limit = "10MB/s"          # Per-connection upload rate limit
connection_pool_size = 200            # Maximum connection pool size
```

**Key Benefits:**
- **Graceful Degradation**: Server doesn't abruptly terminate active uploads during shutdown
- **Stall Detection**: Automatically detects and handles stalled uploads/downloads
- **Connection Management**: Limits concurrent operations to prevent resource exhaustion
- **Rate Limiting**: Prevents individual connections from overwhelming the server

### 2. Enhanced Worker Configuration

```toml
[workers]
# Enhanced queue robustness settings
queue_timeout = "300s"               # Maximum time a job can wait in queue
queue_drain_timeout = "120s"         # Time to wait for queue drain during shutdown
worker_health_check = "30s"          # How often to check worker health
max_queue_retries = 3                # Max retries for failed queue operations
priority_queue_enabled = true        # Enable priority queuing for different file sizes
large_file_queue_size = 20           # Separate queue for files > 100MB
small_file_queue_size = 100          # Queue for files < 10MB
queue_backpressure_threshold = 0.8   # Queue usage % to start backpressure
circuit_breaker_enabled = true       # Enable circuit breaker for queue failures
circuit_breaker_threshold = 10       # Failures before opening circuit
circuit_breaker_timeout = "60s"      # Time before retrying after circuit opens
```

**Key Benefits:**
- **Priority Queuing**: Large files don't block small file uploads
- **Health Monitoring**: Workers are continuously monitored for failures
- **Circuit Breaking**: Automatic failure detection and recovery
- **Backpressure Control**: Gradual slowdown instead of hard failures

### 3. Advanced Queue Resilience

```toml
[queue_resilience]
enabled = true
# Timeout handling
queue_operation_timeout = "30s"      # Max time for queue operations
queue_full_behavior = "reject_oldest" # How to handle full queues
spillover_to_disk = true             # Use disk when memory queue is full
spillover_directory = "/tmp/hmac-queue-spillover"
spillover_max_size = "1GB"           # Max disk spillover size

# Queue persistence and recovery
persistent_queue = true              # Persist queue state
queue_recovery_enabled = true        # Recover queue state on restart
max_recovery_age = "24h"             # Max age of items to recover

# Health monitoring
queue_health_check_interval = "15s"  # Queue health check frequency
dead_letter_queue_enabled = true     # Failed items queue
dead_letter_max_retries = 5          # Max retries before dead letter
dead_letter_retention = "7d"         # Dead letter retention time

# Load balancing and prioritization
priority_levels = 3                  # Number of priority levels
priority_aging_enabled = true        # Age items to higher priority
priority_aging_threshold = "300s"    # Time before aging up
load_balancing_strategy = "least_connections"

# Memory management
queue_memory_limit = "500MB"         # Max memory for queues
queue_gc_interval = "60s"            # Garbage collection interval
emergency_mode_threshold = 0.95      # Emergency mode trigger
```

**Key Benefits:**
- **Disk Spillover**: Never lose uploads due to memory constraints
- **Queue Recovery**: Resume operations after server restarts
- **Dead Letter Queuing**: Handle persistently failing uploads
- **Priority Aging**: Prevent starvation of lower-priority items

### 4. Comprehensive Timeout Configuration

```toml
[timeouts]
# Basic timeouts (existing)
readtimeout = "4800s"
writetimeout = "4800s"
idletimeout = "4800s"

# Enhanced timeout resilience
handshake_timeout = "30s"           # TLS handshake timeout
header_timeout = "60s"              # HTTP header read timeout
body_timeout = "7200s"              # HTTP body read timeout
dial_timeout = "30s"                # Connection dial timeout
keep_alive_probe_interval = "30s"   # TCP keep-alive probe interval
keep_alive_probe_count = 9          # Keep-alive probes before giving up

# Adaptive timeouts based on file size
small_file_timeout = "60s"          # Files < 10MB
medium_file_timeout = "600s"        # Files 10MB-100MB
large_file_timeout = "3600s"        # Files 100MB-1GB
huge_file_timeout = "7200s"         # Files > 1GB

# Retry and backoff settings
retry_base_delay = "1s"             # Base delay between retries
retry_max_delay = "60s"             # Maximum delay between retries
retry_multiplier = 2.0              # Exponential backoff multiplier
max_retry_attempts = 5              # Maximum retry attempts
```

**Key Benefits:**
- **Adaptive Timeouts**: Different timeouts based on file size
- **Connection Resilience**: TCP keep-alive prevents silent failures
- **Exponential Backoff**: Intelligent retry timing reduces server load
- **Granular Control**: Fine-tuned timeouts for different operations

## Timeout Scenario Handling

### 1. Network Interruption Scenarios

**Mobile Network Switching:**
- Keep-alive probes detect network changes
- Chunked uploads can resume after network restoration
- Upload sessions persist through network interruptions

**Slow Network Conditions:**
- Adaptive timeouts prevent premature termination
- Rate limiting prevents network saturation
- Progress monitoring detects actual stalls vs. slow transfers

### 2. Server Overload Scenarios

**High Load Conditions:**
- Circuit breaker prevents cascade failures
- Backpressure slows down new requests gracefully
- Priority queuing ensures critical uploads continue

**Memory Pressure:**
- Disk spillover prevents memory exhaustion
- Queue garbage collection manages memory usage
- Emergency mode provides last-resort protection

### 3. Application Restart Scenarios

**Graceful Shutdown:**
- Active uploads get time to complete
- Queue state is persisted before shutdown
- Connections are drained properly

**Recovery After Restart:**
- Queue state is restored from persistence
- Upload sessions are recovered
- Dead letter items are reprocessed

## Monitoring and Observability

### Queue Health Metrics

The enhanced configuration provides comprehensive metrics:

- **Queue Length**: Current items in each queue
- **Queue Processing Time**: Time items spend in queue
- **Worker Health**: Individual worker status and performance
- **Circuit Breaker State**: Open/closed status and failure counts
- **Spillover Usage**: Disk spillover utilization
- **Dead Letter Queue**: Failed item counts and reasons

### Log Messages

Enhanced logging provides visibility into queue operations:

```
INFO: Queue backpressure activated (80% full)
WARN: Circuit breaker opened for upload queue (10 consecutive failures)
INFO: Spillover activated: 50MB written to disk
ERROR: Dead letter queue: Upload failed after 5 retries
INFO: Queue recovery: Restored 23 items from persistence
```

## Best Practices

### 1. Configuration Tuning

**For High-Volume Servers:**
```toml
uploadqueuesize = 200
large_file_queue_size = 50
small_file_queue_size = 500
max_concurrent_uploads = 200
queue_memory_limit = "1GB"
```

**For Memory-Constrained Environments:**
```toml
uploadqueuesize = 50
spillover_to_disk = true
queue_memory_limit = "200MB"
emergency_mode_threshold = 0.85
```

**For Mobile/Unreliable Networks:**
```toml
keep_alive_probe_interval = "15s"
upload_stall_timeout = "300s"
max_retry_attempts = 8
retry_max_delay = "120s"
```

### 2. Monitoring Setup

**Essential Metrics to Monitor:**
- Queue length trends
- Worker health status
- Circuit breaker activations
- Spillover usage
- Dead letter queue growth

**Alert Thresholds:**
- Queue length > 80% capacity
- Circuit breaker open for > 5 minutes
- Dead letter queue growth > 10 items/hour
- Spillover usage > 50% of limit

### 3. Troubleshooting

**Common Issues and Solutions:**

**Frequent Timeouts:**
- Check network stability
- Increase adaptive timeouts for file size
- Enable more aggressive keep-alive settings

**Queue Backlogs:**
- Monitor worker health
- Check for resource constraints
- Consider increasing worker count

**Memory Issues:**
- Enable disk spillover
- Reduce queue memory limit
- Increase garbage collection frequency

## Implementation Notes

The enhanced queue resilience features are designed to be:

1. **Backward Compatible**: Existing configurations continue to work
2. **Opt-in**: Features can be enabled individually
3. **Performance Conscious**: Minimal overhead when not actively needed
4. **Configurable**: All aspects can be tuned for specific environments

These enhancements make HMAC File Server significantly more robust in handling timeout scenarios while maintaining high performance and reliability.