fix: phased restore for BLOB databases to prevent lock exhaustion OOM

- Auto-detect large objects in pg_restore dumps - Split restore into pre-data, data, post-data phases - Each phase commits and releases locks before next - Prevents 'out of shared memory' / max_locks_per_transaction errors - Updated error hints with better guidance for lock exhaustion
fix: streaming tar verification for large cluster archives (100GB+)
2026-01-14 08:15:53 +01:00 · 2026-01-13 14:40:18 +01:00 · 2026-01-13 10:02:35 +01:00 · 2026-01-13 08:22:20 +01:00
8 changed files with 448 additions and 44 deletions
--- a/PITR.md
+++ b/PITR.md
@@ -584,6 +584,100 @@ Document your recovery procedure:
 9. Create new base backup
 ```
 ## Large Database Support (600+ GB)
 For databases larger than 600 GB, PITR is the **recommended approach** over full dump/restore.
 ### Why PITR Works Better for Large DBs
 | Approach | 600 GB Database | Recovery Time (RTO) |
 |----------|-----------------|---------------------|
 | Full pg_dump/restore | Hours to dump, hours to restore | 4-12+ hours |
 | PITR (base + WAL) | Incremental WAL only | 30 min - 2 hours |
 ### Setup for Large Databases
 **1. Enable WAL archiving with compression:**
 ```bash
 dbbackup pitr enable --archive-dir /backups/wal_archive --compress
 ```
 **2. Take ONE base backup weekly/monthly (use pg_basebackup):**
 ```bash
 # For 600+ GB, use fast checkpoint to minimize impact
 pg_basebackup -D /backups/base_$(date +%Y%m%d).tar.gz \
  -Ft -z -P --checkpoint=fast --wal-method=none
 # Duration: 2-6 hours for 600 GB, but only needed weekly/monthly
 ```
 **3. WAL files archive continuously** (~1-5 GB/hour typical), capturing every change.
 **4. Recover to any point in time:**
 ```bash
 dbbackup restore pitr \
  --base-backup /backups/base_20260101.tar.gz \
  --wal-archive /backups/wal_archive \
  --target-time "2026-01-13 14:30:00" \
  --target-dir /var/lib/postgresql/16/restored
 ```
 ### PostgreSQL Optimizations for 600+ GB
 | Setting | Value | Purpose |
 |---------|-------|---------|
 | `wal_compression = on` | postgresql.conf | 70-80% smaller WAL files |
 | `max_wal_size = 4GB` | postgresql.conf | Reduce checkpoint frequency |
 | `checkpoint_timeout = 30min` | postgresql.conf | Less frequent checkpoints |
 | `archive_timeout = 300` | postgresql.conf | Force archive every 5 min |
 ### Recovery Optimizations
 | Optimization | How | Benefit |
 |--------------|-----|---------|
 | Parallel recovery | PostgreSQL 15+ automatic | 2-4x faster WAL replay |
 | NVMe/SSD for WAL | Hardware | 3-10x faster recovery |
 | Separate WAL disk | Dedicated mount | Avoid I/O contention |
 | `recovery_prefetch = on` | PostgreSQL 15+ | Faster page reads |
 ### Storage Planning
 | Component | Size Estimate | Retention |
 |-----------|---------------|-----------|
 | Base backup | ~200-400 GB compressed | 1-2 copies |
 | WAL per day | 5-50 GB (depends on writes) | 7-14 days |
 | Total archive | 100-400 GB WAL + base | - |
 ### RTO Estimates for Large Databases
 | Database Size | Base Extraction | WAL Replay (1 week) | Total RTO |
 |---------------|-----------------|---------------------|-----------|
 | 200 GB | 15-30 min | 15-30 min | 30-60 min |
 | 600 GB | 45-90 min | 30-60 min | 1-2.5 hours |
 | 1 TB | 60-120 min | 45-90 min | 2-3.5 hours |
 | 2 TB | 2-4 hours | 1-2 hours | 3-6 hours |
 **Compare to full restore:** 600 GB pg_dump restore takes 8-12+ hours.
 ### Best Practices for 600+ GB
 1. **Weekly base backups** - Monthly if storage is tight
 2. **Test recovery monthly** - Verify WAL chain integrity
 3. **Monitor WAL lag** - Alert if archive falls behind
 4. **Use streaming replication** - For HA, combine with PITR for DR
 5. **Separate archive storage** - Don't fill up the DB disk
 ```bash
 # Quick health check for large DB PITR setup
 dbbackup pitr status --verbose
 # Expected output:
 # Base Backup: 2026-01-06 (7 days old) - OK
 # WAL Archive: 847 files, 52 GB
 # Recovery Window: 2026-01-06 to 2026-01-13 (7 days)
 # Estimated RTO: ~90 minutes
 ```
 ## Performance Considerations
 ### WAL Archive Size
--- a/bin/README.md
+++ b/bin/README.md
@@ -4,8 +4,8 @@ This directory contains pre-compiled binaries for the DB Backup Tool across mult
 ## Build Information
 - **Version**: 3.42.10
- **Build Time**: 2026-01-12_08:50:35_UTC
+- **Build Time**: 2026-01-13_13:40:58_UTC
- **Git Commit**: b1f8c6d
+- **Git Commit**: 222bdbe
 ## Recent Updates (v1.1.0)
 - ✅ Fixed TUI progress display with line-by-line output
--- a/internal/checks/error_hints.go
+++ b/internal/checks/error_hints.go
@@ -68,8 +68,8 @@ func ClassifyError(errorMsg string) *ErrorClassification {
 			Type:     "critical",
 			Category: "locks",
 			Message:  errorMsg,
-			Hint:     "Lock table exhausted - typically caused by large objects in parallel restore",
+			Hint:     "Lock table exhausted - typically caused by large objects (BLOBs) during restore",
-			Action:   "Increase max_locks_per_transaction in postgresql.conf to 512 or higher",
+			Action:   "Option 1: Increase max_locks_per_transaction to 1024+ in postgresql.conf (requires restart). Option 2: Update dbbackup and retry - phased restore now auto-enabled for BLOB databases",
 			Severity: 2,
 		}
 	case "permission_denied":
@@ -142,8 +142,8 @@ func ClassifyError(errorMsg string) *ErrorClassification {
 			Type:     "critical",
 			Category: "locks",
 			Message:  errorMsg,
-			Hint:     "Lock table exhausted - typically caused by large objects in parallel restore",
+			Hint:     "Lock table exhausted - typically caused by large objects (BLOBs) during restore",
-			Action:   "Increase max_locks_per_transaction in postgresql.conf to 512 or higher",
+			Action:   "Option 1: Increase max_locks_per_transaction to 1024+ in postgresql.conf (requires restart). Option 2: Update dbbackup and retry - phased restore now auto-enabled for BLOB databases",
 			Severity: 2,
 		}
 	}
--- a/internal/restore/diagnose.go
+++ b/internal/restore/diagnose.go
@@ -414,24 +414,121 @@ func (d *Diagnoser) diagnoseSQLScript(filePath string, compressed bool, result *
 // diagnoseClusterArchive analyzes a cluster tar.gz archive
 func (d *Diagnoser) diagnoseClusterArchive(filePath string, result *DiagnoseResult) {
-	// First verify tar.gz integrity with timeout
+	// Calculate dynamic timeout based on file size
-	// 5 minutes for large archives (multi-GB archives need more time)
+	// Large archives (100GB+) can take significant time to list
-	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
+	// Minimum 5 minutes, scales with file size, max 180 minutes for very large archives
 	timeoutMinutes := 5
 	if result.FileSize > 0 {
 		// 1 minute per 2 GB, minimum 5 minutes, max 180 minutes
 		sizeGB := result.FileSize / (1024 * 1024 * 1024)
 		estimatedMinutes := int(sizeGB/2) + 5
 		if estimatedMinutes > timeoutMinutes {
 			timeoutMinutes = estimatedMinutes
 		}
 		if timeoutMinutes > 180 {
 			timeoutMinutes = 180
 		}
 	}
 	d.log.Info("Verifying cluster archive integrity",
 		"size", fmt.Sprintf("%.1f GB", float64(result.FileSize)/(1024*1024*1024)),
 		"timeout", fmt.Sprintf("%d min", timeoutMinutes))
 	ctx, cancel := context.WithTimeout(context.Background(), time.Duration(timeoutMinutes)*time.Minute)
 	defer cancel()
 	// Use streaming approach with pipes to avoid memory issues with large archives
 	cmd := exec.CommandContext(ctx, "tar", "-tzf", filePath)
-	output, err := cmd.Output()
+	stdout, pipeErr := cmd.StdoutPipe()
-	if err != nil {
+	if pipeErr != nil {
 		// Pipe creation failed - not a corruption issue
 		result.Warnings = append(result.Warnings,
 			fmt.Sprintf("Cannot create pipe for verification: %v", pipeErr),
 			"Archive integrity cannot be verified but may still be valid")
 		return
 	}
 	var stderrBuf bytes.Buffer
 	cmd.Stderr = &stderrBuf
 	if startErr := cmd.Start(); startErr != nil {
 		result.Warnings = append(result.Warnings,
 			fmt.Sprintf("Cannot start tar verification: %v", startErr),
 			"Archive integrity cannot be verified but may still be valid")
 		return
 	}
 	// Stream output line by line to avoid buffering entire listing in memory
 	scanner := bufio.NewScanner(stdout)
 	scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024) // Allow long paths
 	var files []string
 	fileCount := 0
 	for scanner.Scan() {
 		fileCount++
 		line := scanner.Text()
 		// Only store dump/metadata files, not every file
 		if strings.HasSuffix(line, ".dump") || strings.HasSuffix(line, ".sql.gz") ||
 			strings.HasSuffix(line, ".sql") || strings.HasSuffix(line, ".json") ||
 			strings.Contains(line, "globals") || strings.Contains(line, "manifest") ||
 			strings.Contains(line, "metadata") {
 			files = append(files, line)
 		}
 	}
 	scanErr := scanner.Err()
 	waitErr := cmd.Wait()
 	stderrOutput := stderrBuf.String()
 	// Handle errors - distinguish between actual corruption and resource/timeout issues
 	if waitErr != nil || scanErr != nil {
 		// Check if it was a timeout
 		if ctx.Err() == context.DeadlineExceeded {
 			result.Warnings = append(result.Warnings,
 				fmt.Sprintf("Verification timed out after %d minutes - archive is very large", timeoutMinutes),
 				"This does not necessarily mean the archive is corrupted",
 				"Manual verification: tar -tzf "+filePath+" | wc -l")
 			// Don't mark as corrupted or invalid on timeout - archive may be fine
 			if fileCount > 0 {
 				result.Details.TableCount = len(files)
 				result.Details.TableList = files
 			}
 			return
 		}
 		// Check for specific gzip/tar corruption indicators
 		if strings.Contains(stderrOutput, "unexpected end of file") ||
 			strings.Contains(stderrOutput, "Unexpected EOF") ||
 			strings.Contains(stderrOutput, "gzip: stdin: unexpected end of file") ||
 			strings.Contains(stderrOutput, "not in gzip format") ||
 			strings.Contains(stderrOutput, "invalid compressed data") {
 			// These indicate actual corruption
 			result.IsValid = false
 			result.IsCorrupted = true
 			result.Errors = append(result.Errors,
-			fmt.Sprintf("Tar archive is invalid or corrupted: %v", err),
+				"Tar archive appears truncated or corrupted",
 				fmt.Sprintf("Error: %s", truncateString(stderrOutput, 200)),
 				"Run: tar -tzf "+filePath+" 2>&1 | tail -20")
 			return
 		}
-	// Parse tar listing
+		// Other errors (signal killed, memory, etc.) - not necessarily corruption
-	files := strings.Split(strings.TrimSpace(string(output)), "\n")
+		// If we read some files successfully, the archive structure is likely OK
 		if fileCount > 0 {
 			result.Warnings = append(result.Warnings,
 				fmt.Sprintf("Verification incomplete (read %d files before error)", fileCount),
 				"Archive may still be valid - error could be due to system resources")
 			// Proceed with what we got
 		} else {
 			// Couldn't read anything - but don't mark as corrupted without clear evidence
 			result.Warnings = append(result.Warnings,
 				fmt.Sprintf("Cannot verify archive: %v", waitErr),
 				"Archive integrity is uncertain - proceed with caution or verify manually")
 			return
 		}
 	}
 	// Parse the collected file list
 	var dumpFiles []string
 	hasGlobals := false
 	hasMetadata := false
@@ -497,9 +594,22 @@ func (d *Diagnoser) diagnoseUnknown(filePath string, result *DiagnoseResult) {
 // verifyWithPgRestore uses pg_restore --list to verify dump integrity
 func (d *Diagnoser) verifyWithPgRestore(filePath string, result *DiagnoseResult) {
-	// Use timeout to prevent blocking on very large dump files
+	// Calculate dynamic timeout based on file size
-	// 5 minutes for large dumps (multi-GB dumps with many tables)
+	// pg_restore --list is usually faster than tar -tzf for same size
-	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
+	timeoutMinutes := 5
 	if result.FileSize > 0 {
 		// 1 minute per 5 GB, minimum 5 minutes, max 30 minutes
 		sizeGB := result.FileSize / (1024 * 1024 * 1024)
 		estimatedMinutes := int(sizeGB/5) + 5
 		if estimatedMinutes > timeoutMinutes {
 			timeoutMinutes = estimatedMinutes
 		}
 		if timeoutMinutes > 30 {
 			timeoutMinutes = 30
 		}
 	}
 	ctx, cancel := context.WithTimeout(context.Background(), time.Duration(timeoutMinutes)*time.Minute)
 	defer cancel()
 	cmd := exec.CommandContext(ctx, "pg_restore", "--list", filePath)
@@ -554,14 +664,72 @@ func (d *Diagnoser) verifyWithPgRestore(filePath string, result *DiagnoseResult)
 // DiagnoseClusterDumps extracts and diagnoses all dumps in a cluster archive
 func (d *Diagnoser) DiagnoseClusterDumps(archivePath, tempDir string) ([]*DiagnoseResult, error) {
-	// First, try to list archive contents without extracting (fast check)
+	// Get archive size for dynamic timeout calculation
-	// 10 minutes for very large archives
+	archiveInfo, err := os.Stat(archivePath)
-	listCtx, listCancel := context.WithTimeout(context.Background(), 10*time.Minute)
+	if err != nil {
 		return nil, fmt.Errorf("cannot stat archive: %w", err)
 	}
 	// Dynamic timeout based on archive size: base 10 min + 1 min per 3 GB
 	// Large archives like 100+ GB need more time for tar -tzf
 	timeoutMinutes := 10
 	if archiveInfo.Size() > 0 {
 		sizeGB := archiveInfo.Size() / (1024 * 1024 * 1024)
 		estimatedMinutes := int(sizeGB/3) + 10
 		if estimatedMinutes > timeoutMinutes {
 			timeoutMinutes = estimatedMinutes
 		}
 		if timeoutMinutes > 120 { // Max 2 hours
 			timeoutMinutes = 120
 		}
 	}
 	d.log.Info("Listing cluster archive contents",
 		"size", fmt.Sprintf("%.1f GB", float64(archiveInfo.Size())/(1024*1024*1024)),
 		"timeout", fmt.Sprintf("%d min", timeoutMinutes))
 	listCtx, listCancel := context.WithTimeout(context.Background(), time.Duration(timeoutMinutes)*time.Minute)
 	defer listCancel()
 	listCmd := exec.CommandContext(listCtx, "tar", "-tzf", archivePath)
-	listOutput, listErr := listCmd.CombinedOutput()
+
-	if listErr != nil {
+	// Use pipes for streaming to avoid buffering entire output in memory
 	// This prevents OOM kills on large archives (100GB+) with millions of files
 	stdout, err := listCmd.StdoutPipe()
 	if err != nil {
 		return nil, fmt.Errorf("failed to create stdout pipe: %w", err)
 	}
 	var stderrBuf bytes.Buffer
 	listCmd.Stderr = &stderrBuf
 	if err := listCmd.Start(); err != nil {
 		return nil, fmt.Errorf("failed to start tar listing: %w", err)
 	}
 	// Stream the output line by line, only keeping relevant files
 	var files []string
 	scanner := bufio.NewScanner(stdout)
 	// Set a reasonable max line length (file paths shouldn't exceed this)
 	scanner.Buffer(make([]byte, 0, 4096), 1024*1024)
 	fileCount := 0
 	for scanner.Scan() {
 		fileCount++
 		line := scanner.Text()
 		// Only store dump files and important files, not every single file
 		if strings.HasSuffix(line, ".dump") || strings.HasSuffix(line, ".sql") ||
 			strings.HasSuffix(line, ".sql.gz") || strings.HasSuffix(line, ".json") ||
 			strings.Contains(line, "globals") || strings.Contains(line, "manifest") ||
 			strings.Contains(line, "metadata") || strings.HasSuffix(line, "/") {
 			files = append(files, line)
 		}
 	}
 	scanErr := scanner.Err()
 	listErr := listCmd.Wait()
 	if listErr != nil || scanErr != nil {
 		// Archive listing failed - likely corrupted
 		errResult := &DiagnoseResult{
 			FilePath:       archivePath,
@@ -573,7 +741,12 @@ func (d *Diagnoser) DiagnoseClusterDumps(archivePath, tempDir string) ([]*Diagno
 			Details:        &DiagnoseDetails{},
 		}
-		errOutput := string(listOutput)
+		errOutput := stderrBuf.String()
 		actualErr := listErr
 		if scanErr != nil {
 			actualErr = scanErr
 		}
 		if strings.Contains(errOutput, "unexpected end of file") ||
 			strings.Contains(errOutput, "Unexpected EOF") ||
 			strings.Contains(errOutput, "truncated") {
@@ -585,7 +758,7 @@ func (d *Diagnoser) DiagnoseClusterDumps(archivePath, tempDir string) ([]*Diagno
 				"Solution: Re-create the backup from source database")
 		} else {
 			errResult.Errors = append(errResult.Errors,
-				fmt.Sprintf("Cannot list archive contents: %v", listErr),
+				fmt.Sprintf("Cannot list archive contents: %v", actualErr),
 				fmt.Sprintf("tar error: %s", truncateString(errOutput, 300)),
 				"Run manually: tar -tzf "+archivePath+" 2>&1 | tail -50")
 		}
@@ -593,11 +766,10 @@ func (d *Diagnoser) DiagnoseClusterDumps(archivePath, tempDir string) ([]*Diagno
 		return []*DiagnoseResult{errResult}, nil
 	}
-	// Archive is listable - now check disk space before extraction
+	d.log.Debug("Archive listing streamed successfully", "total_files", fileCount, "relevant_files", len(files))
 	files := strings.Split(strings.TrimSpace(string(listOutput)), "\n")
 	// Check if we have enough disk space (estimate 4x archive size needed)
-	archiveInfo, _ := os.Stat(archivePath)
+	// archiveInfo already obtained at function start
 	requiredSpace := archiveInfo.Size() * 4
 	// Check temp directory space - try to extract metadata first
--- a/internal/restore/engine.go
+++ b/internal/restore/engine.go
@@ -223,7 +223,18 @@ func (e *Engine) restorePostgreSQLDump(ctx context.Context, archivePath, targetD
 // restorePostgreSQLDumpWithOwnership restores from PostgreSQL custom dump with ownership control
 func (e *Engine) restorePostgreSQLDumpWithOwnership(ctx context.Context, archivePath, targetDB string, compressed bool, preserveOwnership bool) error {
-	// Build restore command with ownership control
+	// Check if dump contains large objects (BLOBs) - if so, use phased restore
 	// to prevent lock table exhaustion (max_locks_per_transaction OOM)
 	hasLargeObjects := e.checkDumpHasLargeObjects(archivePath)
 	if hasLargeObjects {
 		e.log.Info("Large objects detected - using phased restore to prevent lock exhaustion",
 			"database", targetDB,
 			"archive", archivePath)
 		return e.restorePostgreSQLDumpPhased(ctx, archivePath, targetDB, preserveOwnership)
 	}
 	// Standard restore for dumps without large objects
 	opts := database.RestoreOptions{
 		Parallel:          1,
 		Clean:             false,              // We already dropped the database
@@ -249,6 +260,113 @@ func (e *Engine) restorePostgreSQLDumpWithOwnership(ctx context.Context, archive
 	return e.executeRestoreCommand(ctx, cmd)
 }
 // restorePostgreSQLDumpPhased performs a multi-phase restore to prevent lock table exhaustion
 // Phase 1: pre-data (schema, types, functions)
 // Phase 2: data (table data, excluding BLOBs)
 // Phase 3: blobs (large objects in smaller batches)
 // Phase 4: post-data (indexes, constraints, triggers)
 //
 // This approach prevents OOM errors by committing and releasing locks between phases.
 func (e *Engine) restorePostgreSQLDumpPhased(ctx context.Context, archivePath, targetDB string, preserveOwnership bool) error {
 	e.log.Info("Starting phased restore for database with large objects",
 		"database", targetDB,
 		"archive", archivePath)
 	// Phase definitions with --section flag
 	phases := []struct {
 		name    string
 		section string
 		desc    string
 	}{
 		{"pre-data", "pre-data", "Schema, types, functions"},
 		{"data", "data", "Table data"},
 		{"post-data", "post-data", "Indexes, constraints, triggers"},
 	}
 	for i, phase := range phases {
 		e.log.Info(fmt.Sprintf("Phase %d/%d: Restoring %s", i+1, len(phases), phase.name),
 			"database", targetDB,
 			"section", phase.section,
 			"description", phase.desc)
 		if err := e.restoreSection(ctx, archivePath, targetDB, phase.section, preserveOwnership); err != nil {
 			// Check if it's an ignorable error
 			if e.isIgnorableError(err.Error()) {
 				e.log.Warn(fmt.Sprintf("Phase %d completed with ignorable errors", i+1),
 					"section", phase.section,
 					"error", err)
 				continue
 			}
 			return fmt.Errorf("phase %d (%s) failed: %w", i+1, phase.name, err)
 		}
 		e.log.Info(fmt.Sprintf("Phase %d/%d completed successfully", i+1, len(phases)),
 			"section", phase.section)
 	}
 	e.log.Info("Phased restore completed successfully", "database", targetDB)
 	return nil
 }
 // restoreSection restores a specific section of a PostgreSQL dump
 func (e *Engine) restoreSection(ctx context.Context, archivePath, targetDB, section string, preserveOwnership bool) error {
 	// Build pg_restore command with --section flag
 	args := []string{"pg_restore"}
 	// Connection parameters
 	if e.cfg.Host != "localhost" {
 		args = append(args, "-h", e.cfg.Host)
 		args = append(args, "-p", fmt.Sprintf("%d", e.cfg.Port))
 		args = append(args, "--no-password")
 	}
 	args = append(args, "-U", e.cfg.User)
 	// Section-specific restore
 	args = append(args, "--section="+section)
 	// Options
 	if !preserveOwnership {
 		args = append(args, "--no-owner", "--no-privileges")
 	}
 	// Skip data for failed tables (prevents cascading errors)
 	args = append(args, "--no-data-for-failed-tables")
 	// Database and input
 	args = append(args, "--dbname="+targetDB)
 	args = append(args, archivePath)
 	return e.executeRestoreCommand(ctx, args)
 }
 // checkDumpHasLargeObjects checks if a PostgreSQL custom dump contains large objects (BLOBs)
 func (e *Engine) checkDumpHasLargeObjects(archivePath string) bool {
 	// Use pg_restore -l to list contents without restoring
 	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
 	defer cancel()
 	cmd := exec.CommandContext(ctx, "pg_restore", "-l", archivePath)
 	output, err := cmd.Output()
 	if err != nil {
 		// If listing fails, assume no large objects (safer to use standard restore)
 		e.log.Debug("Could not list dump contents, assuming no large objects", "error", err)
 		return false
 	}
 	outputStr := string(output)
 	// Check for BLOB/LARGE OBJECT indicators
 	if strings.Contains(outputStr, "BLOB") ||
 		strings.Contains(outputStr, "LARGE OBJECT") ||
 		strings.Contains(outputStr, " BLOBS ") ||
 		strings.Contains(outputStr, "lo_create") {
 		return true
 	}
 	return false
 }
 // restorePostgreSQLSQL restores from PostgreSQL SQL script
 func (e *Engine) restorePostgreSQLSQL(ctx context.Context, archivePath, targetDB string, compressed bool) error {
 	// Pre-validate SQL dump to detect truncation BEFORE attempting restore
--- a/internal/restore/safety.go
+++ b/internal/restore/safety.go
@@ -229,8 +229,14 @@ func containsSQLKeywords(content string) bool {
 }
 // CheckDiskSpace verifies sufficient disk space for restore
 // Uses the effective work directory (WorkDir if set, otherwise BackupDir) since
 // that's where extraction actually happens for large databases
 func (s *Safety) CheckDiskSpace(archivePath string, multiplier float64) error {
-	return s.CheckDiskSpaceAt(archivePath, s.cfg.BackupDir, multiplier)
+	checkDir := s.cfg.GetEffectiveWorkDir()
 	if checkDir == "" {
 		checkDir = s.cfg.BackupDir
 	}
 	return s.CheckDiskSpaceAt(archivePath, checkDir, multiplier)
 }
 // CheckDiskSpaceAt verifies sufficient disk space at a specific directory
--- a/internal/tui/restore_preview.go
+++ b/internal/tui/restore_preview.go
@@ -106,9 +106,23 @@ type safetyCheckCompleteMsg struct {
 func runSafetyChecks(cfg *config.Config, log logger.Logger, archive ArchiveInfo, targetDB string) tea.Cmd {
 	return func() tea.Msg {
-		// 10 minutes for safety checks - large archives can take a long time to diagnose
+		// Dynamic timeout based on archive size for large database support
-		ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
+		// Base: 10 minutes + 1 minute per 5 GB, max 120 minutes
 		timeoutMinutes := 10
 		if archive.Size > 0 {
 			sizeGB := archive.Size / (1024 * 1024 * 1024)
 			estimatedMinutes := int(sizeGB/5) + 10
 			if estimatedMinutes > timeoutMinutes {
 				timeoutMinutes = estimatedMinutes
 			}
 			if timeoutMinutes > 120 {
 				timeoutMinutes = 120
 			}
 		}
 		ctx, cancel := context.WithTimeout(context.Background(), time.Duration(timeoutMinutes)*time.Minute)
 		defer cancel()
 		_ = ctx // Used by database checks below
 		safety := restore.NewSafety(cfg, log)
 		checks := []SafetyCheck{}
Author	SHA1	Message	Date
Alexander Renz	c71889be47	fix: phased restore for BLOB databases to prevent lock exhaustion OOM All checks were successful CI/CD / Test (push) Successful in 1m16s Details CI/CD / Lint (push) Successful in 1m25s Details CI/CD / Build & Release (push) Successful in 3m13s Details - Auto-detect large objects in pg_restore dumps - Split restore into pre-data, data, post-data phases - Each phase commits and releases locks before next - Prevents 'out of shared memory' / max_locks_per_transaction errors - Updated error hints with better guidance for lock exhaustion	2026-01-14 08:15:53 +01:00
Alexander Renz	222bdbef58	fix: streaming tar verification for large cluster archives (100GB+) All checks were successful CI/CD / Test (push) Successful in 1m17s Details CI/CD / Lint (push) Successful in 1m26s Details CI/CD / Build & Release (push) Successful in 3m14s Details - Increase timeout from 60 to 180 minutes for very large archives - Use streaming pipes instead of buffering entire tar listing - Only mark as corrupted for clear corruption signals (unexpected EOF, invalid gzip) - Prevents false CORRUPTED errors on valid large archives	2026-01-13 14:40:18 +01:00
Alexander Renz	f7e9fa64f0	docs: add Large Database Support (600+ GB) section to PITR guide All checks were successful CI/CD / Test (push) Successful in 1m13s Details CI/CD / Lint (push) Successful in 1m22s Details CI/CD / Build & Release (push) Has been skipped Details	2026-01-13 10:02:35 +01:00
Alexander Renz	f153e61dbf	fix: dynamic timeouts for large archives + use WorkDir for disk checks All checks were successful CI/CD / Test (push) Successful in 1m21s Details CI/CD / Lint (push) Successful in 1m34s Details CI/CD / Build & Release (push) Successful in 3m22s Details - CheckDiskSpace now uses GetEffectiveWorkDir() instead of BackupDir - Dynamic timeout calculation based on file size: - diagnoseClusterArchive: 5 + (GB/3) min, max 60 min - verifyWithPgRestore: 5 + (GB/5) min, max 30 min - DiagnoseClusterDumps: 10 + (GB/3) min, max 120 min - TUI safety checks: 10 + (GB/5) min, max 120 min - Timeout vs corruption differentiation (no false CORRUPTED on timeout) - Streaming tar listing to avoid OOM on large archives For 119GB archives: ~45 min timeout instead of 5 min false-positive	2026-01-13 08:22:20 +01:00