Three high-value improvements for cluster restore:
1. Weighted progress by database size
- Progress now shows percentage by data volume, not just count
- Phase 3/3: Databases (2/7) - 45.2% by size
- Gives more accurate ETA for clusters with varied DB sizes
2. Pre-extraction disk space check
- Checks workdir has 3x archive size before extraction
- Prevents partial extraction failures when disk fills mid-way
- Clear error message with required vs available GB
3. --parallel-dbs flag for concurrent restores
- dbbackup restore cluster archive.tar.gz --parallel-dbs=4
- Overrides CLUSTER_PARALLELISM config setting
- Set to 1 for sequential restore (safest for large objects)
Ensures PostgreSQL fully closes connections before starting next
restore, preventing potential connection pool exhaustion during
rapid sequential cluster restores.
The test was flaky because it used crypto/rand for random data,
causing non-deterministic chunk boundaries. With small sample sizes
(100KB / 8KB avg = ~12 chunks), variance was high - sometimes only
42.9% overlap instead of expected >50%.
Fixed by using math/rand with seed 42 for reproducible test results.
Now consistently achieves 91.7% overlap (11/12 chunks).
Bubbletea v1.3+ sends InterruptMsg for SIGINT signals instead of
KeyMsg with 'ctrl+c', causing cancellation to not work properly.
- Add tea.InterruptMsg handling to restore_exec.go
- Add tea.InterruptMsg handling to backup_exec.go
- Add tea.InterruptMsg handling to menu.go
- Call cleanup.KillOrphanedProcesses on all interrupt paths
- No zombie pg_dump/pg_restore/gzip processes left behind
Fixes Ctrl+C not working during cluster restore/backup operations.
v3.42.50
Critical PostgreSQL-specific fixes identified by database expert review:
1. **Port always passed for localhost** (pg_dump, pg_restore, pg_dumpall, psql)
- Previously, port was only passed for non-localhost connections
- If user has PostgreSQL on non-standard port (e.g., 5433), commands
would connect to wrong instance or fail
- Now always passes -p PORT to all PostgreSQL tools
2. **CREATE DATABASE with encoding/locale preservation**
- Now creates databases with explicit ENCODING 'UTF8'
- Detects server's LC_COLLATE and uses it for new databases
- Prevents encoding mismatch errors during restore
- Falls back to simple CREATE if encoding fails (older PG versions)
3. **DROP DATABASE WITH (FORCE) for PostgreSQL 13+**
- Uses new WITH (FORCE) option to atomically terminate connections
- Prevents race condition where new connections are established
- Falls back to standard DROP for PostgreSQL < 13
- Also revokes CONNECT privilege before drop attempt
4. **Improved globals restore error handling**
- Distinguishes between FATAL errors (real problems) and regular
ERROR messages (like 'role already exists' which is expected)
- Only fails on FATAL errors or psql command failures
- Logs error count summary for visibility
5. **Better error classification in restore logs**
- Separate log levels for FATAL vs ERROR
- Debug-level logging for 'already exists' errors (expected)
- Error count tracking to avoid log spam
These fixes improve reliability for enterprise PostgreSQL deployments
with non-standard configurations and existing data.
Critical fixes for enterprise environments where dbbackup runs as
postgres user via 'su postgres' without sudo access:
1. canRestartPostgreSQL(): New function that detects if we can restart
PostgreSQL. Returns false immediately if running as postgres user
without sudo access, avoiding wasted time and potential hangs.
2. tryRestartPostgreSQL(): Now calls canRestartPostgreSQL() first to
skip restart attempts in restricted environments.
3. Changed restart warning from ERROR to WARN level - it's expected
behavior in enterprise environments, not an error.
4. Context cancellation check: Goroutines now check ctx.Err() before
starting and properly count cancelled databases as failures.
5. Goroutine accounting: After wg.Wait(), verify all databases were
accounted for (success + fail = total). Catches goroutine crashes
or deadlocks.
6. Port argument fix: Always pass -p port to psql for localhost
restores, fixing non-standard port configurations.
This should fix the issue where cluster restore showed success but
0 databases were actually restored when running on enterprise systems.
CRITICAL FIXES:
- Add check for successCount == 0 to properly fail when no databases restored
- Fix tryRestartPostgreSQL to use non-interactive sudo (-n flag)
- Add 10-second timeout per restart attempt to prevent blocking
- Try pg_ctl directly for postgres user (no sudo needed)
- Set stdin to nil to prevent sudo from waiting for password input
This fixes the issue where cluster restore showed success but no databases
were actually restored due to sudo blocking on password prompts.
- Remove sudo cat attempt for reading pg_hba.conf
- Prevents password prompts when running as postgres via 'su postgres'
- Auth detection now relies on connection attempts when file is unreadable
- Fixes issue where returning to menu after restore triggers sudo prompt
- Add box-style headers for success/failure states
- Display comprehensive summary with archive info, type, database count
- Show timing section with total time, throughput, and average per-DB stats
- Use consistent styling and formatting across all result views
- Improve visual hierarchy with section separators
- Add DatabaseProgressWithTimingCallback for timing-aware progress reporting
- Track elapsed time and average duration per database during restore phase
- Display ETA based on completed database restore times
- Show restore phase elapsed time in progress bar
- Enhance cluster restore progress bar with [elapsed / ETA: remaining] format
- Fixed critical bug where ALTER SYSTEM + pg_reload_conf() was used
but max_locks_per_transaction requires a full PostgreSQL restart
- Added automatic restart attempt (systemctl, service, pg_ctl)
- Added loud warnings if restart fails with manual fix instructions
- Updated preflight checks to warn about low max_locks_per_transaction
- This was causing 'out of shared memory' errors on BLOB-heavy restores
- Add ProgressCallback and DatabaseProgressCallback to backup engine
- Add SetProgressCallback() and SetDatabaseProgressCallback() methods
- Wire database progress callback in backup_exec.go
- Show progress bar (Database: [████░░░] X/Y) during cluster backups
- Hide spinner when progress bar is displayed
- Use shared progress state pattern for thread-safe TUI updates
- Add nil checks before logger calls in diagnose.go (8 places)
- Add nil checks before logger calls in safety.go (2 places)
- Fixes crash when pressing 'v' to verify backup in interactive menu
- New internal/fs package for testable filesystem operations
- In-memory filesystem support for unit testing without disk I/O
- Swappable global FS: SetFS(afero.NewMemMapFs())
- Wrapper functions: ReadFile, WriteFile, Mkdir, Walk, Glob, etc.
- Testing helpers: WithMemFs(), SetupTestDir()
- Comprehensive test suite demonstrating usage
- Upgraded afero from v1.10.0 to v1.15.0
- Windows-compatible colors via native console API
- Color helper functions: Success(), Error(), Warning(), Info()
- Text styling: Header(), Dim(), Bold(), Green(), Red(), Yellow(), Cyan()
- Logger CleanFormatter uses fatih/color instead of raw ANSI
- All progress indicators use colored [OK]/[FAIL] status
- Automatic color detection for non-TTY environments
- Visual progress bars for cloud uploads/downloads
- Byte transfer display, speed, ETA prediction
- Color-coded Unicode block progress
- Checksum verification with progress bar for large files
- Spinner for indeterminate operations (unknown size)
- New types: NewSchollzBar(), NewSchollzBarItems(), NewSchollzSpinner()
- Progress Writer() method for io.Copy integration
- Use hashicorp/go-multierror for cluster restore error collection
- Shows ALL failed databases with full error context (not just count)
- Bullet-pointed output for readability
- Thread-safe error aggregation with dedicated mutex
- Error wrapping with %w for proper error chain preservation
- Added gopsutil/v3 for cross-platform system metrics
* Works on Linux, macOS, Windows, BSD
* Memory detection no longer requires /proc parsing
- Added go-humanize for readable output
* humanize.Bytes() for memory sizes
* humanize.Comma() for large numbers
- Improved preflight display with memory usage percentage
- Linux kernel checks (shmmax/shmall) still use /proc for accuracy
- Linux system checks (read-only from /proc, no auth needed):
* shmmax, shmall kernel limits
* Available RAM check
- PostgreSQL auto-tuning:
* max_locks_per_transaction scaled by BLOB count
* maintenance_work_mem boosted to 2GB for faster indexes
* All settings auto-reset after restore (even on failure)
- Archive analysis:
* Count BLOBs per database (pg_restore -l or zgrep)
* Scale lock boost: 2048 (default) → 4096/8192/16384 based on count
- Nice TUI preflight summary display with ✓/⚠ indicators
- Automatically boost max_locks_per_transaction to 2048 before restore
- Uses ALTER SYSTEM + pg_reload_conf() - no restart needed
- Automatically resets to original value after restore (even on failure)
- Prevents 'out of shared memory' OOM on BLOB-heavy SQL format dumps
- Works transparently - no user intervention required
- Auto-detect large objects in pg_restore dumps
- Split restore into pre-data, data, post-data phases
- Each phase commits and releases locks before next
- Prevents 'out of shared memory' / max_locks_per_transaction errors
- Updated error hints with better guidance for lock exhaustion
- Increase timeout from 60 to 180 minutes for very large archives
- Use streaming pipes instead of buffering entire tar listing
- Only mark as corrupted for clear corruption signals (unexpected EOF, invalid gzip)
- Prevents false CORRUPTED errors on valid large archives
- CheckDiskSpace now uses GetEffectiveWorkDir() instead of BackupDir
- Dynamic timeout calculation based on file size:
- diagnoseClusterArchive: 5 + (GB/3) min, max 60 min
- verifyWithPgRestore: 5 + (GB/5) min, max 30 min
- DiagnoseClusterDumps: 10 + (GB/3) min, max 120 min
- TUI safety checks: 10 + (GB/5) min, max 120 min
- Timeout vs corruption differentiation (no false CORRUPTED on timeout)
- Streaming tar listing to avoid OOM on large archives
For 119GB archives: ~45 min timeout instead of 5 min false-positive
- Changed status query from dbbackup_backup_verified to RPO-based check
- dbbackup_rpo_seconds < 86400 returns SUCCESS when backup < 24h old
- Fixes false FAILED status when verify operations not run
- Includes: status, RPO, backup size, duration, and overview table panels
- verifyArchiveCmd now uses restore.Safety and restore.Diagnoser
- Same validation logic in backup manager verify and restore safety checks
- No more discrepancy between verify showing valid and restore failing
- Service templates now use WorkingDirectory for config loading
- Config is read from .dbbackup.conf in /var/lib/dbbackup
- Updated SYSTEMD.md documentation to match actual CLI
- Removed non-existent --config flag from ExecStart
- Block cluster restore if existing databases found and cleanup not enabled
- User must press 'c' to enable 'Clean All First' before proceeding
- Prevents accidental data conflicts during disaster recovery
- Bug #24: Missing safety gate for cluster restore