Critical PostgreSQL-specific fixes identified by database expert review:
1. **Port always passed for localhost** (pg_dump, pg_restore, pg_dumpall, psql)
- Previously, port was only passed for non-localhost connections
- If user has PostgreSQL on non-standard port (e.g., 5433), commands
would connect to wrong instance or fail
- Now always passes -p PORT to all PostgreSQL tools
2. **CREATE DATABASE with encoding/locale preservation**
- Now creates databases with explicit ENCODING 'UTF8'
- Detects server's LC_COLLATE and uses it for new databases
- Prevents encoding mismatch errors during restore
- Falls back to simple CREATE if encoding fails (older PG versions)
3. **DROP DATABASE WITH (FORCE) for PostgreSQL 13+**
- Uses new WITH (FORCE) option to atomically terminate connections
- Prevents race condition where new connections are established
- Falls back to standard DROP for PostgreSQL < 13
- Also revokes CONNECT privilege before drop attempt
4. **Improved globals restore error handling**
- Distinguishes between FATAL errors (real problems) and regular
ERROR messages (like 'role already exists' which is expected)
- Only fails on FATAL errors or psql command failures
- Logs error count summary for visibility
5. **Better error classification in restore logs**
- Separate log levels for FATAL vs ERROR
- Debug-level logging for 'already exists' errors (expected)
- Error count tracking to avoid log spam
These fixes improve reliability for enterprise PostgreSQL deployments
with non-standard configurations and existing data.
Critical fixes for enterprise environments where dbbackup runs as
postgres user via 'su postgres' without sudo access:
1. canRestartPostgreSQL(): New function that detects if we can restart
PostgreSQL. Returns false immediately if running as postgres user
without sudo access, avoiding wasted time and potential hangs.
2. tryRestartPostgreSQL(): Now calls canRestartPostgreSQL() first to
skip restart attempts in restricted environments.
3. Changed restart warning from ERROR to WARN level - it's expected
behavior in enterprise environments, not an error.
4. Context cancellation check: Goroutines now check ctx.Err() before
starting and properly count cancelled databases as failures.
5. Goroutine accounting: After wg.Wait(), verify all databases were
accounted for (success + fail = total). Catches goroutine crashes
or deadlocks.
6. Port argument fix: Always pass -p port to psql for localhost
restores, fixing non-standard port configurations.
This should fix the issue where cluster restore showed success but
0 databases were actually restored when running on enterprise systems.
CRITICAL FIXES:
- Add check for successCount == 0 to properly fail when no databases restored
- Fix tryRestartPostgreSQL to use non-interactive sudo (-n flag)
- Add 10-second timeout per restart attempt to prevent blocking
- Try pg_ctl directly for postgres user (no sudo needed)
- Set stdin to nil to prevent sudo from waiting for password input
This fixes the issue where cluster restore showed success but no databases
were actually restored due to sudo blocking on password prompts.
- Remove sudo cat attempt for reading pg_hba.conf
- Prevents password prompts when running as postgres via 'su postgres'
- Auth detection now relies on connection attempts when file is unreadable
- Fixes issue where returning to menu after restore triggers sudo prompt
- Add box-style headers for success/failure states
- Display comprehensive summary with archive info, type, database count
- Show timing section with total time, throughput, and average per-DB stats
- Use consistent styling and formatting across all result views
- Improve visual hierarchy with section separators
- Add DatabaseProgressWithTimingCallback for timing-aware progress reporting
- Track elapsed time and average duration per database during restore phase
- Display ETA based on completed database restore times
- Show restore phase elapsed time in progress bar
- Enhance cluster restore progress bar with [elapsed / ETA: remaining] format
- Fixed critical bug where ALTER SYSTEM + pg_reload_conf() was used
but max_locks_per_transaction requires a full PostgreSQL restart
- Added automatic restart attempt (systemctl, service, pg_ctl)
- Added loud warnings if restart fails with manual fix instructions
- Updated preflight checks to warn about low max_locks_per_transaction
- This was causing 'out of shared memory' errors on BLOB-heavy restores
- Add ProgressCallback and DatabaseProgressCallback to backup engine
- Add SetProgressCallback() and SetDatabaseProgressCallback() methods
- Wire database progress callback in backup_exec.go
- Show progress bar (Database: [████░░░] X/Y) during cluster backups
- Hide spinner when progress bar is displayed
- Use shared progress state pattern for thread-safe TUI updates
- Add nil checks before logger calls in diagnose.go (8 places)
- Add nil checks before logger calls in safety.go (2 places)
- Fixes crash when pressing 'v' to verify backup in interactive menu
- New internal/fs package for testable filesystem operations
- In-memory filesystem support for unit testing without disk I/O
- Swappable global FS: SetFS(afero.NewMemMapFs())
- Wrapper functions: ReadFile, WriteFile, Mkdir, Walk, Glob, etc.
- Testing helpers: WithMemFs(), SetupTestDir()
- Comprehensive test suite demonstrating usage
- Upgraded afero from v1.10.0 to v1.15.0
- Windows-compatible colors via native console API
- Color helper functions: Success(), Error(), Warning(), Info()
- Text styling: Header(), Dim(), Bold(), Green(), Red(), Yellow(), Cyan()
- Logger CleanFormatter uses fatih/color instead of raw ANSI
- All progress indicators use colored [OK]/[FAIL] status
- Automatic color detection for non-TTY environments
- Visual progress bars for cloud uploads/downloads
- Byte transfer display, speed, ETA prediction
- Color-coded Unicode block progress
- Checksum verification with progress bar for large files
- Spinner for indeterminate operations (unknown size)
- New types: NewSchollzBar(), NewSchollzBarItems(), NewSchollzSpinner()
- Progress Writer() method for io.Copy integration
- Added gopsutil/v3 for cross-platform system metrics
* Works on Linux, macOS, Windows, BSD
* Memory detection no longer requires /proc parsing
- Added go-humanize for readable output
* humanize.Bytes() for memory sizes
* humanize.Comma() for large numbers
- Improved preflight display with memory usage percentage
- Linux kernel checks (shmmax/shmall) still use /proc for accuracy
- Linux system checks (read-only from /proc, no auth needed):
* shmmax, shmall kernel limits
* Available RAM check
- PostgreSQL auto-tuning:
* max_locks_per_transaction scaled by BLOB count
* maintenance_work_mem boosted to 2GB for faster indexes
* All settings auto-reset after restore (even on failure)
- Archive analysis:
* Count BLOBs per database (pg_restore -l or zgrep)
* Scale lock boost: 2048 (default) → 4096/8192/16384 based on count
- Nice TUI preflight summary display with ✓/⚠ indicators
- Automatically boost max_locks_per_transaction to 2048 before restore
- Uses ALTER SYSTEM + pg_reload_conf() - no restart needed
- Automatically resets to original value after restore (even on failure)
- Prevents 'out of shared memory' OOM on BLOB-heavy SQL format dumps
- Works transparently - no user intervention required
- Auto-detect large objects in pg_restore dumps
- Split restore into pre-data, data, post-data phases
- Each phase commits and releases locks before next
- Prevents 'out of shared memory' / max_locks_per_transaction errors
- Updated error hints with better guidance for lock exhaustion
- Increase timeout from 60 to 180 minutes for very large archives
- Use streaming pipes instead of buffering entire tar listing
- Only mark as corrupted for clear corruption signals (unexpected EOF, invalid gzip)
- Prevents false CORRUPTED errors on valid large archives
- CheckDiskSpace now uses GetEffectiveWorkDir() instead of BackupDir
- Dynamic timeout calculation based on file size:
- diagnoseClusterArchive: 5 + (GB/3) min, max 60 min
- verifyWithPgRestore: 5 + (GB/5) min, max 30 min
- DiagnoseClusterDumps: 10 + (GB/3) min, max 120 min
- TUI safety checks: 10 + (GB/5) min, max 120 min
- Timeout vs corruption differentiation (no false CORRUPTED on timeout)
- Streaming tar listing to avoid OOM on large archives
For 119GB archives: ~45 min timeout instead of 5 min false-positive
- verifyArchiveCmd now uses restore.Safety and restore.Diagnoser
- Same validation logic in backup manager verify and restore safety checks
- No more discrepancy between verify showing valid and restore failing
- Service templates now use WorkingDirectory for config loading
- Config is read from .dbbackup.conf in /var/lib/dbbackup
- Updated SYSTEMD.md documentation to match actual CLI
- Removed non-existent --config flag from ExecStart
CRITICAL FIXES:
- Encryption detection false positive (IsBackupEncrypted returned true for ALL files)
- 12 cmd.Wait() deadlocks fixed with channel-based context handling
- TUI timeout bugs: 60s->10min for safety checks, 15s->60s for DB listing
- diagnose.go timeouts: 60s->5min for tar/pg_restore operations
- Panic recovery added to parallel backup/restore goroutines
- Variable shadowing fix in restore/engine.go
These bugs caused pg_dump backups to fail through TUI for months.
This is a critical bugfix release addressing multiple hardcoded temporary directory paths
that prevented proper use of the WorkDir configuration option.
PROBLEM:
Users configuring WorkDir (e.g., /u01/dba/tmp) for systems with small root filesystems
still experienced failures because critical operations hardcoded /tmp instead of respecting
the configured WorkDir. This made the WorkDir option essentially non-functional.
FIXED LOCATIONS:
1. internal/restore/engine.go:632 - CRITICAL: Used BackupDir instead of WorkDir for extraction
2. cmd/restore.go:354,834 - CLI restore/diagnose commands ignored WorkDir
3. cmd/migrate.go:208,347 - Migration commands hardcoded /tmp
4. internal/migrate/engine.go:120 - Migration engine ignored WorkDir
5. internal/config/config.go:224 - SwapFilePath hardcoded /tmp
6. internal/config/config.go:519 - Backup directory fallback hardcoded /tmp
7. internal/tui/restore_exec.go:161 - Debug logs hardcoded /tmp
8. internal/tui/settings.go:805 - Directory browser default hardcoded /tmp
9. internal/tui/restore_preview.go:474 - Display message hardcoded /tmp
NEW FEATURES:
- Added Config.GetEffectiveWorkDir() helper method
- WorkDir now respects WORK_DIR environment variable
- All temp operations now consistently use configured WorkDir with /tmp fallback
IMPACT:
- Restores on systems with small root disks now work properly with WorkDir configured
- Admins can control disk space usage for all temporary operations
- Debug logs, extraction dirs, swap files all respect WorkDir setting
Version: 3.42.1 (Critical Fix Release)
- Remove invalid --config flag from exporter service template
- Change ReadOnlyPaths to ReadWritePaths for catalog access
- Add copyBinary() to install binary to /usr/local/bin (ProtectHome compat)
- Fix exporter status detection using direct systemctl check
- Add os/exec import for status check
- Add install/uninstall and metrics commands to command table
- Add Systemd Integration section with install examples
- Add Prometheus Metrics section with textfile and HTTP exporter docs
- Update Features list with systemd and metrics highlights
- Rebuild all platform binaries