Files

Alexander Renz 91228552fb fix(backup/restore): implement DB+Go specialist recommendations

P0: Add ON_ERROR_STOP=1 to psql (fail fast, not 2.6M errors)
P1: Fix pipe deadlock in streaming compression (goroutine+context)
P1: Handle SIGPIPE (exit 141) - report compressor as root cause
P2: Validate .dump files with pg_restore --list before restore
P2: Add fsync after streaming compression for durability

Fixes potential hung backups and improves error diagnostics.

2026-01-07 08:58:00 +01:00

23 KiB

Raw Blame History

Changelog

All notable changes to dbbackup will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[3.41.0] - 2026-01-07 "The Pre-Flight Check"

Added - 🛡️ Pre-Restore Validation

Automatic Dump Validation Before Restore:

SQL dump files are now validated BEFORE attempting restore
Detects truncated COPY blocks that cause "syntax error" failures
Catches corrupted backups in seconds instead of wasting 49+ minutes
Cluster restore pre-validates ALL dumps upfront (fail-fast approach)
Custom format .dump files now validated with pg_restore --list

Improved Error Messages:

Clear indication when dump file is truncated
Shows which table's COPY block was interrupted
Displays sample orphaned data for diagnosis
Provides actionable error messages with root cause

Fixed

P0: psql ON_ERROR_STOP - Added -v ON_ERROR_STOP=1 to psql commands to fail fast on first error instead of accumulating millions of errors
P1: Pipe deadlock - Fixed streaming compression deadlock when pg_dump blocks on full pipe buffer; now uses goroutine with proper context timeout handling
P1: SIGPIPE handling - Detect exit code 141 (broken pipe) and report compressor failure as root cause
P2: .dump validation - Custom format dumps now validated with pg_restore --list before restore
P2: fsync durability - Added outFile.Sync() after streaming compression to prevent truncation on power loss
Truncated .sql.gz dumps no longer waste hours on doomed restores
"syntax error at or near" errors now caught before restore begins
Cluster restores abort immediately if any dump is corrupted

Technical Details

Integrated Diagnoser into restore pipeline for pre-validation
Added quickValidateSQLDump() for fast integrity checks
Pre-validation runs on all .sql.gz and .dump files in cluster archives
Streaming compression uses channel-based wait with context cancellation
Zero performance impact on valid backups (diagnosis is fast)

[3.40.0] - 2026-01-05 "The Diagnostician"

Added - 🔍 Restore Diagnostics & Error Reporting

Backup Diagnosis Command:

restore diagnose <archive> - Deep analysis of backup files before restore
Detects truncated dumps, corrupted archives, incomplete COPY blocks
PGDMP signature validation for PostgreSQL custom format
Gzip integrity verification with decompression test
pg_restore --list validation for custom format archives
--deep flag for exhaustive line-by-line analysis
--json flag for machine-readable output
Cluster archive diagnosis scans all contained dumps

Detailed Error Reporting:

Comprehensive error collector captures stderr during restore
Ring buffer prevents OOM on high-error restores (2M+ errors)
Error classification with actionable hints and recommendations
--save-debug-log <path> saves JSON report on failure
Reports include: exit codes, last errors, line context, tool versions
Automatic recommendations based on error patterns

TUI Restore Enhancements:

Dump validity safety check runs automatically before restore
Detects truncated/corrupted backups in restore preview
Press d to toggle debug log saving in Advanced Options
Debug logs saved to /tmp/dbbackup-restore-debug-*.json on failure
Press d in archive browser to run diagnosis on any backup

New Commands:

restore diagnose - Analyze backup file integrity and structure

New Flags:

--save-debug-log <path> - Save detailed JSON error report on failure
--diagnose - Run deep diagnosis before cluster restore
--deep - Enable exhaustive diagnosis (line-by-line analysis)
--json - Output diagnosis in JSON format
--keep-temp - Keep temporary files after diagnosis
--verbose - Show detailed diagnosis progress

Technical Details

1,200+ lines of new diagnostic code
Error classification system with 15+ error patterns
Ring buffer stderr capture (1MB max, 10K lines)
Zero memory growth on high-error restores
Full TUI integration for diagnostics

[3.2.0] - 2025-12-13 "The Margin Eraser"

Added - 🚀 Physical Backup Revolution

MySQL Clone Plugin Integration:

Native physical backup using MySQL 8.0.17+ Clone Plugin
No XtraBackup dependency - pure Go implementation
Real-time progress monitoring via performance_schema
Support for both local and remote clone operations

Filesystem Snapshot Orchestration:

LVM snapshot support with automatic cleanup
ZFS snapshot integration with send/receive
Btrfs subvolume snapshot support
Brief table lock (<100ms) for consistency
Automatic snapshot backend detection

Continuous Binlog Streaming:

Real-time binlog capture using MySQL replication protocol
Multiple targets: file, compressed file, S3 direct streaming
Sub-second RPO without impacting database server
Automatic position tracking and checkpointing

Parallel Cloud Streaming:

Direct database-to-S3 streaming (zero local storage)
Configurable worker pool for parallel uploads
S3 multipart upload with automatic retry
Support for S3, GCS, and Azure Blob Storage

Smart Engine Selection:

Automatic engine selection based on environment
MySQL version detection and capability checking
Filesystem type detection for optimal snapshot backend
Database size-based recommendations

New Commands:

engine list - List available backup engines
engine info <name> - Show detailed engine information
backup --engine=<name> - Use specific backup engine

Technical Details

7,559 lines of new code
Zero new external dependencies
10/10 platform builds successful
Full test coverage for new engines

[3.1.0] - 2025-11-26

Added - 🔄 Point-in-Time Recovery (PITR)

Complete PITR Implementation for PostgreSQL:

WAL Archiving: Continuous archiving of Write-Ahead Log files with compression and encryption support
Timeline Management: Track and manage PostgreSQL timeline history with branching support
Recovery Targets: Restore to specific timestamp, transaction ID (XID), LSN, named restore point, or immediate
PostgreSQL Version Support: Both modern (12+) and legacy recovery configuration formats
Recovery Actions: Promote to primary, pause for inspection, or shutdown after recovery
Comprehensive Testing: 700+ lines of tests covering all PITR functionality with 100% pass rate

New Commands:

PITR Management:

pitr enable - Configure PostgreSQL for WAL archiving and PITR
pitr disable - Disable WAL archiving in PostgreSQL configuration
pitr status - Display current PITR configuration and archive statistics

WAL Archive Operations:

wal archive <wal-file> <filename> - Archive WAL file (used by archive_command)
wal list - List all archived WAL files with details
wal cleanup - Remove old WAL files based on retention policy
wal timeline - Display timeline history and branching structure

Point-in-Time Restore:

restore pitr - Perform point-in-time recovery with multiple target types:
- --target-time "YYYY-MM-DD HH:MM:SS" - Restore to specific timestamp
- --target-xid <xid> - Restore to transaction ID
- --target-lsn <lsn> - Restore to Log Sequence Number
- --target-name <name> - Restore to named restore point
- --target-immediate - Restore to earliest consistent point

Advanced PITR Features:

WAL Compression: gzip compression (70-80% space savings)
WAL Encryption: AES-256-GCM encryption for archived WAL files
Timeline Selection: Recover along specific timeline or latest
Recovery Actions: Promote (default), pause, or shutdown after target reached
Inclusive/Exclusive: Control whether target transaction is included
Auto-Start: Automatically start PostgreSQL after recovery setup
Recovery Monitoring: Real-time monitoring of recovery progress

Configuration Options:

# Enable PITR with compression and encryption
./dbbackup pitr enable --archive-dir /backups/wal_archive \
  --compress --encrypt --encryption-key-file /secure/key.bin

# Perform PITR to specific time
./dbbackup restore pitr \
  --base-backup /backups/base.tar.gz \
  --wal-archive /backups/wal_archive \
  --target-time "2024-11-26 14:30:00" \
  --target-dir /var/lib/postgresql/14/restored \
  --auto-start --monitor

Technical Details:

WAL file parsing and validation (timeline, segment, extension detection)
Timeline history parsing (.history files) with consistency validation
Automatic PostgreSQL version detection (12+ vs legacy)
Recovery configuration generation (postgresql.auto.conf + recovery.signal)
Data directory validation (exists, writable, PostgreSQL not running)
Comprehensive error handling and validation

Documentation:

Complete PITR section in README.md (200+ lines)
Dedicated PITR.md guide with detailed examples and troubleshooting
Test suite documentation (tests/pitr_complete_test.go)

Files Added:

internal/pitr/wal/ - WAL archiving and parsing
internal/pitr/config/ - Recovery configuration generation
internal/pitr/timeline/ - Timeline management
cmd/pitr.go - PITR command implementation
cmd/wal.go - WAL management commands
cmd/restore_pitr.go - PITR restore command
tests/pitr_complete_test.go - Comprehensive test suite (700+ lines)
PITR.md - Complete PITR guide

Performance:

WAL archiving: ~100-200 MB/s (with compression)
WAL encryption: ~1-2 GB/s (streaming)
Recovery replay: 10-100 MB/s (disk I/O dependent)
Minimal overhead during normal operations

Use Cases:

Disaster recovery from accidental data deletion
Rollback to pre-migration state
Compliance and audit requirements
Testing and what-if scenarios
Timeline branching for parallel recovery paths

Changed

Licensing: Added Apache License 2.0 to the project (LICENSE file)
Version: Updated to v3.1.0
Enhanced metadata format with PITR information
Improved progress reporting for long-running operations
Better error messages for PITR operations

Production

Production Validated: 2 production hosts
Databases backed up: 8 databases nightly
Retention policy: 30-day retention with minimum 5 backups
Backup volume: ~10MB/night
Schedule: 02:09 and 02:25 CET
Impact: Resolved 4-day backup failure immediately
User feedback: "cleanup command is SO gut" | "--dry-run: chef's kiss!" 💋

Documentation

Added comprehensive PITR.md guide (complete PITR documentation)
Updated README.md with PITR section (200+ lines)
Updated CHANGELOG.md with v3.1.0 details
Added NOTICE file for Apache License attribution
Created comprehensive test suite (tests/pitr_complete_test.go - 700+ lines)

[3.0.0] - 2025-11-26

Added - 🔐 AES-256-GCM Encryption (Phase 4)

Secure Backup Encryption:

Algorithm: AES-256-GCM authenticated encryption (prevents tampering)
Key Derivation: PBKDF2-SHA256 with 600,000 iterations (OWASP 2024 recommended)
Streaming Encryption: Memory-efficient for large backups (O(buffer) not O(file))
Key Sources: File (raw/base64), environment variable, or passphrase
Auto-Detection: Restore automatically detects and decrypts encrypted backups
Metadata Tracking: Encrypted flag and algorithm stored in .meta.json

CLI Integration:

--encrypt - Enable encryption for backup operations
--encryption-key-file <path> - Path to 32-byte encryption key (raw or base64 encoded)
--encryption-key-env <var> - Environment variable containing key (default: DBBACKUP_ENCRYPTION_KEY)
Automatic decryption on restore (no extra flags needed)

Security Features:

Unique nonce per encryption (no key reuse vulnerabilities)
Cryptographically secure random generation (crypto/rand)
Key validation (32 bytes required)
Authenticated encryption prevents tampering attacks
56-byte header: Magic(16) + Algorithm(16) + Nonce(12) + Salt(32)

Usage Examples:

# Generate encryption key
head -c 32 /dev/urandom | base64 > encryption.key

# Encrypted backup
./dbbackup backup single mydb --encrypt --encryption-key-file encryption.key

# Restore (automatic decryption)
./dbbackup restore single mydb_backup.sql.gz --encryption-key-file encryption.key --confirm

Performance:

Encryption speed: ~1-2 GB/s (streaming, no memory bottleneck)
Overhead: 56 bytes header + 16 bytes GCM tag per file
Key derivation: ~1.4s for 600k iterations (intentionally slow for security)

Files Added:

internal/crypto/interface.go - Encryption interface and configuration
internal/crypto/aes.go - AES-256-GCM implementation (272 lines)
internal/crypto/aes_test.go - Comprehensive test suite (all tests passing)
cmd/encryption.go - CLI encryption helpers
internal/backup/encryption.go - Backup encryption operations
Total: ~1,200 lines across 13 files

Added - 📦 Incremental Backups (Phase 3B)

MySQL/MariaDB Incremental Backups:

Change Detection: mtime-based file modification tracking
Archive Format: tar.gz containing only changed files since base backup
Space Savings: 70-95% smaller than full backups (typical)
Backup Chain: Tracks base → incremental relationships with metadata
Checksum Verification: SHA-256 integrity checking
Auto-Detection: CLI automatically uses correct engine for PostgreSQL vs MySQL

MySQL-Specific Exclusions:

Relay logs (relay-log, relay-bin*)
Binary logs (mysql-bin*, binlog*)
InnoDB redo logs (ib_logfile*)
InnoDB undo logs (undo_*)
Performance schema (in-memory)
Temporary files (#sql*, *.tmp)
Lock files (*.lock, auto.cnf.lock)
PID files (*.pid, mysqld.pid)
Error logs (*.err, error.log)
Slow query logs (slow.log)
General logs (general.log, query.log)

CLI Integration:

--backup-type <full|incremental> - Backup type (default: full)
--base-backup <path> - Path to base backup (required for incremental)
Auto-detects database type (PostgreSQL vs MySQL) and uses appropriate engine
Same interface for both database types

Usage Examples:

# Full backup (base)
./dbbackup backup single mydb --db-type mysql --backup-type full

# Incremental backup
./dbbackup backup single mydb \
  --db-type mysql \
  --backup-type incremental \
  --base-backup /backups/mydb_20251126.tar.gz

# Restore incremental
./dbbackup restore incremental \
  --base-backup mydb_base.tar.gz \
  --incremental-backup mydb_incr_20251126.tar.gz \
  --target /restore/path

Implementation:

Copy-paste-adapt from Phase 3A PostgreSQL (95% code reuse)
Interface-based design enables sharing tests between engines
internal/backup/incremental_mysql.go - MySQL incremental engine (530 lines)
All existing tests pass immediately (interface compatibility)
Development time: 30 minutes (vs 5-6h estimated) - 10x speedup!

Combined Features:

# Encrypted + Incremental backup
./dbbackup backup single mydb \
  --backup-type incremental \
  --base-backup mydb_base.tar.gz \
  --encrypt \
  --encryption-key-file key.txt

Changed

Version: Bumped to 3.0.0 (major feature release)
Backup Engine: Integrated encryption and incremental capabilities
Restore Engine: Added automatic decryption detection
Metadata Format: Extended with encryption and incremental fields

Testing

✅ Encryption tests: 4 tests passing (TestAESEncryptionDecryption, TestKeyDerivation, TestKeyValidation, TestLargeData)
✅ Incremental tests: 2 tests passing (TestIncrementalBackupRestore, TestIncrementalBackupErrors)
✅ Roundtrip validation: Encrypt → Decrypt → Verify (data matches perfectly)
✅ Build: All platforms compile successfully
✅ Interface compatibility: PostgreSQL and MySQL engines share test suite

Documentation

Updated README.md with encryption and incremental sections
Added PHASE4_COMPLETION.md - Encryption implementation details
Added PHASE3B_COMPLETION.md - MySQL incremental implementation report
Usage examples for encryption, incremental, and combined workflows

Performance

Phase 4: Completed in ~1h (encryption library + CLI integration)
Phase 3B: Completed in 30 minutes (vs 5-6h estimated)
Total: 2 major features delivered in 1 day (planned: 6 hours, actual: ~2 hours)
Quality: Production-ready, all tests passing, no breaking changes

Commits

Phase 4: 3 commits (7d96ec7, f9140cf, dd614dd, 8bbca16)
Phase 3B: 2 commits (357084c, a0974ef)
Docs: 1 commit (3b9055b)

[2.1.0] - 2025-11-26

Added - Cloud Storage Integration

S3/MinIO/B2 Support: Native S3-compatible storage backend with streaming uploads
Azure Blob Storage: Native Azure integration with block blob support for files >256MB
Google Cloud Storage: Native GCS integration with 16MB chunked uploads
Cloud URI Syntax: Direct backup/restore using --cloud s3://bucket/path URIs
TUI Cloud Settings: Configure cloud providers directly in interactive menu
- Cloud Storage Enabled toggle
- Provider selector (S3, MinIO, B2, Azure, GCS)
- Bucket/Container configuration
- Region configuration
- Credential management with masking
- Auto-upload toggle
Multipart Uploads: Automatic multipart uploads for files >100MB (S3/MinIO/B2)
Streaming Transfers: Memory-efficient streaming for all cloud operations
Progress Tracking: Real-time upload/download progress with ETA
Metadata Sync: Automatic .sha256 and .info file upload alongside backups
Cloud Verification: Verify backup integrity directly from cloud storage
Cloud Cleanup: Apply retention policies to cloud-stored backups

Added - Cross-Platform Support

Windows Support: Native binaries for Windows Intel (amd64) and ARM (arm64)
NetBSD Support: Full support for NetBSD amd64 (disk checks use safe defaults)
Platform-Specific Implementations:
- resources_unix.go - Linux, macOS, FreeBSD, OpenBSD
- resources_windows.go - Windows stub implementation
- disk_check_netbsd.go - NetBSD disk space stub
Build Tags: Proper Go build constraints for platform-specific code
All Platforms Building: 10/10 platforms successfully compile
- ✅ Linux (amd64, arm64, armv7)
- ✅ macOS (Intel, Apple Silicon)
- ✅ Windows (Intel, ARM)
- ✅ FreeBSD amd64
- ✅ OpenBSD amd64
- ✅ NetBSD amd64

Changed

Cloud Auto-Upload: When CloudEnabled=true and CloudAutoUpload=true, backups automatically upload after creation
Configuration: Added cloud settings to TUI settings interface
Backup Engine: Integrated cloud upload into backup workflow with progress tracking

Fixed

BSD Syscall Issues: Fixed syscall.Rlimit type mismatches (int64 vs uint64) on BSD platforms
OpenBSD RLIMIT_AS: Made RLIMIT_AS check Linux-only (not available on OpenBSD)
NetBSD Disk Checks: Added safe default implementation for NetBSD (syscall.Statfs unavailable)
Cross-Platform Builds: Resolved Windows syscall.Rlimit undefined errors

Documentation

Updated README.md with Cloud Storage section and examples
Enhanced CLOUD.md with setup guides for all providers
Added testing scripts for Azure and GCS
Docker Compose files for Azurite and fake-gcs-server

Testing

Added scripts/test_azure_storage.sh - Azure Blob Storage integration tests
Added scripts/test_gcs_storage.sh - Google Cloud Storage integration tests
Docker Compose setups for local testing (Azurite, fake-gcs-server, MinIO)

[2.0.0] - 2025-11-25

Added - Production-Ready Release

100% Test Coverage: All 24 automated tests passing
Zero Critical Issues: Production-validated and deployment-ready
Backup Verification: SHA-256 checksum generation and validation
JSON Metadata: Structured .info files with backup metadata
Retention Policy: Automatic cleanup of old backups with configurable retention
Configuration Management:
- Auto-save/load settings to .dbbackup.conf in current directory
- Per-directory configuration for different projects
- CLI flags always take precedence over saved configuration
- Passwords excluded from saved configuration files

Added - Performance Optimizations

Parallel Cluster Operations: Worker pool pattern for concurrent database operations
Memory Efficiency: Streaming command output eliminates OOM errors
Optimized Goroutines: Ticker-based progress indicators reduce CPU overhead
Configurable Concurrency: CLUSTER_PARALLELISM environment variable

Added - Reliability Enhancements

Context Cleanup: Proper resource cleanup with sync.Once and io.Closer interface
Process Management: Thread-safe process tracking with automatic cleanup on exit
Error Classification: Regex-based error pattern matching for robust error handling
Performance Caching: Disk space checks cached with 30-second TTL
Metrics Collection: Structured logging with operation metrics

Fixed

Configuration Bug: CLI flags now correctly override config file values
Memory Leaks: Proper cleanup prevents resource leaks in long-running operations

Changed

Streaming Architecture: Constant ~1GB memory footprint regardless of database size
Cross-Platform: Native binaries for Linux (x64/ARM), macOS (x64/ARM), FreeBSD, OpenBSD

[1.2.0] - 2025-11-12

Added

Interactive TUI: Full terminal user interface with progress tracking
Database Selector: Interactive database selection for backup operations
Archive Browser: Browse and restore from backup archives
Configuration Settings: In-TUI configuration management
CPU Detection: Automatic CPU detection and optimization

Changed

Improved error handling and user feedback
Enhanced progress tracking with real-time updates

[1.1.0] - 2025-11-10

Added

Multi-Database Support: PostgreSQL, MySQL, MariaDB
Cluster Operations: Full cluster backup and restore for PostgreSQL
Sample Backups: Create reduced-size backups for testing
Parallel Processing: Automatic CPU detection and parallel jobs

Changed

Refactored command structure for better organization
Improved compression handling

[1.0.0] - 2025-11-08

Added

Initial release
Single database backup and restore
PostgreSQL support
Basic CLI interface
Streaming compression

Version Numbering

Major (X.0.0): Breaking changes, major feature additions
Minor (0.X.0): New features, non-breaking changes
Patch (0.0.X): Bug fixes, minor improvements

Upcoming Features

See ROADMAP.md for planned features:

Phase 3: Incremental Backups
Phase 4: Encryption (AES-256)
Phase 5: PITR (Point-in-Time Recovery)
Phase 6: Enterprise Features (Prometheus metrics, remote restore)

23 KiB Raw Blame History

Changelog

[3.41.0] - 2026-01-07 "The Pre-Flight Check"

Added - 🛡️ Pre-Restore Validation

Fixed

Technical Details

[3.40.0] - 2026-01-05 "The Diagnostician"

Added - 🔍 Restore Diagnostics & Error Reporting

Technical Details

[3.2.0] - 2025-12-13 "The Margin Eraser"

Added - 🚀 Physical Backup Revolution

Technical Details

[3.1.0] - 2025-11-26

Added - 🔄 Point-in-Time Recovery (PITR)

Changed

Production

Documentation

[3.0.0] - 2025-11-26

Added - 🔐 AES-256-GCM Encryption (Phase 4)

Added - 📦 Incremental Backups (Phase 3B)

Changed

Testing

Documentation

Performance

Commits

[2.1.0] - 2025-11-26

Added - Cloud Storage Integration

Added - Cross-Platform Support

Changed

Fixed

Documentation

Testing

[2.0.0] - 2025-11-25

Added - Production-Ready Release

Added - Performance Optimizations

Added - Reliability Enhancements

Fixed

Changed

[1.2.0] - 2025-11-12

Added

Changed

[1.1.0] - 2025-11-10

Added

Changed

[1.0.0] - 2025-11-08

Added

Version Numbering

Upcoming Features

23 KiB

Raw Blame History