Compare commits
17 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| ae8c8fde3d | |||
| 346cb7fb61 | |||
| 18549584b1 | |||
| b1d1d57b61 | |||
| d0e1da1bea | |||
| 343a8b782d | |||
| bc5f7c07f4 | |||
| 821521470f | |||
| 147b9fc234 | |||
| 6f3e81a5a6 | |||
| bf1722c316 | |||
| a759f4d3db | |||
| 7cf1d6f85b | |||
| b305d1342e | |||
| 5456da7183 | |||
| f9ff45cf2a | |||
| 72c06ba5c2 |
103
CHANGELOG.md
103
CHANGELOG.md
@ -5,6 +5,109 @@ All notable changes to dbbackup will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Performance - Cluster Restore Optimization
|
||||
- **Eliminated duplicate archive extraction in cluster restore** - saves 30-50% time on large restores
|
||||
- Previously: Archive was extracted twice (once in preflight validation, once in actual restore)
|
||||
- Now: Archive extracted once and reused for both validation and restore
|
||||
- **Time savings**:
|
||||
- 50 GB cluster: ~3-6 minutes faster
|
||||
- 10 GB cluster: ~1-2 minutes faster
|
||||
- Small clusters (<5 GB): ~30 seconds faster
|
||||
- Optimization automatically enabled when `--diagnose` flag is used
|
||||
- New `ValidateAndExtractCluster()` performs combined validation + extraction
|
||||
- `RestoreCluster()` accepts optional `preExtractedPath` parameter to reuse extracted directory
|
||||
- Disk space checks intelligently skipped when using pre-extracted directory
|
||||
- Maintains backward compatibility - works with and without pre-extraction
|
||||
- Log output shows optimization: `"Using pre-extracted cluster directory ... optimization: skipping duplicate extraction"`
|
||||
|
||||
### Improved - Archive Validation
|
||||
- **Enhanced tar.gz validation with stream-based checks**
|
||||
- Fast header-only validation (validates gzip + tar structure without full extraction)
|
||||
- Checks gzip magic bytes (0x1f 0x8b) and tar header signature
|
||||
- Reduces preflight validation time from minutes to seconds on large archives
|
||||
- Falls back to full extraction only when necessary (with `--diagnose`)
|
||||
|
||||
## [3.42.74] - 2026-01-20 "Resource Profile System + Critical Ctrl+C Fix"
|
||||
|
||||
### Critical Bug Fix
|
||||
- **Fixed Ctrl+C not working in TUI backup/restore** - Context cancellation was broken in TUI mode
|
||||
- `executeBackupWithTUIProgress()` and `executeRestoreWithTUIProgress()` created new contexts with `WithCancel(parentCtx)`
|
||||
- When user pressed Ctrl+C, `model.cancel()` was called on parent context but execution had separate context
|
||||
- Fixed by using parent context directly instead of creating new one
|
||||
- Ctrl+C/ESC/q now properly propagate cancellation to running operations
|
||||
- Users can now interrupt long-running TUI operations
|
||||
|
||||
### Added - Resource Profile System
|
||||
- **`--profile` flag for restore operations** with three presets:
|
||||
- **Conservative** (`--profile=conservative`): Single-threaded (`--parallel=1`), minimal memory usage
|
||||
- Best for resource-constrained servers, shared hosting, or when "out of shared memory" errors occur
|
||||
- Automatically enables `LargeDBMode` for better resource management
|
||||
- **Balanced** (default): Auto-detect resources, moderate parallelism
|
||||
- Good default for most scenarios
|
||||
- **Aggressive** (`--profile=aggressive`): Maximum parallelism, all available resources
|
||||
- Best for dedicated database servers with ample resources
|
||||
- **Potato** (`--profile=potato`): Easter egg 🥔, same as conservative
|
||||
- **Profile system applies to both CLI and TUI**:
|
||||
- CLI: `dbbackup restore cluster backup.tar.gz --profile=conservative --confirm`
|
||||
- TUI: Automatically uses conservative profile for safer interactive operation
|
||||
- **User overrides supported**: `--jobs` and `--parallel-dbs` flags override profile settings
|
||||
- **New `internal/config/profile.go`** module:
|
||||
- `GetRestoreProfile(name)` - Returns profile settings
|
||||
- `ApplyProfile(cfg, profile, jobs, parallelDBs)` - Applies profile with overrides
|
||||
- `GetProfileDescription(name)` - Human-readable descriptions
|
||||
- `ListProfiles()` - All available profiles
|
||||
|
||||
### Added - PostgreSQL Diagnostic Tools
|
||||
- **`diagnose_postgres_memory.sh`** - Comprehensive memory and resource analysis script:
|
||||
- System memory overview with usage percentages and warnings
|
||||
- Top 15 memory consuming processes
|
||||
- PostgreSQL-specific memory configuration analysis
|
||||
- Current locks and connections monitoring
|
||||
- Shared memory segments inspection
|
||||
- Disk space and swap usage checks
|
||||
- Identifies other resource consumers (Nessus, Elastic Agent, monitoring tools)
|
||||
- Smart recommendations based on findings
|
||||
- Detects temp file usage (indicator of low work_mem)
|
||||
- **`fix_postgres_locks.sh`** - PostgreSQL lock configuration helper:
|
||||
- Automatically increases `max_locks_per_transaction` to 4096
|
||||
- Shows current configuration before applying changes
|
||||
- Calculates total lock capacity
|
||||
- Provides restart commands for different PostgreSQL setups
|
||||
- References diagnostic tool for comprehensive analysis
|
||||
|
||||
### Added - Documentation
|
||||
- **`RESTORE_PROFILES.md`** - Complete profile guide with real-world scenarios:
|
||||
- Profile comparison table
|
||||
- When to use each profile
|
||||
- Override examples
|
||||
- Troubleshooting guide for "out of shared memory" errors
|
||||
- Integration with diagnostic tools
|
||||
- **`email_infra_team.txt`** - Admin communication template (German):
|
||||
- Analysis results template
|
||||
- Problem identification section
|
||||
- Three solution variants (temporary, permanent, workaround)
|
||||
- Includes diagnostic tool references
|
||||
|
||||
### Changed - TUI Improvements
|
||||
- **TUI mode defaults to conservative profile** for safer operation
|
||||
- Interactive users benefit from stability over speed
|
||||
- Prevents resource exhaustion on shared systems
|
||||
- Can be overridden with environment variable: `export RESOURCE_PROFILE=balanced`
|
||||
|
||||
### Fixed
|
||||
- Context cancellation in TUI backup operations (critical)
|
||||
- Context cancellation in TUI restore operations (critical)
|
||||
- Better error diagnostics for "out of shared memory" errors
|
||||
- Improved resource detection and management
|
||||
|
||||
### Technical Details
|
||||
- Profile system respects explicit user flags (`--jobs`, `--parallel-dbs`)
|
||||
- Conservative profile sets `cfg.LargeDBMode = true` automatically
|
||||
- TUI profile selection logged when `Debug` mode enabled
|
||||
- All profiles support both single and cluster restore operations
|
||||
|
||||
## [3.42.50] - 2026-01-16 "Ctrl+C Signal Handling Fix"
|
||||
|
||||
### Fixed - Proper Ctrl+C/SIGINT Handling in TUI
|
||||
|
||||
15
README.md
15
README.md
@ -56,7 +56,7 @@ Download from [releases](https://git.uuxo.net/UUXO/dbbackup/releases):
|
||||
|
||||
```bash
|
||||
# Linux x86_64
|
||||
wget https://git.uuxo.net/UUXO/dbbackup/releases/download/v3.42.35/dbbackup-linux-amd64
|
||||
wget https://git.uuxo.net/UUXO/dbbackup/releases/download/v3.42.74/dbbackup-linux-amd64
|
||||
chmod +x dbbackup-linux-amd64
|
||||
sudo mv dbbackup-linux-amd64 /usr/local/bin/dbbackup
|
||||
```
|
||||
@ -239,6 +239,14 @@ When restoring large databases on VMs with limited resources, use the resource p
|
||||
|
||||
**Quick shortcuts:** Press `l` to toggle Large DB Mode, `c` for conservative, `p` to show recommendation.
|
||||
|
||||
**Troubleshooting Tools:**
|
||||
|
||||
For PostgreSQL restore issues ("out of shared memory" errors), diagnostic scripts are available:
|
||||
- **diagnose_postgres_memory.sh** - Comprehensive system memory, PostgreSQL configuration, and resource analysis
|
||||
- **fix_postgres_locks.sh** - Automatically increase max_locks_per_transaction to 4096
|
||||
|
||||
See [RESTORE_PROFILES.md](RESTORE_PROFILES.md) for detailed troubleshooting guidance.
|
||||
|
||||
**Database Status:**
|
||||
```
|
||||
Database Status & Health Check
|
||||
@ -278,6 +286,9 @@ dbbackup restore single backup.dump --target myapp_db --create --confirm
|
||||
# Restore cluster
|
||||
dbbackup restore cluster cluster_backup.tar.gz --confirm
|
||||
|
||||
# Restore with resource profile (for resource-constrained servers)
|
||||
dbbackup restore cluster backup.tar.gz --profile=conservative --confirm
|
||||
|
||||
# Restore with debug logging (saves detailed error report on failure)
|
||||
dbbackup restore cluster backup.tar.gz --save-debug-log /tmp/restore-debug.json --confirm
|
||||
|
||||
@ -333,6 +344,7 @@ dbbackup backup single mydb --dry-run
|
||||
| `--backup-dir` | Backup directory | ~/db_backups |
|
||||
| `--compression` | Compression level (0-9) | 6 |
|
||||
| `--jobs` | Parallel jobs | 8 |
|
||||
| `--profile` | Resource profile (conservative/balanced/aggressive) | balanced |
|
||||
| `--cloud` | Cloud storage URI | - |
|
||||
| `--encrypt` | Enable encryption | false |
|
||||
| `--dry-run, -n` | Run preflight checks only | false |
|
||||
@ -888,6 +900,7 @@ Workload types:
|
||||
|
||||
## Documentation
|
||||
|
||||
- [RESTORE_PROFILES.md](RESTORE_PROFILES.md) - Restore resource profiles & troubleshooting
|
||||
- [SYSTEMD.md](SYSTEMD.md) - Systemd installation & scheduling
|
||||
- [DOCKER.md](DOCKER.md) - Docker deployment
|
||||
- [CLOUD.md](CLOUD.md) - Cloud storage configuration
|
||||
|
||||
195
RESTORE_PROFILES.md
Normal file
195
RESTORE_PROFILES.md
Normal file
@ -0,0 +1,195 @@
|
||||
# Restore Profiles
|
||||
|
||||
## Overview
|
||||
|
||||
The `--profile` flag allows you to optimize restore operations based on your server's resources and current workload. This is particularly useful when dealing with "out of shared memory" errors or resource-constrained environments.
|
||||
|
||||
## Available Profiles
|
||||
|
||||
### Conservative Profile (`--profile=conservative`)
|
||||
**Best for:** Resource-constrained servers, production systems with other running services, or when dealing with "out of shared memory" errors.
|
||||
|
||||
**Settings:**
|
||||
- Single-threaded restore (`--parallel=1`)
|
||||
- Single-threaded decompression (`--jobs=1`)
|
||||
- Memory-conservative mode enabled
|
||||
- Minimal memory footprint
|
||||
|
||||
**When to use:**
|
||||
- Server RAM usage > 70%
|
||||
- Other critical services running (web servers, monitoring agents)
|
||||
- "out of shared memory" errors during restore
|
||||
- Small VMs or shared hosting environments
|
||||
- Disk I/O is the bottleneck
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
dbbackup restore cluster backup.tar.gz --profile=conservative --confirm
|
||||
```
|
||||
|
||||
### Balanced Profile (`--profile=balanced`) - DEFAULT
|
||||
**Best for:** Most scenarios, general-purpose servers with adequate resources.
|
||||
|
||||
**Settings:**
|
||||
- Auto-detect parallelism based on CPU/RAM
|
||||
- Moderate resource usage
|
||||
- Good balance between speed and stability
|
||||
|
||||
**When to use:**
|
||||
- Default choice for most restores
|
||||
- Dedicated database server with moderate load
|
||||
- Unknown or variable server conditions
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
dbbackup restore cluster backup.tar.gz --confirm
|
||||
# or explicitly:
|
||||
dbbackup restore cluster backup.tar.gz --profile=balanced --confirm
|
||||
```
|
||||
|
||||
### Aggressive Profile (`--profile=aggressive`)
|
||||
**Best for:** Dedicated database servers with ample resources, maintenance windows, performance-critical restores.
|
||||
|
||||
**Settings:**
|
||||
- Maximum parallelism (auto-detect based on CPU cores)
|
||||
- Maximum resource utilization
|
||||
- Fastest restore speed
|
||||
|
||||
**When to use:**
|
||||
- Dedicated database server (no other services)
|
||||
- Server RAM usage < 50%
|
||||
- Time-critical restores (RTO minimization)
|
||||
- Maintenance windows with service downtime
|
||||
- Testing/development environments
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
dbbackup restore cluster backup.tar.gz --profile=aggressive --confirm
|
||||
```
|
||||
|
||||
### Potato Profile (`--profile=potato`) 🥔
|
||||
**Easter egg:** Same as conservative, for servers running on a potato.
|
||||
|
||||
## Profile Comparison
|
||||
|
||||
| Setting | Conservative | Balanced | Aggressive |
|
||||
|---------|-------------|----------|-----------|
|
||||
| Parallel DBs | 1 (sequential) | Auto (2-4) | Auto (all CPUs) |
|
||||
| Jobs (decompression) | 1 | Auto (2-4) | Auto (all CPUs) |
|
||||
| Memory Usage | Minimal | Moderate | Maximum |
|
||||
| Speed | Slowest | Medium | Fastest |
|
||||
| Stability | Most stable | Stable | Requires resources |
|
||||
|
||||
## Overriding Profile Settings
|
||||
|
||||
You can override specific profile settings:
|
||||
|
||||
```bash
|
||||
# Use conservative profile but allow 2 parallel jobs for decompression
|
||||
dbbackup restore cluster backup.tar.gz \\
|
||||
--profile=conservative \\
|
||||
--jobs=2 \\
|
||||
--confirm
|
||||
|
||||
# Use aggressive profile but limit to 2 parallel databases
|
||||
dbbackup restore cluster backup.tar.gz \\
|
||||
--profile=aggressive \\
|
||||
--parallel-dbs=2 \\
|
||||
--confirm
|
||||
```
|
||||
|
||||
## Real-World Scenarios
|
||||
|
||||
### Scenario 1: "Out of Shared Memory" Error
|
||||
**Problem:** PostgreSQL restore fails with `ERROR: out of shared memory`
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Step 1: Use conservative profile
|
||||
dbbackup restore cluster backup.tar.gz --profile=conservative --confirm
|
||||
|
||||
# Step 2: If still failing, temporarily stop monitoring agents
|
||||
sudo systemctl stop nessus-agent elastic-agent
|
||||
dbbackup restore cluster backup.tar.gz --profile=conservative --confirm
|
||||
sudo systemctl start nessus-agent elastic-agent
|
||||
|
||||
# Step 3: Ask infrastructure team to increase work_mem (see email_infra_team.txt)
|
||||
```
|
||||
|
||||
### Scenario 2: Fast Disaster Recovery
|
||||
**Goal:** Restore as quickly as possible during maintenance window
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Stop all non-essential services first
|
||||
sudo systemctl stop nginx php-fpm
|
||||
dbbackup restore cluster backup.tar.gz --profile=aggressive --confirm
|
||||
sudo systemctl start nginx php-fpm
|
||||
```
|
||||
|
||||
### Scenario 3: Shared Server with Multiple Services
|
||||
**Environment:** Web server + database + monitoring all on same VM
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Always use conservative to avoid impacting other services
|
||||
dbbackup restore cluster backup.tar.gz --profile=conservative --confirm
|
||||
```
|
||||
|
||||
### Scenario 4: Unknown Server Conditions
|
||||
**Situation:** Restoring to a new server, unsure of resources
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Step 1: Run diagnostics first
|
||||
./diagnose_postgres_memory.sh > diagnosis.log
|
||||
|
||||
# Step 2: Choose profile based on memory usage:
|
||||
# - If memory > 80%: use conservative
|
||||
# - If memory 50-80%: use balanced (default)
|
||||
# - If memory < 50%: use aggressive
|
||||
|
||||
# Step 3: Start with balanced and adjust if needed
|
||||
dbbackup restore cluster backup.tar.gz --confirm
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Profile Selection Guide
|
||||
|
||||
**Use Conservative when:**
|
||||
- ✅ Memory usage > 70%
|
||||
- ✅ Other services running
|
||||
- ✅ Getting "out of shared memory" errors
|
||||
- ✅ Restore keeps failing
|
||||
- ✅ Small VM (< 4 GB RAM)
|
||||
- ✅ High swap usage
|
||||
|
||||
**Use Balanced when:**
|
||||
- ✅ Normal operation
|
||||
- ✅ Moderate server load
|
||||
- ✅ Unsure what to use
|
||||
- ✅ Medium VM (4-16 GB RAM)
|
||||
|
||||
**Use Aggressive when:**
|
||||
- ✅ Dedicated database server
|
||||
- ✅ Memory usage < 50%
|
||||
- ✅ No other critical services
|
||||
- ✅ Need fastest possible restore
|
||||
- ✅ Large VM (> 16 GB RAM)
|
||||
- ✅ Maintenance window
|
||||
|
||||
## Environment Variables
|
||||
|
||||
You can set a default profile:
|
||||
|
||||
```bash
|
||||
export RESOURCE_PROFILE=conservative
|
||||
dbbackup restore cluster backup.tar.gz --confirm
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [diagnose_postgres_memory.sh](diagnose_postgres_memory.sh) - Analyze system resources before restore
|
||||
- [fix_postgres_locks.sh](fix_postgres_locks.sh) - Fix PostgreSQL lock exhaustion
|
||||
- [email_infra_team.txt](email_infra_team.txt) - Template email for infrastructure team
|
||||
@ -4,8 +4,8 @@ This directory contains pre-compiled binaries for the DB Backup Tool across mult
|
||||
|
||||
## Build Information
|
||||
- **Version**: 3.42.50
|
||||
- **Build Time**: 2026-01-18_11:39:42_UTC
|
||||
- **Git Commit**: 59a717a
|
||||
- **Build Time**: 2026-01-18_20:23:55_UTC
|
||||
- **Git Commit**: a759f4d
|
||||
|
||||
## Recent Updates (v1.1.0)
|
||||
- ✅ Fixed TUI progress display with line-by-line output
|
||||
|
||||
@ -66,6 +66,15 @@ TUI Automation Flags (for testing and CI/CD):
|
||||
cfg.TUIVerbose, _ = cmd.Flags().GetBool("verbose-tui")
|
||||
cfg.TUILogFile, _ = cmd.Flags().GetString("tui-log-file")
|
||||
|
||||
// Set conservative profile as default for TUI mode (safer for interactive users)
|
||||
if cfg.ResourceProfile == "" || cfg.ResourceProfile == "balanced" {
|
||||
cfg.ResourceProfile = "conservative"
|
||||
cfg.LargeDBMode = true
|
||||
if cfg.Debug {
|
||||
log.Info("TUI mode: using conservative profile by default")
|
||||
}
|
||||
}
|
||||
|
||||
// Check authentication before starting TUI
|
||||
if cfg.IsPostgreSQL() {
|
||||
if mismatch, msg := auth.CheckAuthenticationMismatch(cfg); mismatch {
|
||||
|
||||
@ -13,6 +13,7 @@ import (
|
||||
|
||||
"dbbackup/internal/backup"
|
||||
"dbbackup/internal/cloud"
|
||||
"dbbackup/internal/config"
|
||||
"dbbackup/internal/database"
|
||||
"dbbackup/internal/pitr"
|
||||
"dbbackup/internal/restore"
|
||||
@ -28,7 +29,8 @@ var (
|
||||
restoreClean bool
|
||||
restoreCreate bool
|
||||
restoreJobs int
|
||||
restoreParallelDBs int // Number of parallel database restores
|
||||
restoreParallelDBs int // Number of parallel database restores
|
||||
restoreProfile string // Resource profile: conservative, balanced, aggressive
|
||||
restoreTarget string
|
||||
restoreVerbose bool
|
||||
restoreNoProgress bool
|
||||
@ -112,6 +114,9 @@ Examples:
|
||||
# Restore to different database
|
||||
dbbackup restore single mydb.dump.gz --target mydb_test --confirm
|
||||
|
||||
# Memory-constrained server (single-threaded, minimal memory)
|
||||
dbbackup restore single mydb.dump.gz --profile=conservative --confirm
|
||||
|
||||
# Clean target database before restore
|
||||
dbbackup restore single mydb.sql.gz --clean --confirm
|
||||
|
||||
@ -144,6 +149,12 @@ Examples:
|
||||
# Restore full cluster
|
||||
dbbackup restore cluster cluster_backup_20240101_120000.tar.gz --confirm
|
||||
|
||||
# Memory-constrained server (conservative profile)
|
||||
dbbackup restore cluster cluster_backup.tar.gz --profile=conservative --confirm
|
||||
|
||||
# Maximum performance (dedicated server)
|
||||
dbbackup restore cluster cluster_backup.tar.gz --profile=aggressive --confirm
|
||||
|
||||
# Use parallel decompression
|
||||
dbbackup restore cluster cluster_backup.tar.gz --jobs 4 --confirm
|
||||
|
||||
@ -277,6 +288,7 @@ func init() {
|
||||
restoreSingleCmd.Flags().BoolVar(&restoreClean, "clean", false, "Drop and recreate target database")
|
||||
restoreSingleCmd.Flags().BoolVar(&restoreCreate, "create", false, "Create target database if it doesn't exist")
|
||||
restoreSingleCmd.Flags().StringVar(&restoreTarget, "target", "", "Target database name (defaults to original)")
|
||||
restoreSingleCmd.Flags().StringVar(&restoreProfile, "profile", "balanced", "Resource profile: conservative (--parallel=1, low memory), balanced, aggressive (max performance)")
|
||||
restoreSingleCmd.Flags().BoolVar(&restoreVerbose, "verbose", false, "Show detailed restore progress")
|
||||
restoreSingleCmd.Flags().BoolVar(&restoreNoProgress, "no-progress", false, "Disable progress indicators")
|
||||
restoreSingleCmd.Flags().StringVar(&restoreEncryptionKeyFile, "encryption-key-file", "", "Path to encryption key file (required for encrypted backups)")
|
||||
@ -289,8 +301,9 @@ func init() {
|
||||
restoreClusterCmd.Flags().BoolVar(&restoreDryRun, "dry-run", false, "Show what would be done without executing")
|
||||
restoreClusterCmd.Flags().BoolVar(&restoreForce, "force", false, "Skip safety checks and confirmations")
|
||||
restoreClusterCmd.Flags().BoolVar(&restoreCleanCluster, "clean-cluster", false, "Drop all existing user databases before restore (disaster recovery)")
|
||||
restoreClusterCmd.Flags().IntVar(&restoreJobs, "jobs", 0, "Number of parallel decompression jobs (0 = auto)")
|
||||
restoreClusterCmd.Flags().IntVar(&restoreParallelDBs, "parallel-dbs", 0, "Number of databases to restore in parallel (0 = use config default, 1 = sequential, -1 = auto-detect based on CPU/RAM)")
|
||||
restoreClusterCmd.Flags().StringVar(&restoreProfile, "profile", "balanced", "Resource profile: conservative (--parallel=1, low memory), balanced, aggressive (max performance)")
|
||||
restoreClusterCmd.Flags().IntVar(&restoreJobs, "jobs", 0, "Number of parallel decompression jobs (0 = auto, overrides profile)")
|
||||
restoreClusterCmd.Flags().IntVar(&restoreParallelDBs, "parallel-dbs", 0, "Number of databases to restore in parallel (0 = use profile, 1 = sequential, -1 = auto-detect, overrides profile)")
|
||||
restoreClusterCmd.Flags().StringVar(&restoreWorkdir, "workdir", "", "Working directory for extraction (use when system disk is small, e.g. /mnt/storage/restore_tmp)")
|
||||
restoreClusterCmd.Flags().BoolVar(&restoreVerbose, "verbose", false, "Show detailed restore progress")
|
||||
restoreClusterCmd.Flags().BoolVar(&restoreNoProgress, "no-progress", false, "Disable progress indicators")
|
||||
@ -436,6 +449,16 @@ func runRestoreDiagnose(cmd *cobra.Command, args []string) error {
|
||||
func runRestoreSingle(cmd *cobra.Command, args []string) error {
|
||||
archivePath := args[0]
|
||||
|
||||
// Apply resource profile
|
||||
if err := config.ApplyProfile(cfg, restoreProfile, restoreJobs, 0); err != nil {
|
||||
log.Warn("Invalid profile, using balanced", "error", err)
|
||||
restoreProfile = "balanced"
|
||||
_ = config.ApplyProfile(cfg, restoreProfile, restoreJobs, 0)
|
||||
}
|
||||
if cfg.Debug && restoreProfile != "balanced" {
|
||||
log.Info("Using restore profile", "profile", restoreProfile)
|
||||
}
|
||||
|
||||
// Check if this is a cloud URI
|
||||
var cleanupFunc func() error
|
||||
|
||||
@ -643,6 +666,16 @@ func runRestoreSingle(cmd *cobra.Command, args []string) error {
|
||||
func runRestoreCluster(cmd *cobra.Command, args []string) error {
|
||||
archivePath := args[0]
|
||||
|
||||
// Apply resource profile
|
||||
if err := config.ApplyProfile(cfg, restoreProfile, restoreJobs, restoreParallelDBs); err != nil {
|
||||
log.Warn("Invalid profile, using balanced", "error", err)
|
||||
restoreProfile = "balanced"
|
||||
_ = config.ApplyProfile(cfg, restoreProfile, restoreJobs, restoreParallelDBs)
|
||||
}
|
||||
if cfg.Debug || restoreProfile != "balanced" {
|
||||
log.Info("Using restore profile", "profile", restoreProfile, "parallel_dbs", cfg.ClusterParallelism, "jobs", cfg.Jobs)
|
||||
}
|
||||
|
||||
// Convert to absolute path
|
||||
if !filepath.IsAbs(archivePath) {
|
||||
absPath, err := filepath.Abs(archivePath)
|
||||
@ -840,22 +873,50 @@ func runRestoreCluster(cmd *cobra.Command, args []string) error {
|
||||
log.Info("Database cleanup completed")
|
||||
}
|
||||
|
||||
// Run pre-restore diagnosis if requested
|
||||
if restoreDiagnose {
|
||||
log.Info("[DIAG] Running pre-restore diagnosis...")
|
||||
// OPTIMIZATION: Pre-extract archive once for both diagnosis and restore
|
||||
// This avoids extracting the same tar.gz twice (saves 5-10 min on large clusters)
|
||||
var extractedDir string
|
||||
var extractErr error
|
||||
|
||||
// Create temp directory for extraction in configured WorkDir
|
||||
workDir := cfg.GetEffectiveWorkDir()
|
||||
diagTempDir, err := os.MkdirTemp(workDir, "dbbackup-diagnose-*")
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to create temp directory for diagnosis in %s: %w", workDir, err)
|
||||
if restoreDiagnose || restoreConfirm {
|
||||
log.Info("Pre-extracting cluster archive (shared for validation and restore)...")
|
||||
extractedDir, extractErr = safety.ValidateAndExtractCluster(ctx, archivePath)
|
||||
if extractErr != nil {
|
||||
return fmt.Errorf("failed to extract cluster archive: %w", extractErr)
|
||||
}
|
||||
defer os.RemoveAll(diagTempDir)
|
||||
defer os.RemoveAll(extractedDir) // Cleanup at end
|
||||
log.Info("Archive extracted successfully", "location", extractedDir)
|
||||
}
|
||||
|
||||
// Run pre-restore diagnosis if requested (using already-extracted directory)
|
||||
if restoreDiagnose {
|
||||
log.Info("[DIAG] Running pre-restore diagnosis on extracted dumps...")
|
||||
|
||||
diagnoser := restore.NewDiagnoser(log, restoreVerbose)
|
||||
results, err := diagnoser.DiagnoseClusterDumps(archivePath, diagTempDir)
|
||||
// Diagnose dumps directly from extracted directory
|
||||
dumpsDir := filepath.Join(extractedDir, "dumps")
|
||||
if _, err := os.Stat(dumpsDir); err != nil {
|
||||
return fmt.Errorf("no dumps directory found in extracted archive: %w", err)
|
||||
}
|
||||
|
||||
entries, err := os.ReadDir(dumpsDir)
|
||||
if err != nil {
|
||||
return fmt.Errorf("diagnosis failed: %w", err)
|
||||
return fmt.Errorf("failed to read dumps directory: %w", err)
|
||||
}
|
||||
|
||||
// Diagnose each dump file
|
||||
var results []*restore.DiagnoseResult
|
||||
for _, entry := range entries {
|
||||
if entry.IsDir() {
|
||||
continue
|
||||
}
|
||||
dumpPath := filepath.Join(dumpsDir, entry.Name())
|
||||
result, err := diagnoser.DiagnoseFile(dumpPath)
|
||||
if err != nil {
|
||||
log.Warn("Could not diagnose dump", "file", entry.Name(), "error", err)
|
||||
continue
|
||||
}
|
||||
results = append(results, result)
|
||||
}
|
||||
|
||||
// Check for any invalid dumps
|
||||
@ -895,7 +956,8 @@ func runRestoreCluster(cmd *cobra.Command, args []string) error {
|
||||
startTime := time.Now()
|
||||
auditLogger.LogRestoreStart(user, "all_databases", archivePath)
|
||||
|
||||
if err := engine.RestoreCluster(ctx, archivePath); err != nil {
|
||||
// Pass pre-extracted directory to avoid double extraction
|
||||
if err := engine.RestoreCluster(ctx, archivePath, extractedDir); err != nil {
|
||||
auditLogger.LogRestoreFailed(user, "all_databases", err)
|
||||
return fmt.Errorf("cluster restore failed: %w", err)
|
||||
}
|
||||
|
||||
359
diagnose_postgres_memory.sh
Executable file
359
diagnose_postgres_memory.sh
Executable file
@ -0,0 +1,359 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# PostgreSQL Memory and Resource Diagnostic Tool
|
||||
# Analyzes memory usage, locks, and system resources to identify restore issues
|
||||
#
|
||||
|
||||
set -e
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo " PostgreSQL Memory & Resource Diagnostics"
|
||||
echo " $(date '+%Y-%m-%d %H:%M:%S')"
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo
|
||||
|
||||
# Function to format bytes to human readable
|
||||
bytes_to_human() {
|
||||
local bytes=$1
|
||||
if [ "$bytes" -ge 1073741824 ]; then
|
||||
echo "$(awk "BEGIN {printf \"%.2f GB\", $bytes/1073741824}")"
|
||||
elif [ "$bytes" -ge 1048576 ]; then
|
||||
echo "$(awk "BEGIN {printf \"%.2f MB\", $bytes/1048576}")"
|
||||
else
|
||||
echo "$(awk "BEGIN {printf \"%.2f KB\", $bytes/1024}")"
|
||||
fi
|
||||
}
|
||||
|
||||
# 1. SYSTEM MEMORY OVERVIEW
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}📊 SYSTEM MEMORY OVERVIEW${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
|
||||
if command -v free &> /dev/null; then
|
||||
free -h
|
||||
echo
|
||||
|
||||
# Calculate percentages
|
||||
MEM_TOTAL=$(free -b | awk '/^Mem:/ {print $2}')
|
||||
MEM_USED=$(free -b | awk '/^Mem:/ {print $3}')
|
||||
MEM_FREE=$(free -b | awk '/^Mem:/ {print $4}')
|
||||
MEM_AVAILABLE=$(free -b | awk '/^Mem:/ {print $7}')
|
||||
|
||||
MEM_PERCENT=$(awk "BEGIN {printf \"%.1f\", ($MEM_USED/$MEM_TOTAL)*100}")
|
||||
|
||||
echo "Memory Utilization: ${MEM_PERCENT}%"
|
||||
echo "Total: $(bytes_to_human $MEM_TOTAL)"
|
||||
echo "Used: $(bytes_to_human $MEM_USED)"
|
||||
echo "Available: $(bytes_to_human $MEM_AVAILABLE)"
|
||||
|
||||
if (( $(echo "$MEM_PERCENT > 90" | bc -l) )); then
|
||||
echo -e "${RED}⚠️ WARNING: Memory usage is critically high (>90%)${NC}"
|
||||
elif (( $(echo "$MEM_PERCENT > 70" | bc -l) )); then
|
||||
echo -e "${YELLOW}⚠️ CAUTION: Memory usage is high (>70%)${NC}"
|
||||
else
|
||||
echo -e "${GREEN}✓ Memory usage is acceptable${NC}"
|
||||
fi
|
||||
fi
|
||||
echo
|
||||
|
||||
# 2. TOP MEMORY CONSUMERS
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}🔍 TOP 15 MEMORY CONSUMING PROCESSES${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
ps aux --sort=-%mem | head -16 | awk 'NR==1 {print $0} NR>1 {printf "%-8s %5s%% %7s %s\n", $1, $4, $6/1024"M", $11}'
|
||||
echo
|
||||
|
||||
# 3. POSTGRESQL PROCESSES
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}🐘 POSTGRESQL PROCESSES${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
|
||||
PG_PROCS=$(ps aux | grep -E "postgres.*:" | grep -v grep || true)
|
||||
if [ -z "$PG_PROCS" ]; then
|
||||
echo "No PostgreSQL processes found"
|
||||
else
|
||||
echo "$PG_PROCS" | awk '{printf "%-8s %5s%% %7s %s\n", $1, $4, $6/1024"M", $11}'
|
||||
echo
|
||||
|
||||
# Sum up PostgreSQL memory
|
||||
PG_MEM_TOTAL=$(echo "$PG_PROCS" | awk '{sum+=$6} END {print sum/1024}')
|
||||
echo "Total PostgreSQL Memory: ${PG_MEM_TOTAL} MB"
|
||||
fi
|
||||
echo
|
||||
|
||||
# 4. POSTGRESQL CONFIGURATION
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}⚙️ POSTGRESQL MEMORY CONFIGURATION${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
|
||||
if command -v psql &> /dev/null; then
|
||||
PSQL_CMD="psql -t -A -c"
|
||||
|
||||
# Try as postgres user first, then current user
|
||||
if sudo -u postgres $PSQL_CMD "SELECT 1" &> /dev/null; then
|
||||
PSQL_PREFIX="sudo -u postgres"
|
||||
elif $PSQL_CMD "SELECT 1" &> /dev/null; then
|
||||
PSQL_PREFIX=""
|
||||
else
|
||||
echo "❌ Cannot connect to PostgreSQL"
|
||||
PSQL_PREFIX="NONE"
|
||||
fi
|
||||
|
||||
if [ "$PSQL_PREFIX" != "NONE" ]; then
|
||||
echo "Key Memory Settings:"
|
||||
echo "────────────────────────────────────────────────────────────"
|
||||
|
||||
# Get all relevant settings (strip timing output)
|
||||
SHARED_BUFFERS=$($PSQL_PREFIX psql -t -A -c "SHOW shared_buffers;" 2>/dev/null | head -1 || echo "unknown")
|
||||
WORK_MEM=$($PSQL_PREFIX psql -t -A -c "SHOW work_mem;" 2>/dev/null | head -1 || echo "unknown")
|
||||
MAINT_WORK_MEM=$($PSQL_PREFIX psql -t -A -c "SHOW maintenance_work_mem;" 2>/dev/null | head -1 || echo "unknown")
|
||||
EFFECTIVE_CACHE=$($PSQL_PREFIX psql -t -A -c "SHOW effective_cache_size;" 2>/dev/null | head -1 || echo "unknown")
|
||||
MAX_CONNECTIONS=$($PSQL_PREFIX psql -t -A -c "SHOW max_connections;" 2>/dev/null | head -1 || echo "unknown")
|
||||
MAX_LOCKS=$($PSQL_PREFIX psql -t -A -c "SHOW max_locks_per_transaction;" 2>/dev/null | head -1 || echo "unknown")
|
||||
MAX_PREPARED=$($PSQL_PREFIX psql -t -A -c "SHOW max_prepared_transactions;" 2>/dev/null | head -1 || echo "unknown")
|
||||
|
||||
echo "shared_buffers: $SHARED_BUFFERS"
|
||||
echo "work_mem: $WORK_MEM"
|
||||
echo "maintenance_work_mem: $MAINT_WORK_MEM"
|
||||
echo "effective_cache_size: $EFFECTIVE_CACHE"
|
||||
echo "max_connections: $MAX_CONNECTIONS"
|
||||
echo "max_locks_per_transaction: $MAX_LOCKS"
|
||||
echo "max_prepared_transactions: $MAX_PREPARED"
|
||||
echo
|
||||
|
||||
# Calculate lock capacity
|
||||
if [ "$MAX_LOCKS" != "unknown" ] && [ "$MAX_CONNECTIONS" != "unknown" ] && [ "$MAX_PREPARED" != "unknown" ]; then
|
||||
# Ensure values are numeric
|
||||
if [[ "$MAX_LOCKS" =~ ^[0-9]+$ ]] && [[ "$MAX_CONNECTIONS" =~ ^[0-9]+$ ]] && [[ "$MAX_PREPARED" =~ ^[0-9]+$ ]]; then
|
||||
LOCK_CAPACITY=$((MAX_LOCKS * (MAX_CONNECTIONS + MAX_PREPARED)))
|
||||
echo "Total Lock Capacity: $LOCK_CAPACITY locks"
|
||||
|
||||
if [ "$MAX_LOCKS" -lt 1000 ]; then
|
||||
echo -e "${RED}⚠️ WARNING: max_locks_per_transaction is too low for large restores${NC}"
|
||||
echo -e "${YELLOW} Recommended: 4096 or higher${NC}"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
echo
|
||||
fi
|
||||
else
|
||||
echo "❌ psql not found"
|
||||
fi
|
||||
|
||||
# 5. CURRENT LOCKS AND CONNECTIONS
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}🔒 CURRENT LOCKS AND CONNECTIONS${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
|
||||
if [ "$PSQL_PREFIX" != "NONE" ] && command -v psql &> /dev/null; then
|
||||
# Active connections
|
||||
ACTIVE_CONNS=$($PSQL_PREFIX psql -t -A -c "SELECT count(*) FROM pg_stat_activity;" 2>/dev/null | head -1 || echo "0")
|
||||
echo "Active Connections: $ACTIVE_CONNS / $MAX_CONNECTIONS"
|
||||
echo
|
||||
|
||||
# Lock statistics
|
||||
echo "Current Lock Usage:"
|
||||
echo "────────────────────────────────────────────────────────────"
|
||||
$PSQL_PREFIX psql -c "
|
||||
SELECT
|
||||
mode,
|
||||
COUNT(*) as count
|
||||
FROM pg_locks
|
||||
GROUP BY mode
|
||||
ORDER BY count DESC;
|
||||
" 2>/dev/null || echo "Unable to query locks"
|
||||
echo
|
||||
|
||||
# Total locks
|
||||
TOTAL_LOCKS=$($PSQL_PREFIX psql -t -A -c "SELECT COUNT(*) FROM pg_locks;" 2>/dev/null | head -1 || echo "0")
|
||||
echo "Total Active Locks: $TOTAL_LOCKS"
|
||||
|
||||
if [ ! -z "$LOCK_CAPACITY" ] && [ ! -z "$TOTAL_LOCKS" ] && [[ "$TOTAL_LOCKS" =~ ^[0-9]+$ ]] && [ "$TOTAL_LOCKS" -gt 0 ] 2>/dev/null; then
|
||||
LOCK_PERCENT=$((TOTAL_LOCKS * 100 / LOCK_CAPACITY))
|
||||
echo "Lock Usage: ${LOCK_PERCENT}%"
|
||||
|
||||
if [ "$LOCK_PERCENT" -gt 80 ]; then
|
||||
echo -e "${RED}⚠️ WARNING: Lock table usage is critically high${NC}"
|
||||
elif [ "$LOCK_PERCENT" -gt 60 ]; then
|
||||
echo -e "${YELLOW}⚠️ CAUTION: Lock table usage is elevated${NC}"
|
||||
fi
|
||||
fi
|
||||
echo
|
||||
|
||||
# Blocking queries
|
||||
echo "Blocking Queries:"
|
||||
echo "────────────────────────────────────────────────────────────"
|
||||
$PSQL_PREFIX psql -c "
|
||||
SELECT
|
||||
blocked_locks.pid AS blocked_pid,
|
||||
blocking_locks.pid AS blocking_pid,
|
||||
blocked_activity.usename AS blocked_user,
|
||||
blocking_activity.usename AS blocking_user,
|
||||
blocked_activity.query AS blocked_query
|
||||
FROM pg_catalog.pg_locks blocked_locks
|
||||
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
|
||||
JOIN pg_catalog.pg_locks blocking_locks
|
||||
ON blocking_locks.locktype = blocked_locks.locktype
|
||||
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
|
||||
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
|
||||
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
|
||||
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
|
||||
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
|
||||
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
|
||||
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
|
||||
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
|
||||
AND blocking_locks.pid != blocked_locks.pid
|
||||
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
|
||||
WHERE NOT blocked_locks.granted;
|
||||
" 2>/dev/null || echo "No blocking queries or unable to query"
|
||||
echo
|
||||
fi
|
||||
|
||||
# 6. SHARED MEMORY USAGE
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}💾 SHARED MEMORY SEGMENTS${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
|
||||
if command -v ipcs &> /dev/null; then
|
||||
ipcs -m
|
||||
echo
|
||||
|
||||
# Sum up shared memory
|
||||
TOTAL_SHM=$(ipcs -m | awk '/^0x/ {sum+=$5} END {print sum}')
|
||||
if [ ! -z "$TOTAL_SHM" ]; then
|
||||
echo "Total Shared Memory: $(bytes_to_human $TOTAL_SHM)"
|
||||
fi
|
||||
else
|
||||
echo "ipcs command not available"
|
||||
fi
|
||||
echo
|
||||
|
||||
# 7. DISK SPACE (relevant for temp files)
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}💿 DISK SPACE${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
|
||||
df -h | grep -E "Filesystem|/$|/var|/tmp|/postgres"
|
||||
echo
|
||||
|
||||
# Check for PostgreSQL temp files
|
||||
if [ "$PSQL_PREFIX" != "NONE" ] && command -v psql &> /dev/null; then
|
||||
TEMP_FILES=$($PSQL_PREFIX psql -t -A -c "SELECT count(*) FROM pg_stat_database WHERE temp_files > 0;" 2>/dev/null | head -1 || echo "0")
|
||||
if [ ! -z "$TEMP_FILES" ] && [ "$TEMP_FILES" -gt 0 ] 2>/dev/null; then
|
||||
echo -e "${YELLOW}⚠️ Databases are using temporary files (work_mem may be too low)${NC}"
|
||||
$PSQL_PREFIX psql -c "SELECT datname, temp_files, pg_size_pretty(temp_bytes) as temp_size FROM pg_stat_database WHERE temp_files > 0;" 2>/dev/null
|
||||
echo
|
||||
fi
|
||||
fi
|
||||
|
||||
# 8. OTHER RESOURCE CONSUMERS
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}🔍 OTHER POTENTIAL MEMORY CONSUMERS${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
|
||||
# Check for common memory hogs
|
||||
echo "Checking for common memory-intensive services..."
|
||||
echo
|
||||
|
||||
for service in "mysqld" "mongodb" "redis" "elasticsearch" "java" "docker" "containerd"; do
|
||||
MEM=$(ps aux | grep "$service" | grep -v grep | awk '{sum+=$4} END {printf "%.1f", sum}')
|
||||
if [ ! -z "$MEM" ] && (( $(echo "$MEM > 0" | bc -l) )); then
|
||||
echo " ${service}: ${MEM}%"
|
||||
fi
|
||||
done
|
||||
echo
|
||||
|
||||
# 9. SWAP USAGE
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}🔄 SWAP USAGE${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
|
||||
if command -v free &> /dev/null; then
|
||||
SWAP_TOTAL=$(free -b | awk '/^Swap:/ {print $2}')
|
||||
SWAP_USED=$(free -b | awk '/^Swap:/ {print $3}')
|
||||
|
||||
if [ "$SWAP_TOTAL" -gt 0 ]; then
|
||||
SWAP_PERCENT=$(awk "BEGIN {printf \"%.1f\", ($SWAP_USED/$SWAP_TOTAL)*100}")
|
||||
echo "Swap Total: $(bytes_to_human $SWAP_TOTAL)"
|
||||
echo "Swap Used: $(bytes_to_human $SWAP_USED) (${SWAP_PERCENT}%)"
|
||||
|
||||
if (( $(echo "$SWAP_PERCENT > 50" | bc -l) )); then
|
||||
echo -e "${RED}⚠️ WARNING: Heavy swap usage detected - system may be thrashing${NC}"
|
||||
elif (( $(echo "$SWAP_PERCENT > 20" | bc -l) )); then
|
||||
echo -e "${YELLOW}⚠️ CAUTION: System is using swap${NC}"
|
||||
else
|
||||
echo -e "${GREEN}✓ Swap usage is low${NC}"
|
||||
fi
|
||||
else
|
||||
echo "No swap configured"
|
||||
fi
|
||||
fi
|
||||
echo
|
||||
|
||||
# 10. RECOMMENDATIONS
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo -e "${BLUE}💡 RECOMMENDATIONS${NC}"
|
||||
echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
|
||||
echo
|
||||
|
||||
echo "Based on the diagnostics:"
|
||||
echo
|
||||
|
||||
# Memory recommendations
|
||||
if [ ! -z "$MEM_PERCENT" ]; then
|
||||
if (( $(echo "$MEM_PERCENT > 80" | bc -l) )); then
|
||||
echo "1. ⚠️ Memory Pressure:"
|
||||
echo " • System memory is ${MEM_PERCENT}% utilized"
|
||||
echo " • Stop non-essential services before restore"
|
||||
echo " • Consider increasing system RAM"
|
||||
echo " • Use 'dbbackup restore --parallel=1' to reduce memory usage"
|
||||
echo
|
||||
fi
|
||||
fi
|
||||
|
||||
# Lock recommendations
|
||||
if [ "$MAX_LOCKS" != "unknown" ] && [ ! -z "$MAX_LOCKS" ] && [[ "$MAX_LOCKS" =~ ^[0-9]+$ ]]; then
|
||||
if [ "$MAX_LOCKS" -lt 1000 ] 2>/dev/null; then
|
||||
echo "2. ⚠️ Lock Configuration:"
|
||||
echo " • max_locks_per_transaction is too low: $MAX_LOCKS"
|
||||
echo " • Run: ./fix_postgres_locks.sh"
|
||||
echo " • Or manually: ALTER SYSTEM SET max_locks_per_transaction = 4096;"
|
||||
echo " • Then restart PostgreSQL"
|
||||
echo
|
||||
fi
|
||||
fi
|
||||
|
||||
# Other recommendations
|
||||
echo "3. 🔧 Before Large Restores:"
|
||||
echo " • Stop unnecessary services (web servers, cron jobs, etc.)"
|
||||
echo " • Clear PostgreSQL idle connections"
|
||||
echo " • Ensure adequate disk space for temp files"
|
||||
echo " • Consider using --large-db mode for very large databases"
|
||||
echo
|
||||
|
||||
echo "4. 📊 Monitor During Restore:"
|
||||
echo " • Watch: watch -n 2 'ps aux | grep postgres | head -20'"
|
||||
echo " • Locks: watch -n 5 'psql -c \"SELECT COUNT(*) FROM pg_locks;\"'"
|
||||
echo " • Memory: watch -n 2 free -h"
|
||||
echo
|
||||
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo " Report generated: $(date '+%Y-%m-%d %H:%M:%S')"
|
||||
echo " Save this output: $0 > diagnosis_$(date +%Y%m%d_%H%M%S).log"
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
112
email_infra_team.txt
Normal file
112
email_infra_team.txt
Normal file
@ -0,0 +1,112 @@
|
||||
Betreff: PostgreSQL Restore Fehler - "out of shared memory" auf RST Server
|
||||
|
||||
Hallo Infra-Team,
|
||||
|
||||
wir haben auf dem RST PostgreSQL Server (PostgreSQL 17.4) wiederholt Restore-Fehler mit "out of shared memory" Meldungen.
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
ANALYSE (Stand: 20.01.2026)
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
Server-Specs:
|
||||
• RAM: 31 GB (aktuell 19.6 GB belegt = 63.9%)
|
||||
• PostgreSQL nutzt nur ~118 MB für eigene Prozesse
|
||||
• Swap: 4 GB (6.4% genutzt)
|
||||
|
||||
Lock-Konfiguration:
|
||||
• max_locks_per_transaction: 4096 ✓ (bereits korrekt)
|
||||
• max_connections: 100
|
||||
• Lock Capacity: 409.600 ✓ (ausreichend)
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
PROBLEM-IDENTIFIKATION
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
1. MEMORY CONSUMER (nicht-PostgreSQL):
|
||||
• Nessus Agent: ~173 MB
|
||||
• Elastic Agent: ~300 MB (mehrere Komponenten)
|
||||
• Icinga: ~24 MB
|
||||
• Weitere Monitoring: ~100+ MB
|
||||
|
||||
2. WORK_MEM ZU NIEDRIG:
|
||||
• Aktuell: 64 MB
|
||||
• 4 Datenbanken nutzen Temp-Files (Indikator für zu wenig work_mem):
|
||||
- prodkc: 201 MB temp files
|
||||
- keycloak: 45 MB temp files
|
||||
- d7030: 6 MB temp files
|
||||
- pgbench_db: 2 MB temp files
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
EMPFOHLENE MASSNAHMEN
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
VARIANTE A - Temporär für große Restores:
|
||||
-------------------------------------------
|
||||
1. Monitoring-Agents stoppen (frei: ~500 MB):
|
||||
sudo systemctl stop nessus-agent
|
||||
sudo systemctl stop elastic-agent
|
||||
|
||||
2. work_mem erhöhen:
|
||||
sudo -u postgres psql -c "ALTER SYSTEM SET work_mem = '256MB';"
|
||||
sudo systemctl restart postgresql
|
||||
|
||||
3. Restore durchführen
|
||||
|
||||
4. Agents wieder starten:
|
||||
sudo systemctl start nessus-agent
|
||||
sudo systemctl start elastic-agent
|
||||
|
||||
|
||||
VARIANTE B - Permanente Lösung:
|
||||
-------------------------------------------
|
||||
1. work_mem auf 256 MB erhöhen (statt 64 MB)
|
||||
2. maintenance_work_mem optional auf 4 GB erhöhen (statt 2 GB)
|
||||
3. Falls möglich: Monitoring auf dedizierten Server verschieben
|
||||
|
||||
SQL-Befehle:
|
||||
ALTER SYSTEM SET work_mem = '256MB';
|
||||
ALTER SYSTEM SET maintenance_work_mem = '4GB';
|
||||
-- Anschließend PostgreSQL restart
|
||||
|
||||
|
||||
VARIANTE C - Falls keine Config-Änderung möglich:
|
||||
-------------------------------------------
|
||||
• Restore mit --profile=conservative durchführen (reduziert Memory-Druck)
|
||||
dbbackup restore cluster backup.tar.gz --profile=conservative --confirm
|
||||
|
||||
• Oder TUI-Modus nutzen (verwendet automatisch conservative profile):
|
||||
dbbackup interactive
|
||||
|
||||
• Monitoring während Restore-Fenster deaktivieren
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
DETAIL-REPORT
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
Vollständiger Diagnose-Report liegt bei bzw. kann jederzeit mit
|
||||
diesem Script generiert werden:
|
||||
|
||||
/path/to/diagnose_postgres_memory.sh
|
||||
|
||||
Das Script analysiert:
|
||||
• System Memory Usage
|
||||
• PostgreSQL Konfiguration
|
||||
• Lock Usage
|
||||
• Temp File Usage
|
||||
• Blocking Queries
|
||||
• Shared Memory Segments
|
||||
|
||||
═══════════════════════════════════════════════════════════
|
||||
|
||||
Bevorzugt wäre Variante B (permanente work_mem Erhöhung), damit künftige
|
||||
große Restores ohne manuelle Eingriffe durchlaufen.
|
||||
|
||||
Bitte um Rückmeldung, welche Variante ihr umsetzt bzw. ob ihr weitere
|
||||
Infos benötigt.
|
||||
|
||||
Danke & Grüße
|
||||
[Dein Name]
|
||||
|
||||
---
|
||||
Anhang: diagnose_postgres_memory.sh (falls nicht vorhanden)
|
||||
Error Log: /a01/dba/tmp/dbbackup-restore-debug-20260119-221730.json
|
||||
140
fix_postgres_locks.sh
Executable file
140
fix_postgres_locks.sh
Executable file
@ -0,0 +1,140 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# Fix PostgreSQL Lock Table Exhaustion
|
||||
# Increases max_locks_per_transaction to handle large database restores
|
||||
#
|
||||
|
||||
set -e
|
||||
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo " PostgreSQL Lock Configuration Fix"
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo
|
||||
|
||||
# Check if running as postgres user or with sudo
|
||||
if [ "$EUID" -ne 0 ] && [ "$(whoami)" != "postgres" ]; then
|
||||
echo "⚠️ This script should be run as:"
|
||||
echo " sudo $0"
|
||||
echo " or as the postgres user"
|
||||
echo
|
||||
read -p "Continue anyway? (y/N) " -n 1 -r
|
||||
echo
|
||||
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Detect PostgreSQL version and config
|
||||
PSQL=$(command -v psql || echo "")
|
||||
if [ -z "$PSQL" ]; then
|
||||
echo "❌ psql not found in PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "📊 Current PostgreSQL Configuration:"
|
||||
echo "────────────────────────────────────────────────────────────"
|
||||
sudo -u postgres psql -c "SHOW max_locks_per_transaction;" 2>/dev/null || psql -c "SHOW max_locks_per_transaction;" || echo "Unable to query current value"
|
||||
sudo -u postgres psql -c "SHOW max_connections;" 2>/dev/null || psql -c "SHOW max_connections;" || echo "Unable to query current value"
|
||||
sudo -u postgres psql -c "SHOW work_mem;" 2>/dev/null || psql -c "SHOW work_mem;" || echo "Unable to query current value"
|
||||
sudo -u postgres psql -c "SHOW maintenance_work_mem;" 2>/dev/null || psql -c "SHOW maintenance_work_mem;" || echo "Unable to query current value"
|
||||
echo
|
||||
|
||||
# Recommended values
|
||||
RECOMMENDED_LOCKS=4096
|
||||
RECOMMENDED_WORK_MEM="256MB"
|
||||
RECOMMENDED_MAINTENANCE_WORK_MEM="4GB"
|
||||
|
||||
echo "🔧 Applying Fixes:"
|
||||
echo "────────────────────────────────────────────────────────────"
|
||||
echo "1. Setting max_locks_per_transaction = $RECOMMENDED_LOCKS"
|
||||
echo "2. Setting work_mem = $RECOMMENDED_WORK_MEM (improves query performance)"
|
||||
echo "3. Setting maintenance_work_mem = $RECOMMENDED_MAINTENANCE_WORK_MEM (speeds up restore/vacuum)"
|
||||
echo
|
||||
|
||||
# Apply the settings
|
||||
SUCCESS=0
|
||||
|
||||
# Fix 1: max_locks_per_transaction
|
||||
if sudo -u postgres psql -c "ALTER SYSTEM SET max_locks_per_transaction = $RECOMMENDED_LOCKS;" 2>/dev/null; then
|
||||
echo "✅ max_locks_per_transaction updated successfully"
|
||||
SUCCESS=$((SUCCESS + 1))
|
||||
elif psql -c "ALTER SYSTEM SET max_locks_per_transaction = $RECOMMENDED_LOCKS;" 2>/dev/null; then
|
||||
echo "✅ max_locks_per_transaction updated successfully"
|
||||
SUCCESS=$((SUCCESS + 1))
|
||||
else
|
||||
echo "❌ Failed to update max_locks_per_transaction"
|
||||
fi
|
||||
|
||||
# Fix 2: work_mem
|
||||
if sudo -u postgres psql -c "ALTER SYSTEM SET work_mem = '$RECOMMENDED_WORK_MEM';" 2>/dev/null; then
|
||||
echo "✅ work_mem updated successfully"
|
||||
SUCCESS=$((SUCCESS + 1))
|
||||
elif psql -c "ALTER SYSTEM SET work_mem = '$RECOMMENDED_WORK_MEM';" 2>/dev/null; then
|
||||
echo "✅ work_mem updated successfully"
|
||||
SUCCESS=$((SUCCESS + 1))
|
||||
else
|
||||
echo "❌ Failed to update work_mem"
|
||||
fi
|
||||
|
||||
# Fix 3: maintenance_work_mem
|
||||
if sudo -u postgres psql -c "ALTER SYSTEM SET maintenance_work_mem = '$RECOMMENDED_MAINTENANCE_WORK_MEM';" 2>/dev/null; then
|
||||
echo "✅ maintenance_work_mem updated successfully"
|
||||
SUCCESS=$((SUCCESS + 1))
|
||||
elif psql -c "ALTER SYSTEM SET maintenance_work_mem = '$RECOMMENDED_MAINTENANCE_WORK_MEM';" 2>/dev/null; then
|
||||
echo "✅ maintenance_work_mem updated successfully"
|
||||
SUCCESS=$((SUCCESS + 1))
|
||||
else
|
||||
echo "❌ Failed to update maintenance_work_mem"
|
||||
fi
|
||||
|
||||
if [ $SUCCESS -eq 0 ]; then
|
||||
echo
|
||||
echo "❌ All configuration updates failed"
|
||||
echo
|
||||
echo "Manual steps:"
|
||||
echo "1. Connect to PostgreSQL as superuser:"
|
||||
echo " sudo -u postgres psql"
|
||||
echo
|
||||
echo "2. Run these commands:"
|
||||
echo " ALTER SYSTEM SET max_locks_per_transaction = $RECOMMENDED_LOCKS;"
|
||||
echo " ALTER SYSTEM SET work_mem = '$RECOMMENDED_WORK_MEM';"
|
||||
echo " ALTER SYSTEM SET maintenance_work_mem = '$RECOMMENDED_MAINTENANCE_WORK_MEM';"
|
||||
echo
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo
|
||||
echo "✅ Applied $SUCCESS out of 3 configuration changes"
|
||||
|
||||
echo
|
||||
echo "⚠️ IMPORTANT: PostgreSQL restart required!"
|
||||
echo "────────────────────────────────────────────────────────────"
|
||||
echo
|
||||
echo "Restart PostgreSQL using one of these commands:"
|
||||
echo
|
||||
echo " • systemd: sudo systemctl restart postgresql"
|
||||
echo " • pg_ctl: sudo -u postgres pg_ctl restart -D /var/lib/postgresql/data"
|
||||
echo " • service: sudo service postgresql restart"
|
||||
echo
|
||||
echo "📊 Expected capacity after restart:"
|
||||
echo "────────────────────────────────────────────────────────────"
|
||||
echo " Lock capacity: max_locks_per_transaction × (max_connections + max_prepared)"
|
||||
echo " = $RECOMMENDED_LOCKS × (connections + prepared)"
|
||||
echo
|
||||
echo " Work memory: $RECOMMENDED_WORK_MEM per query operation"
|
||||
echo " Maintenance: $RECOMMENDED_MAINTENANCE_WORK_MEM for restore/vacuum/index"
|
||||
echo
|
||||
echo "After restarting, verify with:"
|
||||
echo " psql -c 'SHOW max_locks_per_transaction;'"
|
||||
echo " psql -c 'SHOW work_mem;'"
|
||||
echo " psql -c 'SHOW maintenance_work_mem;'"
|
||||
echo
|
||||
echo "💡 Benefits:"
|
||||
echo " ✓ Prevents 'out of shared memory' errors during restore"
|
||||
echo " ✓ Reduces temp file usage (better performance)"
|
||||
echo " ✓ Faster restore, vacuum, and index operations"
|
||||
echo
|
||||
echo "🔍 For comprehensive diagnostics, run:"
|
||||
echo " ./diagnose_postgres_memory.sh"
|
||||
echo
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
@ -445,6 +445,12 @@ func (c *Config) ApplyResourceProfile(profileName string) error {
|
||||
|
||||
// Apply profile settings
|
||||
c.ResourceProfile = profile.Name
|
||||
|
||||
// If LargeDBMode is enabled, apply its modifiers
|
||||
if c.LargeDBMode {
|
||||
profile = cpu.ApplyLargeDBMode(profile)
|
||||
}
|
||||
|
||||
c.ClusterParallelism = profile.ClusterParallelism
|
||||
c.Jobs = profile.Jobs
|
||||
c.DumpJobs = profile.DumpJobs
|
||||
|
||||
@ -30,7 +30,7 @@ type LocalConfig struct {
|
||||
// Performance settings
|
||||
CPUWorkload string
|
||||
MaxCores int
|
||||
ClusterTimeout int // Cluster operation timeout in minutes (default: 1440 = 24 hours)
|
||||
ClusterTimeout int // Cluster operation timeout in minutes (default: 1440 = 24 hours)
|
||||
ResourceProfile string
|
||||
LargeDBMode bool // Enable large database mode (reduces parallelism, increases locks)
|
||||
|
||||
|
||||
128
internal/config/profile.go
Normal file
128
internal/config/profile.go
Normal file
@ -0,0 +1,128 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// RestoreProfile defines resource settings for restore operations
|
||||
type RestoreProfile struct {
|
||||
Name string
|
||||
ParallelDBs int // Number of databases to restore in parallel
|
||||
Jobs int // Parallel decompression jobs
|
||||
DisableProgress bool // Disable progress indicators to reduce overhead
|
||||
MemoryConservative bool // Use memory-conservative settings
|
||||
}
|
||||
|
||||
// GetRestoreProfile returns the profile settings for a given profile name
|
||||
func GetRestoreProfile(profileName string) (*RestoreProfile, error) {
|
||||
profileName = strings.ToLower(strings.TrimSpace(profileName))
|
||||
|
||||
switch profileName {
|
||||
case "conservative":
|
||||
return &RestoreProfile{
|
||||
Name: "conservative",
|
||||
ParallelDBs: 1, // Single-threaded restore
|
||||
Jobs: 1, // Single-threaded decompression
|
||||
DisableProgress: false,
|
||||
MemoryConservative: true,
|
||||
}, nil
|
||||
|
||||
case "balanced", "":
|
||||
return &RestoreProfile{
|
||||
Name: "balanced",
|
||||
ParallelDBs: 0, // Use config default or auto-detect
|
||||
Jobs: 0, // Use config default or auto-detect
|
||||
DisableProgress: false,
|
||||
MemoryConservative: false,
|
||||
}, nil
|
||||
|
||||
case "aggressive", "performance", "max":
|
||||
return &RestoreProfile{
|
||||
Name: "aggressive",
|
||||
ParallelDBs: -1, // Auto-detect based on resources
|
||||
Jobs: -1, // Auto-detect based on CPU
|
||||
DisableProgress: false,
|
||||
MemoryConservative: false,
|
||||
}, nil
|
||||
|
||||
case "potato":
|
||||
// Easter egg: same as conservative but with a fun name
|
||||
return &RestoreProfile{
|
||||
Name: "potato",
|
||||
ParallelDBs: 1,
|
||||
Jobs: 1,
|
||||
DisableProgress: false,
|
||||
MemoryConservative: true,
|
||||
}, nil
|
||||
|
||||
default:
|
||||
return nil, fmt.Errorf("unknown profile: %s (valid: conservative, balanced, aggressive)", profileName)
|
||||
}
|
||||
}
|
||||
|
||||
// ApplyProfile applies profile settings to config, respecting explicit user overrides
|
||||
func ApplyProfile(cfg *Config, profileName string, explicitJobs, explicitParallelDBs int) error {
|
||||
profile, err := GetRestoreProfile(profileName)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// Show profile being used
|
||||
if cfg.Debug {
|
||||
fmt.Printf("Using restore profile: %s\n", profile.Name)
|
||||
if profile.MemoryConservative {
|
||||
fmt.Println("Memory-conservative mode enabled")
|
||||
}
|
||||
}
|
||||
|
||||
// Apply profile settings only if not explicitly overridden
|
||||
if explicitJobs == 0 && profile.Jobs > 0 {
|
||||
cfg.Jobs = profile.Jobs
|
||||
}
|
||||
|
||||
if explicitParallelDBs == 0 && profile.ParallelDBs != 0 {
|
||||
cfg.ClusterParallelism = profile.ParallelDBs
|
||||
}
|
||||
|
||||
// Store profile name
|
||||
cfg.ResourceProfile = profile.Name
|
||||
|
||||
// Conservative profile implies large DB mode settings
|
||||
if profile.MemoryConservative {
|
||||
cfg.LargeDBMode = true
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// GetProfileDescription returns a human-readable description of the profile
|
||||
func GetProfileDescription(profileName string) string {
|
||||
profile, err := GetRestoreProfile(profileName)
|
||||
if err != nil {
|
||||
return "Unknown profile"
|
||||
}
|
||||
|
||||
switch profile.Name {
|
||||
case "conservative":
|
||||
return "Conservative: --parallel=1, single-threaded, minimal memory usage. Best for resource-constrained servers or when other services are running."
|
||||
case "potato":
|
||||
return "Potato Mode: Same as conservative, for servers running on a potato 🥔"
|
||||
case "balanced":
|
||||
return "Balanced: Auto-detect resources, moderate parallelism. Good default for most scenarios."
|
||||
case "aggressive":
|
||||
return "Aggressive: Maximum parallelism, all available resources. Best for dedicated database servers with ample resources."
|
||||
default:
|
||||
return profile.Name
|
||||
}
|
||||
}
|
||||
|
||||
// ListProfiles returns a list of all available profiles with descriptions
|
||||
func ListProfiles() map[string]string {
|
||||
return map[string]string{
|
||||
"conservative": GetProfileDescription("conservative"),
|
||||
"balanced": GetProfileDescription("balanced"),
|
||||
"aggressive": GetProfileDescription("aggressive"),
|
||||
"potato": GetProfileDescription("potato"),
|
||||
}
|
||||
}
|
||||
@ -146,7 +146,7 @@ func (d *Dots) Start(message string) {
|
||||
fmt.Fprint(d.writer, message)
|
||||
|
||||
go func() {
|
||||
ticker := time.NewTicker(500 * time.Millisecond)
|
||||
ticker := time.NewTicker(100 * time.Millisecond)
|
||||
defer ticker.Stop()
|
||||
|
||||
count := 0
|
||||
|
||||
@ -821,7 +821,9 @@ func (e *Engine) previewRestore(archivePath, targetDB string, format ArchiveForm
|
||||
}
|
||||
|
||||
// RestoreCluster restores a full cluster from a tar.gz archive
|
||||
func (e *Engine) RestoreCluster(ctx context.Context, archivePath string) error {
|
||||
// If preExtractedPath is non-empty, uses that directory instead of extracting archivePath
|
||||
// This avoids double extraction when ValidateAndExtractCluster was already called
|
||||
func (e *Engine) RestoreCluster(ctx context.Context, archivePath string, preExtractedPath ...string) error {
|
||||
operation := e.log.StartOperation("Cluster Restore")
|
||||
|
||||
// Validate and sanitize archive path
|
||||
@ -852,22 +854,32 @@ func (e *Engine) RestoreCluster(ctx context.Context, archivePath string) error {
|
||||
return fmt.Errorf("not a cluster archive: %s (detected format: %s)", archivePath, format)
|
||||
}
|
||||
|
||||
// Check disk space before starting restore
|
||||
e.log.Info("Checking disk space for restore")
|
||||
archiveInfo, err := os.Stat(archivePath)
|
||||
if err == nil {
|
||||
spaceCheck := checks.CheckDiskSpaceForRestore(e.cfg.BackupDir, archiveInfo.Size())
|
||||
// Check if we have a pre-extracted directory (optimization to avoid double extraction)
|
||||
// This check must happen BEFORE disk space checks to avoid false failures
|
||||
usingPreExtracted := len(preExtractedPath) > 0 && preExtractedPath[0] != ""
|
||||
|
||||
if spaceCheck.Critical {
|
||||
operation.Fail("Insufficient disk space")
|
||||
return fmt.Errorf("insufficient disk space for restore: %.1f%% used - need at least 4x archive size", spaceCheck.UsedPercent)
|
||||
}
|
||||
// Check disk space before starting restore (skip if using pre-extracted directory)
|
||||
var archiveInfo os.FileInfo
|
||||
var err error
|
||||
if !usingPreExtracted {
|
||||
e.log.Info("Checking disk space for restore")
|
||||
archiveInfo, err = os.Stat(archivePath)
|
||||
if err == nil {
|
||||
spaceCheck := checks.CheckDiskSpaceForRestore(e.cfg.BackupDir, archiveInfo.Size())
|
||||
|
||||
if spaceCheck.Warning {
|
||||
e.log.Warn("Low disk space - restore may fail",
|
||||
"available_gb", float64(spaceCheck.AvailableBytes)/(1024*1024*1024),
|
||||
"used_percent", spaceCheck.UsedPercent)
|
||||
if spaceCheck.Critical {
|
||||
operation.Fail("Insufficient disk space")
|
||||
return fmt.Errorf("insufficient disk space for restore: %.1f%% used - need at least 4x archive size", spaceCheck.UsedPercent)
|
||||
}
|
||||
|
||||
if spaceCheck.Warning {
|
||||
e.log.Warn("Low disk space - restore may fail",
|
||||
"available_gb", float64(spaceCheck.AvailableBytes)/(1024*1024*1024),
|
||||
"used_percent", spaceCheck.UsedPercent)
|
||||
}
|
||||
}
|
||||
} else {
|
||||
e.log.Info("Skipping disk space check (using pre-extracted directory)")
|
||||
}
|
||||
|
||||
if e.dryRun {
|
||||
@ -881,46 +893,56 @@ func (e *Engine) RestoreCluster(ctx context.Context, archivePath string) error {
|
||||
workDir := e.cfg.GetEffectiveWorkDir()
|
||||
tempDir := filepath.Join(workDir, fmt.Sprintf(".restore_%d", time.Now().Unix()))
|
||||
|
||||
// Check disk space for extraction (need ~3x archive size: compressed + extracted + working space)
|
||||
if archiveInfo != nil {
|
||||
requiredBytes := uint64(archiveInfo.Size()) * 3
|
||||
extractionCheck := checks.CheckDiskSpace(workDir)
|
||||
if extractionCheck.AvailableBytes < requiredBytes {
|
||||
operation.Fail("Insufficient disk space for extraction")
|
||||
return fmt.Errorf("insufficient disk space for extraction in %s: need %.1f GB, have %.1f GB (archive size: %.1f GB × 3)",
|
||||
workDir,
|
||||
float64(requiredBytes)/(1024*1024*1024),
|
||||
float64(extractionCheck.AvailableBytes)/(1024*1024*1024),
|
||||
float64(archiveInfo.Size())/(1024*1024*1024))
|
||||
// Handle pre-extracted directory or extract archive
|
||||
if usingPreExtracted {
|
||||
tempDir = preExtractedPath[0]
|
||||
// Note: Caller handles cleanup of pre-extracted directory
|
||||
e.log.Info("Using pre-extracted cluster directory",
|
||||
"path", tempDir,
|
||||
"optimization", "skipping duplicate extraction")
|
||||
} else {
|
||||
// Check disk space for extraction (need ~3x archive size: compressed + extracted + working space)
|
||||
if archiveInfo != nil {
|
||||
requiredBytes := uint64(archiveInfo.Size()) * 3
|
||||
extractionCheck := checks.CheckDiskSpace(workDir)
|
||||
if extractionCheck.AvailableBytes < requiredBytes {
|
||||
operation.Fail("Insufficient disk space for extraction")
|
||||
return fmt.Errorf("insufficient disk space for extraction in %s: need %.1f GB, have %.1f GB (archive size: %.1f GB × 3)",
|
||||
workDir,
|
||||
float64(requiredBytes)/(1024*1024*1024),
|
||||
float64(extractionCheck.AvailableBytes)/(1024*1024*1024),
|
||||
float64(archiveInfo.Size())/(1024*1024*1024))
|
||||
}
|
||||
e.log.Info("Disk space check for extraction passed",
|
||||
"workdir", workDir,
|
||||
"required_gb", float64(requiredBytes)/(1024*1024*1024),
|
||||
"available_gb", float64(extractionCheck.AvailableBytes)/(1024*1024*1024))
|
||||
}
|
||||
e.log.Info("Disk space check for extraction passed",
|
||||
"workdir", workDir,
|
||||
"required_gb", float64(requiredBytes)/(1024*1024*1024),
|
||||
"available_gb", float64(extractionCheck.AvailableBytes)/(1024*1024*1024))
|
||||
}
|
||||
|
||||
if err := os.MkdirAll(tempDir, 0755); err != nil {
|
||||
operation.Fail("Failed to create temporary directory")
|
||||
return fmt.Errorf("failed to create temp directory in %s: %w", workDir, err)
|
||||
}
|
||||
defer os.RemoveAll(tempDir)
|
||||
// Need to extract archive ourselves
|
||||
if err := os.MkdirAll(tempDir, 0755); err != nil {
|
||||
operation.Fail("Failed to create temporary directory")
|
||||
return fmt.Errorf("failed to create temp directory in %s: %w", workDir, err)
|
||||
}
|
||||
defer os.RemoveAll(tempDir)
|
||||
|
||||
// Extract archive
|
||||
e.log.Info("Extracting cluster archive", "archive", archivePath, "tempDir", tempDir)
|
||||
if err := e.extractArchive(ctx, archivePath, tempDir); err != nil {
|
||||
operation.Fail("Archive extraction failed")
|
||||
return fmt.Errorf("failed to extract archive: %w", err)
|
||||
}
|
||||
// Extract archive
|
||||
e.log.Info("Extracting cluster archive", "archive", archivePath, "tempDir", tempDir)
|
||||
if err := e.extractArchive(ctx, archivePath, tempDir); err != nil {
|
||||
operation.Fail("Archive extraction failed")
|
||||
return fmt.Errorf("failed to extract archive: %w", err)
|
||||
}
|
||||
|
||||
// Check context validity after extraction (debugging context cancellation issues)
|
||||
if ctx.Err() != nil {
|
||||
e.log.Error("Context cancelled after extraction - this should not happen",
|
||||
"context_error", ctx.Err(),
|
||||
"extraction_completed", true)
|
||||
operation.Fail("Context cancelled unexpectedly")
|
||||
return fmt.Errorf("context cancelled after extraction completed: %w", ctx.Err())
|
||||
// Check context validity after extraction (debugging context cancellation issues)
|
||||
if ctx.Err() != nil {
|
||||
e.log.Error("Context cancelled after extraction - this should not happen",
|
||||
"context_error", ctx.Err(),
|
||||
"extraction_completed", true)
|
||||
operation.Fail("Context cancelled unexpectedly")
|
||||
return fmt.Errorf("context cancelled after extraction completed: %w", ctx.Err())
|
||||
}
|
||||
e.log.Info("Extraction completed, context still valid")
|
||||
}
|
||||
e.log.Info("Extraction completed, context still valid")
|
||||
|
||||
// Check if user has superuser privileges (required for ownership restoration)
|
||||
e.progress.Update("Checking privileges...")
|
||||
@ -1484,9 +1506,9 @@ func (pr *progressReader) Read(p []byte) (n int, err error) {
|
||||
n, err = pr.reader.Read(p)
|
||||
pr.bytesRead += int64(n)
|
||||
|
||||
// Throttle progress reporting to every 100ms
|
||||
// Throttle progress reporting to every 50ms for smoother updates
|
||||
if pr.reportEvery == 0 {
|
||||
pr.reportEvery = 100 * time.Millisecond
|
||||
pr.reportEvery = 50 * time.Millisecond
|
||||
}
|
||||
if time.Since(pr.lastReport) > pr.reportEvery {
|
||||
if pr.callback != nil {
|
||||
|
||||
@ -190,7 +190,7 @@ func (s *Safety) validateSQLScriptGz(path string) error {
|
||||
return fmt.Errorf("does not appear to contain SQL content")
|
||||
}
|
||||
|
||||
// validateTarGz validates tar.gz archive
|
||||
// validateTarGz validates tar.gz archive with fast stream-based checks
|
||||
func (s *Safety) validateTarGz(path string) error {
|
||||
file, err := os.Open(path)
|
||||
if err != nil {
|
||||
@ -205,11 +205,40 @@ func (s *Safety) validateTarGz(path string) error {
|
||||
return fmt.Errorf("cannot read file header")
|
||||
}
|
||||
|
||||
if buffer[0] == 0x1f && buffer[1] == 0x8b {
|
||||
return nil // Valid gzip header
|
||||
if buffer[0] != 0x1f || buffer[1] != 0x8b {
|
||||
return fmt.Errorf("not a valid gzip file")
|
||||
}
|
||||
|
||||
return fmt.Errorf("not a valid gzip file")
|
||||
// Quick tar structure validation (stream-based, no full extraction)
|
||||
// Reset to start and decompress first few KB to check tar header
|
||||
file.Seek(0, 0)
|
||||
gzReader, err := gzip.NewReader(file)
|
||||
if err != nil {
|
||||
return fmt.Errorf("gzip corruption detected: %w", err)
|
||||
}
|
||||
defer gzReader.Close()
|
||||
|
||||
// Read first tar header to verify it's a valid tar archive
|
||||
headerBuf := make([]byte, 512) // Tar header is 512 bytes
|
||||
n, err = gzReader.Read(headerBuf)
|
||||
if err != nil && err != io.EOF {
|
||||
return fmt.Errorf("failed to read tar header: %w", err)
|
||||
}
|
||||
if n < 512 {
|
||||
return fmt.Errorf("archive too small or corrupted")
|
||||
}
|
||||
|
||||
// Check tar magic ("ustar\0" at offset 257)
|
||||
if len(headerBuf) >= 263 {
|
||||
magic := string(headerBuf[257:262])
|
||||
if magic != "ustar" {
|
||||
s.log.Debug("No tar magic found, but may still be valid tar", "magic", magic)
|
||||
// Don't fail - some tar implementations don't use magic
|
||||
}
|
||||
}
|
||||
|
||||
s.log.Debug("Cluster archive validation passed (stream-based check)")
|
||||
return nil // Valid gzip + tar structure
|
||||
}
|
||||
|
||||
// containsSQLKeywords checks if content contains SQL keywords
|
||||
@ -228,6 +257,42 @@ func containsSQLKeywords(content string) bool {
|
||||
return false
|
||||
}
|
||||
|
||||
// ValidateAndExtractCluster performs validation and pre-extraction for cluster restore
|
||||
// Returns path to extracted directory (in temp location) to avoid double-extraction
|
||||
// Caller must clean up the returned directory with os.RemoveAll() when done
|
||||
func (s *Safety) ValidateAndExtractCluster(ctx context.Context, archivePath string) (extractedDir string, err error) {
|
||||
// First validate archive integrity (fast stream check)
|
||||
if err := s.ValidateArchive(archivePath); err != nil {
|
||||
return "", fmt.Errorf("archive validation failed: %w", err)
|
||||
}
|
||||
|
||||
// Create temp directory for extraction in configured WorkDir
|
||||
workDir := s.cfg.GetEffectiveWorkDir()
|
||||
if workDir == "" {
|
||||
workDir = s.cfg.BackupDir
|
||||
}
|
||||
|
||||
tempDir, err := os.MkdirTemp(workDir, "dbbackup-cluster-extract-*")
|
||||
if err != nil {
|
||||
return "", fmt.Errorf("failed to create temp extraction directory in %s: %w", workDir, err)
|
||||
}
|
||||
|
||||
// Extract using tar command (fastest method)
|
||||
s.log.Info("Pre-extracting cluster archive for validation and restore",
|
||||
"archive", archivePath,
|
||||
"dest", tempDir)
|
||||
|
||||
cmd := exec.CommandContext(ctx, "tar", "-xzf", archivePath, "-C", tempDir)
|
||||
output, err := cmd.CombinedOutput()
|
||||
if err != nil {
|
||||
os.RemoveAll(tempDir) // Cleanup on failure
|
||||
return "", fmt.Errorf("extraction failed: %w: %s", err, string(output))
|
||||
}
|
||||
|
||||
s.log.Info("Cluster archive extracted successfully", "location", tempDir)
|
||||
return tempDir, nil
|
||||
}
|
||||
|
||||
// CheckDiskSpace verifies sufficient disk space for restore
|
||||
// Uses the effective work directory (WorkDir if set, otherwise BackupDir) since
|
||||
// that's where extraction actually happens for large databases
|
||||
|
||||
@ -13,6 +13,14 @@ import (
|
||||
"dbbackup/internal/config"
|
||||
"dbbackup/internal/database"
|
||||
"dbbackup/internal/logger"
|
||||
"path/filepath"
|
||||
)
|
||||
|
||||
// Backup phase constants for consistency
|
||||
const (
|
||||
backupPhaseGlobals = 1
|
||||
backupPhaseDatabases = 2
|
||||
backupPhaseCompressing = 3
|
||||
)
|
||||
|
||||
// BackupExecutionModel handles backup execution with progress
|
||||
@ -31,27 +39,36 @@ type BackupExecutionModel struct {
|
||||
cancelling bool // True when user has requested cancellation
|
||||
err error
|
||||
result string
|
||||
archivePath string // Path to created archive (for summary)
|
||||
archiveSize int64 // Size of created archive (for summary)
|
||||
startTime time.Time
|
||||
elapsed time.Duration // Final elapsed time
|
||||
details []string
|
||||
spinnerFrame int
|
||||
|
||||
// Database count progress (for cluster backup)
|
||||
dbTotal int
|
||||
dbDone int
|
||||
dbName string // Current database being backed up
|
||||
overallPhase int // 1=globals, 2=databases, 3=compressing
|
||||
phaseDesc string // Description of current phase
|
||||
dbTotal int
|
||||
dbDone int
|
||||
dbName string // Current database being backed up
|
||||
overallPhase int // 1=globals, 2=databases, 3=compressing
|
||||
phaseDesc string // Description of current phase
|
||||
phase2StartTime time.Time // When phase 2 (databases) started (for realtime ETA)
|
||||
dbPhaseElapsed time.Duration // Elapsed time since database backup phase started
|
||||
dbAvgPerDB time.Duration // Average time per database backup
|
||||
}
|
||||
|
||||
// sharedBackupProgressState holds progress state that can be safely accessed from callbacks
|
||||
type sharedBackupProgressState struct {
|
||||
mu sync.Mutex
|
||||
dbTotal int
|
||||
dbDone int
|
||||
dbName string
|
||||
overallPhase int // 1=globals, 2=databases, 3=compressing
|
||||
phaseDesc string // Description of current phase
|
||||
hasUpdate bool
|
||||
mu sync.Mutex
|
||||
dbTotal int
|
||||
dbDone int
|
||||
dbName string
|
||||
overallPhase int // 1=globals, 2=databases, 3=compressing
|
||||
phaseDesc string // Description of current phase
|
||||
hasUpdate bool
|
||||
phase2StartTime time.Time // When phase 2 started (for realtime ETA calculation)
|
||||
dbPhaseElapsed time.Duration // Elapsed time since database backup phase started
|
||||
dbAvgPerDB time.Duration // Average time per database backup
|
||||
}
|
||||
|
||||
// Package-level shared progress state for backup operations
|
||||
@ -72,12 +89,12 @@ func clearCurrentBackupProgress() {
|
||||
currentBackupProgressState = nil
|
||||
}
|
||||
|
||||
func getCurrentBackupProgress() (dbTotal, dbDone int, dbName string, overallPhase int, phaseDesc string, hasUpdate bool) {
|
||||
func getCurrentBackupProgress() (dbTotal, dbDone int, dbName string, overallPhase int, phaseDesc string, hasUpdate bool, dbPhaseElapsed, dbAvgPerDB time.Duration, phase2StartTime time.Time) {
|
||||
currentBackupProgressMu.Lock()
|
||||
defer currentBackupProgressMu.Unlock()
|
||||
|
||||
if currentBackupProgressState == nil {
|
||||
return 0, 0, "", 0, "", false
|
||||
return 0, 0, "", 0, "", false, 0, 0, time.Time{}
|
||||
}
|
||||
|
||||
currentBackupProgressState.mu.Lock()
|
||||
@ -86,9 +103,17 @@ func getCurrentBackupProgress() (dbTotal, dbDone int, dbName string, overallPhas
|
||||
hasUpdate = currentBackupProgressState.hasUpdate
|
||||
currentBackupProgressState.hasUpdate = false
|
||||
|
||||
// Calculate realtime phase elapsed if we have a phase 2 start time
|
||||
dbPhaseElapsed = currentBackupProgressState.dbPhaseElapsed
|
||||
if !currentBackupProgressState.phase2StartTime.IsZero() {
|
||||
dbPhaseElapsed = time.Since(currentBackupProgressState.phase2StartTime)
|
||||
}
|
||||
|
||||
return currentBackupProgressState.dbTotal, currentBackupProgressState.dbDone,
|
||||
currentBackupProgressState.dbName, currentBackupProgressState.overallPhase,
|
||||
currentBackupProgressState.phaseDesc, hasUpdate
|
||||
currentBackupProgressState.phaseDesc, hasUpdate,
|
||||
dbPhaseElapsed, currentBackupProgressState.dbAvgPerDB,
|
||||
currentBackupProgressState.phase2StartTime
|
||||
}
|
||||
|
||||
func NewBackupExecution(cfg *config.Config, log logger.Logger, parent tea.Model, ctx context.Context, backupType, dbName string, ratio int) BackupExecutionModel {
|
||||
@ -132,17 +157,18 @@ type backupProgressMsg struct {
|
||||
}
|
||||
|
||||
type backupCompleteMsg struct {
|
||||
result string
|
||||
err error
|
||||
result string
|
||||
err error
|
||||
archivePath string
|
||||
archiveSize int64
|
||||
elapsed time.Duration
|
||||
}
|
||||
|
||||
func executeBackupWithTUIProgress(parentCtx context.Context, cfg *config.Config, log logger.Logger, backupType, dbName string, ratio int) tea.Cmd {
|
||||
return func() tea.Msg {
|
||||
// NO TIMEOUT for backup operations - a backup takes as long as it takes
|
||||
// Large databases can take many hours
|
||||
// Only manual cancellation (Ctrl+C) should stop the backup
|
||||
ctx, cancel := context.WithCancel(parentCtx)
|
||||
defer cancel()
|
||||
// Use the parent context directly - it's already cancellable from the model
|
||||
// DO NOT create a new context here as it breaks Ctrl+C cancellation
|
||||
ctx := parentCtx
|
||||
|
||||
start := time.Now()
|
||||
|
||||
@ -176,9 +202,13 @@ func executeBackupWithTUIProgress(parentCtx context.Context, cfg *config.Config,
|
||||
progressState.dbDone = done
|
||||
progressState.dbTotal = total
|
||||
progressState.dbName = currentDB
|
||||
progressState.overallPhase = 2 // Phase 2: Backing up databases
|
||||
progressState.phaseDesc = fmt.Sprintf("Phase 2/3: Databases (%d/%d)", done, total)
|
||||
progressState.overallPhase = backupPhaseDatabases
|
||||
progressState.phaseDesc = fmt.Sprintf("Phase 2/3: Backing up Databases (%d/%d)", done, total)
|
||||
progressState.hasUpdate = true
|
||||
// Set phase 2 start time on first callback (for realtime ETA calculation)
|
||||
if progressState.phase2StartTime.IsZero() {
|
||||
progressState.phase2StartTime = time.Now()
|
||||
}
|
||||
progressState.mu.Unlock()
|
||||
})
|
||||
|
||||
@ -216,8 +246,9 @@ func executeBackupWithTUIProgress(parentCtx context.Context, cfg *config.Config,
|
||||
}
|
||||
|
||||
return backupCompleteMsg{
|
||||
result: result,
|
||||
err: nil,
|
||||
result: result,
|
||||
err: nil,
|
||||
elapsed: elapsed,
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -230,13 +261,15 @@ func (m BackupExecutionModel) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
|
||||
m.spinnerFrame = (m.spinnerFrame + 1) % len(spinnerFrames)
|
||||
|
||||
// Poll for database progress updates from callbacks
|
||||
dbTotal, dbDone, dbName, overallPhase, phaseDesc, hasUpdate := getCurrentBackupProgress()
|
||||
dbTotal, dbDone, dbName, overallPhase, phaseDesc, hasUpdate, dbPhaseElapsed, dbAvgPerDB, _ := getCurrentBackupProgress()
|
||||
if hasUpdate {
|
||||
m.dbTotal = dbTotal
|
||||
m.dbDone = dbDone
|
||||
m.dbName = dbName
|
||||
m.overallPhase = overallPhase
|
||||
m.phaseDesc = phaseDesc
|
||||
m.dbPhaseElapsed = dbPhaseElapsed
|
||||
m.dbAvgPerDB = dbAvgPerDB
|
||||
}
|
||||
|
||||
// Update status based on progress and elapsed time
|
||||
@ -284,6 +317,7 @@ func (m BackupExecutionModel) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
|
||||
m.done = true
|
||||
m.err = msg.err
|
||||
m.result = msg.result
|
||||
m.elapsed = msg.elapsed
|
||||
if m.err == nil {
|
||||
m.status = "[OK] Backup completed successfully!"
|
||||
} else {
|
||||
@ -361,14 +395,52 @@ func renderBackupDatabaseProgressBar(done, total int, dbName string, width int)
|
||||
return fmt.Sprintf(" Database: [%s] %d/%d", bar, done, total)
|
||||
}
|
||||
|
||||
// renderBackupDatabaseProgressBarWithTiming renders database backup progress with ETA
|
||||
func renderBackupDatabaseProgressBarWithTiming(done, total int, dbPhaseElapsed, dbAvgPerDB time.Duration) string {
|
||||
if total == 0 {
|
||||
return ""
|
||||
}
|
||||
|
||||
// Calculate progress percentage
|
||||
percent := float64(done) / float64(total)
|
||||
if percent > 1.0 {
|
||||
percent = 1.0
|
||||
}
|
||||
|
||||
// Build progress bar
|
||||
barWidth := 50
|
||||
filled := int(float64(barWidth) * percent)
|
||||
if filled > barWidth {
|
||||
filled = barWidth
|
||||
}
|
||||
bar := strings.Repeat("█", filled) + strings.Repeat("░", barWidth-filled)
|
||||
|
||||
// Calculate ETA similar to restore
|
||||
var etaStr string
|
||||
if done > 0 && done < total {
|
||||
avgPerDB := dbPhaseElapsed / time.Duration(done)
|
||||
remaining := total - done
|
||||
eta := avgPerDB * time.Duration(remaining)
|
||||
etaStr = fmt.Sprintf(" | ETA: %s", formatDuration(eta))
|
||||
} else if done == total {
|
||||
etaStr = " | Complete"
|
||||
}
|
||||
|
||||
return fmt.Sprintf(" Databases: [%s] %d/%d | Elapsed: %s%s\n",
|
||||
bar, done, total, formatDuration(dbPhaseElapsed), etaStr)
|
||||
}
|
||||
|
||||
func (m BackupExecutionModel) View() string {
|
||||
var s strings.Builder
|
||||
s.Grow(512) // Pre-allocate estimated capacity for better performance
|
||||
|
||||
// Clear screen with newlines and render header
|
||||
s.WriteString("\n\n")
|
||||
header := titleStyle.Render("[EXEC] Backup Execution")
|
||||
s.WriteString(header)
|
||||
header := "[EXEC] Backing up Database"
|
||||
if m.backupType == "cluster" {
|
||||
header = "[EXEC] Cluster Backup"
|
||||
}
|
||||
s.WriteString(titleStyle.Render(header))
|
||||
s.WriteString("\n\n")
|
||||
|
||||
// Backup details - properly aligned
|
||||
@ -379,7 +451,6 @@ func (m BackupExecutionModel) View() string {
|
||||
if m.ratio > 0 {
|
||||
s.WriteString(fmt.Sprintf(" %-10s %d\n", "Sample:", m.ratio))
|
||||
}
|
||||
s.WriteString(fmt.Sprintf(" %-10s %s\n", "Duration:", time.Since(m.startTime).Round(time.Second)))
|
||||
s.WriteString("\n")
|
||||
|
||||
// Status display
|
||||
@ -395,11 +466,15 @@ func (m BackupExecutionModel) View() string {
|
||||
|
||||
elapsedSec := int(time.Since(m.startTime).Seconds())
|
||||
|
||||
if m.overallPhase == 2 && m.dbTotal > 0 {
|
||||
if m.overallPhase == backupPhaseDatabases && m.dbTotal > 0 {
|
||||
// Phase 2: Database backups - contributes 15-90%
|
||||
dbPct := int((int64(m.dbDone) * 100) / int64(m.dbTotal))
|
||||
overallProgress = 15 + (dbPct * 75 / 100)
|
||||
phaseLabel = m.phaseDesc
|
||||
} else if m.overallPhase == backupPhaseCompressing {
|
||||
// Phase 3: Compressing archive
|
||||
overallProgress = 92
|
||||
phaseLabel = "Phase 3/3: Compressing Archive"
|
||||
} else if elapsedSec < 5 {
|
||||
// Initial setup
|
||||
overallProgress = 2
|
||||
@ -430,9 +505,9 @@ func (m BackupExecutionModel) View() string {
|
||||
}
|
||||
s.WriteString("\n")
|
||||
|
||||
// Database progress bar
|
||||
progressBar := renderBackupDatabaseProgressBar(m.dbDone, m.dbTotal, m.dbName, 50)
|
||||
s.WriteString(progressBar + "\n")
|
||||
// Database progress bar with timing
|
||||
s.WriteString(renderBackupDatabaseProgressBarWithTiming(m.dbDone, m.dbTotal, m.dbPhaseElapsed, m.dbAvgPerDB))
|
||||
s.WriteString("\n")
|
||||
} else {
|
||||
// Intermediate phase (globals)
|
||||
spinner := spinnerFrames[m.spinnerFrame]
|
||||
@ -449,7 +524,10 @@ func (m BackupExecutionModel) View() string {
|
||||
}
|
||||
|
||||
if !m.cancelling {
|
||||
s.WriteString("\n [KEY] Press Ctrl+C or ESC to cancel\n")
|
||||
// Elapsed time
|
||||
s.WriteString(fmt.Sprintf("Elapsed: %s\n", formatDuration(time.Since(m.startTime))))
|
||||
s.WriteString("\n")
|
||||
s.WriteString(infoStyle.Render("[KEYS] Press Ctrl+C or ESC to cancel"))
|
||||
}
|
||||
} else {
|
||||
// Show completion summary with detailed stats
|
||||
@ -474,6 +552,14 @@ func (m BackupExecutionModel) View() string {
|
||||
s.WriteString(infoStyle.Render(" ─── Summary ───────────────────────────────────────────────"))
|
||||
s.WriteString("\n\n")
|
||||
|
||||
// Archive info (if available)
|
||||
if m.archivePath != "" {
|
||||
s.WriteString(fmt.Sprintf(" Archive: %s\n", filepath.Base(m.archivePath)))
|
||||
}
|
||||
if m.archiveSize > 0 {
|
||||
s.WriteString(fmt.Sprintf(" Archive Size: %s\n", FormatBytes(m.archiveSize)))
|
||||
}
|
||||
|
||||
// Backup type specific info
|
||||
switch m.backupType {
|
||||
case "cluster":
|
||||
@ -497,12 +583,21 @@ func (m BackupExecutionModel) View() string {
|
||||
s.WriteString(infoStyle.Render(" ─── Timing ────────────────────────────────────────────────"))
|
||||
s.WriteString("\n\n")
|
||||
|
||||
elapsed := time.Since(m.startTime)
|
||||
s.WriteString(fmt.Sprintf(" Total Time: %s\n", formatBackupDuration(elapsed)))
|
||||
elapsed := m.elapsed
|
||||
if elapsed == 0 {
|
||||
elapsed = time.Since(m.startTime)
|
||||
}
|
||||
s.WriteString(fmt.Sprintf(" Total Time: %s\n", formatDuration(elapsed)))
|
||||
|
||||
// Calculate and show throughput if we have size info
|
||||
if m.archiveSize > 0 && elapsed.Seconds() > 0 {
|
||||
throughput := float64(m.archiveSize) / elapsed.Seconds()
|
||||
s.WriteString(fmt.Sprintf(" Throughput: %s/s (average)\n", FormatBytes(int64(throughput))))
|
||||
}
|
||||
|
||||
if m.backupType == "cluster" && m.dbTotal > 0 && m.err == nil {
|
||||
avgPerDB := elapsed / time.Duration(m.dbTotal)
|
||||
s.WriteString(fmt.Sprintf(" Avg per DB: %s\n", formatBackupDuration(avgPerDB)))
|
||||
s.WriteString(fmt.Sprintf(" Avg per DB: %s\n", formatDuration(avgPerDB)))
|
||||
}
|
||||
|
||||
s.WriteString("\n")
|
||||
@ -513,18 +608,3 @@ func (m BackupExecutionModel) View() string {
|
||||
|
||||
return s.String()
|
||||
}
|
||||
|
||||
// formatBackupDuration formats duration in human readable format
|
||||
func formatBackupDuration(d time.Duration) string {
|
||||
if d < time.Minute {
|
||||
return fmt.Sprintf("%.1fs", d.Seconds())
|
||||
}
|
||||
if d < time.Hour {
|
||||
minutes := int(d.Minutes())
|
||||
seconds := int(d.Seconds()) % 60
|
||||
return fmt.Sprintf("%dm %ds", minutes, seconds)
|
||||
}
|
||||
hours := int(d.Hours())
|
||||
minutes := int(d.Minutes()) % 60
|
||||
return fmt.Sprintf("%dh %dm", hours, minutes)
|
||||
}
|
||||
|
||||
@ -152,8 +152,9 @@ type sharedProgressState struct {
|
||||
currentDB string
|
||||
|
||||
// Timing info for database restore phase
|
||||
dbPhaseElapsed time.Duration // Elapsed time since restore phase started
|
||||
dbAvgPerDB time.Duration // Average time per database restore
|
||||
dbPhaseElapsed time.Duration // Elapsed time since restore phase started
|
||||
dbAvgPerDB time.Duration // Average time per database restore
|
||||
phase3StartTime time.Time // When phase 3 started (for realtime ETA calculation)
|
||||
|
||||
// Overall phase tracking (1=Extract, 2=Globals, 3=Databases)
|
||||
overallPhase int
|
||||
@ -190,12 +191,12 @@ func clearCurrentRestoreProgress() {
|
||||
currentRestoreProgressState = nil
|
||||
}
|
||||
|
||||
func getCurrentRestoreProgress() (bytesTotal, bytesDone int64, description string, hasUpdate bool, dbTotal, dbDone int, speed float64, dbPhaseElapsed, dbAvgPerDB time.Duration, currentDB string, overallPhase int, extractionDone bool, dbBytesTotal, dbBytesDone int64) {
|
||||
func getCurrentRestoreProgress() (bytesTotal, bytesDone int64, description string, hasUpdate bool, dbTotal, dbDone int, speed float64, dbPhaseElapsed, dbAvgPerDB time.Duration, currentDB string, overallPhase int, extractionDone bool, dbBytesTotal, dbBytesDone int64, phase3StartTime time.Time) {
|
||||
currentRestoreProgressMu.Lock()
|
||||
defer currentRestoreProgressMu.Unlock()
|
||||
|
||||
if currentRestoreProgressState == nil {
|
||||
return 0, 0, "", false, 0, 0, 0, 0, 0, "", 0, false, 0, 0
|
||||
return 0, 0, "", false, 0, 0, 0, 0, 0, "", 0, false, 0, 0, time.Time{}
|
||||
}
|
||||
|
||||
currentRestoreProgressState.mu.Lock()
|
||||
@ -204,13 +205,20 @@ func getCurrentRestoreProgress() (bytesTotal, bytesDone int64, description strin
|
||||
// Calculate rolling window speed
|
||||
speed = calculateRollingSpeed(currentRestoreProgressState.speedSamples)
|
||||
|
||||
// Calculate realtime phase elapsed if we have a phase 3 start time
|
||||
dbPhaseElapsed = currentRestoreProgressState.dbPhaseElapsed
|
||||
if !currentRestoreProgressState.phase3StartTime.IsZero() {
|
||||
dbPhaseElapsed = time.Since(currentRestoreProgressState.phase3StartTime)
|
||||
}
|
||||
|
||||
return currentRestoreProgressState.bytesTotal, currentRestoreProgressState.bytesDone,
|
||||
currentRestoreProgressState.description, currentRestoreProgressState.hasUpdate,
|
||||
currentRestoreProgressState.dbTotal, currentRestoreProgressState.dbDone, speed,
|
||||
currentRestoreProgressState.dbPhaseElapsed, currentRestoreProgressState.dbAvgPerDB,
|
||||
dbPhaseElapsed, currentRestoreProgressState.dbAvgPerDB,
|
||||
currentRestoreProgressState.currentDB, currentRestoreProgressState.overallPhase,
|
||||
currentRestoreProgressState.extractionDone,
|
||||
currentRestoreProgressState.dbBytesTotal, currentRestoreProgressState.dbBytesDone
|
||||
currentRestoreProgressState.dbBytesTotal, currentRestoreProgressState.dbBytesDone,
|
||||
currentRestoreProgressState.phase3StartTime
|
||||
}
|
||||
|
||||
// calculateRollingSpeed calculates speed from recent samples (last 5 seconds)
|
||||
@ -253,11 +261,9 @@ type restoreProgressChannel chan restoreProgressMsg
|
||||
|
||||
func executeRestoreWithTUIProgress(parentCtx context.Context, cfg *config.Config, log logger.Logger, archive ArchiveInfo, targetDB string, cleanFirst, createIfMissing bool, restoreType string, cleanClusterFirst bool, existingDBs []string, saveDebugLog bool) tea.Cmd {
|
||||
return func() tea.Msg {
|
||||
// NO TIMEOUT for restore operations - a restore takes as long as it takes
|
||||
// Large databases with large objects can take many hours
|
||||
// Only manual cancellation (Ctrl+C) should stop the restore
|
||||
ctx, cancel := context.WithCancel(parentCtx)
|
||||
defer cancel()
|
||||
// Use the parent context directly - it's already cancellable from the model
|
||||
// DO NOT create a new context here as it breaks Ctrl+C cancellation
|
||||
ctx := parentCtx
|
||||
|
||||
start := time.Now()
|
||||
|
||||
@ -357,6 +363,10 @@ func executeRestoreWithTUIProgress(parentCtx context.Context, cfg *config.Config
|
||||
progressState.overallPhase = 3
|
||||
progressState.extractionDone = true
|
||||
progressState.hasUpdate = true
|
||||
// Set phase 3 start time on first callback (for realtime ETA calculation)
|
||||
if progressState.phase3StartTime.IsZero() {
|
||||
progressState.phase3StartTime = time.Now()
|
||||
}
|
||||
// Clear byte progress when switching to db progress
|
||||
progressState.bytesTotal = 0
|
||||
progressState.bytesDone = 0
|
||||
@ -375,6 +385,10 @@ func executeRestoreWithTUIProgress(parentCtx context.Context, cfg *config.Config
|
||||
progressState.dbPhaseElapsed = phaseElapsed
|
||||
progressState.dbAvgPerDB = avgPerDB
|
||||
progressState.hasUpdate = true
|
||||
// Set phase 3 start time on first callback (for realtime ETA calculation)
|
||||
if progressState.phase3StartTime.IsZero() {
|
||||
progressState.phase3StartTime = time.Now()
|
||||
}
|
||||
// Clear byte progress when switching to db progress
|
||||
progressState.bytesTotal = 0
|
||||
progressState.bytesDone = 0
|
||||
@ -392,6 +406,10 @@ func executeRestoreWithTUIProgress(parentCtx context.Context, cfg *config.Config
|
||||
progressState.overallPhase = 3
|
||||
progressState.extractionDone = true
|
||||
progressState.hasUpdate = true
|
||||
// Set phase 3 start time on first callback (for realtime ETA calculation)
|
||||
if progressState.phase3StartTime.IsZero() {
|
||||
progressState.phase3StartTime = time.Now()
|
||||
}
|
||||
})
|
||||
|
||||
// Store progress state in a package-level variable for the ticker to access
|
||||
@ -447,7 +465,8 @@ func (m RestoreExecutionModel) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
|
||||
m.elapsed = time.Since(m.startTime)
|
||||
|
||||
// Poll shared progress state for real-time updates
|
||||
bytesTotal, bytesDone, description, hasUpdate, dbTotal, dbDone, speed, dbPhaseElapsed, dbAvgPerDB, currentDB, overallPhase, extractionDone, dbBytesTotal, dbBytesDone := getCurrentRestoreProgress()
|
||||
// Note: dbPhaseElapsed is now calculated in realtime inside getCurrentRestoreProgress()
|
||||
bytesTotal, bytesDone, description, hasUpdate, dbTotal, dbDone, speed, dbPhaseElapsed, dbAvgPerDB, currentDB, overallPhase, extractionDone, dbBytesTotal, dbBytesDone, _ := getCurrentRestoreProgress()
|
||||
if hasUpdate && bytesTotal > 0 && !extractionDone {
|
||||
// Phase 1: Extraction
|
||||
m.bytesTotal = bytesTotal
|
||||
@ -1150,12 +1169,12 @@ func formatRestoreError(errStr string) string {
|
||||
|
||||
// Provide specific recommendations based on error
|
||||
if strings.Contains(errStr, "out of shared memory") || strings.Contains(errStr, "max_locks_per_transaction") {
|
||||
s.WriteString(errorStyle.Render(" • Cannot access file: stat : no such file or directory\n"))
|
||||
s.WriteString(errorStyle.Render(" • PostgreSQL lock table exhausted\n"))
|
||||
s.WriteString("\n")
|
||||
s.WriteString(infoStyle.Render(" ─── [HINT] Recommendations ────────────────────────────────"))
|
||||
s.WriteString("\n\n")
|
||||
s.WriteString(" Lock table exhausted. Total capacity = max_locks_per_transaction\n")
|
||||
s.WriteString(" × (max_connections + max_prepared_transactions).\n\n")
|
||||
s.WriteString(" Lock capacity = max_locks_per_transaction\n")
|
||||
s.WriteString(" × (max_connections + max_prepared_transactions)\n\n")
|
||||
s.WriteString(" If you reduced VM size or max_connections, you need higher\n")
|
||||
s.WriteString(" max_locks_per_transaction to compensate.\n\n")
|
||||
s.WriteString(successStyle.Render(" FIX OPTIONS:\n"))
|
||||
|
||||
Reference in New Issue
Block a user