Database Migrations in Production: Lessons Learned

20 min read3907 words

After managing over 50 production database migrations—from small schema changes to complete database engine switches affecting millions of users—I've learned that the difference between a smooth migration and a career-defining disaster often comes down to preparation and having the right rollback strategy.

In this guide, I'll share the hard-won lessons from migrations that saved companies millions, and the mistakes that taught me why you should never underestimate the complexity of production data.

The $2 Million Migration Mistake

Let me start with the migration that changed how I approach database changes forever. It was a "simple" column addition to a user table with 50 million records. The migration looked innocent enough:

ALTER TABLE users ADD COLUMN preferences JSONB;

What we didn't account for was that PostgreSQL would lock the entire table during this operation. For 47 minutes, our entire application was down while the migration completed. The business impact: $2.1 million in lost revenue, thousands of angry customers, and a very uncomfortable all-hands meeting.

This taught me the first rule of production migrations: there are no simple migrations at scale.

The Zero-Downtime Migration Framework

Since that disaster, I've developed a framework that has successfully handled migrations for tables with billions of rows without a single second of downtime. Here's the approach:

Phase 1: Shadow Table Strategy

Instead of altering tables directly, create shadow tables and migrate data incrementally:

-- Create the new table structure
CREATE TABLE users_new (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    preferences JSONB DEFAULT '{}'::jsonb,
    -- Copy all existing columns
    name VARCHAR(255),
    status VARCHAR(50) DEFAULT 'active'
);
 
-- Copy existing indexes (adjust for new structure)
CREATE INDEX CONCURRENTLY idx_users_new_email ON users_new(email);
CREATE INDEX CONCURRENTLY idx_users_new_status ON users_new(status);
CREATE INDEX CONCURRENTLY idx_users_new_created_at ON users_new(created_at);
 
-- Add constraints after data migration
-- ALTER TABLE users_new ADD CONSTRAINT users_new_email_check CHECK (email ~ '^[^@]+@[^@]+\.[^@]+$');

Phase 2: Dual-Write Implementation

Implement dual-write logic in your application to write to both tables:

// services/userService.ts
import { DatabaseClient } from './database';
 
interface UserData {
  email: string;
  name: string;
  preferences?: Record<string, any>;
}
 
class UserService {
  private db: DatabaseClient;
  private isDualWriteEnabled: boolean;
 
  constructor(db: DatabaseClient) {
    this.db = db;
    this.isDualWriteEnabled = process.env.MIGRATION_DUAL_WRITE === 'true';
  }
 
  async createUser(userData: UserData): Promise<User> {
    const transaction = await this.db.beginTransaction();
    
    try {
      // Write to old table (primary)
      const user = await transaction.query(`
        INSERT INTO users (email, name, created_at, updated_at)
        VALUES ($1, $2, NOW(), NOW())
        RETURNING *
      `, [userData.email, userData.name]);
 
      // If dual-write is enabled, also write to new table
      if (this.isDualWriteEnabled) {
        await transaction.query(`
          INSERT INTO users_new (id, email, name, preferences, created_at, updated_at)
          VALUES ($1, $2, $3, $4, $5, $6)
          ON CONFLICT (id) DO UPDATE SET
            email = EXCLUDED.email,
            name = EXCLUDED.name,
            preferences = EXCLUDED.preferences,
            updated_at = EXCLUDED.updated_at
        `, [
          user[0].id,
          userData.email,
          userData.name,
          userData.preferences || {},
          user[0].created_at,
          user[0].updated_at
        ]);
      }
 
      await transaction.commit();
      return user[0];
    } catch (error) {
      await transaction.rollback();
      throw error;
    }
  }
 
  async updateUser(id: number, updates: Partial<UserData>): Promise<User> {
    const transaction = await this.db.beginTransaction();
    
    try {
      // Update old table
      const updateFields: string[] = [];
      const values: any[] = [id];
      let paramIndex = 2;
 
      Object.entries(updates).forEach(([key, value]) => {
        if (key !== 'preferences') { // Handle preferences separately for old table
          updateFields.push(`${key} = $${paramIndex}`);
          values.push(value);
          paramIndex++;
        }
      });
 
      updateFields.push(`updated_at = NOW()`);
 
      const oldTableResult = await transaction.query(`
        UPDATE users 
        SET ${updateFields.join(', ')}
        WHERE id = $1
        RETURNING *
      `, values);
 
      // Update new table if dual-write is enabled
      if (this.isDualWriteEnabled) {
        const newUpdateFields: string[] = [];
        const newValues: any[] = [id];
        let newParamIndex = 2;
 
        Object.entries(updates).forEach(([key, value]) => {
          newUpdateFields.push(`${key} = $${newParamIndex}`);
          newValues.push(value);
          newParamIndex++;
        });
 
        newUpdateFields.push(`updated_at = NOW()`);
 
        await transaction.query(`
          UPDATE users_new 
          SET ${newUpdateFields.join(', ')}
          WHERE id = $1
        `, newValues);
      }
 
      await transaction.commit();
      return oldTableResult[0];
    } catch (error) {
      await transaction.rollback();
      throw error;
    }
  }
}
 
export default UserService;

Phase 3: Background Data Migration

Create a robust background migration process:

// scripts/migrate-users-data.ts
import { DatabaseClient } from '../services/database';
import { createLogger } from '../utils/logger';
 
const logger = createLogger('user-migration');
 
interface MigrationConfig {
  batchSize: number;
  delayMs: number;
  maxRetries: number;
  validateChecksum: boolean;
}
 
class UserDataMigration {
  private db: DatabaseClient;
  private config: MigrationConfig;
  private stats = {
    processed: 0,
    migrated: 0,
    skipped: 0,
    failed: 0,
    startTime: Date.now()
  };
 
  constructor(db: DatabaseClient, config: MigrationConfig) {
    this.db = db;
    this.config = config;
  }
 
  async migrate(): Promise<void> {
    logger.info('Starting user data migration', { config: this.config });
 
    try {
      await this.createMigrationTrackingTable();
      const totalRecords = await this.getTotalRecordCount();
      logger.info(`Total records to migrate: ${totalRecords}`);
 
      let lastId = await this.getLastMigratedId();
      
      while (true) {
        const batch = await this.getBatch(lastId);
        if (batch.length === 0) {
          break;
        }
 
        await this.migrateBatch(batch);
        lastId = batch[batch.length - 1].id;
 
        // Progress reporting
        this.stats.processed += batch.length;
        if (this.stats.processed % 10000 === 0) {
          await this.reportProgress(totalRecords);
        }
 
        // Throttle to avoid overwhelming the database
        if (this.config.delayMs > 0) {
          await this.sleep(this.config.delayMs);
        }
      }
 
      await this.validateMigration();
      logger.info('Migration completed successfully', this.stats);
    } catch (error) {
      logger.error('Migration failed', { error: error.message, stats: this.stats });
      throw error;
    }
  }
 
  private async createMigrationTrackingTable(): Promise<void> {
    await this.db.query(`
      CREATE TABLE IF NOT EXISTS migration_progress (
        migration_name VARCHAR(255) PRIMARY KEY,
        last_migrated_id BIGINT DEFAULT 0,
        created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
        updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
      )
    `);
 
    await this.db.query(`
      INSERT INTO migration_progress (migration_name, last_migrated_id)
      VALUES ('users_to_users_new', 0)
      ON CONFLICT (migration_name) DO NOTHING
    `);
  }
 
  private async getLastMigratedId(): Promise<number> {
    const result = await this.db.query(`
      SELECT last_migrated_id 
      FROM migration_progress 
      WHERE migration_name = 'users_to_users_new'
    `);
    return result[0]?.last_migrated_id || 0;
  }
 
  private async getBatch(lastId: number): Promise<any[]> {
    const result = await this.db.query(`
      SELECT id, email, name, created_at, updated_at, status
      FROM users 
      WHERE id > $1 
      ORDER BY id 
      LIMIT $2
    `, [lastId, this.config.batchSize]);
 
    return result;
  }
 
  private async migrateBatch(batch: any[]): Promise<void> {
    const transaction = await this.db.beginTransaction();
    
    try {
      for (const user of batch) {
        await this.migrateUser(transaction, user);
      }
 
      // Update progress tracking
      const lastId = batch[batch.length - 1].id;
      await transaction.query(`
        UPDATE migration_progress 
        SET last_migrated_id = $1, updated_at = NOW() 
        WHERE migration_name = 'users_to_users_new'
      `, [lastId]);
 
      await transaction.commit();
      this.stats.migrated += batch.length;
    } catch (error) {
      await transaction.rollback();
      this.stats.failed += batch.length;
      
      // Retry individual records on batch failure
      if (this.config.maxRetries > 0) {
        await this.retryFailedRecords(batch);
      } else {
        throw error;
      }
    }
  }
 
  private async migrateUser(transaction: any, user: any): Promise<void> {
    // Check if user already exists in new table
    const existing = await transaction.query(`
      SELECT id FROM users_new WHERE id = $1
    `, [user.id]);
 
    if (existing.length > 0) {
      this.stats.skipped++;
      return;
    }
 
    // Migrate user with default preferences
    await transaction.query(`
      INSERT INTO users_new (id, email, name, created_at, updated_at, status, preferences)
      VALUES ($1, $2, $3, $4, $5, $6, $7)
    `, [
      user.id,
      user.email,
      user.name,
      user.created_at,
      user.updated_at,
      user.status || 'active',
      '{}' // Default empty preferences
    ]);
  }
 
  private async retryFailedRecords(batch: any[]): Promise<void> {
    logger.warn(`Retrying ${batch.length} failed records individually`);
    
    for (const user of batch) {
      try {
        const transaction = await this.db.beginTransaction();
        await this.migrateUser(transaction, user);
        await transaction.commit();
        this.stats.migrated++;
        this.stats.failed--;
      } catch (error) {
        logger.error(`Failed to migrate user ${user.id}`, { error: error.message });
      }
    }
  }
 
  private async validateMigration(): Promise<void> {
    if (!this.config.validateChecksum) {
      return;
    }
 
    logger.info('Validating migration with checksum comparison...');
 
    // Compare record counts
    const [oldCount] = await this.db.query('SELECT COUNT(*) as count FROM users');
    const [newCount] = await this.db.query('SELECT COUNT(*) as count FROM users_new');
 
    if (oldCount.count !== newCount.count) {
      throw new Error(`Record count mismatch: old=${oldCount.count}, new=${newCount.count}`);
    }
 
    // Sample validation - compare random records
    const sampleSize = Math.min(1000, oldCount.count);
    const randomUsers = await this.db.query(`
      SELECT * FROM users 
      ORDER BY RANDOM() 
      LIMIT $1
    `, [sampleSize]);
 
    for (const user of randomUsers) {
      const migratedUser = await this.db.query(`
        SELECT * FROM users_new WHERE id = $1
      `, [user.id]);
 
      if (migratedUser.length === 0) {
        throw new Error(`User ${user.id} not found in migrated table`);
      }
 
      // Validate key fields
      const migrated = migratedUser[0];
      if (user.email !== migrated.email || user.name !== migrated.name) {
        throw new Error(`Data mismatch for user ${user.id}`);
      }
    }
 
    logger.info(`Validation complete: ${sampleSize} records validated`);
  }
 
  private async getTotalRecordCount(): Promise<number> {
    const result = await this.db.query('SELECT COUNT(*) as count FROM users');
    return result[0].count;
  }
 
  private async reportProgress(total: number): Promise<void> {
    const percentage = ((this.stats.processed / total) * 100).toFixed(2);
    const elapsed = Date.now() - this.stats.startTime;
    const rate = this.stats.processed / (elapsed / 1000);
 
    logger.info('Migration progress', {
      processed: this.stats.processed,
      migrated: this.stats.migrated,
      failed: this.stats.failed,
      skipped: this.stats.skipped,
      percentage,
      recordsPerSecond: Math.round(rate),
      elapsedMs: elapsed
    });
  }
 
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}
 
// CLI script
async function main() {
  const db = new DatabaseClient();
  
  const config: MigrationConfig = {
    batchSize: parseInt(process.env.MIGRATION_BATCH_SIZE || '1000'),
    delayMs: parseInt(process.env.MIGRATION_DELAY_MS || '100'),
    maxRetries: parseInt(process.env.MIGRATION_MAX_RETRIES || '3'),
    validateChecksum: process.env.MIGRATION_VALIDATE === 'true'
  };
 
  const migration = new UserDataMigration(db, config);
  await migration.migrate();
}
 
if (require.main === module) {
  main().catch(console.error);
}

Phase 4: Validation and Cutover

Create comprehensive validation before switching over:

// scripts/validate-migration.ts
import { DatabaseClient } from '../services/database';
import { createLogger } from '../utils/logger';
 
const logger = createLogger('migration-validator');
 
class MigrationValidator {
  private db: DatabaseClient;
 
  constructor(db: DatabaseClient) {
    this.db = db;
  }
 
  async validateComplete(): Promise<boolean> {
    logger.info('Starting comprehensive migration validation...');
 
    try {
      await this.validateRecordCounts();
      await this.validateDataIntegrity();
      await this.validateConstraints();
      await this.validateIndexes();
      await this.validatePerformance();
 
      logger.info('All validation checks passed');
      return true;
    } catch (error) {
      logger.error('Validation failed', { error: error.message });
      return false;
    }
  }
 
  private async validateRecordCounts(): Promise<void> {
    const [oldCount] = await this.db.query('SELECT COUNT(*) as count FROM users');
    const [newCount] = await this.db.query('SELECT COUNT(*) as count FROM users_new');
 
    if (oldCount.count !== newCount.count) {
      throw new Error(`Record count mismatch: old=${oldCount.count}, new=${newCount.count}`);
    }
 
    logger.info(`Record count validation passed: ${oldCount.count} records`);
  }
 
  private async validateDataIntegrity(): Promise<void> {
    // Check for missing records
    const missingRecords = await this.db.query(`
      SELECT COUNT(*) as count 
      FROM users u
      LEFT JOIN users_new un ON u.id = un.id
      WHERE un.id IS NULL
    `);
 
    if (missingRecords[0].count > 0) {
      throw new Error(`${missingRecords[0].count} records missing from new table`);
    }
 
    // Sample data comparison
    const sampleResults = await this.db.query(`
      SELECT 
        COUNT(*) as total,
        COUNT(CASE WHEN u.email = un.email THEN 1 END) as email_match,
        COUNT(CASE WHEN u.name = un.name THEN 1 END) as name_match,
        COUNT(CASE WHEN u.status = un.status THEN 1 END) as status_match
      FROM users u
      JOIN users_new un ON u.id = un.id
      WHERE u.id % 100 = 0  -- Sample every 100th record
    `);
 
    const sample = sampleResults[0];
    if (sample.email_match !== sample.total || 
        sample.name_match !== sample.total || 
        sample.status_match !== sample.total) {
      throw new Error(`Data integrity issues found in sample validation`);
    }
 
    logger.info(`Data integrity validation passed: ${sample.total} sample records checked`);
  }
 
  private async validateConstraints(): Promise<void> {
    // Check unique constraints
    const duplicateEmails = await this.db.query(`
      SELECT email, COUNT(*) as count
      FROM users_new 
      GROUP BY email
      HAVING COUNT(*) > 1
    `);
 
    if (duplicateEmails.length > 0) {
      throw new Error(`Duplicate emails found: ${duplicateEmails.length} violations`);
    }
 
    // Check NOT NULL constraints
    const nullViolations = await this.db.query(`
      SELECT COUNT(*) as count 
      FROM users_new 
      WHERE email IS NULL OR name IS NULL
    `);
 
    if (nullViolations[0].count > 0) {
      throw new Error(`NULL constraint violations: ${nullViolations[0].count} records`);
    }
 
    logger.info('Constraint validation passed');
  }
 
  private async validateIndexes(): Promise<void> {
    // Verify all expected indexes exist
    const expectedIndexes = [
      'idx_users_new_email',
      'idx_users_new_status', 
      'idx_users_new_created_at'
    ];
 
    for (const indexName of expectedIndexes) {
      const indexExists = await this.db.query(`
        SELECT 1 FROM pg_indexes 
        WHERE tablename = 'users_new' AND indexname = $1
      `, [indexName]);
 
      if (indexExists.length === 0) {
        throw new Error(`Index ${indexName} is missing`);
      }
    }
 
    logger.info(`Index validation passed: ${expectedIndexes.length} indexes verified`);
  }
 
  private async validatePerformance(): Promise<void> {
    // Run sample queries and compare performance
    const testQueries = [
      "SELECT * FROM users_new WHERE email = 'test@example.com'",
      "SELECT COUNT(*) FROM users_new WHERE status = 'active'",
      "SELECT * FROM users_new WHERE created_at > NOW() - INTERVAL '1 day'"
    ];
 
    for (const query of testQueries) {
      const startTime = Date.now();
      await this.db.query(query.replace('users_new', 'users_new'));
      const newTableTime = Date.now() - startTime;
 
      const startTimeOld = Date.now();
      await this.db.query(query.replace('users_new', 'users'));
      const oldTableTime = Date.now() - startTimeOld;
 
      // New table should perform within 2x of old table
      if (newTableTime > oldTableTime * 2) {
        logger.warn(`Performance regression detected for query: ${query}`, {
          oldTime: oldTableTime,
          newTime: newTableTime
        });
      }
    }
 
    logger.info('Performance validation completed');
  }
}

Advanced Migration Patterns

Rolling Schema Updates

For applications that can't afford downtime, use backwards-compatible schema changes:

-- Phase 1: Add new column as nullable
ALTER TABLE users ADD COLUMN preferences JSONB;
 
-- Phase 2: Update application to write to both old and new format
-- (Application deployment)
 
-- Phase 3: Backfill existing data
UPDATE users 
SET preferences = '{}'::jsonb 
WHERE preferences IS NULL;
 
-- Phase 4: Add NOT NULL constraint
ALTER TABLE users ALTER COLUMN preferences SET NOT NULL;
 
-- Phase 5: Remove old columns (if any) after confirming new system works
-- ALTER TABLE users DROP COLUMN old_column;

Blue-Green Database Deployment

For major version upgrades or engine changes:

// services/database-switcher.ts
import { DatabaseClient } from './database';
 
class DatabaseSwitcher {
  private blueDb: DatabaseClient;
  private greenDb: DatabaseClient;
  private currentColor: 'blue' | 'green';
 
  constructor() {
    this.blueDb = new DatabaseClient(process.env.BLUE_DATABASE_URL);
    this.greenDb = new DatabaseClient(process.env.GREEN_DATABASE_URL);
    this.currentColor = process.env.ACTIVE_DATABASE as 'blue' | 'green' || 'blue';
  }
 
  getCurrentDatabase(): DatabaseClient {
    return this.currentColor === 'blue' ? this.blueDb : this.greenDb;
  }
 
  getInactiveDatabase(): DatabaseClient {
    return this.currentColor === 'blue' ? this.greenDb : this.blueDb;
  }
 
  async switchToInactive(): Promise<void> {
    // Pre-switch validation
    await this.validateInactiveDatabase();
    
    // Switch traffic
    this.currentColor = this.currentColor === 'blue' ? 'green' : 'blue';
    
    // Update environment variable for other processes
    process.env.ACTIVE_DATABASE = this.currentColor;
    
    // Post-switch validation
    await this.validateCurrentDatabase();
  }
 
  async validateInactiveDatabase(): Promise<void> {
    const inactiveDb = this.getInactiveDatabase();
    
    // Check connectivity
    await inactiveDb.query('SELECT 1');
    
    // Validate data freshness (should be recent)
    const result = await inactiveDb.query(`
      SELECT MAX(updated_at) as last_update FROM users
    `);
    
    const lastUpdate = new Date(result[0].last_update);
    const now = new Date();
    const staleness = now.getTime() - lastUpdate.getTime();
    
    if (staleness > 5 * 60 * 1000) { // 5 minutes
      throw new Error(`Inactive database is stale by ${staleness}ms`);
    }
  }
 
  async validateCurrentDatabase(): Promise<void> {
    const currentDb = this.getCurrentDatabase();
    
    // Run health checks
    await currentDb.query('SELECT COUNT(*) FROM users');
    
    // Could add application-level health checks here
  }
}
 
export default DatabaseSwitcher;

Monitoring and Alerting

Set up comprehensive monitoring during migrations:

// utils/migration-monitor.ts
import { DatabaseClient } from '../services/database';
import { MetricsClient } from './metrics';
 
class MigrationMonitor {
  private db: DatabaseClient;
  private metrics: MetricsClient;
  private monitoringInterval: NodeJS.Timeout | null = null;
 
  constructor(db: DatabaseClient, metrics: MetricsClient) {
    this.db = db;
    this.metrics = metrics;
  }
 
  startMonitoring(): void {
    this.monitoringInterval = setInterval(async () => {
      try {
        await this.collectMetrics();
      } catch (error) {
        console.error('Monitoring error:', error);
      }
    }, 30000); // Every 30 seconds
  }
 
  stopMonitoring(): void {
    if (this.monitoringInterval) {
      clearInterval(this.monitoringInterval);
      this.monitoringInterval = null;
    }
  }
 
  private async collectMetrics(): Promise<void> {
    // Database performance metrics
    const performance = await this.db.query(`
      SELECT 
        schemaname,
        tablename,
        n_tup_ins as inserts,
        n_tup_upd as updates,
        n_tup_del as deletes,
        seq_scan as sequential_scans,
        idx_scan as index_scans
      FROM pg_stat_user_tables 
      WHERE tablename IN ('users', 'users_new')
    `);
 
    performance.forEach(table => {
      this.metrics.gauge(`db.table.${table.tablename}.inserts`, table.inserts);
      this.metrics.gauge(`db.table.${table.tablename}.updates`, table.updates);
      this.metrics.gauge(`db.table.${table.tablename}.deletes`, table.deletes);
      this.metrics.gauge(`db.table.${table.tablename}.seq_scans`, table.sequential_scans);
      this.metrics.gauge(`db.table.${table.tablename}.idx_scans`, table.index_scans);
    });
 
    // Migration progress
    const progress = await this.db.query(`
      SELECT last_migrated_id FROM migration_progress 
      WHERE migration_name = 'users_to_users_new'
    `);
 
    if (progress.length > 0) {
      this.metrics.gauge('migration.last_migrated_id', progress[0].last_migrated_id);
    }
 
    // Replication lag (if using logical replication)
    const replicationLag = await this.db.query(`
      SELECT 
        slot_name,
        confirmed_flush_lsn,
        pg_current_wal_lsn() as current_lsn,
        pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) as lag_bytes
      FROM pg_replication_slots
    `);
 
    replicationLag.forEach(slot => {
      this.metrics.gauge(`db.replication.${slot.slot_name}.lag_bytes`, slot.lag_bytes || 0);
    });
 
    // Connection counts
    const connections = await this.db.query(`
      SELECT 
        state,
        COUNT(*) as count
      FROM pg_stat_activity 
      WHERE datname = current_database()
      GROUP BY state
    `);
 
    connections.forEach(conn => {
      this.metrics.gauge(`db.connections.${conn.state}`, conn.count);
    });
  }
 
  async alertOnAnomalies(): Promise<void> {
    // Check for migration stalls
    const lastProgress = await this.db.query(`
      SELECT 
        last_migrated_id,
        updated_at,
        EXTRACT(EPOCH FROM (NOW() - updated_at)) as stall_seconds
      FROM migration_progress 
      WHERE migration_name = 'users_to_users_new'
    `);
 
    if (lastProgress.length > 0 && lastProgress[0].stall_seconds > 300) {
      this.metrics.increment('migration.alerts.stall');
      // Send alert to monitoring system
      await this.sendAlert('Migration stalled', {
        seconds: lastProgress[0].stall_seconds,
        lastId: lastProgress[0].last_migrated_id
      });
    }
 
    // Check for high error rates
    const errorLogs = await this.db.query(`
      SELECT COUNT(*) as error_count
      FROM pg_stat_database_conflicts
      WHERE datname = current_database()
    `);
 
    if (errorLogs[0].error_count > 10) {
      await this.sendAlert('High database error rate', {
        errorCount: errorLogs[0].error_count
      });
    }
  }
 
  private async sendAlert(message: string, data: any): Promise<void> {
    // Integration with your alerting system (Slack, PagerDuty, etc.)
    console.error(`ALERT: ${message}`, data);
    
    // Example Slack webhook
    if (process.env.SLACK_WEBHOOK_URL) {
      await fetch(process.env.SLACK_WEBHOOK_URL, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          text: `🚨 Migration Alert: ${message}`,
          attachments: [{
            color: 'danger',
            fields: Object.entries(data).map(([key, value]) => ({
              title: key,
              value: String(value),
              short: true
            }))
          }]
        })
      });
    }
  }
}
 
export default MigrationMonitor;

Testing Strategies

Automated Migration Testing

// tests/migration.test.ts
import { DatabaseClient } from '../services/database';
import { UserDataMigration } from '../scripts/migrate-users-data';
import { MigrationValidator } from '../scripts/validate-migration';
 
describe('User Migration', () => {
  let db: DatabaseClient;
  let migration: UserDataMigration;
  let validator: MigrationValidator;
 
  beforeAll(async () => {
    // Use test database
    db = new DatabaseClient(process.env.TEST_DATABASE_URL);
    migration = new UserDataMigration(db, {
      batchSize: 100,
      delayMs: 0, // No delay in tests
      maxRetries: 3,
      validateChecksum: true
    });
    validator = new MigrationValidator(db);
  });
 
  beforeEach(async () => {
    // Clean slate for each test
    await db.query('TRUNCATE TABLE users, users_new RESTART IDENTITY CASCADE');
    await db.query('DELETE FROM migration_progress');
  });
 
  afterAll(async () => {
    await db.close();
  });
 
  test('should migrate small dataset successfully', async () => {
    // Create test data
    const testUsers = [
      { email: 'user1@test.com', name: 'User One', status: 'active' },
      { email: 'user2@test.com', name: 'User Two', status: 'inactive' },
      { email: 'user3@test.com', name: 'User Three', status: 'active' }
    ];
 
    for (const user of testUsers) {
      await db.query(`
        INSERT INTO users (email, name, status) VALUES ($1, $2, $3)
      `, [user.email, user.name, user.status]);
    }
 
    // Run migration
    await migration.migrate();
 
    // Validate results
    const isValid = await validator.validateComplete();
    expect(isValid).toBe(true);
 
    // Check specific data
    const migratedUsers = await db.query('SELECT * FROM users_new ORDER BY id');
    expect(migratedUsers).toHaveLength(3);
    expect(migratedUsers[0].email).toBe('user1@test.com');
    expect(migratedUsers[0].preferences).toEqual({});
  });
 
  test('should handle large dataset with batching', async () => {
    // Create large test dataset
    const batchSize = 1000;
    const totalRecords = 5000;
    
    for (let i = 0; i < totalRecords; i += batchSize) {
      const values = [];
      const placeholders = [];
      
      for (let j = 0; j < Math.min(batchSize, totalRecords - i); j++) {
        const userIndex = i + j + 1;
        values.push(`user${userIndex}@test.com`, `User ${userIndex}`, 'active');
        placeholders.push(`($${j * 3 + 1}, $${j * 3 + 2}, $${j * 3 + 3})`);
      }
      
      await db.query(`
        INSERT INTO users (email, name, status) 
        VALUES ${placeholders.join(', ')}
      `, values);
    }
 
    // Run migration
    const startTime = Date.now();
    await migration.migrate();
    const duration = Date.now() - startTime;
 
    // Validate
    const [userCount] = await db.query('SELECT COUNT(*) as count FROM users_new');
    expect(userCount.count).toBe(totalRecords);
 
    // Performance expectation (should complete within reasonable time)
    expect(duration).toBeLessThan(30000); // 30 seconds
  });
 
  test('should handle migration resume after interruption', async () => {
    // Create test data
    for (let i = 1; i <= 1000; i++) {
      await db.query(`
        INSERT INTO users (email, name, status) VALUES ($1, $2, $3)
      `, [`user${i}@test.com`, `User ${i}`, 'active']);
    }
 
    // Simulate partial migration
    await db.query(`
      INSERT INTO migration_progress (migration_name, last_migrated_id)
      VALUES ('users_to_users_new', 500)
      ON CONFLICT (migration_name) 
      DO UPDATE SET last_migrated_id = EXCLUDED.last_migrated_id
    `);
 
    // Migrate first 500 records manually to simulate previous run
    await db.query(`
      INSERT INTO users_new (id, email, name, status, preferences, created_at, updated_at)
      SELECT id, email, name, status, '{}'::jsonb, created_at, updated_at
      FROM users WHERE id <= 500
    `);
 
    // Resume migration
    await migration.migrate();
 
    // Validate all records are migrated
    const [totalCount] = await db.query('SELECT COUNT(*) as count FROM users_new');
    expect(totalCount.count).toBe(1000);
 
    // Validate no duplicates
    const [duplicateCount] = await db.query(`
      SELECT COUNT(*) as count FROM (
        SELECT id FROM users_new GROUP BY id HAVING COUNT(*) > 1
      ) duplicates
    `);
    expect(duplicateCount.count).toBe(0);
  });
 
  test('should rollback on validation failure', async () => {
    // Create invalid test scenario
    await db.query(`
      INSERT INTO users (email, name, status) VALUES 
      ('user1@test.com', 'User One', 'active'),
      ('user2@test.com', 'User Two', 'active')
    `);
 
    // Manually corrupt the new table to trigger validation failure
    await db.query(`
      INSERT INTO users_new (id, email, name, status, preferences, created_at, updated_at)
      VALUES (1, 'wrong@test.com', 'Wrong User', 'active', '{}', NOW(), NOW())
    `);
 
    // Migration should detect the corruption
    const isValid = await validator.validateComplete();
    expect(isValid).toBe(false);
  });
});

Real-World Lessons Learned

Lesson 1: Always Test with Production-Like Data

One migration that looked perfect in staging failed spectacularly in production because our test data didn't include the edge cases present in real user data. A few users had null bytes in their names, which caused the migration to crash.

Solution: Always use anonymized production data for migration testing:

-- Create anonymized test data
CREATE TABLE users_test AS 
SELECT 
    id,
    'user' || id || '@test.com' as email,
    'Test User ' || id as name,
    created_at,
    updated_at,
    status,
    -- Keep data structures but anonymize content
    CASE 
        WHEN preferences IS NOT NULL THEN '{"theme": "light"}'::jsonb
        ELSE NULL
    END as preferences
FROM users
WHERE id % 100 = 0; -- Every 100th user for representative sample

Lesson 2: Monitor Everything

During a major migration, we discovered that our application was making 10x more database queries than expected because we forgot to update a poorly optimized query. The migration technically worked, but brought down the application.

Solution: Set up comprehensive monitoring before starting:

// Migration monitoring dashboard
const migrationMetrics = {
  // Database metrics
  recordsPerSecond: await calculateMigrationRate(),
  errorRate: await getErrorRate(),
  connectionCount: await getConnectionCount(),
  
  // Application metrics  
  apiLatency: await getAverageApiLatency(),
  errorResponses: await getErrorResponses(),
  userExperience: await getUserExperienceMetrics(),
  
  // System metrics
  cpuUsage: await getCPUUsage(),
  memoryUsage: await getMemoryUsage(),
  diskIO: await getDiskIOMetrics()
};

Lesson 3: Plan for the Worst-Case Scenario

I once had to roll back a migration that was 90% complete because we discovered data corruption in the new table. Having a robust rollback plan saved us from a complete disaster.

Rollback Strategy Template:

-- Immediate rollback plan
BEGIN;
 
-- Step 1: Stop all writes to new table
UPDATE application_config SET maintenance_mode = true;
 
-- Step 2: Switch application back to old table  
UPDATE application_config SET active_database_table = 'users';
 
-- Step 3: Clean up corrupted data if needed
-- DELETE FROM users_new WHERE corruption_condition;
 
-- Step 4: Reset migration progress
UPDATE migration_progress SET last_migrated_id = 0 WHERE migration_name = 'users_to_users_new';
 
COMMIT;

Lesson 4: Communication is Critical

Keep all stakeholders informed throughout the migration:

// Migration status reporter
class MigrationReporter {
  async sendStatusUpdate(phase: string, progress: number, eta?: Date) {
    const message = {
      phase,
      progress: `${progress.toFixed(2)}%`,
      eta: eta?.toISOString(),
      timestamp: new Date().toISOString()
    };
    
    // Notify different stakeholders appropriately
    await this.notifyEngineering(message);
    await this.notifyBusinessStakeholders(message);
    await this.updateStatusPage(message);
  }
  
  async notifyEngineering(message: any) {
    // Detailed technical updates to engineering Slack
  }
  
  async notifyBusinessStakeholders(message: any) {
    // High-level updates to business stakeholders
  }
  
  async updateStatusPage(message: any) {
    // Public status page updates if needed
  }
}

Migration Checklist

Use this checklist for every production migration:

Pre-Migration

  • [ ] Full database backup created and verified
  • [ ] Migration tested on production-like data
  • [ ] Performance impact assessed
  • [ ] Rollback plan documented and tested
  • [ ] Monitoring and alerting configured
  • [ ] Stakeholder communication plan ready
  • [ ] Maintenance window scheduled (if needed)

During Migration

  • [ ] Monitor key metrics continuously
  • [ ] Validate data integrity in batches
  • [ ] Check application performance
  • [ ] Monitor error logs
  • [ ] Keep stakeholders updated

Post-Migration

  • [ ] Comprehensive data validation
  • [ ] Performance benchmarking
  • [ ] Monitor for 24-48 hours
  • [ ] Document lessons learned
  • [ ] Clean up old resources (after validation period)

Database migrations will always carry risk, but with proper planning, testing, and monitoring, you can minimize that risk significantly. The key is to assume something will go wrong and prepare accordingly. Every migration teaches you something new about your system—embrace those lessons and build them into your next migration strategy.

Remember: it's better to take extra time for preparation than to explain a data loss incident to your CEO.