Operations Guide¶
Production operations, monitoring, and maintenance for pgraft clusters.
Overview¶
This section covers everything you need to know to operate pgraft clusters in production, from monitoring to troubleshooting to best practices.
Contents¶
Monitoring¶
Learn how to monitor cluster health, performance, and replication status.
Troubleshooting¶
Solutions for common issues and debugging techniques.
Best Practices¶
Production-tested recommendations for configuration, deployment, and maintenance.
Quick Reference¶
Health Checks¶
Quick health check:
SELECT
pgraft_is_leader() as is_leader,
pgraft_get_term() as term,
pgraft_get_leader() as leader_id,
pgraft_get_worker_state() as worker;
Detailed status:
SELECT * FROM pgraft_get_cluster_status();
SELECT * FROM pgraft_get_nodes();
SELECT * FROM pgraft_log_get_stats();
Common Issues¶
Issue | Quick Check | Solution |
---|---|---|
Worker not running | SELECT pgraft_get_worker_state(); | Check shared_preload_libraries , restart PostgreSQL |
No leader elected | SELECT pgraft_get_leader(); | Wait 10 seconds, check network connectivity |
Cannot add node | SELECT pgraft_is_leader(); | Run on leader node only |
Frequent elections | SELECT pgraft_get_term(); | Increase election timeout, check network |
Key Metrics to Monitor¶
Critical (alert if failing): - Leader exists: pgraft_get_leader() > 0
- Worker running: pgraft_get_worker_state() = 'RUNNING'
- Node reachable: Network connectivity
Important (monitor trends): - Term number (should be stable) - Log replication lag - Election frequency - Disk usage for log storage
Emergency Procedures¶
Lost quorum (majority of nodes failed): 1. Cluster becomes read-only 2. Restore failed nodes or add new ones 3. Once quorum restored, leader will be elected 4. Cluster resumes normal operation
Complete cluster failure: 1. Restore nodes from backup 2. Start all PostgreSQL instances 3. Wait 10 seconds for election 4. Verify leader elected: SELECT pgraft_get_leader();
Production Checklist¶
Before Deployment¶
- Cluster size is odd (3, 5, or 7 nodes)
- Nodes distributed across availability zones
- Election timeout tuned for network latency
- Monitoring and alerting configured
- Backup strategy in place
- Disaster recovery plan documented
- Failover scenarios tested
- Network security configured
- Performance baselines established
- Team trained on operations
Daily Operations¶
- Check cluster status
- Verify leader exists
- Check worker state on all nodes
- Review error logs
Weekly Maintenance¶
- Review monitoring metrics
- Check log file sizes and disk usage
- Verify backups are running
- Review and analyze any alerts
Monthly Tasks¶
- Test disaster recovery procedures
- Review and optimize configuration
- Update documentation
- Performance review and tuning
Monitoring Tools¶
Built-in Functions¶
pgraft provides SQL functions for monitoring:
pgraft_get_cluster_status()
- Overall cluster statepgraft_get_nodes()
- All cluster memberspgraft_log_get_stats()
- Log statisticspgraft_get_worker_state()
- Background worker statuspgraft_is_leader()
- Leadership status
External Monitoring¶
Prometheus (if enabled):
Metrics available at http://node:9100/metrics
PostgreSQL Logs:
Custom Monitoring Script:
#!/bin/bash
# Check cluster health every 60 seconds
while true; do
psql -c "SELECT * FROM pgraft_get_cluster_status();"
sleep 60
done
Performance Tuning¶
Network Latency¶
Low latency (<10ms):
Medium latency (10-50ms):
High latency (>50ms):
Throughput Optimization¶
# Larger batches for higher throughput
pgraft.batch_size = 200
pgraft.max_batch_delay = 20
# More frequent snapshots for faster recovery
pgraft.snapshot_interval = 5000
pgraft.max_log_entries = 500
Security¶
Network Security¶
- Use private networks or VPN for inter-node communication
- Configure firewalls to allow only necessary ports
- Consider enabling TLS (when available)
Access Control¶
- pgraft functions require superuser privileges
- Don't grant superuser to application users
- Use connection limits and pg_hba.conf appropriately
Audit Logging¶
Enable PostgreSQL logging for pgraft operations:
Backup and Recovery¶
What to Backup¶
- PostgreSQL data directory - All database data
- pgraft data directory - Raft state and logs
- Configuration files - postgresql.conf
Backup Strategy¶
# PostgreSQL data
pg_basebackup -D /backup/pgdata -Ft -z -P
# pgraft state
tar -czf /backup/pgraft-$(date +%Y%m%d).tar.gz $PGRAFT_DATA_DIR
# Configuration
cp $PGDATA/postgresql.conf /backup/
Recovery Procedure¶
- Stop PostgreSQL
- Restore data from backup
- Restore pgraft state
- Restore configuration
- Start PostgreSQL
- Verify cluster rejoins
Resources¶
- Detailed monitoring: Monitoring Guide
- Problem solving: Troubleshooting Guide
- Production tips: Best Practices
- Understanding internals: Architecture