Best Practices¶

Follow these best practices for running pgraft in production.

Cluster Size¶

Use Odd Number of Nodes¶

Always use an odd number of nodes (3, 5, 7) for better fault tolerance:

Nodes	Fault Tolerance	Recommended For
3	1 node failure	Development, small production
5	2 node failures	Production (recommended)
7	3 node failures	High-availability critical systems

Don't Use Even Numbers

4 nodes tolerates only 1 failure (same as 3)
6 nodes tolerates only 2 failures (same as 5)
Even numbers waste resources!

Don't Go Too Large¶

Avoid >7 nodes: - More nodes = more replication overhead - Diminishing returns on availability - Slower consensus

If you need more than 7 nodes, consider: - Multiple independent clusters - Read replicas (outside Raft cluster) - Sharding

Geographic Distribution¶

Multi-Zone Deployment¶

For disaster recovery, distribute nodes across zones:

5-node example:

Zone A (2 nodes): Primary data center
Zone B (2 nodes): Secondary data center  
Zone C (1 node): Tiebreaker/DR site

Benefits: - Survives zone failure - Tiebreaker prevents split votes - Geographic disaster recovery

Network Considerations¶

Low latency required: <50ms between nodes
Stable network: Packet loss <0.1%
Sufficient bandwidth: For log replication

High Latency

If latency >50ms between zones, increase election timeout:

pgraft.election_timeout = 2000  # 2 seconds for WAN

Configuration¶

Election Timeout¶

Rules of thumb:

# LAN deployment (low latency <10ms)
pgraft.election_timeout = 1000  # 1 second

# WAN deployment (medium latency 10-50ms)
pgraft.election_timeout = 2000  # 2 seconds

# High latency (>50ms)
pgraft.election_timeout = 5000  # 5 seconds

Key relationship:

election_timeout >= 10 × heartbeat_interval

Heartbeat Interval¶

# Default (recommended)
pgraft.heartbeat_interval = 100  # 100ms

# High-throughput systems
pgraft.heartbeat_interval = 50   # 50ms (more network traffic)

# Low-priority systems
pgraft.heartbeat_interval = 200  # 200ms (less overhead)

Snapshot Configuration¶

# Frequent snapshots (faster recovery, more I/O)
pgraft.snapshot_interval = 5000
pgraft.max_log_entries = 500

# Infrequent snapshots (less I/O, slower recovery)
pgraft.snapshot_interval = 100000
pgraft.max_log_entries = 10000

Trade-offs: - Frequent snapshots: Faster crash recovery, more I/O - Infrequent snapshots: Better performance, slower recovery

Operations¶

Adding Nodes¶

Always add nodes on the leader:

-- 1. Check if leader
SELECT pgraft_is_leader();

-- 2. If not leader, find and connect to leader
SELECT pgraft_get_leader();

-- 3. Add node
SELECT pgraft_add_node(4, '192.168.1.14', 7004);

Best practice: Add one node at a time, wait for it to catch up before adding next.

Removing Nodes¶

Graceful removal:

-- 1. Verify cluster health
SELECT * FROM pgraft_get_cluster_status();

-- 2. Remove node (on leader)
SELECT pgraft_remove_node(4);

-- 3. Shutdown removed node's PostgreSQL
-- On node 4:
pg_ctl stop -D /data/node4

Never remove nodes during: - Active elections - Network partitions - While already removing another node

Upgrading¶

Rolling upgrade procedure:

Upgrade followers first:

# On follower node:
pg_ctl stop -D /data/node2
# Install new pgraft version
make clean && make && make install
pg_ctl start -D /data/node2

Then upgrade leader:

# On leader, wait for followers to be healthy
# Then upgrade leader last
pg_ctl stop -D /data/node1
# Install new version
pg_ctl start -D /data/node1

Monitoring¶

Critical Metrics¶

Monitor these continuously:

-- Leader exists (should always be true)
SELECT pgraft_get_leader() > 0;

-- Worker running (should always be RUNNING)
SELECT pgraft_get_worker_state() = 'RUNNING';

-- Term stable (should not increase frequently)
SELECT pgraft_get_term();

Set Up Alerts¶

Critical alerts: - No leader for >10 seconds - Worker not running - Node unreachable

Warning alerts: - Term increased (election occurred) - Log lag >1000 entries - Replication to follower slow

See Monitoring for details.

Backup and Recovery¶

Backup Strategy¶

1. Backup PostgreSQL data:

pg_basebackup -D /backup/node1 -Ft -z -P

2. Backup pgraft state:

# Backup pgraft data directory
tar -czf pgraft-backup.tar.gz $PGRAFT_DATA_DIR

3. Backup configuration:

cp $PGDATA/postgresql.conf /backup/

Disaster Recovery¶

Scenario: Total cluster loss

Restore one node from backup

Initialize new cluster:

CREATE EXTENSION pgraft;
SELECT pgraft_init();

Add other nodes:

SELECT pgraft_add_node(2, '192.168.1.12', 7002);
SELECT pgraft_add_node(3, '192.168.1.13', 7003);

Security¶

Network Security¶

Firewall rules:

# Allow Raft communication between nodes
# Node 1 → Node 2, 3
iptables -A INPUT -p tcp --dport 7002 -s node1_ip -j ACCEPT
iptables -A INPUT -p tcp --dport 7003 -s node1_ip -j ACCEPT

# Node 2 → Node 1, 3
# ... etc

Best practice: Use VPN or private network for inter-node communication.

Access Control¶

Limit pgraft functions to superuser:

-- pgraft functions already require superuser
-- Don't grant superuser to application users

Performance¶

Hardware Recommendations¶

Minimum (per node): - CPU: 2 cores - RAM: 4GB - Disk: SSD with >100 IOPS - Network: 100 Mbps

Production (per node): - CPU: 4+ cores - RAM: 16GB+ - Disk: NVMe SSD with >1000 IOPS - Network: 1 Gbps+

Disk I/O¶

Recommendations: - Use SSD or NVMe for pgraft data directory - Separate disk from PostgreSQL data if possible - Monitor disk I/O wait time

Network¶

Recommendations: - Dedicated network for Raft traffic (if possible) - Monitor network latency continuously - Use quality of service (QoS) for Raft ports

Testing¶

Test Failover Scenarios¶

Test 1: Leader failure

# Kill leader process
pg_ctl stop -D /data/leader -m immediate

# Verify new leader elected
psql -h follower1 -c "SELECT pgraft_get_leader();"

# Restart failed node
pg_ctl start -D /data/leader

Test 2: Network partition

# Simulate partition using iptables
iptables -A INPUT -s node2_ip -j DROP
iptables -A OUTPUT -d node2_ip -j DROP

# Verify majority partition continues
# Restore network
iptables -D INPUT -s node2_ip -j DROP
iptables -D OUTPUT -d node2_ip -j DROP

Test 3: Slow follower

# Simulate slow disk
# On follower, use cgroup or tc to throttle I/O
tc qdisc add dev sda root tbf rate 1mbit burst 32kbit latency 400ms

# Verify leader continues operating
# Remove throttle
tc qdisc del dev sda root

Checklist¶

Before Going to Production¶

Regular Maintenance¶

Daily: - [ ] Check cluster status - [ ] Verify leader exists - [ ] Check worker state

Weekly: - [ ] Review monitoring metrics - [ ] Check log file sizes - [ ] Verify backups

Monthly: - [ ] Test disaster recovery - [ ] Review and optimize configuration - [ ] Update documentation

Common Pitfalls¶

Don't¶

Use even number of nodes - Waste of resources
Add nodes during elections - Wait for stable leader
Ignore monitoring - Set up alerts!
Run on slow disks - Use SSD/NVMe
Deploy across high-latency links without tuning timeouts
Add multiple nodes simultaneously - Add one at a time
Forget to backup pgraft state - Backup both PostgreSQL and pgraft data

Do¶

Use 3, 5, or 7 nodes for optimal fault tolerance
Monitor continuously - Leader, worker, term, logs
Test failover scenarios regularly
Use fast storage - SSD or better
Distribute geographically for disaster recovery
Tune for your network - Adjust timeouts based on latency
Document your setup - Configuration, topology, procedures
Train your team - Everyone should know how to operate pgraft

Production Deployment Example¶

Here's a complete production configuration:

# postgresql.conf - Production Node 1

# PostgreSQL basics
port = 5432
max_connections = 200
shared_buffers = 4GB

# pgraft extension
shared_preload_libraries = 'pgraft'

# Core cluster configuration
pgraft.cluster_id = 'production-cluster'
pgraft.node_id = 1
pgraft.address = '10.0.1.11'
pgraft.port = 7001
pgraft.data_dir = '/var/lib/postgresql/pgraft'

# Consensus settings (tuned for datacenter LAN)
pgraft.election_timeout = 1000
pgraft.heartbeat_interval = 100
pgraft.snapshot_interval = 10000
pgraft.max_log_entries = 1000

# Performance
pgraft.batch_size = 100
pgraft.max_batch_delay = 10
pgraft.compaction_threshold = 10000

# Monitoring
pgraft.metrics_enabled = true
pgraft.metrics_port = 9100

Repeat for nodes 2, 3, 4, 5 with different node_id, address, and port values.