Automatic Node Replication in pgraft¶
Quick Answer¶
YES! When you add a node to the leader, it automatically appears on ALL other nodes.
You only need to run ONE command on the leader. The Raft consensus protocol handles everything else automatically.
How It Works¶
The Process¶
┌──────────────────────────────────────────────────────────────┐
│ │
│ 1. User runs ONE command on leader │
│ Leader$ SELECT pgraft_add_node(4, '192.168.1.14', 7004);│
│ │
└────────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 2. Leader creates ConfChange entry │
│ Entry{Type: AddNode, NodeID: 4, Addr: "192.168.1.14"} │
└────────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 3. Leader appends to Raft log │
│ log[42] = ConfChange entry │
│ Leader persists to disk │
└────────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 4. Leader replicates to ALL followers (AUTOMATIC) │
│ │
│ Leader → Follower1: AppendEntries(log[42]) │
│ Leader → Follower2: AppendEntries(log[42]) │
│ Leader → Follower3: AppendEntries(log[42]) │
│ │
└────────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 5. Followers persist entry and acknowledge │
│ │
│ Follower1: Writes log[42] → ACK │
│ Follower2: Writes log[42] → ACK │
│ Follower3: Writes log[42] → ACK │
│ │
└────────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 6. Leader commits (majority achieved) │
│ commitIndex = 42 │
└────────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 7. ALL nodes apply configuration change (AUTOMATIC) │
│ │
│ Leader: nodes[4] = "192.168.1.14:7004" ✓ │
│ Follower1: nodes[4] = "192.168.1.14:7004" ✓ │
│ Follower2: nodes[4] = "192.168.1.14:7004" ✓ │
│ Follower3: nodes[4] = "192.168.1.14:7004" ✓ │
│ │
│ ALL NODES UPDATED! NO MANUAL STEPS! │
└──────────────────────────────────────────────────────────────┘
Example¶
Initial State¶
Cluster: 3 nodes
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ Leader │ │Follower │ │Follower │
│ :7001 │ │ :7002 │ │ :7003 │
└─────────┘ └─────────┘ └─────────┘
Membership on ALL nodes: {1, 2, 3}
User Action (Only on Leader!)¶
-- Run on Node 1 (leader) ONLY
SELECT pgraft_add_node(4, '192.168.1.14', 7004);
-- Result
pgraft_add_node
-----------------
t
(1 row)
-- Do NOT run on Node 2
-- Do NOT run on Node 3
-- That's it! One command!
Result (Automatic)¶
Cluster: 4 nodes
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node 4 │
│ Leader │ │Follower │ │Follower │ │Follower │
│ :7001 │ │ :7002 │ │ :7003 │ │ :7004 │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Membership on ALL nodes: {1, 2, 3, 4} ← AUTOMATIC!
Verification¶
Check membership on ANY node:
-- On Node 1 (leader)
node1$ SELECT * FROM pgraft_get_nodes();
node_id | address | port
---------+--------------+------
1 | 127.0.0.1 | 7001
2 | 127.0.0.1 | 7002
3 | 127.0.0.1 | 7003
4 | 192.168.1.14 | 7004 ← New node!
-- On Node 2 (follower - was NOT manually updated)
node2$ SELECT * FROM pgraft_get_nodes();
node_id | address | port
---------+--------------+------
1 | 127.0.0.1 | 7001
2 | 127.0.0.1 | 7002
3 | 127.0.0.1 | 7003
4 | 192.168.1.14 | 7004 ← SAME! Automatic!
-- On Node 3 (follower - was NOT manually updated)
node3$ SELECT * FROM pgraft_get_nodes();
node_id | address | port
---------+--------------+------
1 | 127.0.0.1 | 7001
2 | 127.0.0.1 | 7002
3 | 127.0.0.1 | 7003
4 | 192.168.1.14 | 7004 ← SAME! Automatic!
All nodes have identical membership! ✓
What You Do vs. What the System Does¶
What YOU Do¶
-- Step 1: Find the leader
SELECT pgraft_get_leader();
-- Returns: 1
-- Step 2: Connect to the leader (Node 1)
psql -h node1 -p 5432
-- Step 3: Add the new node
SELECT pgraft_add_node(4, '192.168.1.14', 7004);
-- Returns: t
-- DONE! That's all!
What the SYSTEM Does (Automatically)¶
- ✓ Creates ConfChange entry
- ✓ Appends to Raft log
- ✓ Replicates to ALL followers
- ✓ Waits for majority acknowledgment
- ✓ Commits the entry
- ✓ Applies on ALL nodes
- ✓ Updates membership everywhere
- ✓ Establishes connections to new node
- ✓ Starts replicating to new node
- ✓ Persists configuration to disk
All automatic. No manual steps.
What You DON'T Need to Do¶
✗ Do NOT SSH to Each Node¶
# DON'T DO THIS (old, manual way):
ssh node1 "echo 'node4: 192.168.1.14:7004' >> config.conf"
ssh node2 "echo 'node4: 192.168.1.14:7004' >> config.conf"
ssh node3 "echo 'node4: 192.168.1.14:7004' >> config.conf"
# WRONG! Don't do this!
✗ Do NOT Edit Config Files Manually¶
# DON'T DO THIS:
vi /etc/pgraft/nodes.conf # Add node4
# WRONG! pgraft doesn't use config files for membership
✗ Do NOT Restart Any Processes¶
# DON'T DO THIS:
systemctl restart postgresql@node1
systemctl restart postgresql@node2
systemctl restart postgresql@node3
# WRONG! No restart needed!
✗ Do NOT Run the Command on Each Node¶
-- DON'T DO THIS:
node1$ SELECT pgraft_add_node(4, ...); -- OK (on leader)
node2$ SELECT pgraft_add_node(4, ...); -- WRONG! Don't do this!
node3$ SELECT pgraft_add_node(4, ...); -- WRONG! Don't do this!
Why This is Safe¶
1. Raft Log Guarantees¶
Guarantee: Once an entry is committed to the Raft log, it's guaranteed to be on all current and future leaders.
Committed entry = Majority have it in persistent storage
= Cannot be lost
= Will be on all future leaders
= Will be applied on all nodes
2. Sequential Consistency¶
Guarantee: All nodes apply log entries in the same order.
All nodes see:
log[42]: Add node 4
log[43]: Add node 5
log[44]: Remove node 3
All nodes apply in same order:
1. Add node 4
2. Add node 5
3. Remove node 3
Result: All nodes have identical membership
3. Majority Agreement Required¶
Guarantee: Configuration changes only committed when majority agrees.
5-node cluster:
- Leader proposes: Add node 6
- Leader + 2 followers ACK = 3 nodes
- 3 ≥ ⌈(5+1)/2⌉ = 3 → Majority! ✓
- Entry committed
- Safe to apply
If network partition:
5-node partition into [A,B] | [C,D,E]:
- [A,B]: Only 2 nodes, cannot commit (need 3)
- [C,D,E]: 3 nodes, can commit ✓
Only majority partition can add nodes
Minority partition safe (read-only)
4. Automatic Recovery¶
Guarantee: If a node was down, it catches up automatically.
Scenario:
T+0s: Node 3 crashes
T+10s: Leader adds node 4 (log[42])
T+20s: Leader adds node 5 (log[43])
T+30s: Node 3 restarts
Node 3 recovery:
1. Contacts leader
2. Leader sees: Node 3 at log[41], I'm at log[43]
3. Leader sends: log[42] (add node 4), log[43] (add node 5)
4. Node 3 applies both entries
5. Node 3 now knows about nodes 4 and 5
AUTOMATIC! No manual steps!
Implementation Code¶
How It's Enforced¶
// File: src/pgraft_sql.c
Datum pgraft_add_node(PG_FUNCTION_ARGS) {
int node_id = PG_GETARG_INT32(0);
char *address = text_to_cstring(PG_GETARG_TEXT_PP(1));
int port = PG_GETARG_INT32(2);
// Check if this node is the leader
int leader_status = pgraft_go_is_leader();
if (leader_status != 1) {
elog(ERROR, "Cannot add node - this node is not the leader. "
"Node addition must be performed on the leader.");
return false;
}
// Call Go library to add peer (creates ConfChange)
result = pgraft_go_add_peer(node_id, address, port);
// Raft log replication happens automatically
// All nodes will apply this change
return true;
}
What Happens in Go¶
// File: src/pgraft_go.go
func pgraft_go_add_peer(nodeID, address, port) {
// Add to local map
nodeAddr := fmt.Sprintf("%s:%d", address, port)
nodes[nodeID] = nodeAddr
// Create ConfChange entry
cc := raftpb.ConfChange{
Type: raftpb.ConfChangeAddNode,
NodeID: uint64(nodeID),
Context: []byte(nodeAddr),
}
// Propose to Raft (this adds to log)
raftNode.ProposeConfChange(ctx, cc)
// Raft handles the rest:
// - Replicates to followers
// - Gets majority acknowledgment
// - Commits the entry
// - All nodes apply the change
// AUTOMATIC!
return 0
}
Comparison: Manual vs. Automatic¶
Manual Configuration (Traditional Systems)¶
Steps Required: 1. SSH to node1: vi /etc/cluster.conf
, add node4 2. SSH to node2: vi /etc/cluster.conf
, add node4 3. SSH to node3: vi /etc/cluster.conf
, add node4 4. Restart node1: systemctl restart service
5. Restart node2: systemctl restart service
6. Restart node3: systemctl restart service
7. Hope all configs match exactly 8. Debug if configs don't match
Problems: - ✗ Human error likely (typos, wrong IPs) - ✗ Race conditions during restarts - ✗ Downtime during restarts - ✗ Configs can diverge - ✗ Split-brain possible - ✗ Complex rollback if error
Automatic Replication (pgraft)¶
Steps Required: 1. SELECT pgraft_add_node(4, '192.168.1.14', 7004);
That's it!
Advantages: - ✓ No human error (single command) - ✓ No race conditions (Raft serializes) - ✓ No downtime (hot reconfiguration) - ✓ Configs cannot diverge (Raft guarantees consistency) - ✓ Split-brain impossible (quorum required) - ✓ Automatic rollback on failure
Real-World Example¶
Scenario: Add New Node to Production Cluster¶
Initial Cluster:
Production: 5 nodes
node1 (leader): 192.168.1.11:7001
node2 (follower): 192.168.1.12:7002
node3 (follower): 192.168.1.13:7003
node4 (follower): 192.168.1.14:7004
node5 (follower): 192.168.1.15:7005
Task: Add node6 at 192.168.1.16:7006
Old Way (Manual):
# 1. Update config on ALL 5 nodes
for node in node1 node2 node3 node4 node5; do
ssh $node "echo 'node6: 192.168.1.16:7006' >> /etc/pgraft/nodes.conf"
done
# 2. Restart ALL nodes (risky!)
for node in node1 node2 node3 node4 node5; do
ssh $node "systemctl restart postgresql"
done
# 3. Hope nothing breaks
# 4. Debug if it does
Time: 30+ minutes
Risk: HIGH (cluster downtime, split-brain possible)
pgraft Way (Automatic):
-- 1. Connect to leader
psql -h 192.168.1.11 -p 5432
-- 2. Add node
SELECT pgraft_add_node(6, '192.168.1.16', 7006);
-- Done!
Time: 5 seconds
Risk: ZERO (no downtime, split-brain impossible)
Verification:
-- Check on ANY node
SELECT * FROM pgraft_get_nodes();
node_id | address | port
---------+--------------+------
1 | 192.168.1.11 | 7001
2 | 192.168.1.12 | 7002
3 | 192.168.1.13 | 7003
4 | 192.168.1.14 | 7004
5 | 192.168.1.15 | 7005
6 | 192.168.1.16 | 7006 ← NEW! On ALL nodes!
Summary¶
The Answer: YES! 100% Automatic¶
When you add a node to the leader: 1. ✓ Leader creates ConfChange entry 2. ✓ Leader replicates via Raft log to ALL followers 3. ✓ Majority acknowledges 4. ✓ Entry commits 5. ✓ ALL nodes apply the change 6. ✓ New node appears on ALL nodes
No manual steps. No SSH. No config files. No restarts.
What Makes This Possible¶
Raft Consensus Protocol: - Log replication guarantees - Majority agreement required - Sequential consistency - Automatic recovery
Benefits¶
- Safety: Split-brain impossible
- Simplicity: One SQL command
- Reliability: No human error
- Availability: No downtime
- Consistency: All nodes identical
FAQ¶
Q: Do I need to run the command on each node?
A: NO! Only on the leader. The rest is automatic.
Q: What if a follower was down when I added a node?
A: It catches up automatically when it restarts. It will read the ConfChange from the Raft log.
Q: What if the leader crashes during the add?
A: If the entry was committed (majority persisted it), the new leader will have it and will apply it. If it wasn't committed, it's as if the add never happened.
Q: Can I add multiple nodes at once?
A: Yes, but add them one at a time. Each becomes a separate Raft log entry.
Q: How long does it take for the node to appear on all nodes?
A: ~5ms (time for majority acknowledgment). Essentially instant.
Q: Do I need to configure anything on the follower nodes?
A: NO! Raft handles everything automatically.
BOTTOM LINE: ONE COMMAND ON LEADER = AUTOMATIC UPDATE ON ALL NODES ✓