Core Concepts¶
Understand the fundamental concepts behind pgraft.
Overview¶
This section explains the core concepts and algorithms that make pgraft work. Understanding these concepts will help you operate pgraft clusters effectively and troubleshoot issues.
Key Concepts¶
Architecture¶
Learn how pgraft integrates with PostgreSQL and how the Raft consensus algorithm works.
Automatic Replication¶
Understand how configuration changes automatically replicate across the cluster.
Split-Brain Protection¶
Discover how pgraft provides 100% guaranteed split-brain protection.
The Raft Consensus Algorithm¶
pgraft uses the Raft consensus algorithm to ensure all nodes agree on the cluster state. Raft provides:
Leader Election¶
- One node is elected as leader
- Leader must have majority votes (N/2 + 1)
- Elections occur when leader fails or times out
- Higher term always wins
Log Replication¶
- Leader accepts all writes
- Leader replicates entries to followers
- Entries committed when replicated to majority
- All nodes eventually have same log
Safety Guarantees¶
Election Safety: At most one leader per term
Leader Append-Only: Leaders never overwrite entries
Log Matching: If logs contain same entry, all previous entries match
Leader Completeness: Committed entries present in all future leaders
State Machine Safety: All nodes apply same commands in same order
How pgraft Works¶
Component Architecture¶
┌─────────────────────────────────────────┐
│ PostgreSQL Process │
├─────────────────────────────────────────┤
│ SQL Interface (C) │
│ ├─ pgraft_init() │
│ ├─ pgraft_add_node() │
│ └─ pgraft_get_cluster_status() │
├─────────────────────────────────────────┤
│ Background Worker (C) │
│ └─ Drives Raft ticks every 100ms │
├─────────────────────────────────────────┤
│ Go Raft Engine (Go) │
│ └─ etcd-io/raft implementation │
├─────────────────────────────────────────┤
│ Persistent Storage │
│ ├─ HardState (term, vote, commit) │
│ ├─ Log Entries │
│ └─ Snapshots │
└─────────────────────────────────────────┘
Message Flow¶
- SQL function called (e.g.,
pgraft_add_node()
) - Command queued in shared memory
- Background worker processes queue
- Go layer proposes to Raft
- Leader replicates to followers
- Majority commits the entry
- All nodes apply the change
Key Properties¶
Consistency¶
All nodes see the same data in the same order. If a write is acknowledged, it's guaranteed to be present on all surviving nodes.
Availability¶
The cluster remains available as long as a majority of nodes are operational. For a 5-node cluster, up to 2 nodes can fail.
Partition Tolerance¶
During network partitions, only the majority partition can elect a leader and accept writes. Minority partitions become read-only.
Durability¶
All committed data survives node crashes and restarts. HardState, log entries, and snapshots are persisted to disk.
Fault Tolerance¶
Cluster Size | Nodes Required for Quorum | Tolerated Failures |
---|---|---|
1 node | 1 | 0 |
3 nodes | 2 | 1 |
5 nodes | 3 | 2 |
7 nodes | 4 | 3 |
Note: Even numbers waste resources. A 4-node cluster still only tolerates 1 failure (same as 3 nodes).
Performance Characteristics¶
Latency¶
- Leader election: ~1 second (configurable)
- Write latency: ~1 RTT to majority + disk fsync
- Read latency: Local (no consensus required)
Throughput¶
- Bottleneck: Leader's disk write speed
- Optimization: Batch multiple entries per disk write
- Typical: Thousands of writes per second
Resource Usage¶
- CPU: <1% idle, <5% during elections
- Memory: ~50MB per node
- Disk I/O: Proportional to write rate
- Network: Heartbeats every 100ms + replication traffic
Common Patterns¶
Configuration Changes¶
All configuration changes (adding/removing nodes) go through Raft consensus:
- Leader receives request
- Leader proposes ConfChange
- ConfChange replicated to majority
- All nodes apply configuration
- Cluster now uses new configuration
Automatic Failover¶
When leader fails:
- Followers stop receiving heartbeats
- After election timeout, followers start election
- Candidate with up-to-date log wins
- New leader begins sending heartbeats
- Cluster resumes normal operation
Total downtime: ~1-2 seconds (configurable)
Learn More¶
- Deep dive: Architecture
- How replication works: Automatic Replication
- Safety guarantees: Split-Brain Protection
- Operating clusters: Operations Guide