pgraft System Architecture¶
Overview¶
pgraft implements a distributed consensus system using the Raft algorithm integrated with PostgreSQL. This document describes the overall system architecture, component interactions, and operational flows.
System Components¶
1. PostgreSQL Cluster Nodes¶
Each PostgreSQL instance in the cluster runs the pgraft extension and participates in the Raft consensus protocol.
2. Raft Consensus Layer¶
The core consensus engine implemented in Go, providing: - Leader election - Log replication - Cluster membership management - Failure detection and recovery
3. Network Communication¶
TCP-based peer-to-peer communication between cluster nodes for: - Raft protocol messages - Heartbeat signals - Log replication - Configuration changes
4. Shared Memory Interface¶
PostgreSQL shared memory used for: - Command queue between SQL and background worker - Cluster state persistence - Worker status tracking - Command status monitoring
High-Level Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ PostgreSQL Cluster │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ Node 1 │ Node 2 │ Node 3 │
│ │ │ │
│ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────┐ │
│ │ PostgreSQL │ │ │ PostgreSQL │ │ │ PostgreSQL │ │
│ │ Server │ │ │ Server │ │ │ Server │ │
│ └─────────────┘ │ └─────────────┘ │ └─────────────┘ │
│ │ │ │ │ │ │
│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │
│ │ pgraft │ │ pgraft │ │ pgraft │ │
│ │ Extension │ │ Extension │ │ Extension │ │
│ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │
│ │ │ │ │ │ │
│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │
│ │ Background │ │ Background │ │ Background │ │
│ │ Worker │ │ Worker │ │ Worker │ │
│ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │
│ │ │ │ │ │ │
│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐ │
│ │ Go Raft │ │ Go Raft │ │ Go Raft │ │
│ │ Library │ │ Library │ │ Library │ │
│ └───────┬───────┼─────────┬───────┼─────────┬───────┘ │
│ │ │ │ │ │ │
└─────────┼───────┼─────────┼───────┼─────────┼───────────────────┘
│ │ │ │ │
└───────┼─────────┼───────┼─────────┘
│ │ │
┌───────▼─────────▼───────▼───────┐
│ Network Layer │
│ (TCP Peer Communication) │
└─────────────────────────────────┘
Component Interaction Flow¶
1. Cluster Initialization¶
sequenceDiagram
participant U as User
participant N1 as Node 1
participant N2 as Node 2
participant N3 as Node 3
U->>N1: SELECT pgraft_init()
N1->>N1: Start background worker
N1->>N1: Initialize Raft node
N1->>N1: Start network server
U->>N2: SELECT pgraft_add_node('node2:5433')
N1->>N2: Connect to node 2
N2->>N2: Start background worker
N2->>N2: Join cluster
U->>N3: SELECT pgraft_add_node('node3:5433')
N1->>N3: Connect to node 3
N3->>N3: Start background worker
N3->>N3: Join cluster
Note over N1,N3: Cluster formed with 3 nodes
2. Leader Election Process¶
sequenceDiagram
participant N1 as Node 1 (Leader)
participant N2 as Node 2 (Follower)
participant N3 as Node 3 (Follower)
loop Heartbeat
N1->>N2: AppendEntries (heartbeat)
N1->>N3: AppendEntries (heartbeat)
N2->>N1: AppendEntries Response
N3->>N1: AppendEntries Response
end
Note over N1: Leader fails
N2->>N2: Election timeout
N2->>N3: RequestVote
N3->>N2: Vote granted
N2->>N2: Become leader
N2->>N3: AppendEntries (heartbeat)
3. Log Replication¶
sequenceDiagram
participant U as User
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
U->>L: INSERT/UPDATE/DELETE
L->>L: Append to log
L->>F1: AppendEntries (log entry)
L->>F2: AppendEntries (log entry)
F1->>L: AppendEntries Response
F2->>L: AppendEntries Response
L->>L: Commit entry
L->>F1: AppendEntries (commit)
L->>F2: AppendEntries (commit)
L->>U: Transaction committed
Data Flow Architecture¶
1. Command Processing Flow¶
SQL Function → Command Queue → Background Worker → Go Raft Library → Network
↑ ↓
└─────────── Command Status ← Shared Memory ← Raft State ←──────────┘
2. Shared Memory Layout¶
┌─────────────────────────────────────────────────────────────┐
│ Shared Memory │
├─────────────────────────────────────────────────────────────┤
│ Worker State │
│ ├─ Status (IDLE/INIT/RUNNING/STOPPING/STOPPED) │
│ ├─ Node ID, Address, Port │
│ ├─ Cluster Name │
│ └─ Initialization Flags │
├─────────────────────────────────────────────────────────────┤
│ Command Queue (Circular Buffer) │
│ ├─ Command Type (INIT/ADD_NODE/REMOVE_NODE/LOG_APPEND) │
│ ├─ Command Data │
│ ├─ Timestamp │
│ └─ Queue Head/Tail Pointers │
├─────────────────────────────────────────────────────────────┤
│ Command Status FIFO │
│ ├─ Command ID │
│ ├─ Status (PENDING/PROCESSING/COMPLETED/FAILED) │
│ ├─ Error Message │
│ └─ Completion Time │
├─────────────────────────────────────────────────────────────┤
│ Cluster State │
│ ├─ Current Leader ID │
│ ├─ Current Term │
│ ├─ Node Membership │
│ └─ Log Statistics │
└─────────────────────────────────────────────────────────────┘
Network Architecture¶
1. Peer-to-Peer Communication¶
┌─────────────┐ TCP ┌─────────────┐ TCP ┌─────────────┐
│ Node 1 │◄─────────►│ Node 2 │◄─────────►│ Node 3 │
│ │ │ │ │ │
│ Port: 5433 │ │ Port: 5434 │ │ Port: 5435 │
│ Raft Port: │ │ Raft Port: │ │ Raft Port: │
│ 8001 │ │ 8002 │ │ 8003 │
└─────────────┘ └─────────────┘ └─────────────┘
2. Message Types¶
- RequestVote: Candidate requesting votes during elections
- AppendEntries: Leader sending log entries and heartbeats
- InstallSnapshot: Leader sending snapshot to catch up slow followers
- Heartbeat: Regular leader-to-follower communication
Failure Scenarios and Recovery¶
1. Leader Failure¶
2. Network Partition¶
Full Connectivity → Network Split → Partition A (majority) → Partition B (minority)
↓ ↓
Continues Operation Stops Accepting Writes
3. Node Recovery¶
Security Considerations¶
1. Network Security¶
- TCP connections between peers
- Configurable IP addresses and ports
- No built-in encryption (relies on network-level security)
2. Access Control¶
- PostgreSQL's native authentication
- Extension functions require appropriate privileges
- Shared memory access controlled by PostgreSQL
Performance Characteristics¶
1. Latency¶
- Leader election: ~1-5 seconds (configurable)
- Log replication: Network RTT + disk I/O
- Heartbeat interval: 1 second (configurable)
2. Throughput¶
- Single leader handles all writes
- Followers can serve read-only queries
- Log replication limited by network bandwidth
3. Scalability¶
- Optimal with 3-5 nodes
- More nodes increase election time
- Network partitions affect availability
Configuration Parameters¶
1. Network Settings¶
pgraft.listen_address
: IP address to bindpgraft.listen_port
: Port for Raft communicationpgraft.peer_timeout
: Network timeout for peer connections
2. Raft Parameters¶
pgraft.heartbeat_interval
: Heartbeat frequency (ms)pgraft.election_timeout
: Election timeout range (ms)pgraft.max_log_entries
: Maximum log entries per batch
3. Operational Settings¶
pgraft.cluster_name
: Unique cluster identifierpgraft.debug_enabled
: Enable debug loggingpgraft.health_period_ms
: Health check frequency
Monitoring and Observability¶
1. Cluster Health¶
- Leader election status
- Node membership status
- Network connectivity
- Log replication lag
2. Performance Metrics¶
- Command processing latency
- Network message rates
- Memory usage
- Background worker status
3. Logging¶
- Raft protocol events
- Network communication
- Error conditions
- Performance statistics
Deployment Considerations¶
1. Hardware Requirements¶
- Sufficient RAM for shared memory
- Network bandwidth for replication
- Disk I/O for log persistence
- CPU for consensus processing
2. Network Requirements¶
- Low-latency network between nodes
- Reliable network connectivity
- Sufficient bandwidth for replication
- Firewall configuration for peer ports
3. PostgreSQL Configuration¶
- Shared memory allocation
- Background worker limits
- Connection limits
- Logging configuration
This architecture provides a robust foundation for distributed PostgreSQL clusters with automatic failover, consistent replication, and high availability.