Skip to content

pgraft System Architecture

Overview

pgraft implements a distributed consensus system using the Raft algorithm integrated with PostgreSQL. This document describes the overall system architecture, component interactions, and operational flows.

System Components

1. PostgreSQL Cluster Nodes

Each PostgreSQL instance in the cluster runs the pgraft extension and participates in the Raft consensus protocol.

2. Raft Consensus Layer

The core consensus engine implemented in Go, providing: - Leader election - Log replication - Cluster membership management - Failure detection and recovery

3. Network Communication

TCP-based peer-to-peer communication between cluster nodes for: - Raft protocol messages - Heartbeat signals - Log replication - Configuration changes

4. Shared Memory Interface

PostgreSQL shared memory used for: - Command queue between SQL and background worker - Cluster state persistence - Worker status tracking - Command status monitoring

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    PostgreSQL Cluster                          │
├─────────────────┬─────────────────┬─────────────────────────────┤
│   Node 1        │   Node 2        │   Node 3                    │
│                 │                 │                             │
│ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────┐             │
│ │ PostgreSQL  │ │ │ PostgreSQL  │ │ │ PostgreSQL  │             │
│ │   Server    │ │ │   Server    │ │ │   Server    │             │
│ └─────────────┘ │ └─────────────┘ │ └─────────────┘             │
│         │       │         │       │         │                   │
│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐           │
│ │   pgraft      │ │   pgraft      │ │   pgraft      │           │
│ │   Extension   │ │   Extension   │ │   Extension   │           │
│ └───────┬───────┼─────────┬───────┼─────────┬───────┘           │
│         │       │         │       │         │                   │
│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐           │
│ │ Background    │ │ Background    │ │ Background    │           │
│ │   Worker      │ │   Worker      │ │   Worker      │           │
│ └───────┬───────┼─────────┬───────┼─────────┬───────┘           │
│         │       │         │       │         │                   │
│ ┌───────▼───────┼─────────▼───────┼─────────▼───────┐           │
│ │ Go Raft       │ │ Go Raft       │ │ Go Raft       │           │
│ │   Library     │ │   Library     │ │   Library     │           │
│ └───────┬───────┼─────────┬───────┼─────────┬───────┘           │
│         │       │         │       │         │                   │
└─────────┼───────┼─────────┼───────┼─────────┼───────────────────┘
          │       │         │       │         │
          └───────┼─────────┼───────┼─────────┘
                  │         │       │
          ┌───────▼─────────▼───────▼───────┐
          │        Network Layer            │
          │    (TCP Peer Communication)     │
          └─────────────────────────────────┘

Component Interaction Flow

1. Cluster Initialization

sequenceDiagram
    participant U as User
    participant N1 as Node 1
    participant N2 as Node 2
    participant N3 as Node 3

    U->>N1: SELECT pgraft_init()
    N1->>N1: Start background worker
    N1->>N1: Initialize Raft node
    N1->>N1: Start network server

    U->>N2: SELECT pgraft_add_node('node2:5433')
    N1->>N2: Connect to node 2
    N2->>N2: Start background worker
    N2->>N2: Join cluster

    U->>N3: SELECT pgraft_add_node('node3:5433')
    N1->>N3: Connect to node 3
    N3->>N3: Start background worker
    N3->>N3: Join cluster

    Note over N1,N3: Cluster formed with 3 nodes

2. Leader Election Process

sequenceDiagram
    participant N1 as Node 1 (Leader)
    participant N2 as Node 2 (Follower)
    participant N3 as Node 3 (Follower)

    loop Heartbeat
        N1->>N2: AppendEntries (heartbeat)
        N1->>N3: AppendEntries (heartbeat)
        N2->>N1: AppendEntries Response
        N3->>N1: AppendEntries Response
    end

    Note over N1: Leader fails
    N2->>N2: Election timeout
    N2->>N3: RequestVote
    N3->>N2: Vote granted
    N2->>N2: Become leader
    N2->>N3: AppendEntries (heartbeat)

3. Log Replication

sequenceDiagram
    participant U as User
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    U->>L: INSERT/UPDATE/DELETE
    L->>L: Append to log
    L->>F1: AppendEntries (log entry)
    L->>F2: AppendEntries (log entry)
    F1->>L: AppendEntries Response
    F2->>L: AppendEntries Response
    L->>L: Commit entry
    L->>F1: AppendEntries (commit)
    L->>F2: AppendEntries (commit)
    L->>U: Transaction committed

Data Flow Architecture

1. Command Processing Flow

SQL Function → Command Queue → Background Worker → Go Raft Library → Network
     ↑                                                                    ↓
     └─────────── Command Status ← Shared Memory ← Raft State ←──────────┘

2. Shared Memory Layout

┌─────────────────────────────────────────────────────────────┐
│                    Shared Memory                            │
├─────────────────────────────────────────────────────────────┤
│  Worker State                                               │
│  ├─ Status (IDLE/INIT/RUNNING/STOPPING/STOPPED)           │
│  ├─ Node ID, Address, Port                                 │
│  ├─ Cluster Name                                           │
│  └─ Initialization Flags                                   │
├─────────────────────────────────────────────────────────────┤
│  Command Queue (Circular Buffer)                           │
│  ├─ Command Type (INIT/ADD_NODE/REMOVE_NODE/LOG_APPEND)   │
│  ├─ Command Data                                           │
│  ├─ Timestamp                                              │
│  └─ Queue Head/Tail Pointers                               │
├─────────────────────────────────────────────────────────────┤
│  Command Status FIFO                                       │
│  ├─ Command ID                                             │
│  ├─ Status (PENDING/PROCESSING/COMPLETED/FAILED)          │
│  ├─ Error Message                                          │
│  └─ Completion Time                                        │
├─────────────────────────────────────────────────────────────┤
│  Cluster State                                             │
│  ├─ Current Leader ID                                      │
│  ├─ Current Term                                           │
│  ├─ Node Membership                                        │
│  └─ Log Statistics                                         │
└─────────────────────────────────────────────────────────────┘

Network Architecture

1. Peer-to-Peer Communication

┌─────────────┐    TCP    ┌─────────────┐    TCP    ┌─────────────┐
│   Node 1    │◄─────────►│   Node 2    │◄─────────►│   Node 3    │
│             │           │             │           │             │
│ Port: 5433  │           │ Port: 5434  │           │ Port: 5435  │
│ Raft Port:  │           │ Raft Port:  │           │ Raft Port:  │
│    8001     │           │    8002     │           │    8003     │
└─────────────┘           └─────────────┘           └─────────────┘

2. Message Types

  • RequestVote: Candidate requesting votes during elections
  • AppendEntries: Leader sending log entries and heartbeats
  • InstallSnapshot: Leader sending snapshot to catch up slow followers
  • Heartbeat: Regular leader-to-follower communication

Failure Scenarios and Recovery

1. Leader Failure

Normal Operation → Leader Fails → Election Timeout → New Election → New Leader

2. Network Partition

Full Connectivity → Network Split → Partition A (majority) → Partition B (minority)
                                          ↓                        ↓
                                   Continues Operation      Stops Accepting Writes

3. Node Recovery

Node Down → Node Restarts → Joins Cluster → Catches Up Log → Active Participant

Security Considerations

1. Network Security

  • TCP connections between peers
  • Configurable IP addresses and ports
  • No built-in encryption (relies on network-level security)

2. Access Control

  • PostgreSQL's native authentication
  • Extension functions require appropriate privileges
  • Shared memory access controlled by PostgreSQL

Performance Characteristics

1. Latency

  • Leader election: ~1-5 seconds (configurable)
  • Log replication: Network RTT + disk I/O
  • Heartbeat interval: 1 second (configurable)

2. Throughput

  • Single leader handles all writes
  • Followers can serve read-only queries
  • Log replication limited by network bandwidth

3. Scalability

  • Optimal with 3-5 nodes
  • More nodes increase election time
  • Network partitions affect availability

Configuration Parameters

1. Network Settings

  • pgraft.listen_address: IP address to bind
  • pgraft.listen_port: Port for Raft communication
  • pgraft.peer_timeout: Network timeout for peer connections

2. Raft Parameters

  • pgraft.heartbeat_interval: Heartbeat frequency (ms)
  • pgraft.election_timeout: Election timeout range (ms)
  • pgraft.max_log_entries: Maximum log entries per batch

3. Operational Settings

  • pgraft.cluster_name: Unique cluster identifier
  • pgraft.debug_enabled: Enable debug logging
  • pgraft.health_period_ms: Health check frequency

Monitoring and Observability

1. Cluster Health

  • Leader election status
  • Node membership status
  • Network connectivity
  • Log replication lag

2. Performance Metrics

  • Command processing latency
  • Network message rates
  • Memory usage
  • Background worker status

3. Logging

  • Raft protocol events
  • Network communication
  • Error conditions
  • Performance statistics

Deployment Considerations

1. Hardware Requirements

  • Sufficient RAM for shared memory
  • Network bandwidth for replication
  • Disk I/O for log persistence
  • CPU for consensus processing

2. Network Requirements

  • Low-latency network between nodes
  • Reliable network connectivity
  • Sufficient bandwidth for replication
  • Firewall configuration for peer ports

3. PostgreSQL Configuration

  • Shared memory allocation
  • Background worker limits
  • Connection limits
  • Logging configuration

This architecture provides a robust foundation for distributed PostgreSQL clusters with automatic failover, consistent replication, and high availability.