Monitoring¶
This page describes how to monitor your pgraft cluster for health, performance, and troubleshooting.
Quick Health Check¶
Run this query on any node to get a quick overview:
SELECT
pgraft_is_leader() as is_leader,
pgraft_get_term() as term,
pgraft_get_leader() as leader_id,
pgraft_get_worker_state() as worker;
Expected output:
Cluster Status¶
Get Detailed Cluster Status¶
Output includes: - Current term - Leader ID - Node state (Leader/Follower/Candidate) - Number of nodes in cluster
List All Nodes¶
Sample output:
node_id | address | port | is_leader
---------+-------------+------+-----------
1 | 127.0.0.1 | 7001 | t
2 | 127.0.0.1 | 7002 | f
3 | 127.0.0.1 | 7003 | f
Worker Status¶
Check Background Worker¶
The background worker is responsible for driving the Raft consensus:
Possible states: - RUNNING
: Normal operation - STOPPED
: Worker not running - ERROR
: Worker encountered error
Troubleshoot Worker Issues¶
If worker is not running:
-- 1. Check if extension is loaded
SELECT * FROM pg_extension WHERE extname = 'pgraft';
-- 2. Check shared_preload_libraries
SHOW shared_preload_libraries;
-- Should include 'pgraft'
Log Monitoring¶
Get Log Statistics¶
Output:
log_size | last_index | commit_index | last_applied
----------+------------+--------------+--------------
1000 | 1000 | 995 | 995
What to monitor:
- log_size: Total number of log entries
- last_index: Index of last log entry
- commit_index: Last committed entry
- last_applied: Last applied entry
Lag Detection
If commit_index
is significantly behind last_index
, followers may be lagging.
Check Replication Status¶
On the leader:
Shows replication progress for each follower.
Performance Metrics¶
Key Metrics to Monitor¶
Metric | SQL Query | Normal Range | Alert If |
---|---|---|---|
Term stability | SELECT pgraft_get_term(); | Stable | Frequent changes |
Leader stability | SELECT pgraft_get_leader(); | Stable | Frequent changes |
Worker state | SELECT pgraft_get_worker_state(); | RUNNING | Not RUNNING |
Log lag | SELECT * FROM pgraft_log_get_stats(); | commit_index ≈ last_index | Large difference |
Monitoring Script¶
Create a monitoring script that runs periodically:
-- monitoring.sql
\set QUIET on
\pset format unaligned
\pset fieldsep ','
\pset tuples_only on
SELECT
now() as timestamp,
pgraft_is_leader() as is_leader,
pgraft_get_term() as term,
pgraft_get_leader() as leader_id,
pgraft_get_worker_state() as worker_state;
\pset tuples_only off
Run it:
Prometheus Metrics¶
Enable Prometheus metrics in postgresql.conf
:
Available Metrics¶
Metrics are exposed at http://node-address:9100/metrics
:
pgraft_term
: Current Raft termpgraft_leader_id
: Current leader IDpgraft_is_leader
: 1 if leader, 0 if followerpgraft_log_size
: Number of log entriespgraft_commit_index
: Last committed indexpgraft_applied_index
: Last applied index
Grafana Dashboard¶
Sample Prometheus queries for Grafana:
# Leader election frequency
rate(pgraft_term[5m])
# Cluster has leader (should be 1)
max(pgraft_is_leader)
# Log lag
pgraft_log_size - pgraft_commit_index
Log Files¶
PostgreSQL Logs¶
pgraft logs to PostgreSQL's standard log:
Log Levels¶
Enable debug logging:
Disable debug logging:
Important Log Messages¶
Normal operation:
pgraft: Background worker started
pgraft: Raft node initialized, node_id=1
pgraft: Elected as leader in term 5
pgraft: Heartbeat sent to node 2
Warning signs:
pgraft: Election timeout, starting election
pgraft: Lost leadership, stepping down
pgraft: Failed to replicate to majority
Errors:
pgraft: Cannot add node - this node is not the leader
pgraft: Failed to persist HardState
pgraft: Network connection failed to node 2
Alerting¶
Critical Alerts¶
Set up alerts for these conditions:
No Leader:
Worker Not Running:
Frequent Leader Changes:
Warning Alerts¶
Log Lag:
Term Increasing:
Health Check Endpoint¶
Create a health check function for load balancers:
CREATE OR REPLACE FUNCTION pgraft_health_check()
RETURNS json AS $$
DECLARE
result json;
BEGIN
SELECT json_build_object(
'healthy', pgraft_get_worker_state() = 'RUNNING',
'is_leader', pgraft_is_leader(),
'leader_id', pgraft_get_leader(),
'term', pgraft_get_term()
) INTO result;
RETURN result;
END;
$$ LANGUAGE plpgsql;
-- Usage
SELECT pgraft_health_check();
Output:
Monitoring Checklist¶
Daily: - [ ] Check worker state on all nodes - [ ] Verify leader is elected - [ ] Check for errors in logs
Weekly: - [ ] Review term changes (should be stable) - [ ] Check log statistics - [ ] Verify all nodes are in cluster
Monthly: - [ ] Review performance metrics - [ ] Check disk usage for log storage - [ ] Test failover scenarios
Example Monitoring Dashboard¶
Here's a sample SQL script for a monitoring dashboard:
-- pgraft_dashboard.sql
\echo 'Cluster Overview'
\echo '================'
SELECT
'Leader ID' as metric,
pgraft_get_leader()::text as value
UNION ALL
SELECT
'Current Term',
pgraft_get_term()::text
UNION ALL
SELECT
'Worker State',
pgraft_get_worker_state();
\echo ''
\echo 'Cluster Nodes'
\echo '============='
SELECT * FROM pgraft_get_nodes();
\echo ''
\echo 'Log Statistics'
\echo '=============='
SELECT * FROM pgraft_log_get_stats();
\echo ''
\echo 'Node Status'
\echo '==========='
SELECT
CASE WHEN pgraft_is_leader() THEN 'LEADER' ELSE 'FOLLOWER' END as role,
pgraft_get_worker_state() as worker;
Run it: