Troubleshooting¶
This page covers common issues and their solutions.
Worker Not Running¶
Symptom¶
Diagnosis¶
Check if pgraft is in shared_preload_libraries
:
Solution¶
-
Add pgraft to
postgresql.conf
: -
Restart PostgreSQL:
-
Verify:
Cannot Add Node¶
Symptom¶
SELECT pgraft_add_node(2, '127.0.0.1', 7002);
-- Error: "Cannot add node - this node is not the leader"
Diagnosis¶
You're trying to add a node on a follower. Only the leader can add nodes.
Solution¶
-
Find the leader:
-
Connect to the leader node and run the command there:
No Leader Elected¶
Symptom¶
SELECT pgraft_get_leader();
-- Returns: 0 (no leader)
SELECT * FROM pgraft_get_cluster_status();
-- Shows term 0 or very low term
Possible Causes¶
- Cluster just started: Leader election takes ~1 second
- Network issues: Nodes cannot communicate
- No quorum: Insufficient nodes for majority
- Configuration mismatch: Nodes have different cluster_ids
Solution¶
1. Wait for Election¶
2. Check Network Connectivity¶
# From Node 1, test connection to Node 2's Raft port
nc -zv 127.0.0.1 7002
# Or use telnet
telnet 127.0.0.1 7002
3. Verify Cluster Configuration¶
-- On each node, check configuration
SHOW pgraft.cluster_id; -- Must be same on all nodes
SHOW pgraft.node_id; -- Must be unique per node
SHOW pgraft.address; -- Node's own address
SHOW pgraft.port; -- Raft port (not PostgreSQL port)
4. Check Logs¶
Look for errors like: - "Connection refused" - "Network unreachable" - "Failed to send message"
Frequent Leader Changes¶
Symptom¶
-- Term keeps increasing rapidly
SELECT pgraft_get_term();
-- Returns: 156 (very high)
-- Different leader each time you check
SELECT pgraft_get_leader();
Possible Causes¶
- Network instability: Packet loss or high latency
- Election timeout too low: Nodes timeout before receiving heartbeats
- Node overload: Nodes too busy to respond in time
Solution¶
1. Increase Election Timeout¶
Edit postgresql.conf
:
Restart PostgreSQL on all nodes.
2. Check Network Latency¶
# Measure latency between nodes
ping -c 10 node2_address
# Check packet loss
ping -c 100 node2_address | grep loss
3. Monitor System Load¶
Node Cannot Join Cluster¶
Symptom¶
Added node using pgraft_add_node()
but node doesn't appear in cluster:
Diagnosis¶
1. Check if Node is Running¶
2. Check Node Configuration¶
-- On the new node
SHOW pgraft.cluster_id; -- Must match existing cluster
SHOW pgraft.node_id; -- Must match ID used in pgraft_add_node()
SHOW pgraft.address; -- Must match address used in pgraft_add_node()
SHOW pgraft.port; -- Must match port used in pgraft_add_node()
Solution¶
-
Ensure node is initialized:
-
Verify network connectivity:
-
Check firewall rules:
Data Directory Errors¶
Symptom¶
Solution¶
-
Check directory permissions:
-
Check disk space:
-
Verify configuration:
Extension Won't Load¶
Symptom¶
Solution¶
-
Verify extension files are installed:
-
Check file permissions:
-
Verify architecture compatibility:
-
Check for missing dependencies:
Compilation Errors¶
Symptom¶
Solution¶
1. PostgreSQL Development Headers Missing¶
# Ubuntu/Debian
sudo apt-get install postgresql-server-dev-17
# CentOS/RHEL
sudo yum install postgresql17-devel
# macOS
brew install postgresql@17
2. Go Not Found¶
# Check Go installation
go version
# If not installed:
# Ubuntu/Debian
sudo apt-get install golang-go
# macOS
brew install go
3. pg_config Not in PATH¶
# Find pg_config
which pg_config
# If not found, add to PATH:
export PATH="/usr/local/pgsql/bin:$PATH"
# Or set PG_CONFIG in Makefile
make PG_CONFIG=/path/to/pg_config
Performance Issues¶
Symptom¶
- High CPU usage
- Slow replication
- Queries taking too long
Diagnosis¶
-- Check log lag
SELECT * FROM pgraft_log_get_stats();
-- Monitor replication
SELECT * FROM pgraft_log_get_replication_status();
Solution¶
1. Tune Batch Settings¶
2. Adjust Snapshot Frequency¶
# More frequent snapshots reduce log size
pgraft.snapshot_interval = 5000
pgraft.max_log_entries = 500
3. Check System Resources¶
Split-Brain Concerns¶
Symptom¶
"I'm worried about split-brain. How do I verify protection?"
Verification¶
-
Test minority partition:
# In a 3-node cluster, isolate one node # On isolated node: psql -c "SELECT pgraft_is_leader();" # Should be false psql -c "SELECT pgraft_add_node(4, '127.0.0.1', 7004);" # Should fail # On majority partition (2 nodes): psql -c "SELECT pgraft_is_leader();" # One should be true psql -c "SELECT pgraft_add_node(4, '127.0.0.1', 7004);" # Should succeed
-
Monitor term numbers:
See Split-Brain Protection for detailed explanation.
Debug Mode¶
Enable Debug Logging¶
This will log detailed information about: - Raft messages - State transitions - Log replication - Network events
View Debug Logs¶
Disable Debug Logging¶
Getting Help¶
If you're still experiencing issues:
-
Check logs:
-
Gather diagnostic information:
-
Enable debug mode and reproduce the issue:
-
Report the issue on the GitHub repository with:
- pgraft version
- PostgreSQL version
- Operating system
- Configuration (postgresql.conf relevant sections)
- Error messages and logs
- Steps to reproduce