What is an Outage?
An outage is a loss of quorum for Consul, i.e., the nodes are not able to elect a leader. This might be caused by a majority of nodes going offline or not being able to communicate due to network partitions. We’ll deal primarily with nodes going offline here.
For a quorum to be healthy it is necessary for every node to know its neighbors. The list of neighbors is kept in the Raft peers list in the consul-data-directory/raft/peers.json file, where consul-data-directory is specified when Consul starts (see below). There should be at least 2 nodes there in order to form a quorum and elect a leader on startup.
Outage recovery might sometimes involve manual editing of the peers list where the machines are not recoverable.
Key players:
- Raft manages leadership and its consistency.
- Serf manages membership like node joining/leaving.
- The “peer set” is the set of all members participating in log replication. For Consul's purposes, all server nodes are in the peer set of the local datacenter.