What is an Outage?

An outage is a loss of quorum for Consul, i.e., the nodes are not able to elect a leader. This might be caused by a majority of nodes going offline or not being able to communicate due to network partitions. We’ll deal primarily with nodes going offline here.

For a quorum to be healthy it is necessary for every node to know its neighbors. The list of neighbors is kept in the Raft peers list in the consul-data-directory/raft/peers.json file, where consul-data-directory is specified when Consul starts (see below). There should be at least 2 nodes there in order to form a quorum and elect a leader on startup.

Outage recovery might sometimes involve manual editing of the peers list where the machines are not recoverable.

Key players:

  • Raft manages leadership and its consistency.
  • Serf manages membership like node joining/leaving.
  • The “peer set” is the set of all members participating in log replication. For Consul's purposes, all server nodes are in the peer set of the local datacenter.

results matching ""

    No results matching ""