MainNet Outage May 20, 2019


#1

The Sovrin Main Network experienced an outage on Monday, 2019-05-20, lasting from 16:45 to 17:33 UTC. This post describes the incident, how it was addressed, and what we are doing to prevent future incidents and improve our responses.

Incident Summary

  • The outage occurred directly following a regularly scheduled MainNet upgrade. The upgrade was posted at 16:45 UTC by Trustee Drummond Reed.
  • At 17:20 UTC, Evernym network monitoring alerted the Sovrin Foundation that the MainNet nodes were out of consensus. Each node reported being out of consensus.
  • One node - zaValidator - failed to upgrade the indy-node and indy-plenum packages at the time of the MainNet upgrade. Staff is unsure if this was a cause of, an effect of, or unrelated to the outage.
  • At 17:33 UTC, Trustee Drummond Reed posted a restart transaction for all nodes on the Network, with the assistance of Sovrin Foundation staff and Mike Bailey.
  • The restart brought the nodes back into consensus.

Additional notes

  • A few nodes behaved oddly around the time of the update and the outage. These may be unrelated to the issue, but are worth noting for context and further understanding.
    • zaValidator was unable to upgrade to the correct indy-node and indy-plenum versions with the rest of the nodes
    • VeridiumIDC lost connectivity a few hours before the upgrade. They were removed from consensus prior to the upgrade
    • Danube node flickered that “7 nodes were unreachable” a few times while staff checked the consensus status. Each time, when staff checked a few seconds later, it reported all nodes as reachable.
  • Initial analysis of the logs has not shown the cause of the problem.
  • Because this update included the addition of the audit ledger, it could not be done on an incremental basis.

Remediation

We are taking the following actions to reduce the likelihood of these types of issues and to improve our response to them:

  • We are in the process of collecting logs from the validator nodes at the time of the incident, and we will analyze them to determine the root cause of the problem and the proper steps to prevent it happening again.
  • As the Indy Node consensus API has stabilized, it is becoming possible to update nodes one at a time instead of requiring the entire pool to upgrade simultaneously. We will continue our progress toward this goal.

Though we regret this outage, we are pleased that the response plan we put in place in 2018 was effective in allowing us to recover quickly.


#2

Was the network unavailable completely or in read-only mode?


#3

The nodes were responsive, but out of consensus. Thus, the network was in read-only mode.


#4

I think Sovrin needs to be crystal clear in the subject and opening paragraph about what the outage means. At CULedger we are very read heavy - estimating 10,000X more reads then writes - so a read-only outage isn’t severe. Using the term “outage” in our space is terrifying as we immediately think of payment networks and ATMs not being available.

Hopefully, this feedback helps.