Sovrin Main Net Outage, December 2018


#1

The Sovrin Main Network experienced an outage lasting from Saturday 2018-12-08 until Tuesday 2018-12-11. This was our first significant outage of the network. This post describes the incident, how it was addressed, and what we are doing to prevent future incidents and improve our responses.

Incident Summary

  • The pool responded normally until Saturday 2018-12-08 05:24:43 UTC, however some nodes reported not being able to contact the primary node (Danube) and so a number of INSTANCE_CHANGE messages had been broadcast to suggest a view change. These messages accumulated on all nodes, but the messages did not come from enough different machines to trigger a view change because the primary node could successfully contact the majority of nodes in the pool.
  • Logs show that some nodes were restarted during the preceding weeks, probably due to steward maintenance. As a result, their INSTANCE_CHANGE messages were flushed and they had a lower number than other nodes.
  • At 05:25:37 Danube lost connection to 16 of 24 nodes, perhaps due to network issues, which led to another round of INSTANCE_CHANGE messages being propagated.
  • As a result, some of the nodes had enough INSTANCE_CHANGE messages to trigger a view change. But the nodes that had been recently restarted did not trigger a view change. The number of nodes which started a view change was less than the number required to finish it.
  • The resulting pool did not have enough nodes to achieve consensus on the view change, and not enough nodes in consensus to order transactions. Our consensus protocol does not handle this state automatically, as it is not clearly defined in the RBFT paper that we use as the basis of our implementation.
  • At 05:32:56, nodes that entered view change recognized that they could not complete it. They tried a second attempt, which was also unsuccessful. These nodes then stopped view change-related activity and the pool was left out of consensus.
  • Within a few minutes, Evernym’s network monitoring detected that writes could not be completed. Engineers began to triage the problem over the weekend, but were unable to make much progress until Monday morning when we created INDY-1903 to track progress and exchange logs.
  • We found that the situation was aggravated due to a bug which prevented all client communication with nodes that had view change in progress. This cut off read transactions, as well as emergency POOL_RESTART transactions. This is reported in INDY-1896.
  • On Monday 2018-12-10, we diagnosed the problem and decided that the best immediate remediation was to restart the pool. Unfortunately INDY-1896 prevented the pool restart transaction from succeeding. Evernym began to contact stewards to ask them to manually restart their nodes.
  • Because the nodes were not restarted in unison, the erroneous state propagated to the restarted nodes and consensus was not restored.
  • Restarting the nodes took them out of the view change state. So a pool restart transaction would now work. Evernym attempted to contact a trustee to issue the pool restart transaction. Unfortunately, many of the trustees were traveling in order to participate in the Hyperledger Global Forum, and we were unable to reach them.
  • The pool restart transaction was successfully submitted on Tuesday 2018-12-11, and network consensus was restored by 15:27:00 UTC.

Timing

This issue has existed in our implementation of the consensus protocol since the initial deployment of the network. We did not see this issue earlier due to a few factors:

  • The number of nodes in the consensus pool was recently increased, which also increased the odds of hitting this type of issue.
  • We previously tested having the primary node lose connection to the entire pool, but we had not previously tested having the primary node lose connection to a subset of the pool while simultaneously randomly restarting some of the non-primary nodes.

Remediation

We are taking the following actions to reduce the likelihood of these types of issues and to improve our response to them:

  • Now that network adoption is increasing, we will evaluate how to improve our emergency response outside of normal business hours.
  • We are improving the collaboration between Evernym, who has assisted in maintaining the network, and the Sovrin Foundation in diagnosing and resolving these sorts of issues. We are also working with network trustees to improve our incident response.
  • We will improve our communication with users of the network by offering regular status updates in #general at chat.sovrin.org, using issues in the public Sovrin Networks project in the Sovrin Jira to collaborate on the fix, and publishing write-ups like this one in the Sovrin Forum. We will socialize these updates using the @sovrin_status twitter handle.
  • INDY-1896: Fix pool restart transactions and pool reads without consensus. The fix is implemented and in testing. It will be deployed as a fast release to the Sovrin Network in 1.6.80.
  • INDY-1897: Address the underlying protocol deficiency which resulted in the view change being triggered on only a subset of nodes.
  • We will continue to improve our development practices and testing to ensure high quality releases to the Sovrin Network.
  • The view change concern we hit in this incident belongs to a class of problems that we have identified with our current implementation of the consensus protocol. We encourage you to work with us to design and implement a next-generation architecture by joining the discussion in the Sovrin Forum.

We look forward to your feedback.


#2

Richard, great job with this incident report. This is the first major outage of the Sovrin Mainnet so it’s really helpful to have this level of detail.


#3

I should clarify the statement “This was our first significant outage of the network”.

We have had other outages of Sovrin Main Net, some of which were more severe than this one. I meant to say that this was the first outage after users began to rely on it for production use cases.

As adoption of the network increases, we will need to do better at ensuring that people can depend on it. Thank you to everyone who is assisting us in that effort.