BuilderNet Incident Report

The BuilderNet experienced an outage - network unavailable - for a period of 7 days. Here are the details:

On June 4th, 2019 at 16:05 UTC, the BuilderNet was upgraded per its monthly schedule. Two nodes did not upgrade correctly. The Sovrin Foundation staff contacted both organizations and helped them upgrade manually. As the second was being updated, we discovered a bug in how nodes are taken offline. The bug caused the all the nodes to be stuck in a loop of restarting Indy-node over and over. This bug took the network offline, making it unavailable.

The Foundation submitted a ticket for this bug, and received instructions on the fix. We contacted each of the Stewards on the BuilderNet to stop Indy-node manually and sent them a patch to fix the issue. After each node applied the patch, we sent out instructions to restart the services. It should be noted that the main reason the fix took so long was delayed responses from the Stewards. We will be taking measures to close that response gap.

The network became available again around 15:00 UTC on June 11th, 2019. A “hot fix” was then applied to the network to ensure the issue doesn’t occur again. This was completed on all the nodes at 16:45 on June 12th.

@matt, thank you for the details. One of my frustrations with the formal and casual discussions about downtime is that they don’t distinguish between read downtime and write downtime. Many of the cases where the ledger is “down” are cases where it is answering read requests just fine; it’s simply not able to write new transactions.

I think that in this case, the “unavailable” and “offline” periods that you describe were write outages, not just read outages. Is that accurate?

@danielh Yes, in this case since all of the nodes were in a loop of restarting themselves several times per minute, they never came up enough to respond to any read or write requests during the entire 7 days mentioned.
You make a good point, though. For future incidents, we will try and mention whether the downtime affected reads, writes, or both.

Thanks!