At 23:30 UTC on Tuesday, September 12th, the DevOps team was alerted that synthetic transactions were failing across all network proxies in the Hedera testnet. This was confirmed and an incident was declared, as testnet is a production environment with 24/7 monitoring and support.
An investigation determined that the root cause was a shutdown command sent to all testnet consensus nodes via the testnet infrastructure control plane, forcing them all to restart simultaneously. This occurred despite the fact that the consensus nodes are currently configured not to automatically restart in order to avoid accidental corruption of the record stream in scenarios such as this. The simultaneous testnet restart was only possible here, unlike the Hedera mainnet, because the testnet is run entirely in Google’s cloud offering (Google Cloud Platform (GCP)), and centrally provisioned with Hashicorp Terraform infrastructure as code. This central control plane was thus able to shut down all testnet nodes simultaneously.
The surprise shutdown stems from upgrading node hardware. Hedera Governing Council members are currently migrating servers that run the Hedera mainnet because the original Hedera hardware specification is five years old, and was recently updated by TechCom to include more modern requirements. Hardware upgrades to mainnet are occurring node by node during network upgrades to later generation CPUs, faster memory, later kernel versions, and faster disks.
As such, DevOps was reprovisioning the Hedera testnet to meet the same hardware spec. Reprovisioning testnet is entirely different: we simply reconfigure Google Cloud to be up to spec, and this work is done in phases. In one of these phases, a DevOps engineer changed labels and startup scripts - metadata for GCP instances. Those changes appeared to be non-destructive and would be applied “in place,”; there was no indication in our tool that this change would restart the instances. Unfortunately, this change initiated the shutdown command mentioned above, which was sent to all testnet nodes hosted in GCP simultaneously.
Note that a shutdown like this is not possible for mainnet since there is no infrastructure control plane, as all nodes are operated by individual council members on different clouds, hosting providers, and colocations.
Half an hour into the incident, DevOps and Engineering attempted a partial start of the testnet network and determined a bug in node software caused the checksum hash of a single record file to be incorrect when restarting with old record files still on disk.
Engineering took two hours to narrow down the exact issue and found a work around: deleting the corrupt records across consensus nodes, and starting the network again. After deleting the stale record files from disk, replaying from consensus event streams yielded all correct checksums; and no transactions were lost due to the outage.
The network was then successfully started, monitoring and tests confirmed resolution, and the incident was closed. Engineering and DevOps will be conducting a root cause analysis around this issue to better understand how to prevent similar issues from occurring in the future.
In addition, Hedera network developers are currently working on new safety features that will ensure automatic record stream resilience and ledger consistency even in the face of uncontrolled network shutdowns. For example, in the 0.41 release (i.e. the software version that is on testnet right now), we enabled a new feature called the "preconsensus event stream", which provides automatic replay of transactions following a node crash. Without this feature, the only way to prevent rollback of transactions after a full network crash would have been manual action to reconstruct the ledger state. Historically, recovery from failure like this would have required a much more manual intervention.
Hedera is committed to setting the industry standard for transparency, reporting, and business process maturity. We hope that in providing this detailed reporting, it demonstrates Hedera's commitment to transparency and also highlights the network’s decentralized governance, as the Hedera Governing Council model ensures that individual council members operate nodes on different providers, making a simultaneous shutdown like this impossible for mainnet.