Summary
A series of data loss events occurred between April 10th and 13th at Open3DLab, following a server cluster update.
The new database management system that's been put in use has been most reliable, but unfortunately I still had to learn a lot in terms of monitoring and recovering from failure scenarios. A full technical writeup of what happened is listed below.
I'm really sorry for not noticing the issues sooner, and not being able to prevent the dataloss. There is nobody to blame for the dataloss but myself. I will attempt to do better in the future.
What this means for you
If you signed up beteen April 10 and 13
Your account was likely lost in this incident. You will need to sign up for a new account. Apologies for the inconvenience.
If you uploaded anything between April 10 and 13
Your uploads were likely lost in this incident. You will need to recreate your project page, and re-upload any updated files. Apologies for the inconvenience.
If you pledged to the Patreon or Subscribestar page between April 10 and 13
Your pledge might not have synced properly as a result. If you experience issues accessing your benefits on the site, send me a message through the platform you pledged through, and I will work to have it resolved.
If you left a comment between April 10 and 13
Your comment is likely lost in this incident. Although it is debatable if we should lament the loss internet comments.
Background
The sites runs on Kubernetes, which is a container orchestration platform. The Kubernetes system manages servers, and spreads the different workloads (database processes, web servers, task runners, worker queues) over the different servers, and moving workloads to different servers if the servers are running out of resources. Servers are also called Nodes in this context.
The database for the site consists of one primary workload and two replicas to which the database manager can automatically switch in case the primary goes down. This database process is managed by an "operator" which is also a workload running on these nodes. Each replica has its own stored copy of the primary, and the primary streams any WAL changes to these replicas, to be stored on the replica's disk.
A full database backup is made every night. These backups contain the full database at a specific point in time, and are thus quite big. Doing them very often requires a lot of resources, so usually these only once or twice per day. The risk exists that if you backup your database at midnight, and lose your data before the next backup, any data in-between is lost.
This is why WAL archiving exists - it is a method to only track changes since a full backup. WAL archives track individual changes and have the advantage of being smaller, and can thus easily be created and backed up continuously. So you can do a full restore from the last full database backup and then replay the WAL archives that have accumulated that day to minimise the potential dataloss.
Technical explanation
Initially, a scheduled upgrade caused a series of failovers in the database system, which led to the shutdown of primary and replica databases without completing necessary Write-Ahead Logging (WAL) archives. This incomplete data syncing resulted in inconsistencies when databases were later elected as primaries. The final escalation of the issue came when a node failure caused the system to rollback to an older state, effectively erasing several days' worth of data changes. Attempts to restore the system from backups were hindered by synchronization issues and operational failures, leading to a total data loss of approximately two and a half days. The problems were compounded by inadequate monitoring and delayed response to the discrepancies, culminating in a comprehensive system restoration from the last stable backup prior to the upgrade.
Timeline
April 10 - 16:36 - Kubernetes update
Kubernetes cluster upgrade from version 1.28.2 to 1.29.0 is triggered. This involves upgrading and rebooting the servers one by one, completely automatically. The different workloads running on these servers are automatically shut down and restarted on the other servers so that the disruption is minimal.
April 10 - 16:41
Last successful WAL archive completes on the primary database node.
April 10 - 16:42 - First failover happens as intended and Operator shuts down
The node the primary database is running on is scheduled to shut down. The primary database shuts down. One of the replicas on a different node is promoted to primary by the operator. The last remaining replica is receiving WAL changes from the first replica, and is in sync with that one. The operator is also running on this node, and is thus shut down.
April 10 - 16:52 - Second failover doesn't happen, WAL archive fails
The node the new primary (first replica) was running on is now scheduled to shut down, but the old primary is not back up yet and neither is the operator. The last replica should be promoted to primary, as the other primary and first replica are not back up yet.
However, this does not happen because the database operator was startup up on the same node that is now being shut down, and needs more time to be moved elsewhere. This means that the new primary (first replica) is now shut down without being failovered to a different replica. It is shut down without completing its last WAL archive backup.
April 10 - 16:53 - Primary re-elected, WAL replay fails
Database operator and old primary database start on one of the updated nodes. The replica that is currently elected as primary is still being scheduled on a different node. Operator elects the old primary as the new primary, and starts replaying the WAL archive from the backup location. Unfortunately this fails, because the last WAL archive created by the first replica was incomplete. The old primary is thus still slightly behind the first replica.
April 10 - 16:55 - Primary and replica diverge
First replica comes back up around the same moment the node of the third replica goes down. The operator keeps the old primary as the primary database. However, the operator is unable to resolve conflicts between the old primary and the first replica as both databases have now diverged.
The web server itself is connecting to the primary, and appears to be running fine as far as I can tell. The other nodes are still being cycled, so the site goes down a couple more times while nodes are rescheduled.
April 10 - 17:30 - Upgrade completes - issues remain unnoticed
Kubernetes upgrade completes. The second replica never comes back up, because it is out of sync with both the primary and replica, and can not restore from the WAL archive. I am oblivious to the replica and primary being out of sync. New data is being written to the primary that is not being synced to any other replicas. Last full backup is from this day.
April 12 - 18:30 - Second replica failure detected
I notice the second replica is failing to start up. Completely by accident, while I'm trying to implement better logging, unrelated from the database.
April 12 - 18:50
I am unable to determine why the second replica isn't coming back up. The db operator shows no useful information other than that it's waiting for all replica pods to be ready.
April 12 - 18:55 - Second replica failure resolved by scaling down cluster
I decide to scale down the cluster from one primary and two replicas, to one primary and one replica. The operator executes the change and removes all trace of the second replica. This removes any warnings in my UI, and in my status overview. The primary and replica are still out of sync, but I fail to notice this on any of my dashboards.
April 13 - 14:00-ish - Rollback happens
Increase memory pressure on one of the nodes forces the primary to be automatically rescheduled on a different node. As the primary is shutting down, the replica (which is 3 days behind) is elected as the primary by the db operator. It attempts to replay the WAL archive from the backups, but fails. The old primary refuses to come back up, since it tries to sync with the replica first since that is now the primary. The state has diverged from the replica of course, so it's now in a similar state as the second replica.
The site appears to be up, but has effectively rolled everything back to the state it was around April 10th. The replica now resumes writing new WAL archives to the backup storage, further mixing up the different database branches.
April 13 - 16:00-ish - Rollback found, backups from that night restored
I'm notified of inconsistencies related to account signups from a day earlier. Further investigation shows several uploads from the past few days are missing.
I attempt to restore the primary, but the operator is unable to complete the switchover and appears stuck in a mixed state. The operator shuts down the one remaining database process.
Now both primary and replica are no longer starting up. The operator is no longer able to apply automatic conflict resolution. Site is down.
April 13 - 18:00 - 20:30 - Backup restore, postmortem
I bootstrap a new database cluster with the last working backup. It restores the cluster to the last known working state, some time on April 10.
Effective dataloss is roughly 2 and a half day.
I write this postmortem and pickup the pieces of what remains of my weekend.