Subscription Tiers

USD monthly

Tier 1

Ad-free browsing and premium download server access

114 subscribers Site Supporter

USD monthly

Supporter+

Ad-free browsing and premium download server access

18 subscribers Site Supporter

Unlock

$12

USD monthly

Tier 2

Ad-free browsing, premium download access, rank and banner on the special thanks page, and early access to new site features.

8 subscribers Site Supporter

$13

USD monthly

Special Thanks+

Ad-free browsing, premium download access, rank and banner on the special thanks page, and early access to new site features.

2 subscribers SubscribeStar $13.00 tier

Unlock

Open3DLab

Updated on Mar 06, 2025 02:44 am

Hello everyone,

After a long delay, the Open3DLab connection with SubscribeStar has finally been established.

If you're supporting Open3DLab through SubscribeStar, you can now connect your SubscribeStar account to receive benefits. You can connect your account through the SETTINGS page as found in the site header. Alternatively, you can click this link: https://open3dlab.com/accounts/settings/ - Scroll down to the Account Connection section on that page, and select "Connect your SubscribeStar Account" - the process is self-explanatory from there. After you've connected your account, your benefits, like fast downloads, will become available within minutes.

I want to thank everyone who has pledged through SubscribeStar for being so patient up to this point. I had expected to be able to manually assign benefits while working on finalizing this functionality, but there were technical challenges that I did not foresee. It took longer to resolve these issues and I would like to apologize for the delay.

Please let me know if you experience any issues with the integration or have trouble accessing specific benefits.

Thanks, and I hope you enjoy the site.

Features

Access to faster download servers
Ad-free browsing
Help keep the site running

Posts

Open3DLab

Jun 15, 2024 09:30 am

Public post

Hey everyone,

Firstly, welcome to all the new subscribers who joined over the last few weeks. And returning subscribers, thanks for your continued support. If you haven't already, connect your Subscribestar account to your site and Discord accounts to claim your benefits.

It's been a while since the last development update. Here's what's been keeping me busy these last few weeks.

Download integrity

If you've been around the sites long enough, you may know that it uses a process to synchronize files to servers in different continents. These servers help reduce latency for users around the world. For those familiar with tech, it's like a shitty version of a CDN.

Some of you may have noticed that sometimes, the downloads are broken on some servers, while downloading from one of the other servers works fine. This has to do with the way the synchronization works. At all times, there are task workers running to handle synchronization tasks. Sometimes, when a lot of tasks are running and the system is under a lot of pressure, a worker process may get interrupted. The sync tasks are unfortunately non-atomic, meaning that if they are interrupted, they may leave the system in a partial state. And that results in a task that _appears_ to be completed in the database, but leaves a corrupted/partial file on the destination server. There's currently no system that detects broken files automatically. Only when someone reports a broken file to the mods, can it be manually fixed.

To mitigate this, I've been working on a system that verifies file integrity using checksums. Essentially, once a file has been uploaded by a user, we calculate a checksum using a cryptographic hash function like MD5. We will then sync the file to the other servers, and recalculate the MD5 checksum for each of those as well. If the checksums match the original file, we can assume that the transfer was successful. This should make the synchronization process more robust, and prevent file corruption and wasted downloads.

An example of a file checksum as displayed on the Open3DLab sites.

I'm currently in the process of hashing every file on all Open3DLab servers. Future files will be hashed automatically after uploading as well. Once hashing is completed, we'll automatically compare hashes to see if the transfer was successful, and retry a transfer if needed.

Once a file hash has been generated, I will start displaying the MD5 checksum on the download confirmation page. You can then independently verify if the file you downloaded is correct, by simply generating a checksum of the file you downloaded, and comparing it to the one displayed on the site.

Note that checksum generation will take up some resources on our task workers, and as such, some site actions may be delayed, including the synchronization of Patreon/Subscribestar benefits. However, this should be resolved within a few days.

Other improvements

Most of my work recently has been spent on site reliability. I hope to get back to making improvements to site usability in the next few weeks. I've also been fixing a few bugs that were introduced as a result of the UUID changes to projects.

Thank you for your continued support. If you have any questions or concerns, please do not hesitate to contact us at admin@sfmlab.com, or through the relevant channels on Discord.

Sincerely,

Salaryman

SiteUpdates

Comments (1) loading...

Like(3)

Open3DLab

Apr 13, 2024 06:48 pm

Public post

Summary

A series of data loss events occurred between April 10th and 13th at Open3DLab, following a server cluster update.

The new database management system that's been put in use has been most reliable, but unfortunately I still had to learn a lot in terms of monitoring and recovering from failure scenarios. A full technical writeup of what happened is listed below.

I'm really sorry for not noticing the issues sooner, and not being able to prevent the dataloss. There is nobody to blame for the dataloss but myself. I will attempt to do better in the future.

What this means for you

If you signed up beteen April 10 and 13
Your account was likely lost in this incident. You will need to sign up for a new account. Apologies for the inconvenience.

If you uploaded anything between April 10 and 13
Your uploads were likely lost in this incident. You will need to recreate your project page, and re-upload any updated files. Apologies for the inconvenience.

If you pledged to the Patreon or Subscribestar page between April 10 and 13
Your pledge might not have synced properly as a result. If you experience issues accessing your benefits on the site, send me a message through the platform you pledged through, and I will work to have it resolved.

If you left a comment between April 10 and 13
Your comment is likely lost in this incident. Although it is debatable if we should lament the loss internet comments.

Background

The sites runs on Kubernetes, which is a container orchestration platform. The Kubernetes system manages servers, and spreads the different workloads (database processes, web servers, task runners, worker queues) over the different servers, and moving workloads to different servers if the servers are running out of resources. Servers are also called Nodes in this context.

The database for the site consists of one primary workload and two replicas to which the database manager can automatically switch in case the primary goes down. This database process is managed by an "operator" which is also a workload running on these nodes. Each replica has its own stored copy of the primary, and the primary streams any WAL changes to these replicas, to be stored on the replica's disk.

A full database backup is made every night. These backups contain the full database at a specific point in time, and are thus quite big. Doing them very often requires a lot of resources, so usually these only once or twice per day. The risk exists that if you backup your database at midnight, and lose your data before the next backup, any data in-between is lost.

This is why WAL archiving exists - it is a method to only track changes since a full backup. WAL archives track individual changes and have the advantage of being smaller, and can thus easily be created and backed up continuously. So you can do a full restore from the last full database backup and then replay the WAL archives that have accumulated that day to minimise the potential dataloss.

Technical explanation

Initially, a scheduled upgrade caused a series of failovers in the database system, which led to the shutdown of primary and replica databases without completing necessary Write-Ahead Logging (WAL) archives. This incomplete data syncing resulted in inconsistencies when databases were later elected as primaries. The final escalation of the issue came when a node failure caused the system to rollback to an older state, effectively erasing several days' worth of data changes. Attempts to restore the system from backups were hindered by synchronization issues and operational failures, leading to a total data loss of approximately two and a half days. The problems were compounded by inadequate monitoring and delayed response to the discrepancies, culminating in a comprehensive system restoration from the last stable backup prior to the upgrade.

Timeline

April 10 - 16:36 - Kubernetes update
Kubernetes cluster upgrade from version 1.28.2 to 1.29.0 is triggered. This involves upgrading and rebooting the servers one by one, completely automatically. The different workloads running on these servers are automatically shut down and restarted on the other servers so that the disruption is minimal.

April 10 - 16:41
Last successful WAL archive completes on the primary database node.

April 10 - 16:42 - First failover happens as intended and Operator shuts down
The node the primary database is running on is scheduled to shut down. The primary database shuts down. One of the replicas on a different node is promoted to primary by the operator. The last remaining replica is receiving WAL changes from the first replica, and is in sync with that one. The operator is also running on this node, and is thus shut down.

April 10 - 16:52 - Second failover doesn't happen, WAL archive fails
The node the new primary (first replica) was running on is now scheduled to shut down, but the old primary is not back up yet and neither is the operator. The last replica should be promoted to primary, as the other primary and first replica are not back up yet.

However, this does not happen because the database operator was startup up on the same node that is now being shut down, and needs more time to be moved elsewhere. This means that the new primary (first replica) is now shut down without being failovered to a different replica. It is shut down without completing its last WAL archive backup.

April 10 - 16:53 - Primary re-elected, WAL replay fails
Database operator and old primary database start on one of the updated nodes. The replica that is currently elected as primary is still being scheduled on a different node. Operator elects the old primary as the new primary, and starts replaying the WAL archive from the backup location. Unfortunately this fails, because the last WAL archive created by the first replica was incomplete. The old primary is thus still slightly behind the first replica.

April 10 - 16:55 - Primary and replica diverge
First replica comes back up around the same moment the node of the third replica goes down. The operator keeps the old primary as the primary database. However, the operator is unable to resolve conflicts between the old primary and the first replica as both databases have now diverged.

The web server itself is connecting to the primary, and appears to be running fine as far as I can tell. The other nodes are still being cycled, so the site goes down a couple more times while nodes are rescheduled.

April 10 - 17:30 - Upgrade completes - issues remain unnoticed
Kubernetes upgrade completes. The second replica never comes back up, because it is out of sync with both the primary and replica, and can not restore from the WAL archive. I am oblivious to the replica and primary being out of sync. New data is being written to the primary that is not being synced to any other replicas. Last full backup is from this day.

April 12 - 18:30 - Second replica failure detected
I notice the second replica is failing to start up. Completely by accident, while I'm trying to implement better logging, unrelated from the database.

April 12 - 18:50
I am unable to determine why the second replica isn't coming back up. The db operator shows no useful information other than that it's waiting for all replica pods to be ready.

April 12 - 18:55 - Second replica failure resolved by scaling down cluster
I decide to scale down the cluster from one primary and two replicas, to one primary and one replica. The operator executes the change and removes all trace of the second replica. This removes any warnings in my UI, and in my status overview. The primary and replica are still out of sync, but I fail to notice this on any of my dashboards.

April 13 - 14:00-ish - Rollback happens
Increase memory pressure on one of the nodes forces the primary to be automatically rescheduled on a different node. As the primary is shutting down, the replica (which is 3 days behind) is elected as the primary by the db operator. It attempts to replay the WAL archive from the backups, but fails. The old primary refuses to come back up, since it tries to sync with the replica first since that is now the primary. The state has diverged from the replica of course, so it's now in a similar state as the second replica.

The site appears to be up, but has effectively rolled everything back to the state it was around April 10th. The replica now resumes writing new WAL archives to the backup storage, further mixing up the different database branches.

April 13 - 16:00-ish - Rollback found, backups from that night restored
I'm notified of inconsistencies related to account signups from a day earlier. Further investigation shows several uploads from the past few days are missing.

I attempt to restore the primary, but the operator is unable to complete the switchover and appears stuck in a mixed state. The operator shuts down the one remaining database process.

Now both primary and replica are no longer starting up. The operator is no longer able to apply automatic conflict resolution. Site is down.

April 13 - 18:00 - 20:30 - Backup restore, postmortem
I bootstrap a new database cluster with the last working backup. It restores the cluster to the last known working state, some time on April 10.

Effective dataloss is roughly 2 and a half day.

I write this postmortem and pickup the pieces of what remains of my weekend.

Comments (0) loading...

Like(1)

Open3DLab

Jun 28, 2022 03:42 pm

Public post

How to extract Unity models

Unity is an engine that's used for many games. So how do we get to the models and textures? With some publicly available tools, of course!

Check out how it works in this latest video!

https://youtu.be/d_o1JH_kyeQ

Comments (0) loading...

Like(2)

Open3DLab

May 30, 2022 01:23 pm

Open3DLab and SubscribeStar

Comments

Like(5)

Dislike(0)

Posted for $6 tier

Open3DLab

May 30, 2022 12:56 pm

New Tutorial video

Comments

Like(2)

Dislike(0)

Posted for $6 tier

Subscription Tiers

Features

Posts

Download integrity

Other improvements

Summary

What this means for you

Background

Technical explanation

Timeline

How to extract Unity models

Open3DLab and SubscribeStar

New Tutorial video

Creator Stats

Goals

Other Creators

Your Privacy Choices