For this project, where you have 120GB of customer data, and thirty requests a second for ~8k objects (0.25MB/s object reads), you’d seem to be able to 100x the throughput vertically scaling on one machine with a file system and an SSD and never thinking about object storage. Would love to see why the complexity
> The costs have increased: renting an additional dedicated server costs more than storing ~100GB at a managed object storage service. But the improved performance and reliability are worth it.
Part of it is that it follows the object storage model, and part of it is just to lock people into AWS once they start working with it.
Also, none of them implement full S3 API and features.
Ceph certainly implements the full API spec, though it may lag behind some changes. It's mostly a question of engineering time available to the projects to keep up with changes.
We maintain a page that shows our compatibility with S3 API. It's at https://www.tigrisdata.com/docs/api/s3/. The test runner is open source at https://github.com/tigrisdata-community/s3-api-compat-tests
This is some next-level conspiracy theory stuff. What exactly would the alternative have been in 2006? S3 is one of the most commonly implemented object storage APIs around, so if the goal is lock-in, they're really bad at it.
Well, WebDAV (Document Authoring and Versioning) had been around for 8 years when AWS decided they needed a custom API. And what service provider wasn't trying to lock you into a service by providing a custom API (especially pre-GPT) when one existed already? Assuming they made the choice for a business benefit doesn't require anything close to a conspiracy theory.
And it worked as a moat until other companies and open source projects started cloning the API. See also: Microsoft.
And still need redundant backend giving it as API
this is the kind of setup that lets you actually go to bed without checking your phone every 20 minutes.
But a quick grep across versitygw tells me they don't use Sync()/fsync, so not a problem... Any data loss occurring from that is obviously not btrfs fault.
Were your users complaining about reliability and performance? If it cost more, adds more work (backup/restore management), and the users aren't happier then why make the change in the first place?
On a separate note, what tool is the final benchmark screenshot form?
versity does not include any erasure coding or replication...
I mean, I appreciate the openness about the scale, but for context, my home's personal backup managed via restic to S3 is 370GB. Fewer objects, but still, we're not talking a big install here.
This is pretty much like that story of, if it fits on your laptop, it's not big data.
At a previous job years ago, we had a service that was essentially a file server for something like 50TB of tiny files. We only backed it up once a week because just _walking_ the whole filesystem with something like `du` took more than a day. Yes, we should have simply thrown money at the problem and just bought the right solution from an enterprise storage vendor or dumped them all into S3. Unfortunately, these were not options. Blame management.
A close second would have been to rearchitect dependent services to speak S3 instead of a bespoke REST-ish API, deploy something like SeaweedFS, and call it a day. SeaweedFS handles lots of small files gracefully because it doesn't just naively store one object per file on the filesystem like most locally-hosted S3 solutions (including Versity) do. And we'd get replication/redundancy on top of it. Unfortunately, I didn't get buy-in from the other teams maintaining the dependent services ("sorry, we don't have time to refactor our code, guess that makes it a 'you' problem").
What I did instead was descend into madness. Instead of writing each file to disk, all new files were written to a "cache" directory which matched the original filesystem layout of the server. And then every hour, that directory was tarred up and archived. When a read was required, the code would check the cache first. If the file wasn't there, it would figure out which tarball was needed and extract the file from there instead. This only worked because all files had a timestamp embedded in the path. Read performance sucked, but that didn't matter because reads were very rare. But the data absolutely had to be there when needed.
Most importantly, backups took less than an hour for the first time in years.
We never had a catastrophic failure, but yeah, and hour of lost data wouldn't be the end of the world. For regular maintenance we would just coordinate with the QA team to pause testing temporarily. (Testing was only around-the-clock near a product release.)
With apologies to the SRE Book ("hope is not a strategy")... Btrfs is not a strategy.
Thanks for sharing this, I wasn't even aware of Versity S3 from my searches and discussions here. I recently migrated my projects from MinIO to Garage, but this seems like another viable option to consider.
I've been trying to give some containers (LXC/D and OCI) unprivileged access to a network-accessible ZFS filesystem and this might be what I need. Managing UID/GID through bind-mounts from the host to the container (ie NFS on host) has been trickier than I was expecting.
Remember that UID mapping on namespaces is just a facade with an offset and a range typically based on subuid[0] and subgid[1] today.
In the container `cat /proc/self/uid_map` or by looking at the pid from the host `cat /proc/$PID/uid_map` you can tell what those offsets are.
Here you know that PID =0 in the container maps to PID 1000 in the host, with a length of 1The Container PID offset of 1 maps to the host offset of 100000 for a length of 65536
With subuid/subgid you can assign ranges to the user that is instantiating the container, in the flowing I have two users that launch containers.
Assuming you pass in the host UID/GID that is how you can configure a compatible user at the entrypoint.But note that only highly trusted containers should ever really use host bind mounts, it is often much safer to use mount NFS internally.
Host bind mounts of network filesystems, if that is what you are doing, is also fragile as far as dataloss goes. I am an object store fan, but just wanted to give you the above info as it seems hard for people to find.
I would highly encourage you to look into the history of security problems with host bind mounts to see the wack-a-mole that is required with them to see if it fits in with your risk appitite. But if you choose to use them, setting up dedicated uid/gid mappings and setting the external host to the expected effective ID of the container users is a better way than using Idmapped mounts etc...
[0] https://man7.org/linux/man-pages/man5/subuid.5.html [1] https://man7.org/linux/man-pages/man5/subgid.5.html