I've been using ZFS since circa 2008, but it's a tradeoff.
If I want no-worries and I don't want to be surprised by data loss, I use ZFS.
If I want speed with data I can afford to lose, I use something else (ufs, ext4, xfs, the right FS for the job).
ZFS's integrity checking won't do a dang thing for you if you're not paying attention, don't have monitoring, or don't even run its checks. Yes, I over-provision. Yes, I'll make the reliability vs performance tradeoffs (when it makes sense, usually reliability over performance by default though).
The great thing about having options is exactly that. For me, ZFS is the right choice in most cases. For other people it's not. Being able to make an informed choice and not being forced either way is a good thing.
Zealotry/religion has no place in matters like this when there are clear tradeoffs in all the alternatives.
I don't say it is the right choice in 'most cases', but its my preferred default choice for /etc and user home directories.
The hard case is databases. ZFS has a lot to offer (convenient support at FS level for replication stands out), but it is doing a lot of things that are solved problems in the design of competently designed databases and I never trust this kind of needless complexity. Of course if you really care about performance here, you should be willing to roll up your sleeves and tune the settings of the FS and ZFS is really nice in how it allows its complex features to be switched off.
I can kind of see your point, but I trust ZFS to never lose data, and I trust (in my case) postgres to never lose data, so the only issue is performance, and while that varies immensely, I mostly work on data that compresses well, so I can barely afford not to use ZFS with compression, because it saves a ton of space and actually improves I/O performance (if you're I/O bound, compressing your data lets you read and write faster than the physical disks can handle, which is still wild to me). Of course, that all depends on trusting all parts of the system; if I thought that ZFS+postgres could ever lose data, or possibly that there was a real risk of it causing an outage (say, memory exhaustion), it'd be a harder trade to make.
> I've been using ZFS since circa 2008, but it's a tradeoff.
Indeed. I was disappointed about the low quality of the article. A good article on why not ZFS would have been an interesting addition, to help users decide.
I've been using ZFS on my home NAS for over a decade and overall it's been a great experience, but as you say ZFS does have some limitations which makes it a poor fit for certain use-cases.
You say that you choose ZFS if you don't want to worry about data loss. My experience is the opposite of yours. The only time I experienced filesystem corruption (other than hardware failure) was when I upgraded to Ubuntu 21.10 on a ZFS installation. Ubuntu 21.10 released with a known bug in the ZFS implementation that caused filesystem corruption. I was disappointed to say the least.
That's absolutely incorrect. The data-loss bug was created by Ubuntu developers when they developed a bad patch for a less-severe upstream OpenZFS issue.
Dedup is silly expensive and is off by default on sane OS.
The memory cost behind Dedup explains FUD about memory cost of ZFS, arc will consume free memory, sure. But it also ejects itself properly like any memory hog and you can tune it.
The 'never in tree' thing is hardly ZFS's fault. ZFS is fully in-tree in FreeBSD.
"there are other choices" is a fine message. I chose to use ZFS for convenience of snapshots as a mechanism to drive backup to a cloned zfs disk I hold offline, as part of my 3-2-1. I also deliberately bought a larger memory device to scale to the burden. At work we use SSD to front for the cost of write, and we get good scale speed backing a DB and large filestores (large for us is still only terabytes, but I know of petabyte instances multi-zvol elsewhere in the world)
iX systems offered us support and we grabbed it with both hands. I have no complaints about maintenance and SLA on this product.
I have migrated zpools Linux-BSD routinely. It doesn't depend on RAID card specific semantic marks, CARD BIOS level config, Drive order in the frame, or "quirks" in the OS beyond conformace to a flagset. If you upgrade flags you can be stuck but we checked before upgrade.
I have lost data in JBOD, in UFS, in EXT, in ZFS. Nothing is perfect. I have lost data in soft RAID and in hard RAID. Nothing is perfect.
> I chose to use ZFS for convenience of snapshots as a mechanism to drive backup to a cloned zfs disk I hold offline
I do something similar with Sanoid/Syncoid and Sanoid snapshots are super easy to hook into with Borgmatic so you can have an alternate backup set that has nothing to do with ZFS.
The very first system I ever installed ZFS on had a consumer grade motherboard and very soon after installation I realized one of the SATA ports would get flaky under load because ZFS kept spitting up errors. The controller didn't report errors and happily wrote garbage to the disk from what I could tell. So, IMO, checksumming isn't worthless and I keep all my important data on ZFS these days.
I also disagree with the layering thing. I know it's the "unix way", but chaining together a half dozen independent system isn't something I find appealing. At that point I'm the biggest risk to my own data because the odds of me making a mistake are higher than the odds of hitting a bug that affects data integrity. I'd much rather have a single coherent interface to deal with.
It was the same thing with SystemD. Everyone complained about it "taking over everything" instead of chaining a bunch of existing uncoordinated systems together, but I can't imagine going back to the old way now that I'm used to SystemD. It would be nice to see a SystemD style initiative for desktop Linux.
Spot on on the layering. It is the easiest thing to criticize on ZFS, but it is an ideological point, not a very substantive one. Creating a reliable and performant storage system is possible both with well distinguished layering (LVM2+XFS) and also without it (ZFS). What ZFS lost in ideological purity, it gained in functionality and performance (fast rebuilds and send/receive, in case of corruption, more information on which files are affected).
> File-based RAID offers the promise of having to do less work and avoiding RAIDing the empty space, but in practice it is outweighed significantly by this difference.
Citation needed. I've found ZFS recovers faster and is more usable in degraded mode than an equivalent mdadm raid.
> Buggy
Compared to what? ZFS has a better record of not losing data than anything else, mdadm and ext4 included. Having a large number of bugs in your bug tracker is a poor measure of how buggy your system is.
> Scrubbing simply needs to read every file from the disk so the RAID layer notices and repairs a URE. You can simply put cat /dev/array > /dev/null on cron once a month which is enough for mdadm to notice and repair UREs.
If you do that it will swamp your disks and make that filesystem system unusable once a month, and it will take longer than it should to notice UREs. And if you reinstall your OS you will probably forget to set it up again and not notice until you have a disk failure and lose all your data.
> Checksumming is usually not worthwhile - the physical disk already has CRC checksums at the SATA level, and if you are paranoid you should also have ECC ram to prevent integrity issues in-memory (applies to ZFS too), and this should be enough. But you can easily get this if you want, either at the block layer with dm-integrity (integritysetup) below your disk or btrfs does it automatically.
Checksumming is essential to having a rebuild process that actually works. With mdadm when you rebuild you will probably get silent corrupt data in a couple of files because anything that went bad since your last scrub (if your scrub setup is even working) will not be detectable.
> btrfs does it automatically.
btrfs doesn't have stable support for raid-like modes, and has exactly the same kind of vertical integration that this author doesn't like about ZFS.
ZFS was the thing that convinced me that layering isn't actually always great and sometimes vertical integration makes sense. The file-level RAID is a lot nicer to work with in practice. The integrated tooling is a lot nicer to work with in practice than having to manage the md, lvm, and filesystem parts separately. I believe the out-of-tree thing might be an issue for Linux, which is part of why I'm much happier running FreeBSD.
> Checksumming is essential to having a rebuild process that actually works. With mdadm when you rebuild you will probably get silent corrupt data in a couple of files because anything that went bad since your last scrub (if your scrub setup is even working) will not be detectable.
I can confirm, it happened to me. A disk was corrupting my files and mdadm had no problem propagating the errors. Later switched to ZFS, and checksum error happen (defective SATA cable IIRC). By the way, the author suggests ECC RAM "if you are paranoid", good suggestion, it goes particularly well with ZFS checksumming, and I experienced faulty RAM too.
I switched to ZFS because I actually lost data with a mdadm/ext4 system, I didn't lose data with ZFS even though I went through broken hardware.
With regards to "File-based RAID offers the promise of having to do less work and avoiding RAIDing the empty space", there has been some work on improving performance of ZFS scrub and re silver. In the original algorithm, it worked very well for mostly-empty disks as it only had to check written data. But mostly full disks had worse performance for scrub compared to traditional raid, as the data was checked in tree order. So there was a lot of seeks.
At least that is my understanding.
Anyways, see these talks from the ZFS Developer conference about how scrubbing has been improved:
> ZFS was the thing that convinced me that layering isn't actually always great and sometimes vertical integration makes sense.
The funniest thing is that ZFS is layered. It's just that 99% of people only see the final result, and don't notice the one place the abstraction leaks a little (zpool and zfs commands operate on separate layers with a bit of overlap).
ZFS is composed of separate layer for actual block devices, a layer of object-storage system on top of it called DMU (those two layers overlap a bit in handling data safety), and finally on top of the object storage there is implementation of posix-compatible filesystem called ZPL, an emulation of plain block storage called ZVOL, and there is optional LustreZFS which also operates on that layer.
There's a bit more segmentation when dig deep into code (whole special I/O scheduling and processing system called ZIO, for example, which provides things like encryption and compression) but yes, ZFS as a whole is quite layered - just with different APIs in between than offered let's say by Linux (you could technically build ZPL on top of now removed SCSI object storage driver, for example, but it would lack things)
> btrfs doesn't have stable support for raid-like modes, and has exactly the same kind of vertical integration that this author doesn't like about ZFS.
Btrfs raid 1/10 is rock solid and offers similar robustness to ZFS, ie >= hardware raid.
It is a game changer being able to know not only which side of a mirror is correct (checksum), but also which side is the same age as everything else (generation).
I was mind blown that btrfs raid could handle a disk going completely offline for hours (ssd firmware bug) and then reappearing, without any kind of re-sync needed. It didn’t even ‘cheat’ like other raids by marking the disk as offline - it kept using it (automatically fixing any inconsistency) even before I ran a scrub. Thanks to the generation values it could tell that some reads coming from the desynced disk were wrong without needing to read both disks on every read
My understanding is ZFS (or any other raid) would offline the drive in this scenario, if this is true then btrfs raid is arguably better. There is also no threshold where btrfs gives up, like what LinusTechTips experienced with ZFS
Btrfs was lunched in 2009, and there is still so much meh about it. I've been using ZFS since Solaris 10 (2006) and all the features have done exactly what they say on the tin.
The degraded mounting is a valid concern. Rebooting while in a degraded state is problematic (from a ‘might lose ssh access’ standpoint, not data loss)
The other concern seems to be the auto mounting of stale disks, which is by design and works a treat
In my experience a much bigger footgun is that btrfs raid makes files with CoW disabled completely unsafe (no sync between disks even when scrubbed), and some distros and programs (systemd-journald) selectively disable CoW by default
Maybe (though even then I've heard talk of bugs), but the article mainly talks about raid5/6-like modes and those are still marked as unsafe AIUI.
> My understanding is ZFS (or any other raid) would offline the drive in this scenario, if this is true then btrfs raid is arguably better.
ZFS can certainly handle a large number of read errors (recording them but remaining running), but if you reach the point where a device node completely disappears then it won't automatically re-add it when it comes back (you have to explicitly "zpool online" or reboot, then the pool will be imported as dirty and recover). I don't know the full details of exactly what ZFS does in every scenario but to my mind having a level of error at which you offline a drive seems pretty reasonable - once a drive is completely broken it's a waste of everyone's effort to keep retrying indefinitely.
I have a small home NAS with 3x 8TB drives in ZFS RAID-Z. Everything was fine then one day my son complained that some video files were pausing in the middle when they used to work.
I check the zpool status and saw that the system had been running on just 2 disk for probably a month due to write errors on one disk. Of the two remaining, another was having occasional errors that ZFS was correcting.
I pulled the bad drive, zeroed it out three times, and reinserted it. ZFS performed a resilver. After it was done, I pulled the other drive, zeroed it, and added it back.
The drives (SMART and ZFS) are no longer reporting any errors on any of the three drives and I only lost partial data on 4 files (which were replaceable).
Overall, I was surprised how resilient ZFS was and how easy the process was to replace 2 drives in a 3 drive array with minimal data loss.
> Are you scrubbing once a month? If not this will happen again. Linus Media lost 100’s of TB that way.
Anybody running 100 TB+ ZFS arrays should really have been aware of this class of problem before deploying (and doubly so if using consumer drives). ZFS protects from bitrot and hardware failures... but it's not magic. If data is written once and not read again for a long time, you won't know it's been sitting on bad sectors until it's too late, and you may end up with multiple failing drives at once if you lose a disk and try to resilver. Far better to check periodically so you can throw bad disks early.
Watched the video where Linus talks about it. Apparently they store all of the footage as a nice to have, and to have a use case to make the large storage content around.
All that to say yes they lost all that data, no it wasn't backed up. Not critical to the business.
This is tricky, because it's a class of problem that you need to understand exists in order to seek out the relevant information. Once you know what to look for, it's easy to find guidance - even from Oracle themselves [0]
A lot of us have "learned the hard way," as it sounds like Linus himself eventually did. I think this highlights an issue with the "learn through youtube video" approach. An internet celebrity may acquire enough knowledge to do accessible demonstrations, presenting totally valid, useful, and correct information in them, and still miss crucial "unknown unknowns" that they simply hadn't encountered in their own research.
It's hard to know what to recommend for a class of issue you aren't even aware of!
Can someone explain this? I find it a bit scary that your have to do monthly manual work on your NAS or you'll lose.
Edit: wait, I think I get it... "Scrubbing simply needs to read every file from the disk so the RAID layer notices and repairs a URE". So it's to avoid bit rot? I have to say, as someone who has a few TB of personal data on a NAS, bit rot is a bit scary. Data backup is mainly why I bought the lifetime pCloud+encryption package during the last Black Friday. I wonder how (if?) they avoid bit rot?
That is why I think Consumer NAS is an unsolved problem. I have no need for VPN, Cloud Photos, or Mail Server etc. I need a simple, reliable network storage that prevent drive failure and bit riot. Right now you need to spend at least $400 on a NAS with lots of config before getting it done.
It's less Linux and more weather or not you have a database/NAS centric distribution I think.
The problem with by default installing a cron job when installing ZFS is that for a general purpose OS there a good default for when and how often to run it. And running it on the wrong time might even be a major problem.
Through then tbh. having a bad default is probably still better then no default in this case.
> lose data before noticing
Is a bit of an overstatement as they didn't look for quite a while,
they also did not only fail to do scrubbing, they also failed to setup
automated health checks and reporting.
Turning a non NAS focused Linux distribution into a well working and tuned NAS isn't easy (compared to using a good NAS OS/distribution), but making it somewhat work is easy. Which makes this a pretty common mistake for non-specialized people (i.e. like in their case).
I believe I got default cron-files when I installed ZFS on ubuntu 21.04. It could be that they were created and I had to uncomment one line in a file. Then a scrub on the pools would run once a month. Then I setup email on the server and everytime a scrub is done with any errors, I get an email.
Quite easy for me to setup, even though its my first NAS that I built myself and first time using ZFS. Very surprised that LTT effed that up to be honest.
Let's start out with what the article spends the most time on: licensing.
This is a giant self-own and a very long winded way of saying "Linux will ignore the best implementation of a thing because we/they don't like the license."
That's a choice, not a law of the universe. It's perhaps also a canonical example of spite-induced nose cutting.
Let's see... other things... oh! Suggesting mdadm is better than well... anything... is a giant red flag to me. No thanks.
Dismissing checksums as unnecessary? Okay, now I think they're just trolling. Years of running ZFS teach you that bits flip, a lot... and you have no realistic recourse without block level checksums.
> "Linux will ignore the best implementation of a thing because we/they don't like the license."
> That's a choice, not a law of the universe. It's perhaps also a canonical example of spite-induced nose cutting
It isn't because they don't like the license. The licenses of linux and zfs are incompatible, so zfs is out of tree and that leads to technical problems. No, it isn't a law of the universe, but it is the law of the United States at least, and probably many other countries as well.
> To be fair, they kind of did that to themselves with the GPL
Well at this point, even if they wanted to change the license of Linux, they probably couldn't. Every contributor would have to agree to it, which is unlikely to happen, even if you ignore the logistics of asking everyone.
Not meaningfully; Linux can't change away from GPLv2 (too many copyright owners), and ZFS can never change from CDDL (again, too many owners, including Oracle), and those licenses are incompatible. The only way Linux could make it work is by freezing all the kernel APIs that ZFS uses, and while that is a choice, it would be an extremely expensive choice to give up being able to refactor anything that it touched.
I would agree with insaneirish, if it was anybody but Oracle who held the CDDL-licensed code (I think IBM have ZFS-relevant patents, but no CDDL code in the OpenZFS codebase). With anyone else, I'd say the legal risks were overwrought; we're talking two subtly mismatched copyleft licenses, so the idea that there is damage done by combing the codebases is legally weird and this kind of cursed-earth legal strategy is insane from a market giant. But as a wise man once said, don't anthropomorphise Oracle.
Except Oracle is not the danger here: the license incompatibility is about GPL, not CDDL. I don’t think Oracle is going to sue anyone for GPL violation. And if it does, it could do that without CDDL in the picture.
Oracle also has a Linux distribution. So, I wouldn’t count them out as suing over a GPL/CDDL issue. I’d love to know who the harmed party would be in a lawsuit like that, but I’m sure that there would be no winners.
The GPL-side license trolling happened way before Oracle was in talks to buy Sun. On Dtrace team (first CDDL licensed code) afaik the expectation was that they would see Dtrace integrated pretty fast in Linux. Of course, it's all anecdata (just like the opposite view about purposeful incompatibility from another Sun employee).
> This is a giant self-own and a very long winded way of saying "Linux will ignore the best implementation of a thing because we/they don't like the license."
It has nothing to do with "dislike." It's illegal to put ZFS into the Linux kernel. Just as illegal as putting a pirate copy of Microsoft's NTFS in there.
This is because of the choice of license by the owners of ZFS code!
You're blaming Linux but Linux license was chosen long before ZFS even existed. ZFS license was specifically chosen for the purpose of keeping code out of Linux.
The incompatibility is on the GPL side, not CDDL side, and even there it's unclear because of the "derived work" term involved. CDDL was created because GPL would artificially limit inclusion in other projects, and while views on its compatibility with GPL were split at Sun (some employees, even legal counsel if one believes claims on this very site, expected CDDL code to be possible for inclusion in GPLv2 projects like Linux, some made rather public remarks on purposeful incompatibility but without anything to support it).
And no, it's not illegal to put ZFS into linux kernel, because GPLv2 applies only on distribution.
No, I'm deadly serious. GPLv2 vs CDDL is a complex situation with somewhat easy results (somewhat because there are divergent opinions - so far those opinions only got clicks in media and nothing in court).
It's hard to determine whether GPLv2 would apply on distribution and on what kind of distribution, even.
But your comparison with "pirated NTFS driver" was completely out of whack, suggesting that it's illegal to even do it personally (as would be the case with "pirated NTFS driver"). GPLv2 applies only on distribution. Always did, Always will.
> As an out-of-tree GPL-incompatible module, it is regularly broken by upstream changes on Linux where ZoL was discovered to be abusing GPL symbols, causing long periods of unavailability until a workaround can be found.
No it absolutely wasn't. It was using a symbol that saves the FPU state, and then that symbol was deleted in favor of a GPL symbol that does the same thing. There is only one abuse in the situation, and that is the flat-out lie that triggering an FPU state save intertwines you so deeply with the kernel that it makes your code derivative of it.
The breakage is also irrelevant as an end user if you're sticking to a LTS kernel, or using distro packaging that supports it (if Canonical never ships a kernel+ZFS that don't work, I don't care what upstream does). And if it does break... I guess I just have to roll back to the last root snapshot;D
ZFS + latest kernel is not confident to be safe for me. If ZFS is suitable for use case, it's better to use Ubuntu or Proxmox (or FreeBSD!) rather than Arch.
Any blog post that espouses the virtues of btrfs as better alternative to ZFS is going to be immediately recognized as uninformed and a little embarrassing by any operator with a significant degree of experience in both technologies.
As someone who _has_ operated many systems with both filesystems - for a long time - it’s malfeasance to shepherd someone in that direction. btrfs has its uses, but there’s no comparison in terms of project maturity. Sharp edges abound (behavior at high usage, RAID immaturity, the still-extant 5.16 kernel single-core max CPU use bug, etc.)
Similarly to the author, I definitely want to be able to use btrfs. I liked features like being able to add/remove drives, and reflink=auto. But I got burned on it's stability (luckily didn't lose any data) and would not look at it again for anything important in a long time.
Still use it on single drive filesystems for inline compression sometimes though.
Well, it's less ZFS and more Linux, but it makes for reasonable problem on 32bit platforms (though there was a lot of work to move of vmalloc() in ZoL)
I switched to ZFS a decade ago. Before that I used mdadm and LVM. Before that I used hardware RAID. I also dabbled in BTRFS but didn't like it. I usually play with any new filesystem that has 5 years maturity and a big userbase, but nowadays my go-to filesystems are ext4 and ZFS.
I use ext4 mainly because it's the "blessed" filesystem of linux. It's simple, does what I want, and I have yet to be bitten by it. That's good enough for general use in situations where I don't care about file integrity.
ZFS marks the first time I've been able to breathe easy, knowing that my backups aren't storing corrupt files because of a raid write hole or hardware silent corruption that I won't discover for months or even years (likely long past any restore point on my backups).
Don't trust the hardware. At all. Hardware fails all the time, often silently, and can't be audited because the code and silicon are closed source so you have absolutely NO indication as to its quality (other than Backblaze reports [1]). You'll never know what demons lurk in those depths, but with paranoid software like ZFS, you don't have to care.
The only time I use non-ZFS filesystems is when I don't care about data integrity on that drive, such as pushbutton rebuildable server boot disks with backed up configurations (NixOS is great here), or cache/scratch drives. If it ever becomes easier to make ZFS boot disks, I'll probably start using them for server boot disks as well, just so that I can be alerted whenever a drive starts to fail.
By the way the author sounds like he is in love with LVM and Linux ecosystem and does not understand ZFS at all ... while he also recommends BTRFS on LVM for the same features.
If you do now know ZFS do not read it - it will only bring false information into your mind. If you know ZFS then you can go read an laugh to make your mood better :)
It looks like you didn't read the article, either, because author already mentioned the very same links in the article
> Growing a RAIDZ vdev by adding disks is at least coming soon. It is still a WIP as of August 2021 despite a breathless Ars Technica article about it in June.
I'm a long-time ZFS fan, and still for storage of my digital artifacts I don't really trust anything else. For backup servers, where performance isn't really critical, and snapshots are a huge benefit, it is great!
However, ~2 years ago I installed my laptop with encrypted ZFS, and frankly it kind of sucks. I don't know if this is the SIMD thing, or something else, but I can basically count on 1 full CPU running some zfs process at all times. Any apt install takes a stupid long time because it's doing something with snapshots. And I think I've only used the snapshots a couple of times. When I did though, it was handy.
But, I can't complain about my track record with ZFS: Over ~15 years I've never had data lost by it, despite having some horrible things happen to the systems that have run it.
I will fully agree that dedup is effectively useless. I've never had a system that had enough memory to reliably run dedup on any real workload.
I'll disagree that there are other tools better at doing what ZFS does though. In particular, I've had 5 events over the last year where our (normally reliable) PERC storage systems experienced corruption on LVM and RAID-6 arrays. I believe it was related to some issues with the Dell drives and Kafka broker activity on the arrays, but it sure would have been nice to have had ZFS on them instead of hardware RAID+LVM.
That sounds promising. I made the mistake of using de-dupe on ZFS before I understood the RAM requirements and performance implications ... quickly backed out.
> I don't know if this is the SIMD thing, or something else, but I can basically count on 1 full CPU running some zfs process at all times.
Based on my (limited) experience it was really bad before the whole SIMD thing got resolved. There was a huge improvement once they worked around that.
For security reasons I recently encrypted some of my ZFS home server (including the OS SSD). Throughput took a severe hit, because the CPU (Pentium G630) does not support AES-NI. However, day-to-day operations (also CPU load over time) remain virtually unchanged.
Encryption is unrelated to any snapshotting activity. Sounds like you may have additional hooks in place. ZFS by itself does not integrate in any way with package management or anything else, really.
I don't know if it ended up implemented, but there was talk of patching Linux to re-enable the necessary symbols for ZFS. This would let it use SIMD again, and speed it up to where encryption isn't a performance hog.
It'd explain why you get different results on NixOS.
EDIT: Seems like it was, but has been removed as unnecessary on newer ZFS versions.
zsysd is an Ubuntu daemon for managing ZFS things, not something OpenZFS provides.
So if that is the only process spinning, that is probably specific to them.
(Not that OZFS native encryption doesn't have flaws; I am probably the last person on the planet to pick to argue that point. But I don't know that this is among them.)
I wonder what `sudo perf top` says it's doing (in terms of heaviest symbols). That would probably be the most useful thing to stick on the end of a pitchfork :P
(The command accepts a `-p <pid>` parameter; without that it'll profile the whole system (I make this sound much heavier than it is).)
zsysd is Ubuntu's zsys tool. It's notoriously bad at burning CPU as your number of snapshots increase. It has some pretty serious flaws as well that can result in total data loss on the local drive due to aggressive / incorrect snapshot pruning.
- My primary need for it is on my backup box, which does "rsync --inplace" to keep small changes to large files from completely creating a new copy of that file every backup run. Hardlinks would cause all of those to compete if some previously similar file started getting updated (OS updates or similar starting files that then diverge.
- Each system gets it's own ZFS, so I can snapshot them at the end of the backup, keep different retention times, etc. But I can't hardlink across filesystems, I don't think.
The linked article did mention increasing dedup block size, which might help. If I could even cut down the DDT by half, that would make it more doable.
This removes copy-on-write though, which is why it's a bad idea. Editing a file in one place and having it surprise edit everywhere else is a good way to wind up with a disaster.
The single, actual reason for refusing to not use ZFS on Linux systems in general is IMO the non-mainline state.
Sure, as with other FS it has trade-offs and may not match all workloads or use cases, but it also has a maturity and no native Linux file systems has such a thought out and complete feature set ZFS has, albeit btrfs comes closer every release which is nice to see. The one thing ZFS is missing is rebalancing, but there are ideas out to solve that, but those are not reasons for just not singling out ZFS and never considering usage in general.
Alternatively fighting with proprietary HW raid controller is risking to get one's data eaten on a simple firmware upgrade, and having no resource to introspect or salvage, because they're just a proprietary mess.
So, while installing ZFS via DKMS or the like works out, in the end they'll always be a second class citizen in the kernel and spent a lot of time on fighting changes upstream, working in a parallel universe in the kernel and thus spending much more resources and effort.
See for example the ARC, that is allowed to use up to half of memory by default, but it just nowhere shows up in the native Linux memory accounting, yes, some specialized tools like arcstat exist, but if ZFS was actually mainlined it could benefit from actually working with the kernel, not against it half of the time.
I think that if ZFS would be mainlined, and sadly that seems very unlikely to happen anytime soon, a big part of the things causing users currently trouble when using ZFS on Linux would go away pretty soon just due to being natively integrated.
The issue with ARC is that you can't really make it cooperate with normal linux VFS, because of how linux works. You can either break with Linux VFS (as ZFSonLinux does) or do a complete rewrite of core foundational element of Linux kernel. (No, you can't reimplement ZFS to be compatible with certain VFS expectations).
There are reasons why both kernel and ZoL teams say they are happy with not mainlining.
> The issue with ARC is that you can't really make it cooperate with normal linux VFS, because of how linux works.
That's only parts of the reason, and in the simplest form you can just disable the VFS page cache and render fitting arcstat properties through it.
And besides that, I see no inherent issue in solving that, at least in a much more integrated way as the status quo is now..
> There are reasons why both kernel and ZoL teams say they are happy with not mainlining.
Yes, but that's the license and not the caching/VFS differences.
As Linus said:
> And honestly, there is no way I can merge any of the ZFS efforts until I get
> an official letter from Oracle that is signed by their main legal counsel or
> preferably by Larry Ellison himself that says that yes, it's ok to do so and
> treat the end result as GPL'd.
>
> Other people think it can be ok to merge ZFS code into the kernel and that
> the module interface makes it ok, and that's their decision. But considering
> Oracle's litigious nature, and the questions over licensing, there's no way I
> can feel safe in ever doing so.
>
> And I'm not at all interested in some "ZFS shim layer" thing either that some
> people seem to think would isolate the two projects. That adds no value to
> our side, and given Oracle's interface copyright suits (see Java), I don't
> think it's any real licensing win either.
>
> [...] the licensing issues just make it a non-starter for me."
Oh, I haven't seen that issue. (Notice "made me reconsider" from ryao).
But pagecache is still something that isn't fixable (you can't disable VFS pagecache, it's too intertwined with the whole I/O system) - I don't see a way to implement ZFS "linux way" considering things like block sizes bigger than one page being antithetical in linux.
It's kind of dishonest to have a section "Out-of-tree and will never be mainlined" that discusses Debian and Red Hat but not Ubuntu, who have bitten the bullet of doing exactly this in-kernel maintenance the article claims doesn't exist. (Or is it ignorance? Did the author not know of this support?).
> Ubuntu ships ZFS as part of the kernel, not even as a separate loadable module. This redistribution of a combined CDDL/GPLv2 work is probably illegal.
I did and I seem to have had a reading comprehension failure and skipped that brief, crystal-clear paragraph. Ouch!
That criticism retracted, although if the author suggests Ubuntu users are under threat here, that seems kind of fanciful to me. And while Oracle are hyperaggressive, I'm struggling to see that even they would think it in their interests to go after Canonical itself.
They bought Sun, partly because they saw a revenue opportunity to go after Java's largest end user deployment on Android devices. I'm shocked they haven't gone after Canonical yet, to be honest.
The only arguments they would have are the same as with existing SFC vs Canonical court case (SFC sued Canonical on behalf of few Linux kernel developers)
Switched to ZFS recently for my home server, 6x8TB SATA Raidz2 (data) and 4x 500GB SSD SATA Raid1 Mirror (VMs). Above everything, resilience, predictability and piece of mind were the primary motivations. The performance is totally fine - many argue against slow SATA disks with ZFS, especially SSDs, due to ZFS impact on Total Bytes Written (TBW), but if you're not hosting an enterprise with many Database writes, SATA disks are perfectly acceptable with ZFS (my observation). After one month, I have accumulated 2 1/2 TBW on the SSD-Mirror - the disks would still last about 7 years given this rate (this is running with 25 Docker Services and 5 VMs, including bigger ones such as Gitlab).
I have compression and encryption on and the fio benchmarks were all good, similar to what I saw with the Hardware Raid 1.
Regarding monitoring: It is not that difficult. (1) Have ZFS Mail status updates to you (e.g. this is the default in Proxmox), (2) Monthly Scrubs (cron) (3) Monitor TBW with (e.g.) InfluxDB. Wrote a blog post about the last part [1] "Disk Wear (SSD) - extracted from extended Smart Attributes (Single Stat)".
All of the issues with degrading disks apply equally to other disk setups (e.g. Raid1). However, ZFS gives you better tools to predict and prevent, before disaster happens.
he is wrong about a few things:
1: checksuming is worthwhile, I have had silent data corruption on both ssd/hdd.
2: compression is also worth it: home dir 74G/103G, virtual machines dir 1.3T/2.0T
3: zfs was never supposed to be fast, data integrity is the target.
4: zfs does not need a manual repair tool, is automatic and data at rest is always consistent.
5: in the future x, y, z - yeah sure.
TL;DW: Data in their massive 1PB server was suffering from bitrot because there was no scheduled scrub to repair the bad data. And they couldn't tell how bad the situation was because without any scrubs happening, the stats on data integrity were inaccurate.
I agree, the scrubs are performed manually, but enabled by default in debian/ubuntu, via crontab every two weeks (mdadm consitency check is also triggered from cron)
about the data loss in the video, mistakes easy to spot:
1st: using seagate.
2nd: installed by us and never updated
3rd: insufficient reading of docs before going all in on zfs.
4th: buying more seagate drives ;))
I think they had a way higher chance of losing their data going the usual stack mdadm/lvm/ext4/luks/btrfs, I think mastering those is harder than mastering zfs.
I think for Linus Media Group the main "meta" issue is that they don't have a dedicated member of staff to handle boring day-to-day IT / sysadmin tasks that you don't make videos about. A video about building a crazy storage server is content for a video so gets done, but somebody needs to make sure its still working / updated, and you don't make videos about routine maintenance that so its forgotten.
Although everyone else can learn the important lesson that RAID / ZFS isn't magic and you need to have stuff setup correctly and monitored. The fact that RAID isn't a backup as well[1].
[1] Although if the LMG servers affected are just for data hoarding raw footage that is unlikely to be needed again, it's possible the risk / cost balance pushes away from backups and just relying on RAID, but that's a niche case (and they lost the gamble...).
Yes, their setup likely would have been configured, monitored, and maintained properly if they had an IT guy. But they made the (easy to make) mistake of thinking that having enough tech knowledge means you don't need a proper IT/systems department.
I'm certain at least that Linus knows that RAID isn't backup. And I'm Linus is going to try had to get the data back, but it seems to me that this isn't some devastating failure for him.
I bet you they also bought all the same make, model, and batch/vintage drives.
If you are building a storage array, do not do this. Ensure that you are using a variety of drive types (obviously same size and interface technology). Doing so guards against the danger of too many drives going wrong at the same time (within the same time window) causing a failure from which it is impossible to recover.
Glancing through, it seems ZFS is about middle of the pack of the analyzed filesystems (BetrFS Btrfs ext4 F2FS XFS ZFS) in terms of performance aging caused by change history. And only BetrFS did not show any aging from these tests and was surprisingly stable throughout. I'm curious if that's changed in recent versions of any the analyzed file systems, I know ZFS has had a major release recently.
"Git aging workload on btrfs on HDD. The overall slowdown is 20.6x."
Well, this article is fascinating, but maybe the reason why btrfs has built-in defrag functionality is because without it, performance degrades incredibly steeply?
ZFS has a `recordsize` parameter that can mitigate fragmentation if tuned correctly for the workload. This allows zfs to perform pretty competitively on workloads that would cripple btrfs unless one disabled copy-on-write (and therefore checksumming and compression) altogether [1].
I can't usually do that on Btrfs either. Whenever it defragments something that's COW, it makes a new copy with 1 use, and leaves all the other snapshots/reflinks pointing at the old copy.
Seems the author hasn't even played with zfs than small experiments. Compression works very well for logs and even compresses /var/lib/docker to about half which usually becomes quite huge as you keep using containers.
Saying to be cautious of using zfs because some day there will be an in-tree file system to replace zfs is absolute nonsense.
What happened with btrfs? People waited for a decade and still no wide deployment.
And no one wants to use zfs-fuse. Everyone says the performance is a joke but yet, it's not mentioned in there.
I find this article overstated. I ran some Postgres databases on ZFS for specialty needs for some time, mostly to take advantage of compression. It can be quite potent for this, though the amplification of CPU usage with a pathological access pattern can be a problem for some workloads.
I did have some problems in early versions that seemed plausibly related to its contiguous memory needs, and later I saw a release quite drastically changed this (0.7), and indeed, it was stable from then on.
I found it more cohesive to work with than other options available at the time, e.g. l2arc for caching. I didn't have need for pool dynamism, but I found it rather easy to use, though I also have experience with lvm and mdadm.
That said, some of the source code I found more incomprehensible than average, and I'm fairly used to reading linux to answer certain questions. Maybe I just needed to study it longer.
Sounds like somebody without practical experience.
E.g. Borg was worthless because the retrival of data was so slow from a dedicated hetzner SX6_ Host that getting data back with less than 10MB/s was disastrous for a 10TB+ repo. (Not a bandwidth problem.)
Been there, done most of the suggestions. Most of them are unpractical -- regardless what benchmarks at phoronix say and how often one jumps from solution to solution for certain aspects.
Still use ZFS, still IMO overall the best package for a lot of data management.
I am sorry but there is little fact in this post. It starts with the religious crap about licensing, then inaccurate statements about symbols, statements about performance with zero proof, misunderstanding of layering, dumb statements about memory cache and how it works the ARC, some shock about dedup requiring lots of ram (duh)....etc. This is written as well I feel this way so here is my Cherry-picking set of items (without clear understanding) to backup my view.
I'm not entirely convinced by his arguments, "oh, if you want the features of zfs you just this bunch of other stuff and it will come in more performant, stable and less buggy than zfs".. Even if that was the case, I don't feel convinced that I would be as capable of recovering from a disaster with that many moving parts.
"Unless you have ECC", of course I have ECC, it's for my file storage.
That said, there are valid points and something to be considered, especially the licensing stuff is a bit scary.
The licensing scare is overblown. You only need to worry if you want to distribute software that includes both ZFS and Linux, like Canonical does (and they decided it's worth the small risk). If you just use ZFS and Linux on your machines, there is no licensing issue whatsoever.
Dedupe is a lot less memory-intensive if you have larger files and tune your datasets' `recordsize` to 1MiB instead of the 128KiB default. It makes per-file deduplication less efficient but makes the overall process much more usable since there will be much fewer total blocks for it to keep track of. I use it to maintain my torrent seed directories while also moving/retagging the files into my own separate hierarchy.
To expand on (2) - CDDL was specifically designed to make it easier to include in other projects while also keeping certain protections for all involved. The issue is that GPLv2 has somewhat complex case of whether it applies ("derivative work") and also tries to push you to distribute code under GPLv2 terms and doesn't play ball if some part of it has requirements that exceed GPLv2 ones.
The specific incompatibility is, iirc, due to patent litigation protections in CDDL, which do not exist in GPLv2, meaning you can't just redistribute CDDL code under GPLv2 umbrella
I'm a long term ZoL user that's been looking over the fence at btrfs for a long time. I've been tempted to switch since I only use ZFS mirrors (RAID 5 instability isn't a deal killer). If you don't mind me asking, how long have you been using it and have you had any stability issues?
I lost a bit of data circa 2013 on a BTRFS RAID5 as a result of a power loss. I rebuilt the array as RAID1 and it's been fine ever since, despite initially using only ST3000DM001 drives (most of which are now dead). I've added and removed drives more times than I can recall (some dead or dying, some just too small or slow to keep in service), migrated the array piecemeal over to an assortment of SSDs of varying sizes, recovered from mistakes like ejecting the wrong drive while the filesystem was mounted, and switched the metadata to RAID1c3, all with reasonably low downtime and no data loss. There have definitely been times I would have appreciated the space efficiency of a reliable RAID5/6 setup, but on the other hand I've made heavy use of features that ZFS doesn't have and probably never will.
Yeah seems good. I will say, WinBtrfs is unfortunately garbage, but you can just set up a simple Linux VM as an SMB server in Hyper-V with dynamic memory, and for me it uses about 700MB of RAM. Works great.
FYI, WSL now supports mounting physical disks. So if you're running a setup like that, you can take advantage of a VM managed for you, along with not having to run Samba because files are made available in explorer (via 9P).
I tried it twice and got completely hosed drives twice somehow.
Have had zero hosed drives with ZFS. Although with ZFS not being supported in the kernel, it's fairly hard to do a ZFS-on-root setup unless you're using Ubuntu.
Ditto; I used BTRFS on OpenSUSE - y'know, the big distro that's all-in on BTRFS and that's been using it for forever, the one place where I'd expect it to work - and it ate the whole root filesystem twice. The first time I tried to recover (fsck? I forget), the second time I didn't even try and just reinstalled. In hindsight, I'm not sure why I reinstalled on BTRFS again, but it hasn't managed a third time at least:) (Now that I've written that, it will of course break)
btrfs is the default for fedora workstation 35 and seems to work well for normal use cases. However it failed miserably when I tried to download the monero blockhain, either through raw import or incremental p2p sync. The fix was simple - set the database directory to nocow, recreate the db from a backup and all was well. But btrfs has that sharp edge users need to worry about in case they do have a large, regularly written file (say, like a database) hosted on the filesystem. XFS never had such an issue.
Well, ZFS not being in the kernel is the easy reason I don't use zfs.
The reason I don't use btrfs is I don't want to deal with problems I currently don't have. ENOSPC with space available, performance issues when space available, quotas.
yes but nixos requires a degree in nixos. let me know when it's accessible by mere mortal nerds who have enough new tech to stay on top of that "getting a degree in adminning my OS" is too prohibitive
I call it the "easy" answer because I am old and "time is expensive" to me and I'm not rich.
I want to depend on others on what to spend my time on and using in kernel on whether to "jump ship" is a filter I've decided. YMMV. Like I said, I don't use btrfs either.
Fedora CoreOS can rollback the operating system to a previous state on errors. This could be done as soon as you can‘t mount the kernel module anymore after an update.
> At time of writing there are 387 open issues with Type: Defect label on the ZoL Github and the bulk of them seem to be genuinely important problems, such as logic bugs, panics, assertions, hanging, system crashes, kernel null pointer dereferences, and xfstests failures.
My impression is that zfs is generally considered to be less buggy than btrfs, which is the filesystem that is most often used when directly comparing to zfs. Though I admit I mostly use ZFS on freebsd, not linux (and I think the article is mostly focused on ZoL)
Some compression stats from my local setup. I work with a lot of source code (Android / AOSP), and big output directories (hundreds of gigabytes)
ZFS compressratio:
source: 1.40x (Separate pool)
output: 1.68x-2.49x (Multiple datasets to make management easier)
The output compression allowed me to use a 1TB drive as a 2TB drive effectively, allowing me to store a lot more output and not have to wipe away build output from x to build y.
That alone makes zfs worth it for me (and I know many other Android devs who use it in a similar fashion)
Ubuntu bundling zfs is a pretty good and convenient feature. As for the alternatives, mdadm + ext4/xfs is fine (i.e., does what it says on the tin), but all the other alternatives, especially the ones championed by RedHat - lvm, thin provisioning, stratis and the likes, are messy, fragile and not really used at scale by the general populace. With file systems, and linux in general, you want things that are battle tested by regular people.
Wait, how do you consider lvm "not really used at scale by the general populace"? Throughout the last several years, be it Ubuntu, CentOS, RHEL, etc., most distros tend to default to an LVM-driven partitioning scheme, I believe.
Disclaimer: I work at Red Hat, though nowhere near filesystems.
I mean the fancier things like thin provisioning, snapshots and other features built on top of lvm as a means to provide zfs like functionality. Vanilla lvm is fine, which is what is used be everyone. I have been bitten twice by lvm related issues. Once because metadata storage in lvm thin provision got full (no idea why defaults are so low), and once because snapshots by snapper stopped working (i.e., they wouldn't rotate eventually causing a disk full situation) because an selinux policty update borked snapper.
I agree with the speed issues, but resilvering due to diskcorruption is faster than raid, which is left out, as is the problem with single redundancy ambiguity in raid that raidz fixes. The checksumming is also fantastic in very specific cases, including where sata checksumming is insufficient. The old example is a mechanical drive in a laptop which is reading or writing when the laptop gets moved. Kinda applies to mid level datacenters in earthquake regions too. But also applies to any kind of shitty cheap laptop drives, and similar.
I also think the article underestimates the value of compression, its insane for the file formats to include it at all, its absolutely a filesystem property. How much and how hard a file should be compressed isnt known on creation, its known by the server that is currently providing it to users, meaning precompressed files are always either to strongly compressed, or not strongly enough, and often using custom shitty compression. Its also great for many kinds of research and engineering workloads where data does not have built in compression, or generates intermediate files.
File formats have compression for a very good reason - network transfer of these files. If compression was done only on the FS level, web servers would uncompress files when reading them and then immediately after that compressing them again to send them over the network to clients, thus wasting computing resources. It's better when large files (image and video files) are stored and transferred in the same compressed format.
The licensing and resulting encryption-performance (first time hearing about that TBH) situations are indeed unfortunate.
The one point that that I feel is bordering on bad faith is complaining about the pitfalls of deduplication. Basically everyone (including official docs) will tell you not to use it unless you’re an expert with very specific reasons. It’s not functionality present in any other filesystem I know of and OP then agrees that it’s not a needed or even desirable feature anyway. It’s basically just present for backwards-compatibility at this point.
> No disk checking tool
I don’t understand this complaint. Is this not what you get by running scrub with checksumming enabled?
> High memory requirements for ARC
This section is simply wrong or at least very misleading: the main motivation for ARC today is not that ZFS can’t use the page cache but that its own strategy can perform better and more optimally (if tuned properly). Things generally do not get double-cached in both page cache and ARC at all. The memory usage can be tuned with various parameters including min and max sizes (though I do wish these parameters could be set per pool as opposed to on a system level). This section too seems to be in bad faith as a reader who is not deeply familiar with ZFS already will have an incorrect understanding.
> causing long periods of unavailability until a workaround can be found.
Has it really? The one situation I’m aware of is the one they’re mentioning. Which was not experienced any differently than the usual expected delay for things to bubble down to distro repos.
> btrfs
Is simply not ready for casual users yet. It will hopefully get there before too long but at this point I find it irresponsible to recommend it for non-expert users and production workloads.
OpenZFS is not perfect and there are valid reasons not to use it. You don’t have to misrepresent it to find them.
Not mentioned there (only hinted by "Phoronix benchmarks of ext4 vs zfs in 2019 show that ZFS does win some synthetic benchmarks but badly loses all real-world tests to ext4"): bad performance if you have a raid0 or raid10 using very fast NVME drives.
For most of my usecases, I typically default to xfs over mdadm raid10 in f2 mode.
Why not use an experimental filesystem with functionally zero real world usage that isn't in the upstream kernel and has exactly one full time developer working on it?
I love the ideas behind bcachefs but suggesting it's anywhere close to being ready for primetime is just silly.
To which, "it isn't stable or anything like prod-ready, and you still have to build it as an out-of-tree module" is a perfectly good answer. Hopefully, the answer will become "yes, bcachefs is the solution" soon, but we aren't there today.
I'm very much hoping this becomes the future of filesystems on linux, but it's very in flux at the moment. I wish some company would hire him and put a bunch of devs on it full time.
Can anyone chime in with their experience using dm-integrity? I looked into it a while back for home RAID and recall seeing some limitations that made me steer clear, possibly related to LUKS, or array expansion (I'm using mdadm RAID6 and I periodically grow my arrays). But that was a while ago, and I don't remember what the issues were. This time I promise to keep notes.
I've been using dm-integrity with mdadm RAID on top, and LVM on top of that. Once it's all initialized an working, it works great.
The biggest downsides I can think of are:
1. You have to do an initial initial initialization of checksums on the disk, which takes forever.
2. Automatically bringing up the dm-integrity layer of the plain disks at boot time is a bit of a dark art involving udev rules.
If you skip the checksum initialization step and add the device to the RAID immediately, it _mostly_ works, because disk blocks are rewritten by the resync/reshape, but then on reboot it fails because some part of the system tries to read an otherwise unused sector (probably to find metadata or a partition signature of some kind?), and then determines the device is unusable because of read errors (dm-integrity errors show up as read failures to the next layer in the stack).
Thanks for sharing your experience. Do you run it in bitmap mode? If not, how’s the perf hit feel (I’m using spinners)? Does it have any ability to correct errors, or does it just cause the drive to drop when they appear? (I hoped it would write the corrected value back during an mdadm array check.) Does it reduce drive capacity at all? If not, I’m thinking maybe I could convert my array to it, one drive at a time, verrrry slowly.
I'm running in bitmap mode, on spinning drives. It reduces the drive capacity, but not by a noticable amount.
I've set up my RAID to not haeve a "bad block list" so I _think_ it will try fixing it a few times, and drop the drive if there are too many errors in a row.
Should be easy enough to try with a "test" RAID made out of a bunch of files (or even USB sticks).
Out of interest: What is the state of btrf compared to ZFS.
Last time I checked btrf _seemed_ like a mess with many hard
to reproduce subtil bugs and for was basically a no-go for
any usage requiring any reliability. Like it looked bad to
a point that I wondered if it only still exists due to the
sunken cost fallacy.
But neither then nor now I was familiar with either btrf nor
zfs, so I might have been wrong.
Whenever I look at zfs I become confused how that all works. Predicting performance is incredibly difficult. The rebuild times horrendous.
That's coming from someone who has been using Ceph for a couple of years now. I know it's apples to oranges, but still. (For one, zfs is non-distributed, whereas Ceph is).
I’ve ran ZFS on Linux, Solaris, and FreeBSD. While I like some of the features it has I wouldn’t run it again on anything that needs decent write speeds. The performance is really sub-par and you need a lot of disks/memory to get over that hurdle.
Would Oracle relicensing it be able to change anything? OpenZFS and ZFS split over a decade ago, does the OpenZFS project have the right to relicense everything they've done in that time?
Yeah, Illumos and then OpenZFS very specifically avoided contributor license agreements after Oracle demonstrated the ultimate failure mode of that model (by closing OpenSolaris); if Oracle had had the sense to immediately re-license ZFS to GPLv2 or dual-licensed it could have worked, but it's a decade too late now. (Incidentally, I have never understood why they didn't do exactly this; ZFS was superior to anything Linux had at the time, and Oracle would have benefited enormously from shipping ZFS as a first-class feature on OEL. Of course, the obvious answer is... Oracle[0].)
>Incidentally, I have never understood why they didn't do exactly this; ZFS was superior to anything Linux had at the time, and Oracle would have benefited enormously from shipping ZFS as a first-class feature on OEL
I've always figured it's at least partially because someone was too stubborn to give up their work creating btrfs. Though relicensing it to proprietary is Oracle.
It would change a lot - if ZFS got relicensed as GPL, users would lose the patent protection granted by CDDL, opening the possibility of being sued by Oracle.
I have heard arguments that Oracle would now be the License Steward and could change it to be whatever they liked (GPL 2.0, MIT, etc.) which would flow to the OpenZFS project.
Oracle can create CDDL-1.1 or any other newer version with different terms, but there's no "CDDL 1.0 or newer" license on OpenZFS, nor is Oracle sole copyright holder.
Another issue is that what is specifically a problem with GPLv2 is that CDDL-1.0 provides patent litigation indemnity, losing which would be a big problem.
Pretty much every license it there disclaims any responsibility.
CDDL has provisions that mean that Oracle can't sue you for (hypothetical) patent infringement in ZFS code because Sun gave automatic license to those patents when it released code under CDDL
This article is full of misinformation to the point that I'm not sure there's many redeeming points in it, and if there are, they're drowned out by wrong information:
> Out-of-tree and will never be mainlined
So the article is heavily Linux-focused. Fine. To an end user, having a driver that's not part of the mainline kernel sources is hardly a deal breaker. It still ends up running in kernel space with all the advantages that comes with.
>> Ubuntu ships ZFS as part of the kernel, not even as a separate loadable module. This redistribution of a combined CDDL/GPLv2 work is probably illegal.
Canonical lawyers have obviously disagreed with this assessment. So far so good.
>> Red Hat will not touch this with a bargepole.
Red Hat also doesn't touch btrfs. Or basically anything that's not ext4 and XFS.
>> You could consider trying the fuse ZFS instead of the in-kernel one at least, as a userspace program it is definitely not a combined work.
No, you really should not. zfs-fuse has not been maintained in over a decade, doesn't even remotely come close to supporting the features of modern ZFS, and frankly... it's FUSE. It's slow as molasses.
> Slow performance of encryption
>> ZoL did workaround the Linux symbol issue above by disabling all use of SIMD for encryption, reducing the performance versus an in-tree filesystem.
Only partially true, but the damage is limited to some metadata structures and the bulk of encryption code does use SIMD instructions (eg, the parts that encrypt your file data).
> Rigid
>> This RAID-X0 (stripe of mirrors) structure is rigid, you can’t do 0X (mirror of stripes) instead at all. You can’t stack vdevs in any other configuration.
Hard to say why it's useful. Hardly anyone ever chooses that configuration even in the non-ZFS world. It's the same capacity tradeoff for a different approach.
>> For argument’s sake, let’s assume most small installations would have a pool with only a single RAID-Z2 vdev.
Okay, not an item that can actually be refuted, but the idea that "most small installations" have only a single raidz2 vdev is a stretch. I'd wager there's a heck of a lot more single-disk and two-disk mirror configurations than all the other types combined.
> Can’t add/remove disks to a RAID
Everything here is accurate. Part of it is because of ZFS's original target audience, part of it is really hard math problems that hadn't been solved in a manner where you can actually pull it off before the death of the universe. Mind that mdadm isn't exactly magic either -- it'll frequently refuse some operations to reshape a RAID array, and it's not quite as upfront in the documentation about what those scenarios are.
> RAIDZ is slow
raidz has to make sure all the disks have the same set of data synchronized. This is a major reason people are generally recommended to use mirrors instead of raidz.
> File-based RAID is slow
ZFS does not use file-based RAID, it uses block-based RAID. Yes, it knows what blocks are used and will only need to scrub/resilver those blocks.
>> Sequential read/write is a far more performant workload for both HDDs and SSDs.
Yes it is. Which is why ZFS 2.0 introduced sequential scrubs.
>> It’s especially bad if you have a lot of small files.
ZFS isn't file-based, it doesn't matter if you have a single 2TB file or a million files. It's the same work either way.
> Real-world performance is slow
Comparing to ext4 isn't the most fair thing to do. ext4 is a dumb file system that will give you raw disk performance, every time. If this is the utmost priority, use ext4. ZFS adds compression, checksums, and redundancy to the mix. Extra protection means it's a bit slower.
> Performance degrades faster with low free space
>> It’s recommended to keep a ZFS volume below 80 - 85% usage and even on SSDs. This means you have to buy bigger drives to get the same usable size compared to other filesystems.
Basically every file system gets bad when in high-utilization, and probably should be taken as a sign to either upgrade the storage or start deleting.
The threshold for where ZFS starts getting painful depends on which anecdote you listen to. I've heard from people running up to 95% utilization on an SSD without feeling the burn.
>> ZFS’s problem is on an entirely different level because it does not have a free-blocks bitmap at all.
ZFS has had a spacemap feature since 0.6.4 and an improved version since 0.8.0. If the output of the "zpool list" command has a FRAG value other than a hyphen, your pool is using this feature already.
In order for ZFS to do what it does so well, it has to incorporate all these features that used to be in different layers. The reason that resilvering a ZFS pool is so much faster than the whole-disk RAID solutions of yore? Precisely because it knows exactly what blocks are in-use and what are not.
>> If you use ZFS’s volume management, you can’t have it manage your other drives using ext4, xfs, UFS, ntfs filesystems.
You can create volumes and store ext4, xfs, ufs, ntfs filesystems on top of ZFS just fine.
>> And likewise you can’t use ZFS’s filesystem with any other volume manager.
You can, but you probably shouldn't.
> Doesn’t support reflink
Some work in progress is around to do it, but not in a release version yet. Doesn't mean it never will have it.
> High memory requirements for dedupe
Indeed, and there are research projects at a new dedup algorithm to drastically reduce the memory requirement. Don't use dedup unless you really really need it.
> Dedupe is synchronous instead of asynchronous
Dedup is meant to work on live data so it does what you want immediately.
>> (By comparison, btrfs’s deduplication and Windows Server deduplication run as a background process, to reclaim space at off-peak times.)
Sometimes "off-peak times" don't exist, and this just highlights the limitations of the two mentioned technologies: they don't have an online dedup mode. I know at least for btrfs, you have to completely take the file system offline to do a dedup pass after-the-fact (and the end result is the same as the aforementioned reflink).
> High memory requirements for ARC
Basically a long rundown of Linux's own page cache fighting with the ARC. It's actually a fair point, but it's probably overblown. It's nowhere as dire as the section makes it out to be.
>> Even on FreeBSD where ZFS is supposedly better integrated, ZoL still pretends that every OS is Solaris via the Solaris Porting Layer (SPL) and doesn’t use their page cache neither. This design decision makes it a bad citizen on every OS.
FreeBSD's technical implementation of how they ported ZFS doesn't really matter, and this is the second time the article has said "supposedly better integrated" -- it's not supposed, it's literally as well-integrated as ZFS on Solaris is. (Guess what? Solaris still has UFS too, it's pretty much on-par with FreeBSD)
> Buggy
The flimsiest argument of them all :)
Bug trackers track bugs. Some of them aren't even bugs (as the nature of user-submitted bug reports are). It goes more to show the popularity and widespread use of ZFS than anything else.
I'd be far more concerned about a software project that has no bug reports on display at all.
> No disk checking tool (fsck)
>> Yikes.
There is, it's called scrubbing. See "man zpool-scrub" for details.
>> In ZFS you can use zpool clear to roll back to the last good snapshot, that’s better than nothing.
"zpool clear" is an administrative command to wipe away error reports from storage devices. It should only be used when an administrator determines that the problem is not a bad disk.
ZFS pools can be made of many file systems with any snapshots you desire. There is no "the last good snapshot". Maybe zpool checkpoints are what the author is thinking about, but I doubt it.
>> merely rolling back to the last good snapshot as above does not verify the deduplication table (DDT) and this will cause all snapshots to be unmountable
That really should be impossible. I've never even heard of such a thing happening.
>> coupled with the above point (“Buggy”) if ZFS writes bad data to the disk or writes bad metaslabs, this is a showstopper
This is an error that is detected and provided as part of the "zpool status" command (and as mentioned, "zpool clear" can even clear the errors).
>> and so it should have an fsck.zfs tool that does more repair steps than just exit 0.
You could replace it with one that does a scrub, but that can take weeks on some pools :)
> Things to use instead
>> The baseline comparison should just be ext4. Maybe on mdadm.
If you think mdadm+ext4 is comparable to ZFS, you are waaaaay off. Even btrfs can't hold a candle to ZFS and it comes closer than mdadm+ext4.
Here in this section, the author comes across the term "scrubbing" but doesn't really apply it in the way ZFS uses it.
>> Compression is usually not worthwhile
Hard disagree :)
>> Checksumming is usually not worthwhile
Disks lie. All the time. The claims of "physical disk already has CRC checksums" was around over 20 years ago when the ZFS project started, and the fact that disks lie or do not do strong protection is a huge reason that ZFS was created in the first place. The problem in 2022 remains the same as it was in 2000.
> Summary
>> you can achieve all the same nice advanced features
You really can't. It's not even close. ZFS is so far ahead of the game, that even if some alternatives (eg, btrfs) offer a few similar features, they don't even approach it.
>> ZFS also has a lot of tuning parameters to set.
Having tuning parameters and requiring them are two different things.
>> In the future we’re waiting to see what stratis
stratis is a dead-on-arrival joke.
>> bcachefs
Probably the only thing that has a shot at competing with ZFS.
You seem very knowledgeable about this. I'm a hobbyist who runs debian with a 6 drive raidz2 array in the basement. Hardware aside, do you have any housekeeping suggestions that will help me keep it running well?
My approach is a login script that tells me the health of my zpool. My crontab has a "0 2 * * 0 /sbin/zpool scrub tank" and 6 variants of "0 2 2 * * /usr/sbin/smartctl --test=long /dev/sda &> /dev/null". I've learned to resilver a dead drive recently, it's on a UPS, and I have automated iterative backups elsewhere for critical data that I've practiced restoring from to verify my solution works. Never had much luck with email alerts unless I want to get all of crontab's emails sent to me.
Aside from the usual cronjobs to scrub my local and backup pools, I do a few extra things:
I'm using https://habilis.net/cronic/ to make sure I don't mess up the email notification part of the cronjob. It's a simple wrapper script that sends an email in a readable format if a cronjob fails.
I use Sanoid to create snapshots on my home server, and use Syncoid to push those to a cloud VPS with a beefy network drive as an off-site backup. Both tools are available here: https://github.com/jimsalterjrs/sanoid
The free tier of https://cronitor.io/ makes sure I'm alerted if a cronjob fails, or fails to run on time. Especially that last bit is interesting: that way I'm sure cronjobs aren't silently failing for days/weeks on end.
I have 4 monitors set up in Cronitor: snapshot creation, zpool status on the local and backup machine, and send/receive with Syncoid. This is how that looks on the Cronitor dashboard: https://img.marceldegraaf.net/v6IpNAyxrZ54vgLqIJpY
Let me know if you want more info or examples, happy to share whatever I can to help :-)
EDIT: feel free to reach out via email as well, my address is in my profile.
Thanks, I hadn't heard of zed. More reliable solution than catching a fault with a login script. I was pretty happy when I found msmtp and imagine I can make them work together.
>Even btrfs can't hold a candle to ZFS and it comes closer than mdadm+ext4.
Could you elaborate a little bit on this point. I've used both btrfs and zfs and they have both been fine, but I'm not using them on a large enough scale to see problems.
It's not so much of a problem of scale (it could be when we move beyond 16EiB of storage...): It's primarily about the features and functionality that ZFS offers, and the entire vertical integration of the storage stack helps that. From snapshots that make sense, to df working, to have no surprise-out-of-space error conditions, all the fundamental storage layers in ZFS working as expected, more redundancy types that btrfs can ever dream of (mirror, raidz[123], draid[123]), log/hot-spare devices, keeping all administrative details inside the pool where properties can inherit, actually having block devices without breaking CoW, per-dataset compression policies, quota/reservations, case-sensitivity, optional UTF-8 enforcement with optional normalization,
It can go on and on. btrfs makes an attempt to compete with some of it, but after 14 years of development, it's still not even close to ZFS's first public release (which was 4~5 years after internal Sun development)
If I want no-worries and I don't want to be surprised by data loss, I use ZFS. If I want speed with data I can afford to lose, I use something else (ufs, ext4, xfs, the right FS for the job).
ZFS's integrity checking won't do a dang thing for you if you're not paying attention, don't have monitoring, or don't even run its checks. Yes, I over-provision. Yes, I'll make the reliability vs performance tradeoffs (when it makes sense, usually reliability over performance by default though).
The great thing about having options is exactly that. For me, ZFS is the right choice in most cases. For other people it's not. Being able to make an informed choice and not being forced either way is a good thing.
Zealotry/religion has no place in matters like this when there are clear tradeoffs in all the alternatives.