Why Not ZFS (2021)

darkhelmet · on Feb 4, 2022

I've been using ZFS since circa 2008, but it's a tradeoff.

If I want no-worries and I don't want to be surprised by data loss, I use ZFS. If I want speed with data I can afford to lose, I use something else (ufs, ext4, xfs, the right FS for the job).

ZFS's integrity checking won't do a dang thing for you if you're not paying attention, don't have monitoring, or don't even run its checks. Yes, I over-provision. Yes, I'll make the reliability vs performance tradeoffs (when it makes sense, usually reliability over performance by default though).

The great thing about having options is exactly that. For me, ZFS is the right choice in most cases. For other people it's not. Being able to make an informed choice and not being forced either way is a good thing.

Zealotry/religion has no place in matters like this when there are clear tradeoffs in all the alternatives.

chalst · on Feb 4, 2022

I don't say it is the right choice in 'most cases', but its my preferred default choice for /etc and user home directories.

The hard case is databases. ZFS has a lot to offer (convenient support at FS level for replication stands out), but it is doing a lot of things that are solved problems in the design of competently designed databases and I never trust this kind of needless complexity. Of course if you really care about performance here, you should be willing to roll up your sleeves and tune the settings of the FS and ZFS is really nice in how it allows its complex features to be switched off.

yjftsjthsd-h · on Feb 4, 2022

> The hard case is databases.

I can kind of see your point, but I trust ZFS to never lose data, and I trust (in my case) postgres to never lose data, so the only issue is performance, and while that varies immensely, I mostly work on data that compresses well, so I can barely afford not to use ZFS with compression, because it saves a ton of space and actually improves I/O performance (if you're I/O bound, compressing your data lets you read and write faster than the physical disks can handle, which is still wild to me). Of course, that all depends on trusting all parts of the system; if I thought that ZFS+postgres could ever lose data, or possibly that there was a real risk of it causing an outage (say, memory exhaustion), it'd be a harder trade to make.

chalst · on Feb 4, 2022

Losing data isn't the fear, it's that there might be weird corner cases that lead to badly unpredictable performance.

magicalhippo · on Feb 4, 2022

> I've been using ZFS since circa 2008, but it's a tradeoff.

Indeed. I was disappointed about the low quality of the article. A good article on why not ZFS would have been an interesting addition, to help users decide.

I've been using ZFS on my home NAS for over a decade and overall it's been a great experience, but as you say ZFS does have some limitations which makes it a poor fit for certain use-cases.

Flimm · on Feb 4, 2022

You say that you choose ZFS if you don't want to worry about data loss. My experience is the opposite of yours. The only time I experienced filesystem corruption (other than hardware failure) was when I upgraded to Ubuntu 21.10 on a ZFS installation. Ubuntu 21.10 released with a known bug in the ZFS implementation that caused filesystem corruption. I was disappointed to say the least.

Quekid5 · on Feb 4, 2022

That sounds like an Ubuntu problem more than a ZFS problem, no?

Very happy Ubuntu/ZFS user here, but we use only the LTS versions of Ubuntu.

DarylZero · on Feb 4, 2022

Ubuntu didn't write the ZFS implementation.

p_l · on Feb 4, 2022

The bug was introduced by Ubuntu in their own patch to ZFS code, which is why it only happened in very specific Ubuntu kernel package and nowhere else

nix23 · on Feb 4, 2022

If you release software with a known bug....

DarylZero · on Feb 4, 2022

Ambiguous whether the bug was known at the time of release by Ubuntu. However, either way, it would have been released by ZFS authors first.

E39M5S62 · on Feb 4, 2022

That's absolutely incorrect. The data-loss bug was created by Ubuntu developers when they developed a bad patch for a less-severe upstream OpenZFS issue.

ggm · on Feb 4, 2022

Dedup is silly expensive and is off by default on sane OS.

The memory cost behind Dedup explains FUD about memory cost of ZFS, arc will consume free memory, sure. But it also ejects itself properly like any memory hog and you can tune it.

The 'never in tree' thing is hardly ZFS's fault. ZFS is fully in-tree in FreeBSD.

"there are other choices" is a fine message. I chose to use ZFS for convenience of snapshots as a mechanism to drive backup to a cloned zfs disk I hold offline, as part of my 3-2-1. I also deliberately bought a larger memory device to scale to the burden. At work we use SSD to front for the cost of write, and we get good scale speed backing a DB and large filestores (large for us is still only terabytes, but I know of petabyte instances multi-zvol elsewhere in the world)

iX systems offered us support and we grabbed it with both hands. I have no complaints about maintenance and SLA on this product.

I have migrated zpools Linux-BSD routinely. It doesn't depend on RAID card specific semantic marks, CARD BIOS level config, Drive order in the frame, or "quirks" in the OS beyond conformace to a flagset. If you upgrade flags you can be stuck but we checked before upgrade.

I have lost data in JBOD, in UFS, in EXT, in ZFS. Nothing is perfect. I have lost data in soft RAID and in hard RAID. Nothing is perfect.

donmcronald · on Feb 4, 2022

> I chose to use ZFS for convenience of snapshots as a mechanism to drive backup to a cloned zfs disk I hold offline

I do something similar with Sanoid/Syncoid and Sanoid snapshots are super easy to hook into with Borgmatic so you can have an alternate backup set that has nothing to do with ZFS.

The very first system I ever installed ZFS on had a consumer grade motherboard and very soon after installation I realized one of the SATA ports would get flaky under load because ZFS kept spitting up errors. The controller didn't report errors and happily wrote garbage to the disk from what I could tell. So, IMO, checksumming isn't worthless and I keep all my important data on ZFS these days.

I also disagree with the layering thing. I know it's the "unix way", but chaining together a half dozen independent system isn't something I find appealing. At that point I'm the biggest risk to my own data because the odds of me making a mistake are higher than the odds of hitting a bug that affects data integrity. I'd much rather have a single coherent interface to deal with.

It was the same thing with SystemD. Everyone complained about it "taking over everything" instead of chaining a bunch of existing uncoordinated systems together, but I can't imagine going back to the old way now that I'm used to SystemD. It would be nice to see a SystemD style initiative for desktop Linux.

posix_me_less · on Feb 5, 2022

Spot on on the layering. It is the easiest thing to criticize on ZFS, but it is an ideological point, not a very substantive one. Creating a reliable and performant storage system is possible both with well distinguished layering (LVM2+XFS) and also without it (ZFS). What ZFS lost in ideological purity, it gained in functionality and performance (fast rebuilds and send/receive, in case of corruption, more information on which files are affected).

lmm · on Feb 4, 2022

> File-based RAID offers the promise of having to do less work and avoiding RAIDing the empty space, but in practice it is outweighed significantly by this difference.

Citation needed. I've found ZFS recovers faster and is more usable in degraded mode than an equivalent mdadm raid.

> Buggy

Compared to what? ZFS has a better record of not losing data than anything else, mdadm and ext4 included. Having a large number of bugs in your bug tracker is a poor measure of how buggy your system is.

> Scrubbing simply needs to read every file from the disk so the RAID layer notices and repairs a URE. You can simply put cat /dev/array > /dev/null on cron once a month which is enough for mdadm to notice and repair UREs.

If you do that it will swamp your disks and make that filesystem system unusable once a month, and it will take longer than it should to notice UREs. And if you reinstall your OS you will probably forget to set it up again and not notice until you have a disk failure and lose all your data.

> Checksumming is usually not worthwhile - the physical disk already has CRC checksums at the SATA level, and if you are paranoid you should also have ECC ram to prevent integrity issues in-memory (applies to ZFS too), and this should be enough. But you can easily get this if you want, either at the block layer with dm-integrity (integritysetup) below your disk or btrfs does it automatically.

Checksumming is essential to having a rebuild process that actually works. With mdadm when you rebuild you will probably get silent corrupt data in a couple of files because anything that went bad since your last scrub (if your scrub setup is even working) will not be detectable.

> btrfs does it automatically.

btrfs doesn't have stable support for raid-like modes, and has exactly the same kind of vertical integration that this author doesn't like about ZFS.

ZFS was the thing that convinced me that layering isn't actually always great and sometimes vertical integration makes sense. The file-level RAID is a lot nicer to work with in practice. The integrated tooling is a lot nicer to work with in practice than having to manage the md, lvm, and filesystem parts separately. I believe the out-of-tree thing might be an issue for Linux, which is part of why I'm much happier running FreeBSD.

GuB-42 · on Feb 4, 2022

> Checksumming is essential to having a rebuild process that actually works. With mdadm when you rebuild you will probably get silent corrupt data in a couple of files because anything that went bad since your last scrub (if your scrub setup is even working) will not be detectable.

I can confirm, it happened to me. A disk was corrupting my files and mdadm had no problem propagating the errors. Later switched to ZFS, and checksum error happen (defective SATA cable IIRC). By the way, the author suggests ECC RAM "if you are paranoid", good suggestion, it goes particularly well with ZFS checksumming, and I experienced faulty RAM too.

I switched to ZFS because I actually lost data with a mdadm/ext4 system, I didn't lose data with ZFS even though I went through broken hardware.

MarkSweep · on Feb 4, 2022

With regards to "File-based RAID offers the promise of having to do less work and avoiding RAIDing the empty space", there has been some work on improving performance of ZFS scrub and re silver. In the original algorithm, it worked very well for mostly-empty disks as it only had to check written data. But mostly full disks had worse performance for scrub compared to traditional raid, as the data was checked in tree order. So there was a lot of seeks.

At least that is my understanding.

Anyways, see these talks from the ZFS Developer conference about how scrubbing has been improved:

* 2016 - adding batching to resilvering to improve performance: https://openzfs.org/wiki/Scrub/Resilver_Performance

* 2020 - for mirrors (not RAIDz), Sequential Reconstruction can further improve resilver speed: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020

khimaros · on Feb 4, 2022

if you are using mdadm, it is essential that you do so with a modern kennel and enable dm-integrity for the underlying block devices

p_l · on Feb 4, 2022

> ZFS was the thing that convinced me that layering isn't actually always great and sometimes vertical integration makes sense.

The funniest thing is that ZFS is layered. It's just that 99% of people only see the final result, and don't notice the one place the abstraction leaks a little (zpool and zfs commands operate on separate layers with a bit of overlap).

ZFS is composed of separate layer for actual block devices, a layer of object-storage system on top of it called DMU (those two layers overlap a bit in handling data safety), and finally on top of the object storage there is implementation of posix-compatible filesystem called ZPL, an emulation of plain block storage called ZVOL, and there is optional LustreZFS which also operates on that layer.

There's a bit more segmentation when dig deep into code (whole special I/O scheduling and processing system called ZIO, for example, which provides things like encryption and compression) but yes, ZFS as a whole is quite layered - just with different APIs in between than offered let's say by Linux (you could technically build ZPL on top of now removed SCSI object storage driver, for example, but it would lack things)

kiririn · on Feb 4, 2022

> btrfs doesn't have stable support for raid-like modes, and has exactly the same kind of vertical integration that this author doesn't like about ZFS.

Btrfs raid 1/10 is rock solid and offers similar robustness to ZFS, ie >= hardware raid.

It is a game changer being able to know not only which side of a mirror is correct (checksum), but also which side is the same age as everything else (generation).

I was mind blown that btrfs raid could handle a disk going completely offline for hours (ssd firmware bug) and then reappearing, without any kind of re-sync needed. It didn’t even ‘cheat’ like other raids by marking the disk as offline - it kept using it (automatically fixing any inconsistency) even before I ran a scrub. Thanks to the generation values it could tell that some reads coming from the desynced disk were wrong without needing to read both disks on every read

My understanding is ZFS (or any other raid) would offline the drive in this scenario, if this is true then btrfs raid is arguably better. There is also no threshold where btrfs gives up, like what LinusTechTips experienced with ZFS

throw0101a · on Feb 4, 2022

> Btrfs raid 1/10 is rock solid and offers similar robustness to ZFS, ie >= hardware raid.

There seem to be some foot guns present. See Jim Salter's article, specifically the section "Btrfs RAID array management is a mess":

* https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...

Btrfs was lunched in 2009, and there is still so much meh about it. I've been using ZFS since Solaris 10 (2006) and all the features have done exactly what they say on the tin.

kiririn · on Feb 4, 2022

The degraded mounting is a valid concern. Rebooting while in a degraded state is problematic (from a ‘might lose ssh access’ standpoint, not data loss)

The other concern seems to be the auto mounting of stale disks, which is by design and works a treat

In my experience a much bigger footgun is that btrfs raid makes files with CoW disabled completely unsafe (no sync between disks even when scrubbed), and some distros and programs (systemd-journald) selectively disable CoW by default

lmm · on Feb 4, 2022

> Btrfs raid 1/10 is rock solid

Maybe (though even then I've heard talk of bugs), but the article mainly talks about raid5/6-like modes and those are still marked as unsafe AIUI.

> My understanding is ZFS (or any other raid) would offline the drive in this scenario, if this is true then btrfs raid is arguably better.

ZFS can certainly handle a large number of read errors (recording them but remaining running), but if you reach the point where a device node completely disappears then it won't automatically re-add it when it comes back (you have to explicitly "zpool online" or reboot, then the pool will be imported as dirty and recover). I don't know the full details of exactly what ZFS does in every scenario but to my mind having a level of error at which you offline a drive seems pretty reasonable - once a drive is completely broken it's a waste of everyone's effort to keep retrying indefinitely.

Osiris · on Feb 4, 2022

I have a small home NAS with 3x 8TB drives in ZFS RAID-Z. Everything was fine then one day my son complained that some video files were pausing in the middle when they used to work.

I check the zpool status and saw that the system had been running on just 2 disk for probably a month due to write errors on one disk. Of the two remaining, another was having occasional errors that ZFS was correcting.

I pulled the bad drive, zeroed it out three times, and reinserted it. ZFS performed a resilver. After it was done, I pulled the other drive, zeroed it, and added it back.

The drives (SMART and ZFS) are no longer reporting any errors on any of the three drives and I only lost partial data on 4 files (which were replaceable).

Overall, I was surprised how resilient ZFS was and how easy the process was to replace 2 drives in a 3 drive array with minimal data loss.

zionic · on Feb 4, 2022

Are you scrubbing once a month? If not this will happen again. Linus Media lost 100’s of TB that way.

Seagate drives by chance?

JeremyNT · on Feb 4, 2022

> Are you scrubbing once a month? If not this will happen again. Linus Media lost 100’s of TB that way.

Anybody running 100 TB+ ZFS arrays should really have been aware of this class of problem before deploying (and doubly so if using consumer drives). ZFS protects from bitrot and hardware failures... but it's not magic. If data is written once and not read again for a long time, you won't know it's been sitting on bad sectors until it's too late, and you may end up with multiple failing drives at once if you lose a disk and try to resilver. Far better to check periodically so you can throw bad disks early.

Hopefully they had a good backup strategy!

aerojoe23 · on Feb 7, 2022

Watched the video where Linus talks about it. Apparently they store all of the footage as a nice to have, and to have a use case to make the large storage content around.

All that to say yes they lost all that data, no it wasn't backed up. Not critical to the business.

Handytinge · on Feb 6, 2022

I have a 100TB+ ZFS based NAS and have poor understanding of this.

I'm yet to see anyone make a really good set of resources to either watch or read that are not much too technically deep.

JeremyNT · on Feb 9, 2022

This is tricky, because it's a class of problem that you need to understand exists in order to seek out the relevant information. Once you know what to look for, it's easy to find guidance - even from Oracle themselves [0]

A lot of us have "learned the hard way," as it sounds like Linus himself eventually did. I think this highlights an issue with the "learn through youtube video" approach. An internet celebrity may acquire enough knowledge to do accessible demonstrations, presenting totally valid, useful, and correct information in them, and still miss crucial "unknown unknowns" that they simply hadn't encountered in their own research.

It's hard to know what to recommend for a class of issue you aren't even aware of!

[0] https://blogs.oracle.com/oracle-systems/post/disk-scrub-why-...

raffraffraff · on Feb 4, 2022

Can someone explain this? I find it a bit scary that your have to do monthly manual work on your NAS or you'll lose.

Edit: wait, I think I get it... "Scrubbing simply needs to read every file from the disk so the RAID layer notices and repairs a URE". So it's to avoid bit rot? I have to say, as someone who has a few TB of personal data on a NAS, bit rot is a bit scary. Data backup is mainly why I bought the lifetime pCloud+encryption package during the last Black Friday. I wonder how (if?) they avoid bit rot?

ksec · on Feb 4, 2022

That is why I think Consumer NAS is an unsolved problem. I have no need for VPN, Cloud Photos, or Mail Server etc. I need a simple, reliable network storage that prevent drive failure and bit riot. Right now you need to spend at least $400 on a NAS with lots of config before getting it done.

throw0101a · on Feb 4, 2022

> Edit: wait, I think I get it... "Scrubbing simply needs to read every file from the disk so the RAID layer notices and repairs a URE".

If you don't know about ZFS scrubbing, but are using ZFS, you may wish to spend some time researching ZFS some more.

> I find it a bit scary that your have to do monthly manual work on your NAS or you'll lose.

Most distros / ZFS packages set up a cron job and e-mail out any errors. You do receive error reports from your NAS, right?

fomine3 · on Feb 4, 2022

just add zpool scrub to your crontab.

throw0101a · on Feb 4, 2022

Make make sure the error output of your cron jobs is sent out to a mailbox that you read.

fomine3 · on Feb 4, 2022

I recommend to receive events from zed, because not only scrub detect errors. It also support Pushbullet.

dathinab · on Feb 4, 2022

> So it's to avoid bit rot?

Yes,

it's also trivially to automatize in more or less all situations.

The gotcha is that you have to do it/know about it...

francis-io · on Feb 4, 2022

I know that TrueNAS/FreeNAS add a scrub by default. Possibly ZFS on Linux does now too.

LinusTechTips recently did a video about how they installed ZFS on Linux and didn't have a scrub cron. They started to lose data before noticing.

https://youtu.be/Npu7jkJk5nM

dathinab · on Feb 4, 2022

It's less Linux and more weather or not you have a database/NAS centric distribution I think.

The problem with by default installing a cron job when installing ZFS is that for a general purpose OS there a good default for when and how often to run it. And running it on the wrong time might even be a major problem.

Through then tbh. having a bad default is probably still better then no default in this case.

> lose data before noticing

Is a bit of an overstatement as they didn't look for quite a while, they also did not only fail to do scrubbing, they also failed to setup automated health checks and reporting.

Turning a non NAS focused Linux distribution into a well working and tuned NAS isn't easy (compared to using a good NAS OS/distribution), but making it somewhat work is easy. Which makes this a pretty common mistake for non-specialized people (i.e. like in their case).

ap-andersson · on Feb 4, 2022

I believe I got default cron-files when I installed ZFS on ubuntu 21.04. It could be that they were created and I had to uncomment one line in a file. Then a scrub on the pools would run once a month. Then I setup email on the server and everytime a scrub is done with any errors, I get an email.

Quite easy for me to setup, even though its my first NAS that I built myself and first time using ZFS. Very surprised that LTT effed that up to be honest.

Osiris · on Feb 4, 2022

Scrubbing was turned on by default and the scrubbing was catching checksum errors.

And yes, Seagate SMR drives (obviously not ideal but I'm cheap)

hvidgaard · on Feb 4, 2022

That is quite a odd failure. Have you by any chance done some research on why it happened?

insaneirish · on Feb 4, 2022

Oh boy! Where do we even begin?

Let's start out with what the article spends the most time on: licensing.

This is a giant self-own and a very long winded way of saying "Linux will ignore the best implementation of a thing because we/they don't like the license."

That's a choice, not a law of the universe. It's perhaps also a canonical example of spite-induced nose cutting.

Let's see... other things... oh! Suggesting mdadm is better than well... anything... is a giant red flag to me. No thanks.

Dismissing checksums as unnecessary? Okay, now I think they're just trolling. Years of running ZFS teach you that bits flip, a lot... and you have no realistic recourse without block level checksums.

thayne · on Feb 4, 2022

> "Linux will ignore the best implementation of a thing because we/they don't like the license."

> That's a choice, not a law of the universe. It's perhaps also a canonical example of spite-induced nose cutting

It isn't because they don't like the license. The licenses of linux and zfs are incompatible, so zfs is out of tree and that leads to technical problems. No, it isn't a law of the universe, but it is the law of the United States at least, and probably many other countries as well.

7steps2much · on Feb 4, 2022

To be fair, they kind of did that to themselves with the GPL. BSD for example has no problems whatsoever with including ZFS.

> probably many other countries as well

Maybe. Enforceability of the GPL outside of the US is flaky, so whether or not this specific case would matter is always up to the courts to decide.

thayne · on Feb 4, 2022

> To be fair, they kind of did that to themselves with the GPL

Well at this point, even if they wanted to change the license of Linux, they probably couldn't. Every contributor would have to agree to it, which is unlikely to happen, even if you ignore the logistics of asking everyone.

yjftsjthsd-h · on Feb 4, 2022

> That's a choice

Not meaningfully; Linux can't change away from GPLv2 (too many copyright owners), and ZFS can never change from CDDL (again, too many owners, including Oracle), and those licenses are incompatible. The only way Linux could make it work is by freezing all the kernel APIs that ZFS uses, and while that is a choice, it would be an extremely expensive choice to give up being able to refactor anything that it touched.

chalst · on Feb 4, 2022

I would agree with insaneirish, if it was anybody but Oracle who held the CDDL-licensed code (I think IBM have ZFS-relevant patents, but no CDDL code in the OpenZFS codebase). With anyone else, I'd say the legal risks were overwrought; we're talking two subtly mismatched copyleft licenses, so the idea that there is damage done by combing the codebases is legally weird and this kind of cursed-earth legal strategy is insane from a market giant. But as a wise man once said, don't anthropomorphise Oracle.

trasz · on Feb 4, 2022

Except Oracle is not the danger here: the license incompatibility is about GPL, not CDDL. I don’t think Oracle is going to sue anyone for GPL violation. And if it does, it could do that without CDDL in the picture.

mbreese · on Feb 4, 2022

Oracle also has a Linux distribution. So, I wouldn’t count them out as suing over a GPL/CDDL issue. I’d love to know who the harmed party would be in a lawsuit like that, but I’m sure that there would be no winners.

p_l · on Feb 4, 2022

Their main approach would be pretty much the same as with Software Freedom Conservancy vs Canonical, afaik.

p_l · on Feb 4, 2022

The GPL-side license trolling happened way before Oracle was in talks to buy Sun. On Dtrace team (first CDDL licensed code) afaik the expectation was that they would see Dtrace integrated pretty fast in Linux. Of course, it's all anecdata (just like the opposite view about purposeful incompatibility from another Sun employee).

fomine3 · on Feb 4, 2022

That's true. But they don't have to make it difficult to use SIMD functions for non-GPL modules. This is a bit controversial.

DarylZero · on Feb 4, 2022

> This is a giant self-own and a very long winded way of saying "Linux will ignore the best implementation of a thing because we/they don't like the license."

It has nothing to do with "dislike." It's illegal to put ZFS into the Linux kernel. Just as illegal as putting a pirate copy of Microsoft's NTFS in there.

This is because of the choice of license by the owners of ZFS code!

You're blaming Linux but Linux license was chosen long before ZFS even existed. ZFS license was specifically chosen for the purpose of keeping code out of Linux.

nix23 · on Feb 4, 2022

>It's illegal to put ZFS into the Linux kernel.

No it's not, and you don't have to put it into the kernel.

>This is because of the choice of license by the owners of ZFS code!

Everyone else with a real free license can do it (BSD), so i would suggest that GPL is the problem and not the other way around.

>ZFS license was specifically chosen for the purpose of keeping code out of Linux

Good, it should no go into that rotting mammoth-pile of code.

p_l · on Feb 4, 2022

The incompatibility is on the GPL side, not CDDL side, and even there it's unclear because of the "derived work" term involved. CDDL was created because GPL would artificially limit inclusion in other projects, and while views on its compatibility with GPL were split at Sun (some employees, even legal counsel if one believes claims on this very site, expected CDDL code to be possible for inclusion in GPLv2 projects like Linux, some made rather public remarks on purposeful incompatibility but without anything to support it).

And no, it's not illegal to put ZFS into linux kernel, because GPLv2 applies only on distribution.

DarylZero · on Feb 4, 2022

> it's not illegal to put ZFS into linux kernel, because GPLv2 applies only on distribution.

Oh, then I guess it _IS_ in the Linux kernel already.

You're not being serious.

p_l · on Feb 4, 2022

No, I'm deadly serious. GPLv2 vs CDDL is a complex situation with somewhat easy results (somewhat because there are divergent opinions - so far those opinions only got clicks in media and nothing in court).

It's hard to determine whether GPLv2 would apply on distribution and on what kind of distribution, even.

But your comparison with "pirated NTFS driver" was completely out of whack, suggesting that it's illegal to even do it personally (as would be the case with "pirated NTFS driver"). GPLv2 applies only on distribution. Always did, Always will.

DarylZero · on Feb 4, 2022

By "not being serious" I mean that what you're saying lacks even internal consistency.

Dylan16807 · on Feb 4, 2022

> As an out-of-tree GPL-incompatible module, it is regularly broken by upstream changes on Linux where ZoL was discovered to be abusing GPL symbols, causing long periods of unavailability until a workaround can be found.

No it absolutely wasn't. It was using a symbol that saves the FPU state, and then that symbol was deleted in favor of a GPL symbol that does the same thing. There is only one abuse in the situation, and that is the flat-out lie that triggering an FPU state save intertwines you so deeply with the kernel that it makes your code derivative of it.

yjftsjthsd-h · on Feb 4, 2022

The breakage is also irrelevant as an end user if you're sticking to a LTS kernel, or using distro packaging that supports it (if Canonical never ships a kernel+ZFS that don't work, I don't care what upstream does). And if it does break... I guess I just have to roll back to the last root snapshot;D

fomine3 · on Feb 4, 2022

ZFS + latest kernel is not confident to be safe for me. If ZFS is suitable for use case, it's better to use Ubuntu or Proxmox (or FreeBSD!) rather than Arch.

Macha · on Feb 4, 2022

There is the option of runnning a LTS kernel on Arch: https://archlinux.org/packages/core/x86_64/linux-lts/

p_l · on Feb 4, 2022

Now if Canonical made sure their own modifications don't break... ;)

blibble · on Feb 4, 2022

they broke it IN the LTS kernel

deliberately

tylerjl · on Feb 4, 2022

Any blog post that espouses the virtues of btrfs as better alternative to ZFS is going to be immediately recognized as uninformed and a little embarrassing by any operator with a significant degree of experience in both technologies.

As someone who _has_ operated many systems with both filesystems - for a long time - it’s malfeasance to shepherd someone in that direction. btrfs has its uses, but there’s no comparison in terms of project maturity. Sharp edges abound (behavior at high usage, RAID immaturity, the still-extant 5.16 kernel single-core max CPU use bug, etc.)

akvadrako · on Feb 4, 2022

As evidence of these bugs in BTRFS.

The kernel hotfix from a few days ago to break an infinite loop regression:

https://www.phoronix.com/scan.php?page=news_item&px=Linux-5....

And the announcement last year of officially giving up on RAID5/6 support, after not discouraging the totally dangerous feature for years:

https://www.phoronix.com/scan.php?page=news_item&px=Btrfs-Wa...

I would love to have options besides ZFS for check-summing and snapshots, but BTRFS isn't for people who care about their data.

ThatPlayer · on Feb 4, 2022

Similarly to the author, I definitely want to be able to use btrfs. I liked features like being able to add/remove drives, and reflink=auto. But I got burned on it's stability (luckily didn't lose any data) and would not look at it again for anything important in a long time.

Still use it on single drive filesystems for inline compression sometimes though.

ungamed · on Feb 4, 2022

and ZFS has that nasty issue with 32 bit fragmented memory..

p_l · on Feb 4, 2022

Well, it's less ZFS and more Linux, but it makes for reasonable problem on 32bit platforms (though there was a lot of work to move of vmalloc() in ZoL)

kstenerud · on Feb 4, 2022

I switched to ZFS a decade ago. Before that I used mdadm and LVM. Before that I used hardware RAID. I also dabbled in BTRFS but didn't like it. I usually play with any new filesystem that has 5 years maturity and a big userbase, but nowadays my go-to filesystems are ext4 and ZFS.

I use ext4 mainly because it's the "blessed" filesystem of linux. It's simple, does what I want, and I have yet to be bitten by it. That's good enough for general use in situations where I don't care about file integrity.

ZFS marks the first time I've been able to breathe easy, knowing that my backups aren't storing corrupt files because of a raid write hole or hardware silent corruption that I won't discover for months or even years (likely long past any restore point on my backups).

Don't trust the hardware. At all. Hardware fails all the time, often silently, and can't be audited because the code and silicon are closed source so you have absolutely NO indication as to its quality (other than Backblaze reports [1]). You'll never know what demons lurk in those depths, but with paranoid software like ZFS, you don't have to care.

The only time I use non-ZFS filesystems is when I don't care about data integrity on that drive, such as pushbutton rebuildable server boot disks with backed up configurations (NixOS is great here), or cache/scratch drives. If it ever becomes easier to make ZFS boot disks, I'll probably start using them for server boot disks as well, just so that I can be alerted whenever a drive starts to fail.

[1] https://www.backblaze.com/b2/hard-drive-test-data.html

fomine3 · on Feb 4, 2022

I think XFS is also blessed and a bit modern filesystem. It is backed by Red Hat.

p_l · on Feb 4, 2022

XFS definitely feels better designed at times.

vermaden · on Feb 4, 2022

> Can't add/remove disks to a RAID

You can expand ZPOOL with RAIDZ Expansion:

- https://arstechnica.com/gadgets/2021/06/raidz-expansion-code...

- https://github.com/openzfs/zfs/pull/12225

By the way the author sounds like he is in love with LVM and Linux ecosystem and does not understand ZFS at all ... while he also recommends BTRFS on LVM for the same features.

If you do now know ZFS do not read it - it will only bring false information into your mind. If you know ZFS then you can go read an laugh to make your mood better :)

kasabali · on Feb 5, 2022

> If you do now know ZFS do not read it

It looks like you didn't read the article, either, because author already mentioned the very same links in the article

> Growing a RAIDZ vdev by adding disks is at least coming soon. It is still a WIP as of August 2021 despite a breathless Ars Technica article about it in June.

esyir · on Feb 4, 2022

This is a relatively recent change, and for a while expandable pools was definitely an advantage of BTRFS over ZFS.

linsomniac · on Feb 4, 2022

I'm a long-time ZFS fan, and still for storage of my digital artifacts I don't really trust anything else. For backup servers, where performance isn't really critical, and snapshots are a huge benefit, it is great!

However, ~2 years ago I installed my laptop with encrypted ZFS, and frankly it kind of sucks. I don't know if this is the SIMD thing, or something else, but I can basically count on 1 full CPU running some zfs process at all times. Any apt install takes a stupid long time because it's doing something with snapshots. And I think I've only used the snapshots a couple of times. When I did though, it was handy.

But, I can't complain about my track record with ZFS: Over ~15 years I've never had data lost by it, despite having some horrible things happen to the systems that have run it.

I will fully agree that dedup is effectively useless. I've never had a system that had enough memory to reliably run dedup on any real workload.

I'll disagree that there are other tools better at doing what ZFS does though. In particular, I've had 5 events over the last year where our (normally reliable) PERC storage systems experienced corruption on LVM and RAID-6 arrays. I believe it was related to some issues with the Dell drives and Kafka broker activity on the arrays, but it sure would have been nice to have had ZFS on them instead of hardware RAID+LVM.

rincebrain · on Feb 4, 2022

Coming soon to an OpenZFS near you: BRT [1].

Not inline dedup, but a number of the nicer benefits, and usually cheaper.

[1] - https://drive.google.com/file/d/1csE8OuPotfhaFi9KvTGKMGy86Kx...

abrookewood · on Feb 4, 2022

That sounds promising. I made the mistake of using de-dupe on ZFS before I understood the RAM requirements and performance implications ... quickly backed out.

donmcronald · on Feb 4, 2022

> I don't know if this is the SIMD thing, or something else, but I can basically count on 1 full CPU running some zfs process at all times.

Based on my (limited) experience it was really bad before the whole SIMD thing got resolved. There was a huge improvement once they worked around that.

fuzzy2 · on Feb 4, 2022

For security reasons I recently encrypted some of my ZFS home server (including the OS SSD). Throughput took a severe hit, because the CPU (Pentium G630) does not support AES-NI. However, day-to-day operations (also CPU load over time) remain virtually unchanged.

Encryption is unrelated to any snapshotting activity. Sounds like you may have additional hooks in place. ZFS by itself does not integrate in any way with package management or anything else, really.

linsomniac · on Feb 4, 2022

Yes, Ubuntu includes some hooks in the package management for snapshots related to installing packages. I haven't had need to use them so far.

colordrops · on Feb 4, 2022

I installed NixOS from scratch on my laptop a couple weeks ago, and went with encrypted ZFS. Not seeing any CPU utilization it.

Filligree · on Feb 4, 2022

I don't know if it ended up implemented, but there was talk of patching Linux to re-enable the necessary symbols for ZFS. This would let it use SIMD again, and speed it up to where encryption isn't a performance hog.

It'd explain why you get different results on NixOS.

EDIT: Seems like it was, but has been removed as unnecessary on newer ZFS versions.

linsomniac · on Feb 4, 2022

Not sure what the deal is, I haven't really dug into it at all, but pretty much any time I run htop I can see a 100% CPU process like:

2710046 root 20 0 1121M 85744 6848 S 102. 0.3 0:05.76 /sbin/zsysd

rincebrain · on Feb 4, 2022

zsysd is an Ubuntu daemon for managing ZFS things, not something OpenZFS provides.

So if that is the only process spinning, that is probably specific to them.

(Not that OZFS native encryption doesn't have flaws; I am probably the last person on the planet to pick to argue that point. But I don't know that this is among them.)

exikyut · on Feb 4, 2022

I wonder what `sudo perf top` says it's doing (in terms of heaviest symbols). That would probably be the most useful thing to stick on the end of a pitchfork :P

(The command accepts a `-p <pid>` parameter; without that it'll profile the whole system (I make this sound much heavier than it is).)

E39M5S62 · on Feb 4, 2022

zsysd is Ubuntu's zsys tool. It's notoriously bad at burning CPU as your number of snapshots increase. It has some pretty serious flaws as well that can result in total data loss on the local drive due to aggressive / incorrect snapshot pruning.

pmarreck · on Feb 4, 2022

Couldn't you do passive dedup somehow by having a periodic task that searched for dupes and replaced them with hardlinks?

linsomniac · on Feb 4, 2022

I don't think so, for a few reasons:

- My primary need for it is on my backup box, which does "rsync --inplace" to keep small changes to large files from completely creating a new copy of that file every backup run. Hardlinks would cause all of those to compete if some previously similar file started getting updated (OS updates or similar starting files that then diverge.

- Each system gets it's own ZFS, so I can snapshot them at the end of the backup, keep different retention times, etc. But I can't hardlink across filesystems, I don't think.

The linked article did mention increasing dedup block size, which might help. If I could even cut down the DDT by half, that would make it more doable.

yjftsjthsd-h · on Feb 4, 2022

Wouldn't the better option be to overwrite one duplicate with `cp --reflink`? That way you keep the nice CoW properties... I think.

Dylan16807 · on Feb 4, 2022

ZFS can't do reflinks.

yjftsjthsd-h · on Feb 4, 2022

Oh, I didn't realize that, thanks

XorNot · on Feb 4, 2022

This removes copy-on-write though, which is why it's a bad idea. Editing a file in one place and having it surprise edit everywhere else is a good way to wind up with a disaster.

tlamponi · on Feb 4, 2022

The single, actual reason for refusing to not use ZFS on Linux systems in general is IMO the non-mainline state.

Sure, as with other FS it has trade-offs and may not match all workloads or use cases, but it also has a maturity and no native Linux file systems has such a thought out and complete feature set ZFS has, albeit btrfs comes closer every release which is nice to see. The one thing ZFS is missing is rebalancing, but there are ideas out to solve that, but those are not reasons for just not singling out ZFS and never considering usage in general. Alternatively fighting with proprietary HW raid controller is risking to get one's data eaten on a simple firmware upgrade, and having no resource to introspect or salvage, because they're just a proprietary mess.

So, while installing ZFS via DKMS or the like works out, in the end they'll always be a second class citizen in the kernel and spent a lot of time on fighting changes upstream, working in a parallel universe in the kernel and thus spending much more resources and effort.

See for example the ARC, that is allowed to use up to half of memory by default, but it just nowhere shows up in the native Linux memory accounting, yes, some specialized tools like arcstat exist, but if ZFS was actually mainlined it could benefit from actually working with the kernel, not against it half of the time.

I think that if ZFS would be mainlined, and sadly that seems very unlikely to happen anytime soon, a big part of the things causing users currently trouble when using ZFS on Linux would go away pretty soon just due to being natively integrated.

p_l · on Feb 4, 2022

The issue with ARC is that you can't really make it cooperate with normal linux VFS, because of how linux works. You can either break with Linux VFS (as ZFSonLinux does) or do a complete rewrite of core foundational element of Linux kernel. (No, you can't reimplement ZFS to be compatible with certain VFS expectations).

There are reasons why both kernel and ZoL teams say they are happy with not mainlining.

tlamponi · on Feb 5, 2022

> The issue with ARC is that you can't really make it cooperate with normal linux VFS, because of how linux works.

That's only parts of the reason, and in the simplest form you can just disable the VFS page cache and render fitting arcstat properties through it.

And besides that, I see no inherent issue in solving that, at least in a much more integrated way as the status quo is now..

> There are reasons why both kernel and ZoL teams say they are happy with not mainlining.

Yes, but that's the license and not the caching/VFS differences.

As Linus said:

    > And honestly, there is no way I can merge any of the ZFS efforts until I get
    > an official letter from Oracle that is signed by their main legal counsel or
    > preferably by Larry Ellison himself that says that yes, it's ok to do so and
    > treat the end result as GPL'd.
    >
    > Other people think it can be ok to merge ZFS code into the kernel and that
    > the module interface makes it ok, and that's their decision. But considering
    > Oracle's litigious nature, and the questions over licensing, there's no way I
    > can feel safe in ever doing so.
    >
    > And I'm not at all interested in some "ZFS shim layer" thing either that some
    > people seem to think would isolate the two projects. That adds no value to
    > our side, and given Oracle's interface copyright suits (see Java), I don't
    > think it's any real licensing win either.
    >
    > [...] the licensing issues just make it a non-starter for me."

https://www.realworldtech.com/forum/?threadid=189711&curpost...

And OpenZFS also wants to do so, but legal blockers and Linux maintainers not wanting to do a dual-licensed CDDL/GPL either, is their blocker:

https://github.com/openzfs/zfs/issues/8314

p_l · on Feb 5, 2022

Oh, I haven't seen that issue. (Notice "made me reconsider" from ryao).

But pagecache is still something that isn't fixable (you can't disable VFS pagecache, it's too intertwined with the whole I/O system) - I don't see a way to implement ZFS "linux way" considering things like block sizes bigger than one page being antithetical in linux.

chalst · on Feb 4, 2022

It's kind of dishonest to have a section "Out-of-tree and will never be mainlined" that discusses Debian and Red Hat but not Ubuntu, who have bitten the bullet of doing exactly this in-kernel maintenance the article claims doesn't exist. (Or is it ignorance? Did the author not know of this support?).

Cf. https://www.reddit.com/r/zfs/comments/ryaaqo/zfs_on_debian_v...

ylyn · on Feb 4, 2022

Did you read the article?

> Ubuntu ships ZFS as part of the kernel, not even as a separate loadable module. This redistribution of a combined CDDL/GPLv2 work is probably illegal.

chalst · on Feb 4, 2022

I did and I seem to have had a reading comprehension failure and skipped that brief, crystal-clear paragraph. Ouch!

That criticism retracted, although if the author suggests Ubuntu users are under threat here, that seems kind of fanciful to me. And while Oracle are hyperaggressive, I'm struggling to see that even they would think it in their interests to go after Canonical itself.

Macha · on Feb 4, 2022

They bought Sun, partly because they saw a revenue opportunity to go after Java's largest end user deployment on Android devices. I'm shocked they haven't gone after Canonical yet, to be honest.

p_l · on Feb 4, 2022

The only arguments they would have are the same as with existing SFC vs Canonical court case (SFC sued Canonical on behalf of few Linux kernel developers)

p_l · on Feb 4, 2022

Oracle has no ground to stand on wrt to OpenZFS, the only legal attacks are possible from GPL side.

mustache_kimono · on Feb 12, 2022

This is a really good point. The only GPL remedy expressly excludes attacks on mere users.

Helmut10001 · on Feb 4, 2022

Switched to ZFS recently for my home server, 6x8TB SATA Raidz2 (data) and 4x 500GB SSD SATA Raid1 Mirror (VMs). Above everything, resilience, predictability and piece of mind were the primary motivations. The performance is totally fine - many argue against slow SATA disks with ZFS, especially SSDs, due to ZFS impact on Total Bytes Written (TBW), but if you're not hosting an enterprise with many Database writes, SATA disks are perfectly acceptable with ZFS (my observation). After one month, I have accumulated 2 1/2 TBW on the SSD-Mirror - the disks would still last about 7 years given this rate (this is running with 25 Docker Services and 5 VMs, including bigger ones such as Gitlab).

I have compression and encryption on and the fio benchmarks were all good, similar to what I saw with the Hardware Raid 1.

Regarding monitoring: It is not that difficult. (1) Have ZFS Mail status updates to you (e.g. this is the default in Proxmox), (2) Monthly Scrubs (cron) (3) Monitor TBW with (e.g.) InfluxDB. Wrote a blog post about the last part [1] "Disk Wear (SSD) - extracted from extended Smart Attributes (Single Stat)".

All of the issues with degrading disks apply equally to other disk setups (e.g. Raid1). However, ZFS gives you better tools to predict and prevent, before disaster happens.

[1]: https://du.nkel.dev/blog/2021-05-05_proxmox_influxdb/#config...

cosmin800 · on Feb 4, 2022

he is wrong about a few things: 1: checksuming is worthwhile, I have had silent data corruption on both ssd/hdd. 2: compression is also worth it: home dir 74G/103G, virtual machines dir 1.3T/2.0T 3: zfs was never supposed to be fast, data integrity is the target. 4: zfs does not need a manual repair tool, is automatic and data at rest is always consistent. 5: in the future x, y, z - yeah sure.

ziml77 · on Feb 4, 2022

By default, ZFS does need scrubs to be performed manually. Ask Linus Sebastian about that one https://www.youtube.com/watch?v=Npu7jkJk5nM

TL;DW: Data in their massive 1PB server was suffering from bitrot because there was no scheduled scrub to repair the bad data. And they couldn't tell how bad the situation was because without any scrubs happening, the stats on data integrity were inaccurate.

cosmin800 · on Feb 4, 2022

I agree, the scrubs are performed manually, but enabled by default in debian/ubuntu, via crontab every two weeks (mdadm consitency check is also triggered from cron)

about the data loss in the video, mistakes easy to spot: 1st: using seagate. 2nd: installed by us and never updated 3rd: insufficient reading of docs before going all in on zfs. 4th: buying more seagate drives ;))

I think they had a way higher chance of losing their data going the usual stack mdadm/lvm/ext4/luks/btrfs, I think mastering those is harder than mastering zfs.

fredoralive · on Feb 4, 2022

I think for Linus Media Group the main "meta" issue is that they don't have a dedicated member of staff to handle boring day-to-day IT / sysadmin tasks that you don't make videos about. A video about building a crazy storage server is content for a video so gets done, but somebody needs to make sure its still working / updated, and you don't make videos about routine maintenance that so its forgotten.

Although everyone else can learn the important lesson that RAID / ZFS isn't magic and you need to have stuff setup correctly and monitored. The fact that RAID isn't a backup as well[1].

[1] Although if the LMG servers affected are just for data hoarding raw footage that is unlikely to be needed again, it's possible the risk / cost balance pushes away from backups and just relying on RAID, but that's a niche case (and they lost the gamble...).

ziml77 · on Feb 4, 2022

Yes, their setup likely would have been configured, monitored, and maintained properly if they had an IT guy. But they made the (easy to make) mistake of thinking that having enough tech knowledge means you don't need a proper IT/systems department.

I'm certain at least that Linus knows that RAID isn't backup. And I'm Linus is going to try had to get the data back, but it seems to me that this isn't some devastating failure for him.

Dylan16807 · on Feb 4, 2022

What also makes it niche is getting the drives for free. When you're paying upwards of $30k for a petabyte of cold storage, tape is pretty tempting.

gjvc · on Feb 4, 2022

I bet you they also bought all the same make, model, and batch/vintage drives.

If you are building a storage array, do not do this. Ensure that you are using a variety of drive types (obviously same size and interface technology). Doing so guards against the danger of too many drives going wrong at the same time (within the same time window) causing a failure from which it is impossible to recover.

chasil · on Feb 4, 2022

There is an elephant in the room.

  btrfs filesystem defrag -r /

You can't do that with ZFS.

People can really hurt you with that.

https://www.usenix.org/system/files/login/articles/login_sum...

infogulch · on Feb 4, 2022

Interesting. That was specifically tested on ZFS in a prior paper referenced by that one and published in 2017:

https://www.usenix.org/system/files/conference/fast17/fast17...

Glancing through, it seems ZFS is about middle of the pack of the analyzed filesystems (BetrFS Btrfs ext4 F2FS XFS ZFS) in terms of performance aging caused by change history. And only BetrFS did not show any aging from these tests and was surprisingly stable throughout. I'm curious if that's changed in recent versions of any the analyzed file systems, I know ZFS has had a major release recently.

_iziv · on Feb 4, 2022

When a comment is worth more than the discussed random blog entry...

Thanks for the USENIX link!

pmarreck · on Feb 4, 2022

"Git aging workload on btrfs on HDD. The overall slowdown is 20.6x."

Well, this article is fascinating, but maybe the reason why btrfs has built-in defrag functionality is because without it, performance degrades incredibly steeply?

pmarreck · on Feb 4, 2022

Followup to this: I've had a btrfs defrag running for over 12 hours now :O

kasabali · on Feb 5, 2022

Are you on Linux 5.16? You might've hit the infinite defrag bug.

pmarreck · on Feb 5, 2022

i'm on arch, so probably. any info on this?

kasabali · on Feb 5, 2022

https://www.reddit.com/r/btrfs/comments/sh0otq

pmarreck · on Feb 5, 2022

interesting, my btrfs mount in fstab did not have autodefrag on, but I also have it in a VMWare VM, so it is probably that

curt15 · on Feb 4, 2022

ZFS has a `recordsize` parameter that can mitigate fragmentation if tuned correctly for the workload. This allows zfs to perform pretty competitively on workloads that would cripple btrfs unless one disabled copy-on-write (and therefore checksumming and compression) altogether [1].

[1] https://www.percona.com/blog/mysql-zfs-performance-update/

Dylan16807 · on Feb 4, 2022

I can't usually do that on Btrfs either. Whenever it defragments something that's COW, it makes a new copy with 1 use, and leaves all the other snapshots/reflinks pointing at the old copy.

the8472 · on Feb 4, 2022

You can defrag and then run an offline dedup to restore CoW.

Dylan16807 · on Feb 4, 2022

Can I be reasonably sure that the dedup will pick the defragmented version?

the8472 · on Feb 7, 2022

Well, most of dedup runs in userspace. But yes, the existing tools are good enough at it to work across snapshots.

mekster · on Feb 4, 2022

Seems the author hasn't even played with zfs than small experiments. Compression works very well for logs and even compresses /var/lib/docker to about half which usually becomes quite huge as you keep using containers.

Saying to be cautious of using zfs because some day there will be an in-tree file system to replace zfs is absolute nonsense.

What happened with btrfs? People waited for a decade and still no wide deployment.

And no one wants to use zfs-fuse. Everyone says the performance is a joke but yet, it's not mentioned in there.

Word of advice, try zfs than read this article.

Macha · on Feb 4, 2022

> What happened with btrfs? People waited for a decade and still no wide deployment.

Isn't Facebook all in on btrfs? https://www.networkworld.com/article/2367229/smooth-like-btr...

mekster · on Feb 4, 2022

I said wide deployment. A single company and Suse don't count as wide deployment.

modzu · on Feb 4, 2022

well you can include fedora now too

fdr · on Feb 4, 2022

I find this article overstated. I ran some Postgres databases on ZFS for specialty needs for some time, mostly to take advantage of compression. It can be quite potent for this, though the amplification of CPU usage with a pathological access pattern can be a problem for some workloads.

I did have some problems in early versions that seemed plausibly related to its contiguous memory needs, and later I saw a release quite drastically changed this (0.7), and indeed, it was stable from then on.

I found it more cohesive to work with than other options available at the time, e.g. l2arc for caching. I didn't have need for pool dynamism, but I found it rather easy to use, though I also have experience with lvm and mdadm.

That said, some of the source code I found more incomprehensible than average, and I'm fairly used to reading linux to answer certain questions. Maybe I just needed to study it longer.

_iziv · on Feb 4, 2022

Sounds like somebody without practical experience.

E.g. Borg was worthless because the retrival of data was so slow from a dedicated hetzner SX6_ Host that getting data back with less than 10MB/s was disastrous for a 10TB+ repo. (Not a bandwidth problem.)

Been there, done most of the suggestions. Most of them are unpractical -- regardless what benchmarks at phoronix say and how often one jumps from solution to solution for certain aspects.

Still use ZFS, still IMO overall the best package for a lot of data management.

myrandomcomment · on Feb 4, 2022

I am sorry but there is little fact in this post. It starts with the religious crap about licensing, then inaccurate statements about symbols, statements about performance with zero proof, misunderstanding of layering, dumb statements about memory cache and how it works the ARC, some shock about dedup requiring lots of ram (duh)....etc. This is written as well I feel this way so here is my Cherry-picking set of items (without clear understanding) to backup my view.

dusted · on Feb 4, 2022

I'm not entirely convinced by his arguments, "oh, if you want the features of zfs you just this bunch of other stuff and it will come in more performant, stable and less buggy than zfs".. Even if that was the case, I don't feel convinced that I would be as capable of recovering from a disaster with that many moving parts.

"Unless you have ECC", of course I have ECC, it's for my file storage.

That said, there are valid points and something to be considered, especially the licensing stuff is a bit scary.

posix_me_less · on Feb 5, 2022

The licensing scare is overblown. You only need to worry if you want to distribute software that includes both ZFS and Linux, like Canonical does (and they decided it's worth the small risk). If you just use ZFS and Linux on your machines, there is no licensing issue whatsoever.

Lammy · on Feb 4, 2022

Dedupe is a lot less memory-intensive if you have larger files and tune your datasets' `recordsize` to 1MiB instead of the 128KiB default. It makes per-file deduplication less efficient but makes the overall process much more usable since there will be much fewer total blocks for it to keep track of. I use it to maintain my torrent seed directories while also moving/retagging the files into my own separate hierarchy.

shmerl · on Feb 4, 2022

> Things to use instead

Funding bcachefs development would help that as well: https://www.patreon.com/bcachefs

thayne · on Feb 4, 2022

Has there ever been an attempt to buy the copyright for ZFS from Oracle, so it could be changed to be compatible with GPLv2?

p_l · on Feb 4, 2022

1. Oracle can only sell you a divergent version

2. relicensing would kill the project

To expand on (2) - CDDL was specifically designed to make it easier to include in other projects while also keeping certain protections for all involved. The issue is that GPLv2 has somewhat complex case of whether it applies ("derivative work") and also tries to push you to distribute code under GPLv2 terms and doesn't play ball if some part of it has requirements that exceed GPLv2 ones.

The specific incompatibility is, iirc, due to patent litigation protections in CDDL, which do not exist in GPLv2, meaning you can't just redistribute CDDL code under GPLv2 umbrella

thayne · on Feb 4, 2022

But why couldn't it be relicensed as mit, bsd, or some other non-copyleft license that is compatible with GPLv2?

p_l · on Feb 4, 2022

Because those don't include patent indemnity which protects you from hypothetical patent litigation from Oracle (and other contributors).

Unfortunately, it's IIRC the part that, according to some, makes it incompatible with GPL

ksec · on Feb 4, 2022

Same question i have had in my mind for a long time.

ddtaylor · on Feb 4, 2022

So far I have been happy with btrfs instead.

gkhartman · on Feb 4, 2022

I'm a long term ZoL user that's been looking over the fence at btrfs for a long time. I've been tempted to switch since I only use ZFS mirrors (RAID 5 instability isn't a deal killer). If you don't mind me asking, how long have you been using it and have you had any stability issues?

wtallis · on Feb 4, 2022

I lost a bit of data circa 2013 on a BTRFS RAID5 as a result of a power loss. I rebuilt the array as RAID1 and it's been fine ever since, despite initially using only ST3000DM001 drives (most of which are now dead). I've added and removed drives more times than I can recall (some dead or dying, some just too small or slow to keep in service), migrated the array piecemeal over to an assortment of SSDs of varying sizes, recovered from mistakes like ejecting the wrong drive while the filesystem was mounted, and switched the metadata to RAID1c3, all with reasonably low downtime and no data loss. There have definitely been times I would have appreciated the space efficiency of a reliable RAID5/6 setup, but on the other hand I've made heavy use of features that ZFS doesn't have and probably never will.

TillE · on Feb 4, 2022

Yeah seems good. I will say, WinBtrfs is unfortunately garbage, but you can just set up a simple Linux VM as an SMB server in Hyper-V with dynamic memory, and for me it uses about 700MB of RAM. Works great.

jpeeler · on Feb 4, 2022

FYI, WSL now supports mounting physical disks. So if you're running a setup like that, you can take advantage of a VM managed for you, along with not having to run Samba because files are made available in explorer (via 9P).

I was able to get it working (with a few setup issues) with a RAID 1 mirror: https://github.com/microsoft/WSL/issues/6711#issuecomment-10...

(Forgive linking in the middle of a Github issue. Everything ended up working with the latest updates and being an administrator.)

pmarreck · on Feb 4, 2022

I tried it twice and got completely hosed drives twice somehow.

Have had zero hosed drives with ZFS. Although with ZFS not being supported in the kernel, it's fairly hard to do a ZFS-on-root setup unless you're using Ubuntu.

YMMV.

yjftsjthsd-h · on Feb 4, 2022

Ditto; I used BTRFS on OpenSUSE - y'know, the big distro that's all-in on BTRFS and that's been using it for forever, the one place where I'd expect it to work - and it ate the whole root filesystem twice. The first time I tried to recover (fsck? I forget), the second time I didn't even try and just reinstalled. In hindsight, I'm not sure why I reinstalled on BTRFS again, but it hasn't managed a third time at least:) (Now that I've written that, it will of course break)

technofiend · on Feb 4, 2022

btrfs is the default for fedora workstation 35 and seems to work well for normal use cases. However it failed miserably when I tried to download the monero blockhain, either through raw import or incremental p2p sync. The fix was simple - set the database directory to nocow, recreate the db from a backup and all was well. But btrfs has that sharp edge users need to worry about in case they do have a large, regularly written file (say, like a database) hosted on the filesystem. XFS never had such an issue.

doublepg23 · on Feb 4, 2022

Every time I've used btrfs I've gotten bitten. The tooling in general is just not up to snuff to ZFS's vertical integration approach.

Filligree · on Feb 4, 2022

How is performance nowadays if you have a couple thousand snapshots? And can I assume there's no admin overhead, like manually balancing things?

posix_me_less · on Feb 5, 2022

Why would one need thousands of snapshots, in practice?

throw7 · on Feb 4, 2022

Well, ZFS not being in the kernel is the easy reason I don't use zfs.

The reason I don't use btrfs is I don't want to deal with problems I currently don't have. ENOSPC with space available, performance issues when space available, quotas.

I want evolution from where I am, not jump ship.

charcircuit · on Feb 4, 2022

>ZFS not being in the kernel

To the end user the functionality is identical whether or not it's in the kernel as opposed to just being a kernel module

pmarreck · on Feb 4, 2022

Now try booting off it with it 1) not being supported in the kernel, 2) not using Ubuntu, 3) figuring out how to boot into a previous snapshot easily

j16sdiz · on Feb 4, 2022

OpenZFS.org have pretty good guide on varies distributions.

https://openzfs.github.io/openzfs-docs/Getting%20Started/ind...

It is as easy as "Copy and Paste", but not "point and click" easy

zeec123 · on Feb 4, 2022

I have zfs on root with nixos since day 1. One line of code, as other fs like ext4, btrfs or bcachefs.

pmarreck · on Feb 4, 2022

yes but nixos requires a degree in nixos. let me know when it's accessible by mere mortal nerds who have enough new tech to stay on top of that "getting a degree in adminning my OS" is too prohibitive

kadoban · on Feb 4, 2022

I'm curious why that's a blocker for you? It doesn't seem to affect me at all really.

throw7 · on Feb 4, 2022

I call it the "easy" answer because I am old and "time is expensive" to me and I'm not rich.

I want to depend on others on what to spend my time on and using in kernel on whether to "jump ship" is a filter I've decided. YMMV. Like I said, I don't use btrfs either.

_pn3l · on Feb 4, 2022

It’s distributed as a DKMS module, it’s not like we’re all doing `cd /usr/src/linux && make config`

Filligree · on Feb 4, 2022

You run apt-get. Kernel gets updated. The ZFS module doesn't compile under the new kernel.

I'm not aware of any distros except NixOS which will then proceed to roll back the entire transaction.

tpetry · on Feb 4, 2022

Fedora CoreOS can rollback the operating system to a previous state on errors. This could be done as soon as you can‘t mount the kernel module anymore after an update.

kadoban · on Feb 5, 2022

On arch I just run the LTS kernel. I've _never_ had zfs module not build.

doublepg23 · on Feb 4, 2022

Don't boot the new kernel?

eminence32 · on Feb 4, 2022

> At time of writing there are 387 open issues with Type: Defect label on the ZoL Github and the bulk of them seem to be genuinely important problems, such as logic bugs, panics, assertions, hanging, system crashes, kernel null pointer dereferences, and xfstests failures.

My impression is that zfs is generally considered to be less buggy than btrfs, which is the filesystem that is most often used when directly comparing to zfs. Though I admit I mostly use ZFS on freebsd, not linux (and I think the article is mostly focused on ZoL)

cdesai · on Feb 4, 2022

Some compression stats from my local setup. I work with a lot of source code (Android / AOSP), and big output directories (hundreds of gigabytes)

ZFS compressratio: source: 1.40x (Separate pool) output: 1.68x-2.49x (Multiple datasets to make management easier)

The output compression allowed me to use a 1TB drive as a 2TB drive effectively, allowing me to store a lot more output and not have to wipe away build output from x to build y.

That alone makes zfs worth it for me (and I know many other Android devs who use it in a similar fashion)

bubblethink · on Feb 4, 2022

Ubuntu bundling zfs is a pretty good and convenient feature. As for the alternatives, mdadm + ext4/xfs is fine (i.e., does what it says on the tin), but all the other alternatives, especially the ones championed by RedHat - lvm, thin provisioning, stratis and the likes, are messy, fragile and not really used at scale by the general populace. With file systems, and linux in general, you want things that are battle tested by regular people.

FireBeyond · on Feb 4, 2022

Wait, how do you consider lvm "not really used at scale by the general populace"? Throughout the last several years, be it Ubuntu, CentOS, RHEL, etc., most distros tend to default to an LVM-driven partitioning scheme, I believe.

Disclaimer: I work at Red Hat, though nowhere near filesystems.

bubblethink · on Feb 4, 2022

I mean the fancier things like thin provisioning, snapshots and other features built on top of lvm as a means to provide zfs like functionality. Vanilla lvm is fine, which is what is used be everyone. I have been bitten twice by lvm related issues. Once because metadata storage in lvm thin provision got full (no idea why defaults are so low), and once because snapshots by snapper stopped working (i.e., they wouldn't rotate eventually causing a disk full situation) because an selinux policty update borked snapper.

midjji · on Feb 4, 2022

I agree with the speed issues, but resilvering due to diskcorruption is faster than raid, which is left out, as is the problem with single redundancy ambiguity in raid that raidz fixes. The checksumming is also fantastic in very specific cases, including where sata checksumming is insufficient. The old example is a mechanical drive in a laptop which is reading or writing when the laptop gets moved. Kinda applies to mid level datacenters in earthquake regions too. But also applies to any kind of shitty cheap laptop drives, and similar.

I also think the article underestimates the value of compression, its insane for the file formats to include it at all, its absolutely a filesystem property. How much and how hard a file should be compressed isnt known on creation, its known by the server that is currently providing it to users, meaning precompressed files are always either to strongly compressed, or not strongly enough, and often using custom shitty compression. Its also great for many kinds of research and engineering workloads where data does not have built in compression, or generates intermediate files.

posix_me_less · on Feb 5, 2022

File formats have compression for a very good reason - network transfer of these files. If compression was done only on the FS level, web servers would uncompress files when reading them and then immediately after that compressing them again to send them over the network to clients, thus wasting computing resources. It's better when large files (image and video files) are stored and transferred in the same compressed format.

3np · on Feb 4, 2022

The licensing and resulting encryption-performance (first time hearing about that TBH) situations are indeed unfortunate. The one point that that I feel is bordering on bad faith is complaining about the pitfalls of deduplication. Basically everyone (including official docs) will tell you not to use it unless you’re an expert with very specific reasons. It’s not functionality present in any other filesystem I know of and OP then agrees that it’s not a needed or even desirable feature anyway. It’s basically just present for backwards-compatibility at this point.

> No disk checking tool

I don’t understand this complaint. Is this not what you get by running scrub with checksumming enabled?

> High memory requirements for ARC

This section is simply wrong or at least very misleading: the main motivation for ARC today is not that ZFS can’t use the page cache but that its own strategy can perform better and more optimally (if tuned properly). Things generally do not get double-cached in both page cache and ARC at all. The memory usage can be tuned with various parameters including min and max sizes (though I do wish these parameters could be set per pool as opposed to on a system level). This section too seems to be in bad faith as a reader who is not deeply familiar with ZFS already will have an incorrect understanding.

> causing long periods of unavailability until a workaround can be found.

Has it really? The one situation I’m aware of is the one they’re mentioning. Which was not experienced any differently than the usual expected delay for things to bubble down to distro repos.

> btrfs

Is simply not ready for casual users yet. It will hopefully get there before too long but at this point I find it irresponsible to recommend it for non-expert users and production workloads.

OpenZFS is not perfect and there are valid reasons not to use it. You don’t have to misrepresent it to find them.

p_l · on Feb 4, 2022

The motivation for ARC is that it's better algorithm, yes.

However linux pagecache isn't compatible with ZFS for many reasons (XFS was bit by it as well in the past)

fomine3 · on Feb 4, 2022

SIMD was a issue but now mitigated.

csdvrx · on Feb 4, 2022

Not mentioned there (only hinted by "Phoronix benchmarks of ext4 vs zfs in 2019 show that ZFS does win some synthetic benchmarks but badly loses all real-world tests to ext4"): bad performance if you have a raid0 or raid10 using very fast NVME drives.

For most of my usecases, I typically default to xfs over mdadm raid10 in f2 mode.

pmarreck · on Feb 4, 2022

Why not bcachefs? It has many of the features of zfs, but without the problems

https://bcachefs.org/

unmole · on Feb 4, 2022

Why not use an experimental filesystem with functionally zero real world usage that isn't in the upstream kernel and has exactly one full time developer working on it?

I love the ideas behind bcachefs but suggesting it's anywhere close to being ready for primetime is just silly.

pmarreck · on Feb 4, 2022

Well, that's hopefully coming soon. Nothing wrong with being supportive... I never claimed it was production-ready

https://www.phoronix.com/scan.php?page=news_item&px=Bcachefs...

yjftsjthsd-h · on Feb 4, 2022

You asked

> Why not bcachefs?

To which, "it isn't stable or anything like prod-ready, and you still have to build it as an out-of-tree module" is a perfectly good answer. Hopefully, the answer will become "yes, bcachefs is the solution" soon, but we aren't there today.

pmarreck · on Feb 4, 2022

Alright. I kind of agree. At least I wanted to draw some attention to it at least, as I think it's promising.

specto · on Feb 4, 2022

I'm very much hoping this becomes the future of filesystems on linux, but it's very in flux at the moment. I wish some company would hire him and put a bunch of devs on it full time.

michaelmrose · on Feb 4, 2022

>Bcachefs is not yet upstream - you'll have to build a kernel to use it.

> Bcachefs can currently be considered beta quality. It has a small pool of outside users

>However, given that it's still under active development backups are a good idea.

pmarreck · on Feb 4, 2022

It has about a 999% better chance of going upstream than ZFS does, so there's that

michaelmrose · on Feb 4, 2022

It's completely ridiculous as a choice for present day usage however.

pmarreck · on Feb 5, 2022

I'd actually like to try.

With regular backups, of course.

fomine3 · on Feb 4, 2022

Not quite accurate article overall. Some are outdated even in 2021.

RAIDZ expansion and disintegration of ARC and kernel page cache is valid problem what I wish solved.

Scramblejams · on Feb 4, 2022

Can anyone chime in with their experience using dm-integrity? I looked into it a while back for home RAID and recall seeing some limitations that made me steer clear, possibly related to LUKS, or array expansion (I'm using mdadm RAID6 and I periodically grow my arrays). But that was a while ago, and I don't remember what the issues were. This time I promise to keep notes.

martijnvds · on Feb 4, 2022

I've been using dm-integrity with mdadm RAID on top, and LVM on top of that. Once it's all initialized an working, it works great.

The biggest downsides I can think of are:

1. You have to do an initial initial initialization of checksums on the disk, which takes forever.

2. Automatically bringing up the dm-integrity layer of the plain disks at boot time is a bit of a dark art involving udev rules.

If you skip the checksum initialization step and add the device to the RAID immediately, it _mostly_ works, because disk blocks are rewritten by the resync/reshape, but then on reboot it fails because some part of the system tries to read an otherwise unused sector (probably to find metadata or a partition signature of some kind?), and then determines the device is unusable because of read errors (dm-integrity errors show up as read failures to the next layer in the stack).

Scramblejams · on Feb 4, 2022

Thanks for sharing your experience. Do you run it in bitmap mode? If not, how’s the perf hit feel (I’m using spinners)? Does it have any ability to correct errors, or does it just cause the drive to drop when they appear? (I hoped it would write the corrected value back during an mdadm array check.) Does it reduce drive capacity at all? If not, I’m thinking maybe I could convert my array to it, one drive at a time, verrrry slowly.

martijnvds · on Feb 5, 2022

I'm running in bitmap mode, on spinning drives. It reduces the drive capacity, but not by a noticable amount.

I've set up my RAID to not haeve a "bad block list" so I _think_ it will try fixing it a few times, and drop the drive if there are too many errors in a row.

Should be easy enough to try with a "test" RAID made out of a bunch of files (or even USB sticks).

dathinab · on Feb 4, 2022

Out of interest: What is the state of btrf compared to ZFS.

Last time I checked btrf _seemed_ like a mess with many hard to reproduce subtil bugs and for was basically a no-go for any usage requiring any reliability. Like it looked bad to a point that I wondered if it only still exists due to the sunken cost fallacy.

But neither then nor now I was familiar with either btrf nor zfs, so I might have been wrong.

johnny22 · on Feb 4, 2022

I can't actually answer those questions, but I can say the btrfs is the default FS in both Fedora and OpenSuSE. So it can only be so terrible.

hbogert · on Feb 4, 2022

Whenever I look at zfs I become confused how that all works. Predicting performance is incredibly difficult. The rebuild times horrendous.

That's coming from someone who has been using Ceph for a couple of years now. I know it's apples to oranges, but still. (For one, zfs is non-distributed, whereas Ceph is).

nix23 · on Feb 4, 2022

That's a crazy comparison...it's like comparing Kubernetes to a Operating-system on a node.

DeepYogurt · on Feb 4, 2022

GPL incompatible. That's why not.

montjoy · on Feb 5, 2022

I’ve ran ZFS on Linux, Solaris, and FreeBSD. While I like some of the features it has I wouldn’t run it again on anything that needs decent write speeds. The performance is really sub-par and you need a lot of disks/memory to get over that hurdle.

rsync · on Feb 4, 2022

The author of this post has no idea what they are talking about.

Not minor quibbles or terminology or preferences - but flat out incorrect.

I post this for posterity / search engines / future readers: Ignore this post.