I have a small home NAS with 3x 8TB drives in ZFS RAID-Z. Everything was fine then one day my son complained that some video files were pausing in the middle when they used to work.
I check the zpool status and saw that the system had been running on just 2 disk for probably a month due to write errors on one disk. Of the two remaining, another was having occasional errors that ZFS was correcting.
I pulled the bad drive, zeroed it out three times, and reinserted it. ZFS performed a resilver. After it was done, I pulled the other drive, zeroed it, and added it back.
The drives (SMART and ZFS) are no longer reporting any errors on any of the three drives and I only lost partial data on 4 files (which were replaceable).
Overall, I was surprised how resilient ZFS was and how easy the process was to replace 2 drives in a 3 drive array with minimal data loss.
> Are you scrubbing once a month? If not this will happen again. Linus Media lost 100’s of TB that way.
Anybody running 100 TB+ ZFS arrays should really have been aware of this class of problem before deploying (and doubly so if using consumer drives). ZFS protects from bitrot and hardware failures... but it's not magic. If data is written once and not read again for a long time, you won't know it's been sitting on bad sectors until it's too late, and you may end up with multiple failing drives at once if you lose a disk and try to resilver. Far better to check periodically so you can throw bad disks early.
Watched the video where Linus talks about it. Apparently they store all of the footage as a nice to have, and to have a use case to make the large storage content around.
All that to say yes they lost all that data, no it wasn't backed up. Not critical to the business.
This is tricky, because it's a class of problem that you need to understand exists in order to seek out the relevant information. Once you know what to look for, it's easy to find guidance - even from Oracle themselves [0]
A lot of us have "learned the hard way," as it sounds like Linus himself eventually did. I think this highlights an issue with the "learn through youtube video" approach. An internet celebrity may acquire enough knowledge to do accessible demonstrations, presenting totally valid, useful, and correct information in them, and still miss crucial "unknown unknowns" that they simply hadn't encountered in their own research.
It's hard to know what to recommend for a class of issue you aren't even aware of!
Can someone explain this? I find it a bit scary that your have to do monthly manual work on your NAS or you'll lose.
Edit: wait, I think I get it... "Scrubbing simply needs to read every file from the disk so the RAID layer notices and repairs a URE". So it's to avoid bit rot? I have to say, as someone who has a few TB of personal data on a NAS, bit rot is a bit scary. Data backup is mainly why I bought the lifetime pCloud+encryption package during the last Black Friday. I wonder how (if?) they avoid bit rot?
That is why I think Consumer NAS is an unsolved problem. I have no need for VPN, Cloud Photos, or Mail Server etc. I need a simple, reliable network storage that prevent drive failure and bit riot. Right now you need to spend at least $400 on a NAS with lots of config before getting it done.
It's less Linux and more weather or not you have a database/NAS centric distribution I think.
The problem with by default installing a cron job when installing ZFS is that for a general purpose OS there a good default for when and how often to run it. And running it on the wrong time might even be a major problem.
Through then tbh. having a bad default is probably still better then no default in this case.
> lose data before noticing
Is a bit of an overstatement as they didn't look for quite a while,
they also did not only fail to do scrubbing, they also failed to setup
automated health checks and reporting.
Turning a non NAS focused Linux distribution into a well working and tuned NAS isn't easy (compared to using a good NAS OS/distribution), but making it somewhat work is easy. Which makes this a pretty common mistake for non-specialized people (i.e. like in their case).
I believe I got default cron-files when I installed ZFS on ubuntu 21.04. It could be that they were created and I had to uncomment one line in a file. Then a scrub on the pools would run once a month. Then I setup email on the server and everytime a scrub is done with any errors, I get an email.
Quite easy for me to setup, even though its my first NAS that I built myself and first time using ZFS. Very surprised that LTT effed that up to be honest.
I check the zpool status and saw that the system had been running on just 2 disk for probably a month due to write errors on one disk. Of the two remaining, another was having occasional errors that ZFS was correcting.
I pulled the bad drive, zeroed it out three times, and reinserted it. ZFS performed a resilver. After it was done, I pulled the other drive, zeroed it, and added it back.
The drives (SMART and ZFS) are no longer reporting any errors on any of the three drives and I only lost partial data on 4 files (which were replaceable).
Overall, I was surprised how resilient ZFS was and how easy the process was to replace 2 drives in a 3 drive array with minimal data loss.