Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have a small home NAS with 3x 8TB drives in ZFS RAID-Z. Everything was fine then one day my son complained that some video files were pausing in the middle when they used to work.

I check the zpool status and saw that the system had been running on just 2 disk for probably a month due to write errors on one disk. Of the two remaining, another was having occasional errors that ZFS was correcting.

I pulled the bad drive, zeroed it out three times, and reinserted it. ZFS performed a resilver. After it was done, I pulled the other drive, zeroed it, and added it back.

The drives (SMART and ZFS) are no longer reporting any errors on any of the three drives and I only lost partial data on 4 files (which were replaceable).

Overall, I was surprised how resilient ZFS was and how easy the process was to replace 2 drives in a 3 drive array with minimal data loss.



Are you scrubbing once a month? If not this will happen again. Linus Media lost 100’s of TB that way.

Seagate drives by chance?


> Are you scrubbing once a month? If not this will happen again. Linus Media lost 100’s of TB that way.

Anybody running 100 TB+ ZFS arrays should really have been aware of this class of problem before deploying (and doubly so if using consumer drives). ZFS protects from bitrot and hardware failures... but it's not magic. If data is written once and not read again for a long time, you won't know it's been sitting on bad sectors until it's too late, and you may end up with multiple failing drives at once if you lose a disk and try to resilver. Far better to check periodically so you can throw bad disks early.

Hopefully they had a good backup strategy!


Watched the video where Linus talks about it. Apparently they store all of the footage as a nice to have, and to have a use case to make the large storage content around.

All that to say yes they lost all that data, no it wasn't backed up. Not critical to the business.


I have a 100TB+ ZFS based NAS and have poor understanding of this.

I'm yet to see anyone make a really good set of resources to either watch or read that are not much too technically deep.


This is tricky, because it's a class of problem that you need to understand exists in order to seek out the relevant information. Once you know what to look for, it's easy to find guidance - even from Oracle themselves [0]

A lot of us have "learned the hard way," as it sounds like Linus himself eventually did. I think this highlights an issue with the "learn through youtube video" approach. An internet celebrity may acquire enough knowledge to do accessible demonstrations, presenting totally valid, useful, and correct information in them, and still miss crucial "unknown unknowns" that they simply hadn't encountered in their own research.

It's hard to know what to recommend for a class of issue you aren't even aware of!

[0] https://blogs.oracle.com/oracle-systems/post/disk-scrub-why-...


Can someone explain this? I find it a bit scary that your have to do monthly manual work on your NAS or you'll lose.

Edit: wait, I think I get it... "Scrubbing simply needs to read every file from the disk so the RAID layer notices and repairs a URE". So it's to avoid bit rot? I have to say, as someone who has a few TB of personal data on a NAS, bit rot is a bit scary. Data backup is mainly why I bought the lifetime pCloud+encryption package during the last Black Friday. I wonder how (if?) they avoid bit rot?


That is why I think Consumer NAS is an unsolved problem. I have no need for VPN, Cloud Photos, or Mail Server etc. I need a simple, reliable network storage that prevent drive failure and bit riot. Right now you need to spend at least $400 on a NAS with lots of config before getting it done.


> Edit: wait, I think I get it... "Scrubbing simply needs to read every file from the disk so the RAID layer notices and repairs a URE".

If you don't know about ZFS scrubbing, but are using ZFS, you may wish to spend some time researching ZFS some more.

> I find it a bit scary that your have to do monthly manual work on your NAS or you'll lose.

Most distros / ZFS packages set up a cron job and e-mail out any errors. You do receive error reports from your NAS, right?


just add zpool scrub to your crontab.


Make make sure the error output of your cron jobs is sent out to a mailbox that you read.


I recommend to receive events from zed, because not only scrub detect errors. It also support Pushbullet.


> So it's to avoid bit rot?

Yes,

it's also trivially to automatize in more or less all situations.

The gotcha is that you have to do it/know about it...


I know that TrueNAS/FreeNAS add a scrub by default. Possibly ZFS on Linux does now too.

LinusTechTips recently did a video about how they installed ZFS on Linux and didn't have a scrub cron. They started to lose data before noticing.

https://youtu.be/Npu7jkJk5nM


It's less Linux and more weather or not you have a database/NAS centric distribution I think.

The problem with by default installing a cron job when installing ZFS is that for a general purpose OS there a good default for when and how often to run it. And running it on the wrong time might even be a major problem.

Through then tbh. having a bad default is probably still better then no default in this case.

> lose data before noticing

Is a bit of an overstatement as they didn't look for quite a while, they also did not only fail to do scrubbing, they also failed to setup automated health checks and reporting.

Turning a non NAS focused Linux distribution into a well working and tuned NAS isn't easy (compared to using a good NAS OS/distribution), but making it somewhat work is easy. Which makes this a pretty common mistake for non-specialized people (i.e. like in their case).


I believe I got default cron-files when I installed ZFS on ubuntu 21.04. It could be that they were created and I had to uncomment one line in a file. Then a scrub on the pools would run once a month. Then I setup email on the server and everytime a scrub is done with any errors, I get an email.

Quite easy for me to setup, even though its my first NAS that I built myself and first time using ZFS. Very surprised that LTT effed that up to be honest.


Scrubbing was turned on by default and the scrubbing was catching checksum errors.

And yes, Seagate SMR drives (obviously not ideal but I'm cheap)


That is quite a odd failure. Have you by any chance done some research on why it happened?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: