Looking for thoughts/opinions

I have a 5 disc raidz1 array. The volumes are accumulating CKSUM errors - fairly evenly distributed over the discs. I’ve been lazy and let this progress to the point where there are permanent errors in files.

# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 748K in 06:17:19 with 1 errors on Sun Jul 14 06:41:22 2024
config:

        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          raidz1-0                           ONLINE       0     0     0
            ata-ST8000VN004-2M2101_WSD13YBW  ONLINE       0     0     6
            ata-ST8000VN004-2M2101_WSD13YE4  ONLINE       0     0     7
            ata-ST8000VN004-2M2101_WSD1454G  ONLINE       0     0     8
            ata-ST8000VN004-2M2101_WSD1454W  ONLINE       0     0     6
            ata-ST8000VN004-2M2101_WSD14563  ONLINE       0     0     7

errors: Permanent errors have been detected in the following files:

        /you/do/not/need/this/level of detail.txt

I’ve done some research and believe (hope) that the cause of these errors is the “domestic” onboard SATA controllers I’m using and I have ordered a LSI SAS3008 9300-8i HBA as an upgrade.

I know I can fix the permanent error by deleting and restoring it and then running a scrub. But, I’m torn - should I scrub now and risk stressing it more on the crappy SATA controllers, or wait until I get the new HBA (in a few weeks - free cheap, slow, shipping)?

  • spitfire@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    I’d shut it down before it corrupts even more, replace HBA when it arrives and run a scrub to see what’s the damage

    • Great Blue Heron@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      I know that’s the correct response. But, it’s been running like this for many months, maybe even years - as I said in the post, I’ve been lazy. There’s nothing on it that can’t easily be restored, or replaced, and shutting it down would be a PITA.

  • Max-P@lemmy.max-p.me
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    3 months ago

    I have the same issue. For what it’s worth it’s still running just fine after 3 years apart from the occasional corrupted file after scrub, which thankfully that pool is mostly games and media I can just redownload. Error rate is always the same, and a corrupted file when the controller fucks up. Weirdly my SSD pool on the same controller seems fine, but it also completes scrub in less than an hour vs the HDD array.

    You’ll be fine waiting a week if you want to be sure.

    I wish there was an option to retry a few times instead of giving up, as the controller will give the correct data if tried again. It seems to happen when the controller is under heavy load for an extended period of time (ie 18h of scrubbing), it only does it close to the end.

    I’ve seen some tunables to make the scrub slower, it might help reduce the strain enough to not cause the errors.