Discussion:
gvinum raid5 vs. ZFS raidz
(too old to reply)
Scott Bennett
2014-07-29 08:27:12 UTC
Permalink
I want to set up a couple of software-based RAID devices across
identically sized partitions on several disks. At first I thought that
gvinum's raid5 would be the way to go, but now that I have finally found
and read some information about raidz, I am unsure which to choose. My
current, and possibly wrong, understanding about the two methods' most
important features (to me, at least) can be summarized as follows.

raid5 raidz

Has parity checking, but any parity Has parity checking *and*
errors identified are assumed to be frequently spaced checksums
errors in the parity blocks themselves, that are checked when data
so if errors occur in data blocks, the blocks are read, so errors are
errors can be detected, but not unless and detected relatively quickly.
a "checkparity" operation is done. Errors Checksums enable identification
in parity blocks can be fixed by a of where the error exists and
"rebuildparity" operation, but a automatic repair of erroneous
"rebuildparity" will merely cement errors bytes in either data blocks or
in data blocks by creating parity blocks parity blocks.
to match/confirm the erroneous data. (This
also appears to be the case for graid3
devices.)

Can be expanded by the addition of more Can only be expanded by
spindles via a "gvinum grow" operation. replacing all components with
larger components. The number
of component devices cannot be
changed, so the percentage of
space tied up in parity cannot
be changed.

Does not support migration to any other Does not support migration
RAID levels or their equivalents. between raidz levels, even by
(N.B. The exception to this limitation adding drives to support the
seems to be to create a mirror of a raid5 increased space required for
device, effectively migrating to a RAID5+1 parity blocks.
configuration.)

Does not support additional parity Supports one (raidz2) or two
dimensions a la RAID6. (raidz3) additional parity
dimensions if one or two extra
components is designated for
such purpose when the raidz
device is created.

Fast performance because each block Slower performance because each
is on a separate spindle from the block is spread across all
from the previous and next blocks. spindles a la RAID3, so many
simultaneous I/O operations are
required for each block.
-----------------------
I hoped to start with a minimal number of components and eventually
add more components to increase the space available in the raid5 or raidz
devices. Increasing their sizes that way would also increase the total
percentage of space in the devices devoted to data rather than parity, as
well as improving the performance enhancement of the striping. For various
reasons, having to replace all component spindles with larger-capacity
components is not a viable method of increasing the size of the raid5 or
raidz devices in my case. That would appear to rule out raidz.
OTOH, the very large-capacity drives available in the last two or
three years appear not to be very reliable(*) compared to older drives of
1 TB or smaller capacities. gvinum's raid5 appears not to offer good
protection against, nor any repair of, damaged data blocks.
Ideally, one ought to be able to create a minimal device with the
space equivalent of one device devoted to parity, whose space and/or
dimensions of parity could be increased later by the addition of more
spindles. Given that there appears to be no software available under
FreeBSD to support that ideal, I am currently stumped as to which available
way to go. I would appreciate anyone with actual experience with gvinum's
raid5 or ZFS raidz (preferably with both) correcting any errors in my
understanding as described above and also offering suggestions as to how
best to resolve my quandary. Thanks to three failed external drives and
apparently not fully reliable replacements, compounded by a bad ports
update two or three months ago, I have no functioning X11 and no space
set up any longer in which to build ports to fix the X11 problem, so I
really want to get the disk situation settled ASAP. Trying to keep track
of everything using only syscons and window(1) is wearing my patience
awfully thin.

(*) [Last year I got two defective 3 TB drives in a row from Seagate.
I ended up settling for a 2 TB Seagate that is still running fine AFAIK.
While that process was going on, I bought three 2 TB Seagate drives in
external cases with USB 3.0 interfaces, two of which failed outright
after about 12 months and have been replaced with two refurbished drives
under warranty. While waiting for those replacements to arrive, I bought
a 2 TB Samsung drive in an external case with a USB 3.0 interface. I
discovered by chance that copying very large files to these drives is an
error-prone process. A roughly 1.1 TB file on the one surviving external
Seagate drive from last year's purchase of three, when copied to the
Samsung drive, showed no I/O errors during the copy operation. However,
a comparison check using "cmp -l -z originalfile copyoforiginal" shows
quite a few places where the contents don't match. The same procedure
applied to one of the refurbished Seagates gives similar results, although
the locations and numbers of differing bytes are different from those on
the Samsung drive. The same procedure applied to the other refurbished
drive resulted in a good copy the first time, but a later repetition ended
up with a copied file that differed from the original by a single bit in
each of two widely separated places in the files. These problems have
raised the priority of a self-healing RAID device in my mind.
I have to say that these are new experiences to me. The disk drives,
controllers, etc. that I grew up with all had parity checking in the hardware,
including the data encoded on the disks, so single-bit errors anywhere in
the process showed up as hardware I/O errors instantly. If the errors were
not eliminated during a limited number of retries, they ended up as permanent
I/O errors that a human would have to resolve at some point.
FWIW, I also discovered that I cannot run two such multi-hour-long
copy operations in parallel using two separate pairs of drives. Running
them together seems to go okay for a while, but eventually always results
in a panic. This is on 9.2-STABLE (r264339). I know that that is not up
to date, but I can't do anything about that until my disk hardware situation
is settled.]

Thanks in advance for any help, information, advice, etc. I'm running
out of hair to tear out at this point. :-(

Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Paul Kraus
2014-07-29 16:01:36 UTC
Permalink
Post by Scott Bennett
I want to set up a couple of software-based RAID devices across
identically sized partitions on several disks. At first I thought that
gvinum's raid5 would be the way to go, but now that I have finally found
and read some information about raidz, I am unsure which to choose. My
current, and possibly wrong, understanding about the two methods' most
important features (to me, at least) can be summarized as follows.
Disclaimer, I have experience with ZFS but not your other alternative.
Post by Scott Bennett
raid5 raidz
Has parity checking, but any parity Has parity checking *and*
errors identified are assumed to be frequently spaced checksums
ZFS checksums all data for errors. If there is redundancy (mirror, raid, copies > 1) ZFS will transparently repair damaged data (but increment the “checksum” error count so you can know via the zpool status command that you *are* hitting errors).

<snip>
Post by Scott Bennett
Can be expanded by the addition of more Can only be expanded by
spindles via a "gvinum grow" operation. replacing all components with
larger components. The number
All ZFS devices are derived from what are called top level vdevs (virtual devices). The data is striped across all of the top level vdevs. Each vdev may be composed of a single drive. mirror, or raid (z1, z2, or z3). So you can create a mixed zpool (not recommended for a variety of reasons) with a different type of vdev for each vdev. The way to expand any ZFS zpool is to add additional vdevs (beyond the replace all drives in a single vdev and then grow to fill the new drives). So you can create a zpool with one raidz1 vdev and then later add a second raidz1 vdev. Or more commonly, start with a mirror vdev and then add a second, third, fourth (etc.) mirror vdev.

It is this two tier structure that is one of ZFSes strengths. It is also a feature that is not well understood.

<snip>
Post by Scott Bennett
Does not support migration to any other Does not support migration
RAID levels or their equivalents. between raidz levels, even by
Correct. Once you have created a vdev, that vdev must remain the same type. You can add mirrors to a mirror vdev, but you cannot add drives or change raid level to raidz1, raidz2, or raidz3 vdevs.

<snip>
Post by Scott Bennett
Does not support additional parity Supports one (raidz2) or two
dimensions a la RAID6. (raidz3) additional parity
ZFS parity is handled slightly differently than for traditional raid-5 (as well as the striping of data / parity blocks). So you cannot just count on loosing 1, 2, or 3 drives worth of space to parity. See Matt Ahren’s Blog entry here http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for (probably) more data on this than you want :-) And here https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674 is his spreadsheet that relates space lost due to parity to number of drives in raidz vdev and data block size (yes, the amount of space lost to parity caries with data block, not configured filesystem block size!). There is a separate tab for each of RAIDz1, RAIDz2, and RAIDz3.

<snip>
Post by Scott Bennett
Fast performance because each block Slower performance because each
is on a separate spindle from the block is spread across all
from the previous and next blocks. spindles a la RAID3, so many
simultaneous I/O operations are
required for each block.
ZFS performance is never that simple as I/O is requested from the drive in parallel. Unless you are saturating the controller you should be able to keep all the drive busy at once. Also note that ZFS does NOT suffer the RAID-5 read-modify-write penalty on writes as every write is a new write to disk (there is no modification of existing disk blocks), this is referred to as being Copy On Write (COW).
Post by Scott Bennett
-----------------------
I hoped to start with a minimal number of components and eventually
add more components to increase the space available in the raid5 or raidz
devices. Increasing their sizes that way would also increase the total
percentage of space in the devices devoted to data rather than parity, as
well as improving the performance enhancement of the striping. For various
reasons, having to replace all component spindles with larger-capacity
components is not a viable method of increasing the size of the raid5 or
raidz devices in my case. That would appear to rule out raidz.
Yup.
Post by Scott Bennett
OTOH, the very large-capacity drives available in the last two or
three years appear not to be very reliable(*) compared to older drives of
1 TB or smaller capacities. gvinum's raid5 appears not to offer good
protection against, nor any repair of, damaged data blocks.
Yup. Unless you use ZFS plan on suffering silent data corruption due to the uncorrectable (and undetectable by the drive) error rate off of large drives. All drives suffer uncorrectable errors, read errors that the drive itself does not realize are errors. With traditional filesystems this bad data is returned to the OS and in some cases may cause a filesystem panic and in others just bad data returned to the application. This is one of the HUGE benefits of ZFS, it catches those errors.

<snip>
Post by Scott Bennett
Thanks to three failed external drives and
apparently not fully reliable replacements, compounded by a bad ports
update two or three months ago, I have no functioning X11 and no space
set up any longer in which to build ports to fix the X11 problem, so I
really want to get the disk situation settled ASAP. Trying to keep track
of everything using only syscons and window(1) is wearing my patience
awfully thin.
My home server is ZFS only and I have 2 drives mirrored for the OS and 5 drives in a raidz2 for data with one hot spare. I have suffered 3 drive failures (all Seagate), two of which took the four drives in my external enclosure offline (damn sata port multipliers). I have had NO data loss or corruption!

I started like you, wanting to have some drives and add more later. I started with a pair of 1TB drives mirrored, then added a second pair to double my capacity. The problem with 2-way mirrors is that the MTTDL (Mean Time To Data Loss) is much lower than with RAIDz2, with similar cost in spec for a 4 disk configuration. After I had a drive fail in the mirror configuration, I ordered a replacement and crossed my fingers that the other half to *that* mirror would not fail (the pairs of drives in the mirrors were the same make / model bought at the same time … not a good bet for reliability). When I got the replacement drive(s) I took some time and rebuilt my configuration to better handle growth and reliability by going from a 4 disk 2-way mirror configuration to a 5 disk RAIDz2. I went from net about 2TB to net about 3TB capacity and a hot spare.

If being able to easily grow capacity is the primary goal I would go with a 2-way mirror configuration and always include a hot spare (so that *when* a drive fails it immediately starts resilvering (the ZFS term for syncing) the vdev). Then you can simple add pairs of drives to add capacity. Just make sure that the hot spare is at least as large as the largest drive in use. When you buy drives, always buy from as many different manufacturers and models as you can. I just bought four 2TB drives for my backup server. One is a WD, the other 3 are HGST but they are four different model drives, so that they did not come off the same production line on the same week as each other. If I could have I would have gotten four different manufacturers. I also only buy server class (rated for 24x7 operation with 5 year warranty) drives. The additional cost has been offset by the savings due to being able to have a failed drive replaced under warranty.
Post by Scott Bennett
(*) [Last year I got two defective 3 TB drives in a row from Seagate.
Wow, the only time I have seen that kind of failure rate was buying from Newegg when they were packing them badly.
Post by Scott Bennett
I ended up settling for a 2 TB Seagate that is still running fine AFAIK.
While that process was going on, I bought three 2 TB Seagate drives in
external cases with USB 3.0 interfaces, two of which failed outright
after about 12 months and have been replaced with two refurbished drives
under warranty.
Yup, they all replace failed drives with refurb.

As a side note, on my home server I have had 6 Seagate ES.2 or ES.3 drives, 2 HGST UltraStar drives, and 2 WD RE4 in service on my home server. I have had 3 of the Seagates fail (and one of the Seagate replacements has failed, still under warranty). I have not had any HGST or WD drives fail (and they both have better performance than the Seagates). This does not mean that I do not buy Seagate drives. I spread my purchases around, keeping to the 24x7 5 year warranty drives and followup when I have a failure.
Post by Scott Bennett
While waiting for those replacements to arrive, I bought
a 2 TB Samsung drive in an external case with a USB 3.0 interface. I
discovered by chance that copying very large files to these drives is an
error-prone process.
I would suspect the USB 3.0 layer problem, but that is just a guess.
Post by Scott Bennett
A roughly 1.1 TB file on the one surviving external
Seagate drive from last year's purchase of three, when copied to the
Samsung drive, showed no I/O errors during the copy operation. However,
a comparison check using "cmp -l -z originalfile copyoforiginal" shows
quite a few places where the contents don't match.
ZFS would not tolerate those kinds of errors. On reading the file ZFS would know via the checksum that the file was bad.
Post by Scott Bennett
The same procedure
applied to one of the refurbished Seagates gives similar results, although
the locations and numbers of differing bytes are different from those on
the Samsung drive. The same procedure applied to the other refurbished
drive resulted in a good copy the first time, but a later repetition ended
up with a copied file that differed from the original by a single bit in
each of two widely separated places in the files. These problems have
raised the priority of a self-healing RAID device in my mind.
Self healing RAID will be of little help… See more below
Post by Scott Bennett
I have to say that these are new experiences to me. The disk drives,
controllers, etc. that I grew up with all had parity checking in the hardware,
including the data encoded on the disks, so single-bit errors anywhere in
the process showed up as hardware I/O errors instantly. If the errors were
not eliminated during a limited number of retries, they ended up as permanent
I/O errors that a human would have to resolve at some point.
What controllers and drives? I have never seen a drive that does NOT have uncorrectable errors (these are undetectable by the drive). I have also never seen a controller that checksums the data. The controllers rely on the drive to report errors. If the drive does not report an error, then the controller trusts the data.

The big difference is that with drives under 1TB the odds of running into an uncorrectable error over the life of the drive is very, very small. The uncorrectable error rate does NOT scale down as the drives scale up. It has been stable at 1 in 10e-14 (for cheap drives) to 1 in 10e-15 (for good drives) for over the past 10 years (when I started looking at that drive spec). So if the rate is not changing and the total amount of data written / read over the life of the drive join up by, in some cases, orders of magnitude, the real world occurrence of such errors is increasing.
Post by Scott Bennett
FWIW, I also discovered that I cannot run two such multi-hour-long
copy operations in parallel using two separate pairs of drives. Running
them together seems to go okay for a while, but eventually always results
in a panic. This is on 9.2-STABLE (r264339). I know that that is not up
to date, but I can't do anything about that until my disk hardware situation
is settled.]
I have had mixed luck with large copy operations via USB on Freebsd 9.x Under 9.1 I have found it to be completely unreliable. With 9.2 I have managed without too many errors. USB really does not seem to be a good transport for large quantities of data at fast rates. See my rant on USB hubs here: http://pk1048.com/usb-beware/

--
Paul Kraus
***@kraus-haus.org
Scott Bennett
2014-08-02 06:21:54 UTC
Permalink
Post by Paul Kraus
Post by Scott Bennett
I want to set up a couple of software-based RAID devices across
identically sized partitions on several disks. At first I thought that
gvinum's raid5 would be the way to go, but now that I have finally found
and read some information about raidz, I am unsure which to choose. My
current, and possibly wrong, understanding about the two methods' most
important features (to me, at least) can be summarized as follows.
Disclaimer, I have experience with ZFS but not your other alternative.
Okay, I appreciate the ZFS info anyway. Maybe someone with gvinum
experience will weigh in at some point.
Thanks. I'll check into it.
Post by Paul Kraus
Post by Scott Bennett
raid5 raidz
Has parity checking, but any parity Has parity checking *and*
errors identified are assumed to be frequently spaced checksums
ZFS checksums all data for errors. If there is redundancy (mirror, raid, copies > 1) ZFS will transparently repair damaged data (but increment the ?checksum? error count so you can know via the zpool status command that you *are* hitting errors).
<snip>
Post by Scott Bennett
Can be expanded by the addition of more Can only be expanded by
spindles via a "gvinum grow" operation. replacing all components with
larger components. The number
All ZFS devices are derived from what are called top level vdevs (virtual devices). The data is striped across all of the top level vdevs. Each vdev may be composed of a single drive. mirror, or raid (z1, z2, or z3). So you can create a mixed zpool (not recommended for a variety of reasons) with a different type of vdev for each vdev. The way to expand any ZFS zpool is to add additional vdevs (beyond the replace all drives in a single vdev and then grow to fill the new drives). So you can create a zpool with one raidz1 vdev and then later add a second raidz1 vdev. Or more commonly, start with a mirror vdev and then add a second, third, fourth (etc.) mirror vdev.
[Ouch. Trying to edit a response into entire paragraphs on single lines
is a drag.]
Post by Paul Kraus
It is this two tier structure that is one of ZFSes strengths. It is also a feature that is not well understood.
I understood that, but apparently I didn't express it well enough
in my comparison table. Thanks, though, for the confirmation of what
I wrote. GEOM devices can be built upon other GEOM devices, too, as
can gvinum devices within some constraints.
Post by Paul Kraus
<snip>
Post by Scott Bennett
Does not support migration to any other Does not support migration
RAID levels or their equivalents. between raidz levels, even by
Correct. Once you have created a vdev, that vdev must remain the same type. You can add mirrors to a mirror vdev, but you cannot add drives or change raid level to raidz1, raidz2, or raidz3 vdevs.
Too bad. Increasing the raidz level ought to be not much more
difficult than growing the raidz device by adding more spindles. Doing
the latter ought to be no more difficult that doing it with gvinum's
stripe or raid5 devices. Perhaps the ZFS developers will eventually
implement these capabilities. (A side thought: gstripe and graid3
devices ought also to be expandable in this manner, although the resulting
number of graid3 components would still need to be 2^n + 1.)
Post by Paul Kraus
<snip>
Post by Scott Bennett
Does not support additional parity Supports one (raidz2) or two
dimensions a la RAID6. (raidz3) additional parity
ZFS parity is handled slightly differently than for traditional raid-5 (as well as the striping of data / parity blocks). So you cannot just count on loosing 1, 2, or 3 drives worth of space to parity. See Matt Ahren?s Blog entry here http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for (probably) more data on this than you want :-) And here https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674 is his spreadsheet that relates space lost due to parity to number of drives in raidz vdev and data block size (yes, the amount of space lost to parity caries with data block, not configured filesystem block size!). There is a separate tab for each of RAIDz1, RAIDz2, and RAIDz3.
Yes, I had found both of those by following links from the ZFS material
at the freebsd.org web site. However, lynx(1) is the only web browser I can
use at present because X11 was screwed on my system by an update that changed
the ABI for the server and various loadable modules, but did not update the
keyboard driver module or the pointing device driver module. If I start X up,
it rejects those two driver modules due to the incompatible ABIs, so I have
no further influence on the system short of an ACPI shutdown triggered by
pushing the power button briefly. Until I get the disk situation settled,
I have no easy way to rebuild X11. Anyway, using lynx(1), it is very hard
to make any sense of the spreadsheet.
Post by Paul Kraus
<snip>
Post by Scott Bennett
Fast performance because each block Slower performance because each
is on a separate spindle from the block is spread across all
from the previous and next blocks. spindles a la RAID3, so many
simultaneous I/O operations are
required for each block.
ZFS performance is never that simple as I/O is requested from the drive in parallel. Unless you are saturating the controller you should be able to keep all the drive busy at once. Also note that ZFS does NOT suffer the RAID-5 read-modify-write penalty on writes as every write is a new write to disk (there is no modification of existing disk blocks), this is referred to as being Copy On Write (COW).
Again, your use of single-line paragraphs makes it tough to respond to
your several points in-line.
The information that I read on-line said that each raidz data block is
distributed across all devices in the raidzN device, just like in RAID3 or
RAID4. That means that, whether reading or writing one data block, *all* of
he drives require a read or a write, not just one as would be the case in
RAID5. So a raidzN device will require N I/O operations * m data blocks to
be read/written, not just m I/O operations. That was the point I was making
in the table entry above, i.e., ZFS raidz, like RAID3 and RAID4, is many
times as I/O-intensive as RAID5. In essence, reading or writing 100 data
blocks from a raidz is, at best, no faster than reading 100 blocks from a
single drive. At worst, there will be bus conflicts leading to overruns and
full rotation delays in the process of gathering all the fragments in a
block, thus performing even slower than a single drive. I.e., raidzN offers
no speed advantage to using multiple spindles, just like RAID3/RAID4. In
other words, the data are not really striped but rather distributed in
parallel. So I guess the question is, was what I read about raidz incorrect,
i.e., are individual data blocks *not* divided into a fragment on each and
every spindle minus the raidz level (number of parity dimensions)?
Post by Paul Kraus
Post by Scott Bennett
-----------------------
I hoped to start with a minimal number of components and eventually
add more components to increase the space available in the raid5 or raidz
devices. Increasing their sizes that way would also increase the total
percentage of space in the devices devoted to data rather than parity, as
well as improving the performance enhancement of the striping. For various
reasons, having to replace all component spindles with larger-capacity
components is not a viable method of increasing the size of the raid5 or
raidz devices in my case. That would appear to rule out raidz.
Yup.
Bummer. Oh, well.
Post by Paul Kraus
Post by Scott Bennett
OTOH, the very large-capacity drives available in the last two or
three years appear not to be very reliable(*) compared to older drives of
1 TB or smaller capacities. gvinum's raid5 appears not to offer good
protection against, nor any repair of, damaged data blocks.
Yup. Unless you use ZFS plan on suffering silent data corruption due to the uncorrectable (and undetectable by the drive) error rate off of large drives. All drives suffer uncorrectable errors, read errors that the drive itself does not realize are errors. With traditional filesystems this bad data is returned to the OS and in some cases may cause a filesystem panic and in others just bad data returned to the application. This is one of the HUGE benefits of ZFS, it catches those errors.
I think you've convinced me right there. Although RAID[3456] offers
protection against drive failures, it offers no protection against silent
data corruption, which seems to be common on the large-capacity drives on
the market for the last three or four years.
Post by Paul Kraus
<snip>
Post by Scott Bennett
Thanks to three failed external drives and
apparently not fully reliable replacements, compounded by a bad ports
update two or three months ago, I have no functioning X11 and no space
set up any longer in which to build ports to fix the X11 problem, so I
really want to get the disk situation settled ASAP. Trying to keep track
of everything using only syscons and window(1) is wearing my patience
awfully thin.
My home server is ZFS only and I have 2 drives mirrored for the OS and 5 drives in a raidz2 for data with one hot spare. I have suffered 3 drive failures (all Seagate), two of which took the four drives in my external enclosure offline (damn sata port multipliers). I have had NO data loss or corruption!
Bravo, then. Looks like ZFS raidz is what I need. Unfortunately,
I only have four drives available for the raidz at present, so it looks
like I'll need to save up for at least one additional drive and probably
two for a raidz2 that doesn't sacrifice an unacceptably high fraction of
the total space to parity blocks. :-( On my "budget" (ha!) that could be
several months or more, by which time three of the four I currently have
will be out of warranty. I suppose more failures could also occur during
that time. Sigh.
Post by Paul Kraus
I started like you, wanting to have some drives and add more later. I started with a pair of 1TB drives mirrored, then added a second pair to double my capacity. The problem with 2-way mirrors is that the MTTDL (Mean Time To Data Loss) is much lower than with RAIDz2, with similar cost in spec for a 4 disk configuration. After I had a drive fail in the mirror configuration, I ordered a replacement and crossed my fingers that the other half to *that* mirror would not fail (the pairs of drives in the mirrors were the same make / model bought at the same time ? not a good bet for reliability). When I got the replacement drive(s) I took some time and rebuilt my configuration to better handle growth and reliability by going from a 4 disk 2-way mirror configuration to a 5 disk RAIDz2. I went fro
m net about 2TB to net about 3TB capacity and a hot spare.

Yeah, the mirrors never did look to me to be as good an option either.
Post by Paul Kraus
If being able to easily grow capacity is the primary goal I would go with a 2-way mirror configuration and always include a hot spare (so that *when* a drive fails it immediately starts resilvering (the ZFS term for syncing) the vdev). Then you can simple add pairs of drives to add capacity. Just make sure that the hot spare is at least as large as the largest drive in use. When you buy drives, always buy from as many different manufacturers and models as you can. I just bought four 2TB drives for my backup server. One is a WD, the other 3 are HGST but they are four different model drives, so that they did not come off the same production line on the same week as each other. If I could have I would have gotten four different manufacturers. I also only buy server class (rated for 24x7 oper
ation with 5 year warranty) drives. The additional cost has been offset by the savings due to being able to have a failed drive replaced under warranty.

I'm not familiar with HGST, but I will look into their products.
Where does one find the server-class drives for sale? What sort of
price difference is there between the server-class and the ordinary
drives?
And yes, I did run across the silly term used in ZFS for rebuilding
a drive's contents. :-}
Post by Paul Kraus
Post by Scott Bennett
(*) [Last year I got two defective 3 TB drives in a row from Seagate.
Wow, the only time I have seen that kind of failure rate was buying from Newegg when they were packing them badly.
At that time, the shop that was getting them for me (to put into a
third-party case with certain interfaces I needed at the time) told me
that, of the 3 TB Seagate drives they had gotten for their own use and
also for sale to customers who wanted them, only roughly 50% survived past
their first 30 days of use, and that none of the Western Digital 3 TB
drives had survived that long. I concluded that the 3 TB drives were not
yet ready for prime time and should not have been marketed as early as
they were. That was the reason for my insisting upon a 2 TB Seagate to
fill the third-party case.
Post by Paul Kraus
Post by Scott Bennett
I ended up settling for a 2 TB Seagate that is still running fine AFAIK.
While that process was going on, I bought three 2 TB Seagate drives in
external cases with USB 3.0 interfaces, two of which failed outright
after about 12 months and have been replaced with two refurbished drives
under warranty.
Yup, they all replace failed drives with refurb.
As a side note, on my home server I have had 6 Seagate ES.2 or ES.3 drives, 2 HGST UltraStar drives, and 2 WD RE4 in service on my home server. I have had 3 of the Seagates fail (and one of the Seagate replacements has failed, still under warranty). I have not had any HGST or WD drives fail (and they both have better performance than the Seagates). This does not mean that I do not buy Seagate drives. I spread my purchases around, keeping to the 24x7 5 year warranty drives and followup when I have a failure.
I had a WD 1 TB drive fail last year. It was just over three years
old at the time.
Post by Paul Kraus
Post by Scott Bennett
While waiting for those replacements to arrive, I bought
a 2 TB Samsung drive in an external case with a USB 3.0 interface. I
discovered by chance that copying very large files to these drives is an
error-prone process.
I would suspect the USB 3.0 layer problem, but that is just a guess.
There has been no evidence to support that conjecture so far. What
the guy at Samsung/Seagate (they appear to be the same company now) told
me was that what I described did not mean that the drive was bad, but
instead was a common event with large-capacity drives. He seemed to
think that the problems were associated with long-running series of write
operations, though he had no explanation for that. It seems to me that
such errors being considered "normal" for these newer, larger-capacity
drives indicates the adoption of a drastically lowered standard of quality,
as compared to just a few years ago. And if that is to be the way of disks
from now on, then self-correcting file systems will soon become the only
acceptable file systems for production use outside of scratch areas.
Post by Paul Kraus
Post by Scott Bennett
A roughly 1.1 TB file on the one surviving external
Seagate drive from last year's purchase of three, when copied to the
Samsung drive, showed no I/O errors during the copy operation. However,
a comparison check using "cmp -l -z originalfile copyoforiginal" shows
quite a few places where the contents don't match.
ZFS would not tolerate those kinds of errors. On reading the file ZFS would know via the checksum that the file was bad.
And ZFS would attempt to rewrite the bad block(s) with the correct
contents? If so, would it then read back what it had written to make
sure the errors had, in fact, been corrected on the disk(s)?
Post by Paul Kraus
Post by Scott Bennett
The same procedure
applied to one of the refurbished Seagates gives similar results, although
the locations and numbers of differing bytes are different from those on
the Samsung drive. The same procedure applied to the other refurbished
drive resulted in a good copy the first time, but a later repetition ended
up with a copied file that differed from the original by a single bit in
each of two widely separated places in the files. These problems have
raised the priority of a self-healing RAID device in my mind.
Self healing RAID will be of little help? See more below
Why would it be of little help? What you wrote here seems to suggest
that it would be very helpful, at least for dealing with the kind of trouble
that caused me to start this thread. Was the above just a typo of some kind?
Post by Paul Kraus
Post by Scott Bennett
I have to say that these are new experiences to me. The disk drives,
controllers, etc. that I grew up with all had parity checking in the hardware,
including the data encoded on the disks, so single-bit errors anywhere in
the process showed up as hardware I/O errors instantly. If the errors were
not eliminated during a limited number of retries, they ended up as permanent
I/O errors that a human would have to resolve at some point.
What controllers and drives? I have never seen a drive that does NOT have uncorrectable errors (these are undetectable by the drive). I have also never seen a controller that checksums the data. The controllers rely on the drive to report errors. If the drive does not report an error, then the controller trusts the data.
Hmm... [scratches head a moment] Well, IBM 1311, 2305, 2311, 2314,
3330, 3350, 3380, third-party equivalents of those, DEC RA80, Harris disks
(model numbers forgotten), HP disks (numbers also forgotten), Prime disks
(ditto). Maybe some others that escape me now. Tape drives until the
early 1990s that I worked with were all 9-track, so each byte was written
across 8 data tracks and 1 parity track. Then we got a cartridge-based
system, and I *think* it may have been 10-track (i.e., 2 parity tracks).
Those computers and I/O subsystems and media had a parity bit for each byte
from the CPU and memory all the way out to the oxide on the media. Anytime
odd parity was broken, the hardware detected it and passed an indication of
the error back to the operating system.
Post by Paul Kraus
The big difference is that with drives under 1TB the odds of running into an uncorrectable error over the life of the drive is very, very small. The uncorrectable error rate does NOT scale down as the drives scale up. It has been stable at 1 in 10e-14 (for cheap drives) to 1 in 10e-15 (for good drives) for over the past 10 years (when I started looking at that drive spec). So if the rate is not changing and the total amount of data written / read over the life of the drive join up by, in some cases, orders of magnitude, the real world occurrence of such errors is increasing.
Interesting. I wondered if that were all there were to it, rather
than the rate per gigabyte increasing due to the increased recording
density at the larger capacities. I do think that newer drives have a
shorter MTBF than the drives of a decade ago, however. I know that none
of the ones I've seen fail had served any 300,000+ hours that the
manufacturers were citing as MTBF values for their products. One
"feature" of many newer drives is an automatic spindown whenever the drive
has been inactive for a short time. The heating/cooling cycles that result
from these "energy-saving" or "standby" responses look to me like probable
culprits for drive failures.
The spindowns also mean that such a drive cannot have a paging/swapping
area on it because the kernel will not wait five to ten seconds (while the
drive spins up) for a pagein to complete. Instead, it will log an error
message on the console and will terminate the process that needed the page.
Post by Paul Kraus
Post by Scott Bennett
FWIW, I also discovered that I cannot run two such multi-hour-long
copy operations in parallel using two separate pairs of drives. Running
them together seems to go okay for a while, but eventually always results
in a panic. This is on 9.2-STABLE (r264339). I know that that is not up
to date, but I can't do anything about that until my disk hardware situation
is settled.]
I have had mixed luck with large copy operations via USB on Freebsd 9.x Under 9.1 I have found it to be completely unreliable. With 9.2 I have managed without too many errors. USB really does not seem to be a good transport for large quantities of data at fast rates. See my rant on USB hubs here: http://pk1048.com/usb-beware/
I was referring to kernel panics, not I/O errors. These very long
copy operations all complete normally when run serially. The panics
occur only when I run two such copies in parallel.
I took a look at that link. I've had good luck with Dynex USB 2.0
hubs, both powered and unpowered, but I've only bought their 4-port hubs,
not the 7-ports. One of mine recently failed after at least five years
of service, possibly as long as seven years.
However, the only hard drive I currently have connected via USB 2.0
is my oldest external drive, an 80 GB WD drive in an iomega case, and I
have yet to see any problems with it after nearly ten years of mostly
around-the-clock service. The drives showing the errors I've described
in this thread are all connected via Connectland 4-port USB 3.0 hubs.
I have some other ZFS questions, but this posting is very long
already, so I'll post them in a separate thread.
Well, thank you very much for your reply. I appreciate the helpful
information and perspectives from your actual experiences. There are
some capabilities that I would very much like to see added to ZFS in the
future, but I think I can live with what it can already do right now, at
least for a few years. The protection against data corruption, especially
of the silent type, is something I really, really want, and none of the
standard RAID versions seems to offer it, so I guess I'll have to go with
raidz and deal with the performance hit and the lack of a "grow" command
for raidz for now.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Warren Block
2014-08-02 10:25:02 UTC
Permalink
Post by Paul Kraus
ZFS parity is handled slightly differently than for traditional
raid-5 (as well as the striping of data / parity blocks). So you
cannot just count on loosing 1, 2, or 3 drives worth of space to
parity. See Matt Ahren?s Blog entry here
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for
(probably) more data on this than you want :-) And here
https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674
is his spreadsheet that relates space lost due to parity to number of
drives in raidz vdev and data block size (yes, the amount of space
lost to parity caries with data block, not configured filesystem
block size!). There is a separate tab for each of RAIDz1, RAIDz2, and
RAIDz3.
Anyway, using lynx(1), it is very hard to make any sense of the
spreadsheet.
Even with a graphic browser, let's say that spreadsheet is not a paragon
of clarity. It's not clear what "block size in sectors" means in that
context. Filesystem blocks, presumably, but are sectors physical or
virtual disk blocks, 512 or 4K? What is that number when using a
standard configuration of a disk with 4K sectors and ashift=12? It
could be 1, or 8, or maybe something else.

As I read it, RAIDZ2 with five disks uses somewhere between 67% and 40%
of the data space for redundancy. The first seems unlikely, but I can't
tell. Better labels or rearrangement would help.

A second chart with no labels at all follows the first. It has only the
power-of-two values in the "block size in sectors" column. A
restatement of the first one... but it's not clear why.

My previous understanding was that RAIDZ2 with five disks would leave
60% of the capacity for data.
Arthur Chance
2014-08-02 12:39:04 UTC
Permalink
Post by Warren Block
Post by Paul Kraus
ZFS parity is handled slightly differently than for traditional
raid-5 (as well as the striping of data / parity blocks). So you
cannot just count on loosing 1, 2, or 3 drives worth of space to
parity. See Matt Ahren?s Blog entry here
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for
(probably) more data on this than you want :-) And here
https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674
is his spreadsheet that relates space lost due to parity to number of
drives in raidz vdev and data block size (yes, the amount of space
lost to parity caries with data block, not configured filesystem
block size!). There is a separate tab for each of RAIDz1, RAIDz2, and
RAIDz3.
Anyway, using lynx(1), it is very hard to make any sense of the
spreadsheet.
Even with a graphic browser, let's say that spreadsheet is not a paragon
of clarity. It's not clear what "block size in sectors" means in that
context. Filesystem blocks, presumably, but are sectors physical or
virtual disk blocks, 512 or 4K? What is that number when using a
standard configuration of a disk with 4K sectors and ashift=12? It
could be 1, or 8, or maybe something else.
As I read it, RAIDZ2 with five disks uses somewhere between 67% and 40%
of the data space for redundancy. The first seems unlikely, but I can't
tell. Better labels or rearrangement would help.
A second chart with no labels at all follows the first. It has only the
power-of-two values in the "block size in sectors" column. A
restatement of the first one... but it's not clear why.
My previous understanding was that RAIDZ2 with five disks would leave
60% of the capacity for data.
Quite right. If you have N disks in a RAIDZx configuration, the fraction
used for data is (N-x)/N and the fraction for parity is x/N. There's
always overhead for the file system bookkeeping of course, but that's
not specific to ZFS or RAID.
Scott Bennett
2014-08-06 05:56:20 UTC
Permalink
Post by Arthur Chance
Post by Warren Block
Post by Paul Kraus
ZFS parity is handled slightly differently than for traditional
raid-5 (as well as the striping of data / parity blocks). So you
cannot just count on loosing 1, 2, or 3 drives worth of space to
parity. See Matt Ahren?s Blog entry here
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for
(probably) more data on this than you want :-) And here
https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674
is his spreadsheet that relates space lost due to parity to number of
drives in raidz vdev and data block size (yes, the amount of space
lost to parity caries with data block, not configured filesystem
block size!). There is a separate tab for each of RAIDz1, RAIDz2, and
RAIDz3.
Anyway, using lynx(1), it is very hard to make any sense of the
spreadsheet.
Even with a graphic browser, let's say that spreadsheet is not a paragon
of clarity. It's not clear what "block size in sectors" means in that
context. Filesystem blocks, presumably, but are sectors physical or
virtual disk blocks, 512 or 4K? What is that number when using a
standard configuration of a disk with 4K sectors and ashift=12? It
could be 1, or 8, or maybe something else.
As I read it, RAIDZ2 with five disks uses somewhere between 67% and 40%
of the data space for redundancy. The first seems unlikely, but I can't
tell. Better labels or rearrangement would help.
A second chart with no labels at all follows the first. It has only the
power-of-two values in the "block size in sectors" column. A
restatement of the first one... but it's not clear why.
My previous understanding was that RAIDZ2 with five disks would leave
60% of the capacity for data.
Quite right. If you have N disks in a RAIDZx configuration, the fraction
used for data is (N-x)/N and the fraction for parity is x/N. There's
always overhead for the file system bookkeeping of course, but that's
not specific to ZFS or RAID.
I wonder if what varies is the amount of space taken up by the
checksums. If there's a checksum for each block, then the block size
would change the fraction of the space lost to checksums, and the parity
for the checksums would thus also change. Enough to matter? Maybe.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Arthur Chance
2014-08-06 10:11:43 UTC
Permalink
Post by Scott Bennett
Post by Arthur Chance
Post by Warren Block
Post by Paul Kraus
ZFS parity is handled slightly differently than for traditional
raid-5 (as well as the striping of data / parity blocks). So you
cannot just count on loosing 1, 2, or 3 drives worth of space to
parity. See Matt Ahren?s Blog entry here
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for
(probably) more data on this than you want :-) And here
https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674
is his spreadsheet that relates space lost due to parity to number of
drives in raidz vdev and data block size (yes, the amount of space
lost to parity caries with data block, not configured filesystem
block size!). There is a separate tab for each of RAIDz1, RAIDz2, and
RAIDz3.
Anyway, using lynx(1), it is very hard to make any sense of the
spreadsheet.
Even with a graphic browser, let's say that spreadsheet is not a paragon
of clarity. It's not clear what "block size in sectors" means in that
context. Filesystem blocks, presumably, but are sectors physical or
virtual disk blocks, 512 or 4K? What is that number when using a
standard configuration of a disk with 4K sectors and ashift=12? It
could be 1, or 8, or maybe something else.
As I read it, RAIDZ2 with five disks uses somewhere between 67% and 40%
of the data space for redundancy. The first seems unlikely, but I can't
tell. Better labels or rearrangement would help.
A second chart with no labels at all follows the first. It has only the
power-of-two values in the "block size in sectors" column. A
restatement of the first one... but it's not clear why.
My previous understanding was that RAIDZ2 with five disks would leave
60% of the capacity for data.
Quite right. If you have N disks in a RAIDZx configuration, the fraction
used for data is (N-x)/N and the fraction for parity is x/N. There's
always overhead for the file system bookkeeping of course, but that's
not specific to ZFS or RAID.
I wonder if what varies is the amount of space taken up by the
checksums. If there's a checksum for each block, then the block size
would change the fraction of the space lost to checksums, and the parity
for the checksums would thus also change. Enough to matter? Maybe.
I'm not a file system guru, but my (high level) understanding is as
follows. Corrections from anyone more knowledgeable welcome.

1. UFS and ZFS both use tree structures to represent files, with the
data stored at the leaves and bookkeeping stored in the higher nodes.
Therefore the overhead scales as the log of the data size, which is a
negligible fraction for any sufficiently large amount of data.

2. UFS doesn't have data checksums, it relies purely on the hardware
checksums. (This is the area I'm least certain of.)

3. ZFS keeps its checksums in a Merkel tree
(http://en.wikipedia.org/wiki/Merkle_tree) so the checksums are held in
the bookkeeping blocks, not in the data blocks. This simply changes the
constant multiplier in front of the logarithm for the overhead. Also, I
believe ZFS doesn't use fixed size data blocks, but aggregates writes
into blocks of up to 128K.

Personally, I don't worry about the overheads of checksumming as the
cost of the parity stripe(s) in raidz is dominant. It's a cost well
worth paying though - I have a 3 disk raidz1 pool and a disk went bad
within 3 months of building it (the manufacturer turned out to be having
a few problems at the time) but I didn't lose a byte.
Scott Bennett
2014-08-07 08:31:43 UTC
Permalink
Post by Arthur Chance
Post by Scott Bennett
[stuff deleted --SB]
I wonder if what varies is the amount of space taken up by the
checksums. If there's a checksum for each block, then the block size
would change the fraction of the space lost to checksums, and the parity
for the checksums would thus also change. Enough to matter? Maybe.
I'm not a file system guru, but my (high level) understanding is as
follows. Corrections from anyone more knowledgeable welcome.
1. UFS and ZFS both use tree structures to represent files, with the
data stored at the leaves and bookkeeping stored in the higher nodes.
Therefore the overhead scales as the log of the data size, which is a
negligible fraction for any sufficiently large amount of data.
2. UFS doesn't have data checksums, it relies purely on the hardware
checksums. (This is the area I'm least certain of.)
What hardware checksums are there? I wasn't aware that this sort of
hardware kept any.
Post by Arthur Chance
3. ZFS keeps its checksums in a Merkel tree
(http://en.wikipedia.org/wiki/Merkle_tree) so the checksums are held in
the bookkeeping blocks, not in the data blocks. This simply changes the
constant multiplier in front of the logarithm for the overhead. Also, I
believe ZFS doesn't use fixed size data blocks, but aggregates writes
into blocks of up to 128K.
Personally, I don't worry about the overheads of checksumming as the
cost of the parity stripe(s) in raidz is dominant. It's a cost well
worth paying though - I have a 3 disk raidz1 pool and a disk went bad
within 3 months of building it (the manufacturer turned out to be having
a few problems at the time) but I didn't lose a byte.
Good testimonial. I'm not worried about the checksum space either.
I figure the benefits make it cheap at the price. Of more concern to me
now is how I'm going to come up with at least two more 2 TB drives to set
up a raidz2 with a tolerably small fraction of the total space being tied
up in combined ZFS overhead (i.e., bookkeeping, parity, checksums, etc.)


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Trond Endrestøl
2014-08-07 08:35:36 UTC
Permalink
Post by Scott Bennett
Post by Arthur Chance
Post by Scott Bennett
[stuff deleted --SB]
I wonder if what varies is the amount of space taken up by the
checksums. If there's a checksum for each block, then the block size
would change the fraction of the space lost to checksums, and the parity
for the checksums would thus also change. Enough to matter? Maybe.
I'm not a file system guru, but my (high level) understanding is as
follows. Corrections from anyone more knowledgeable welcome.
1. UFS and ZFS both use tree structures to represent files, with the
data stored at the leaves and bookkeeping stored in the higher nodes.
Therefore the overhead scales as the log of the data size, which is a
negligible fraction for any sufficiently large amount of data.
2. UFS doesn't have data checksums, it relies purely on the hardware
checksums. (This is the area I'm least certain of.)
What hardware checksums are there? I wasn't aware that this sort of
hardware kept any.
To quote http://en.wikipedia.org/wiki/Disk_sector:

In disk drives, each physical sector is made up of three basic parts,
the sector header, the data area and the error-correcting code (ECC).
Post by Scott Bennett
Post by Arthur Chance
3. ZFS keeps its checksums in a Merkel tree
(http://en.wikipedia.org/wiki/Merkle_tree) so the checksums are held in
the bookkeeping blocks, not in the data blocks. This simply changes the
constant multiplier in front of the logarithm for the overhead. Also, I
believe ZFS doesn't use fixed size data blocks, but aggregates writes
into blocks of up to 128K.
Personally, I don't worry about the overheads of checksumming as the
cost of the parity stripe(s) in raidz is dominant. It's a cost well
worth paying though - I have a 3 disk raidz1 pool and a disk went bad
within 3 months of building it (the manufacturer turned out to be having
a few problems at the time) but I didn't lose a byte.
Good testimonial. I'm not worried about the checksum space either.
I figure the benefits make it cheap at the price. Of more concern to me
now is how I'm going to come up with at least two more 2 TB drives to set
up a raidz2 with a tolerably small fraction of the total space being tied
up in combined ZFS overhead (i.e., bookkeeping, parity, checksums, etc.)
Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
--
+-------------------------------+------------------------------------+
| Vennlig hilsen, | Best regards, |
| Trond Endrestøl, | Trond Endrestøl, |
| IT-ansvarlig, | System administrator, |
| Fagskolen Innlandet, | Gjøvik Technical College, Norway, |
| tlf. mob. 952 62 567, | Cellular...: +47 952 62 567, |
| sentralbord 61 14 54 00. | Switchboard: +47 61 14 54 00. |
+-------------------------------+------------------------------------+
Scott Bennett
2014-08-07 09:36:46 UTC
Permalink
Post by Trond Endrestøl
Post by Scott Bennett
Post by Arthur Chance
Post by Scott Bennett
[stuff deleted --SB]
I wonder if what varies is the amount of space taken up by the
checksums. If there's a checksum for each block, then the block size
would change the fraction of the space lost to checksums, and the parity
for the checksums would thus also change. Enough to matter? Maybe.
I'm not a file system guru, but my (high level) understanding is as
follows. Corrections from anyone more knowledgeable welcome.
1. UFS and ZFS both use tree structures to represent files, with the
data stored at the leaves and bookkeeping stored in the higher nodes.
Therefore the overhead scales as the log of the data size, which is a
negligible fraction for any sufficiently large amount of data.
2. UFS doesn't have data checksums, it relies purely on the hardware
checksums. (This is the area I'm least certain of.)
What hardware checksums are there? I wasn't aware that this sort of
hardware kept any.
In disk drives, each physical sector is made up of three basic parts,
the sector header, the data area and the error-correcting code (ECC).
That's interesting, and I know it was true in the days of minicomputers.
However, it appears to be out of date, based upon 1) the observed fact that
corrupted data *do* get recorded onto today's PC-style disk drives with no
indication that an error has occurred, no parity bits are present in the
processor chips, memory cards, motherboards, PATA/SATA/SCSI/etc. controllers,
nor 2) the disk drives themselves, as confirmed by the technical support guy
I spoke with about it at Seagate/Samsung recently. That guy said that there
is *no parity-checking* of data written to/read from the disks and that some
silent errors are now considered to be "normal" on disks whose capacities
exceed 1 TB.
Post by Trond Endrestøl
[remainder deleted --SB]
Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Trond Endrestøl
2014-08-07 10:36:56 UTC
Permalink
Post by Scott Bennett
Post by Trond Endrestøl
Post by Scott Bennett
Post by Arthur Chance
Post by Scott Bennett
[stuff deleted --SB]
I wonder if what varies is the amount of space taken up by the
checksums. If there's a checksum for each block, then the block size
would change the fraction of the space lost to checksums, and the parity
for the checksums would thus also change. Enough to matter? Maybe.
I'm not a file system guru, but my (high level) understanding is as
follows. Corrections from anyone more knowledgeable welcome.
1. UFS and ZFS both use tree structures to represent files, with the
data stored at the leaves and bookkeeping stored in the higher nodes.
Therefore the overhead scales as the log of the data size, which is a
negligible fraction for any sufficiently large amount of data.
2. UFS doesn't have data checksums, it relies purely on the hardware
checksums. (This is the area I'm least certain of.)
What hardware checksums are there? I wasn't aware that this sort of
hardware kept any.
In disk drives, each physical sector is made up of three basic parts,
the sector header, the data area and the error-correcting code (ECC).
That's interesting, and I know it was true in the days of minicomputers.
However, it appears to be out of date, based upon 1) the observed fact that
corrupted data *do* get recorded onto today's PC-style disk drives with no
indication that an error has occurred, no parity bits are present in the
processor chips, memory cards, motherboards, PATA/SATA/SCSI/etc. controllers,
nor 2) the disk drives themselves, as confirmed by the technical support guy
I spoke with about it at Seagate/Samsung recently. That guy said that there
is *no parity-checking* of data written to/read from the disks and that some
silent errors are now considered to be "normal" on disks whose capacities
exceed 1 TB.
I guess I stand corrected. It's been some years since I had any CS/CE
education, and maybe my professor's knowledge were also a bit dated
back then (1999-2002).

However, some, if not all, enterprise graded disks uses 520 bytes
blocks, giving 8 bytes extra compared to the usual 512 bytes blocks,
possibly to be used for ECC, etc.

Though, as proved above, I can be equally wrong about the ECC used in
enterprise graded disks.

It seems modern day consumer electronics are being manufactured way
too brittle. :-/
--
+-------------------------------+------------------------------------+
| Vennlig hilsen, | Best regards, |
| Trond Endrestøl, | Trond Endrestøl, |
| IT-ansvarlig, | System administrator, |
| Fagskolen Innlandet, | Gjøvik Technical College, Norway, |
| tlf. mob. 952 62 567, | Cellular...: +47 952 62 567, |
| sentralbord 61 14 54 00. | Switchboard: +47 61 14 54 00. |
+-------------------------------+------------------------------------+
Scott Bennett
2014-08-07 11:06:19 UTC
Permalink
Post by Trond Endrestøl
Post by Scott Bennett
Post by Trond Endrestøl
Post by Scott Bennett
Post by Arthur Chance
Post by Scott Bennett
[stuff deleted --SB]
I wonder if what varies is the amount of space taken up by the
checksums. If there's a checksum for each block, then the block size
would change the fraction of the space lost to checksums, and the parity
for the checksums would thus also change. Enough to matter? Maybe.
I'm not a file system guru, but my (high level) understanding is as
follows. Corrections from anyone more knowledgeable welcome.
1. UFS and ZFS both use tree structures to represent files, with the
data stored at the leaves and bookkeeping stored in the higher nodes.
Therefore the overhead scales as the log of the data size, which is a
negligible fraction for any sufficiently large amount of data.
2. UFS doesn't have data checksums, it relies purely on the hardware
checksums. (This is the area I'm least certain of.)
What hardware checksums are there? I wasn't aware that this sort of
hardware kept any.
In disk drives, each physical sector is made up of three basic parts,
the sector header, the data area and the error-correcting code (ECC).
That's interesting, and I know it was true in the days of minicomputers.
However, it appears to be out of date, based upon 1) the observed fact that
corrupted data *do* get recorded onto today's PC-style disk drives with no
indication that an error has occurred, no parity bits are present in the
processor chips, memory cards, motherboards, PATA/SATA/SCSI/etc. controllers,
nor 2) the disk drives themselves, as confirmed by the technical support guy
I spoke with about it at Seagate/Samsung recently. That guy said that there
is *no parity-checking* of data written to/read from the disks and that some
silent errors are now considered to be "normal" on disks whose capacities
exceed 1 TB.
I guess I stand corrected. It's been some years since I had any CS/CE
education, and maybe my professor's knowledge were also a bit dated
back then (1999-2002).
However, some, if not all, enterprise graded disks uses 520 bytes
blocks, giving 8 bytes extra compared to the usual 512 bytes blocks,
possibly to be used for ECC, etc.
Even just as parity bits, those would amount to only one bit per
eight bytes, which seems inadequate. OTOH, the 520 bytes thing is
tickling something in my memory that I can't quite seem to recover, and
I don't know (or can't remember) what else those eight bytes might be
used for. In any case, at the time I spoke with the guy at Seagate/Samsung,
I was unaware of the server grade vs. non-server grade distinction, so I
didn't know to ask him anything about whether silent errors should be
accepted as "normal" for the server grade of disks.
Post by Trond Endrestøl
Though, as proved above, I can be equally wrong about the ECC used in
enterprise graded disks.
As just noted above, I can't comment about server/enterprise grades
of disks either. However, I do not that many server motherboards can use,
or even require, ECC memory. Whether the parity information from the ECC
memory is also reflected in extra data bus lines in the CPU, I/O controllers,
or peripheral devices of any kind is another matter.
Post by Trond Endrestøl
It seems modern day consumer electronics are being manufactured way
too brittle. :-/
I agree completely.


Scott Bennett, Comm. ASMELG, CFIAG
P.O. Box 772
DeKalb, Illinois 60115
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Paul Kraus
2014-08-08 17:54:18 UTC
Permalink
Post by Scott Bennett
Even just as parity bits, those would amount to only one bit per
eight bytes, which seems inadequate. OTOH, the 520 bytes thing is
tickling something in my memory that I can't quite seem to recover, and
I don't know (or can't remember) what else those eight bytes might be
used for. In any case, at the time I spoke with the guy at Seagate/Samsung,
I was unaware of the server grade vs. non-server grade distinction, so I
didn't know to ask him anything about whether silent errors should be
accepted as "normal" for the server grade of disks.
Take a look at the manufacturer data sheets for this drives. All of the ones that I have looked at over the past ten years have included the “uncorrectable error rate” and it is generally 1 in 10e-14 for “consumer grade drives” and 1 in 1e-15 for “enterprise grade drives”. That right there shows the order of magnitude difference in this error rate between consumer and enterprise drives.

The reason no one even discussed it prior to the appearance of 1TB drives is that over the life of a less than 1TB drive you are statistically almost assured of NOT running into it. It was still there, but no one wrote/read enough data over the life of the drive to hit it.

On the other hand, I am willing to bet that many of the “random” systems crashes (and Windows BSOD) were caused by this issue. A hard disk returned a single bit error in a bad place and the system crashed.

Note that all disk drives include some amount of error checking, even as far back as the 10MB MFM drives of the 1980’s. Anyone remember having to manually manage the “Bad block list” ? Those were blocks that were so bad that the error correction could not fix them. But, as far as I can tell, the uncorrectable errors have always been with us, we just did not not see them.

--
Paul Kraus
***@kraus-haus.org
Scott Bennett
2014-08-22 09:40:06 UTC
Permalink
Post by Scott Bennett
Even just as parity bits, those would amount to only one bit per
eight bytes, which seems inadequate. OTOH, the 520 bytes thing is
tickling something in my memory that I can't quite seem to recover, and
I don't know (or can't remember) what else those eight bytes might be
used for. In any case, at the time I spoke with the guy at Seagate/Samsung,
I was unaware of the server grade vs. non-server grade distinction, so I
didn't know to ask him anything about whether silent errors should be
accepted as "normal" for the server grade of disks.
Take a look at the manufacturer data sheets for this drives. All of the ones that I have looked at over the past ten years have included the ?uncorrectable error rate? and it is generally 1 in 10e-14 for ?consumer grade drives? and 1 in 1e-15 for ?enterprise grade drives?. That right there shows the order of magnitude difference in this error rate between consumer and enterprise drives.
I'll assume you meant the reciprocals of those ratios or possibly even
1/10 of the reciprocals. ;-) What I'm seeing here is ~2 KB of errors out
of ~1.1TB, which is an error rate (in bytes, not bits) of ~1.82e+09, and the
majority of the erroneous bytes I looked at had multibit errors. I consider
that to be a huge change in the actual device error rates, specs be damned.
While I was out of town, I came across a trade magazine article that
said that as the areal density of bits approaches the theoretical limit for
the recording technology currently in production, the error rate climbs ever
more steeply, and that the drives larger than 1 TB are now making that effect
easily demonstrable. :-( The article went on to describe superficially a new
recording technology due to appear on the mass market in 2015 that will allow
much higher bit densities, while drastically improving the error rate (at
least until densities eventually close in on that technology's limit). So
it may turn out that next year consumers will begin to move past the hump in
error rates and will find that hardware RAID will have become acceptably safe
once again. The description of the new recording technology looked like a
really spiffed up version of the magneto-optical disks of the 1990s. In the
meantime, though, the current crops of large-capacity disks apparently
require software solutions like ZFS to preserve data integrity.
The reason no one even discussed it prior to the appearance of 1TB drives is that over the life of a less than 1TB drive you are statistically almost assured of NOT running into it. It was still there, but no one wrote/read enough data over the life of the drive to hit it.
That sounds reasonable, but it doesn't account for the error rates I'm
seeing.
On the other hand, I am willing to bet that many of the ?random? systems crashes (and Windows BSOD) were caused by this issue. A hard disk returned a single bit error in a bad place and the system crashed.
Quite possibly so, I'd say.
Note that all disk drives include some amount of error checking, even as far back as the 10MB MFM drives of the 1980?s. Anyone remember having to manually manage the ?Bad block list? ? Those were blocks that were so bad that the error correction could not fix them. But, as far as I can tell, the uncorrectable errors have always been with us, we just did not not see them.
I remember hearing about it, but I was safely tucked away on minicomputers
and a mainframe at that point. As I wrote before, an unrecoverable single-bit
error resulted in a bad sector reassignment by a human or, as was the policy
where I worked at the time, replacement of the disk pack by the vendor. The
smallest drive on any pee cee that I used was 20 MB with an average seek time
of 68 ms, IIRC. Even that 4 MHz 8088 had to cool its heels for long stretches
of time with that drive. Fortunately for me, I didn't have to use that
machine for work, but rather for an applied microclimatology course I was
taking at the time.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Paul Kraus
2014-08-22 13:12:45 UTC
Permalink
Post by Scott Bennett
Take a look at the manufacturer data sheets for this drives. All of the ones that I have looked at over the past ten years have included the ?uncorrectable error rate? and it is generally 1 in 10e-14 for ?consumer grade drives? and 1 in 1e-15 for ?enterprise grade drives?. That right there shows the order of magnitude difference in this error rate between consumer and enterprise drives.
I'll assume you meant the reciprocals of those ratios or possibly even
1/10 of the reciprocals. ;-)
Uhhh, yeah, my bad.
Post by Scott Bennett
What I'm seeing here is ~2 KB of errors out
of ~1.1TB, which is an error rate (in bytes, not bits) of ~1.82e+09, and the
majority of the erroneous bytes I looked at had multibit errors. I consider
that to be a huge change in the actual device error rates, specs be damned.
That seems like a very high error rate. Is the drive reporting those errors or are they getting past the drive’s error correction and showing up as checksum errors in ZFS ? A drive that is throwing that many errors is clearly defective or dying.
Post by Scott Bennett
While I was out of town, I came across a trade magazine article that
said that as the areal density of bits approaches the theoretical limit for
the recording technology currently in production, the error rate climbs ever
more steeply, and that the drives larger than 1 TB are now making that effect
easily demonstrable. :-(
It took perpendicular recording to make >1TB drives possible at all.
Post by Scott Bennett
The article went on to describe superficially a new
recording technology due to appear on the mass market in 2015 that will allow
much higher bit densities, while drastically improving the error rate (at
least until densities eventually close in on that technology's limit). So
it may turn out that next year consumers will begin to move past the hump in
error rates and will find that hardware RAID will have become acceptably safe
once again. The description of the new recording technology looked like a
really spiffed up version of the magneto-optical disks of the 1990s. In the
meantime, though, the current crops of large-capacity disks apparently
require software solutions like ZFS to preserve data integrity.
I do not know the root cause of the uncorrectable errors, but they seem to vary with product line and not capacity. Whether that means the Enterprise drives with the order of magnitude better uncorrectable error rate has better coatings on the platters or better heads or better electronics or better QC I do not know. So I don’t know how mud this new technology will effect those errors.

--
Paul Kraus
***@kraus-haus.org
Scott Bennett
2014-08-26 06:41:40 UTC
Permalink
Post by Scott Bennett
What I'm seeing here is ~2 KB of errors out
of ~1.1TB, which is an error rate (in bytes, not bits) of ~1.82e+09, and the
majority of the erroneous bytes I looked at had multibit errors. I consider
that to be a huge change in the actual device error rates, specs be damned.
That seems like a very high error rate. Is the drive reporting those errors or are they getting past the drive?s error correction and showing up as checksum errors in ZFS ? A drive that is throwing that many errors is clearly defective or dying.
I'm not using ZFS yet. Once I get a couple more 2 TB drives, I'll give
it a shot.
The numbers are from running direct comparisons between the source file
and the copy of it using cmp(1). In one case, I ran the cmp twice and got
identical results, which I interpret as an indication that the errors are
occurring during the writes to the target disk during the copying.
Post by Scott Bennett
While I was out of town, I came across a trade magazine article that
said that as the areal density of bits approaches the theoretical limit for
the recording technology currently in production, the error rate climbs ever
more steeply, and that the drives larger than 1 TB are now making that effect
easily demonstrable. :-(
It took perpendicular recording to make >1TB drives possible at all.
Post by Scott Bennett
The article went on to describe superficially a new
recording technology due to appear on the mass market in 2015 that will allow
much higher bit densities, while drastically improving the error rate (at
least until densities eventually close in on that technology's limit). So
it may turn out that next year consumers will begin to move past the hump in
error rates and will find that hardware RAID will have become acceptably safe
once again. The description of the new recording technology looked like a
really spiffed up version of the magneto-optical disks of the 1990s. In the
meantime, though, the current crops of large-capacity disks apparently
require software solutions like ZFS to preserve data integrity.
I do not know the root cause of the uncorrectable errors, but they seem to vary with product line and not capacity. Whether that means the Enterprise drives with the order of magnitude better uncorrectable error rate has better coatings on the platters or better heads or better electronics or better QC I do not know. So I don?t know how mud this new technology will effect those errors.
I guess we'll have to see what people report after the new technology
appears on the market next year.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Paul Kraus
2014-08-26 13:44:46 UTC
Permalink
Post by Scott Bennett
Post by Scott Bennett
What I'm seeing here is ~2 KB of errors out
of ~1.1TB, which is an error rate (in bytes, not bits) of ~1.82e+09, and the
majority of the erroneous bytes I looked at had multibit errors. I consider
that to be a huge change in the actual device error rates, specs be damned.
That seems like a very high error rate. Is the drive reporting those errors or are they getting past the drive?s error correction and showing up as checksum errors in ZFS ? A drive that is throwing that many errors is clearly defective or dying.
I'm not using ZFS yet. Once I get a couple more 2 TB drives, I'll give
it a shot.
The numbers are from running direct comparisons between the source file
and the copy of it using cmp(1). In one case, I ran the cmp twice and got
identical results, which I interpret as an indication that the errors are
occurring during the writes to the target disk during the copying.
Wow. That implies you are hitting a drive with a very high uncorrectable error rate since the drive did not report any errors and the data is corrupt. I have yet to run into one of those.

--
Paul Kraus
***@kraus-haus.org
Scott Bennett
2014-08-28 06:36:05 UTC
Permalink
Post by Paul Kraus
Post by Scott Bennett
Post by Scott Bennett
What I'm seeing here is ~2 KB of errors out
of ~1.1TB, which is an error rate (in bytes, not bits) of ~1.82e+09, and the
As I caught and corrected before, the above should have said, "~1.82e-09".
Post by Paul Kraus
Post by Scott Bennett
Post by Scott Bennett
majority of the erroneous bytes I looked at had multibit errors. I consider
that to be a huge change in the actual device error rates, specs be damned.
That seems like a very high error rate. Is the drive reporting those errors or are they getting past the drive?s error correction and showing up as checksum errors in ZFS ? A drive that is throwing that many errors is clearly defective or dying.
I'm not using ZFS yet. Once I get a couple more 2 TB drives, I'll give
it a shot.
The numbers are from running direct comparisons between the source file
and the copy of it using cmp(1). In one case, I ran the cmp twice and got
identical results, which I interpret as an indication that the errors are
occurring during the writes to the target disk during the copying.
Wow. That implies you are hitting a drive with a very high uncorrectable error rate since the drive did not report any errors and the data is corrupt. I have yet to run into one of those.
How would an uncorrectable error be detected by the drive without any
parity checking or hardware-implemented write-with-verify?
Are you using any drives larger than 1 TB? If so, try copying a 1.1 TB
file to one of them, and then trying comparing the copy against the original.
Out of the three drives I could test that way, I got that kind of result on
two every time I tried it. One of the two was a new Samsung (i.e., a
Seagate), and the other was a refurbished Seagate supplied as a replacement
under warranty. The third got a clean copy the first time and two bytes with
single-bit errors on the second try. That one was also a refurbished Seagate
provided under warranty.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Paul Kraus
2014-08-28 14:27:59 UTC
Permalink
Post by Scott Bennett
Post by Paul Kraus
Wow. That implies you are hitting a drive with a very high uncorrectable error rate since the drive did not report any errors and the data is corrupt. I have yet to run into one of those.
How would an uncorrectable error be detected by the drive without any
parity checking or hardware-implemented write-with-verify?
I suppose my point was that an operation that is NOT flagged by the drive as failing and DOES return faulty data is, by definition, an uncorrectable error (as far as the drive is concerned). The point is that an uncorrectable error (from the drive standpoint) is just that, an error that the drive CANNOT detect.
Post by Scott Bennett
Are you using any drives larger than 1 TB?
I have been testing with a bunch of 2TB (3 HGST and 1 WD). I have been using ZFS and it has not reported *any* checksum errors.

I have put one of the 4 into production service (I needed a replacement for a failed 1TB and did not have any more 1TB in stock). It has been running for a couple weeks now with no checksum errors reported. My zpool is 5 x 1TB RAIDz2 and it has about 2TB of data on it right now.
Post by Scott Bennett
If so, try copying a 1.1 TB
file to one of them, and then trying comparing the copy against the original.
Hurmmm. I have not worked with individual files that large. What filesystem are you using here?
Post by Scott Bennett
Out of the three drives I could test that way, I got that kind of result on
two every time I tried it. One of the two was a new Samsung (i.e., a
Seagate), and the other was a refurbished Seagate supplied as a replacement
under warranty. The third got a clean copy the first time and two bytes with
single-bit errors on the second try. That one was also a refurbished Seagate
provided under warranty.
If you use ZFS on these drives and copy the same file do you get any checksum errors?

--
Paul Kraus
***@kraus-haus.org
Scott Bennett
2014-08-30 01:47:40 UTC
Permalink
Post by Paul Kraus
Post by Scott Bennett
Post by Paul Kraus
Wow. That implies you are hitting a drive with a very high uncorrectable error rate since the drive did not report any errors and the data is corrupt. I have yet to run into one of those.
How would an uncorrectable error be detected by the drive without any
parity checking or hardware-implemented write-with-verify?
I suppose my point was that an operation that is NOT flagged by the drive as failing and DOES return faulty data is, by definition, an uncorrectable error (as far as the drive is concerned). The point is that an uncorrectable error (from the drive standpoint) is just that, an error that the drive CANNOT detect.
Maybe this is just a linguistics/semantics matter, but it seems to me
that an operation that is not flagged by the drive as an error but does return
faulty data is, by definition, not an error at all as far as the drive is
concerned, but it *is* an error as far as a human is concerned. The difference
in wording is why I asked the question, so at least the confusion is now
cleared up. :-)
Post by Paul Kraus
Post by Scott Bennett
Are you using any drives larger than 1 TB?
I have been testing with a bunch of 2TB (3 HGST and 1 WD). I have been using ZFS and it has not reported *any* checksum errors.
What sort of testing? Unless the data written with errors are read back,
how would ZFS know about any checksum errors? Does ZFS implement write-with-
verify? Copying some humongous file and then reading it back for comparison
(or, with ZFS, just reading them) ought to bring the checksums into play. Of
course, a scrub should do that, too.
I have never bought the enterprise-grade drives--though I may begin doing
so after having read the information you've brought up here--so the difference
in drive quality at the outset may explain why your results so far have been
so much better than mine.
Post by Paul Kraus
I have put one of the 4 into production service (I needed a replacement for a failed 1TB and did not have any more 1TB in stock). It has been running for a couple weeks now with no checksum errors reported. My zpool is 5 x 1TB RAIDz2 and it has about 2TB of data on it right now.
Post by Scott Bennett
If so, try copying a 1.1 TB
file to one of them, and then trying comparing the copy against the original.
Hurmmm. I have not worked with individual files that large. What filesystem are you using here?
At the moment, all of my file systems on hard drives are UFS2.
Post by Paul Kraus
Post by Scott Bennett
Out of the three drives I could test that way, I got that kind of result on
two every time I tried it. One of the two was a new Samsung (i.e., a
Seagate), and the other was a refurbished Seagate supplied as a replacement
under warranty. The third got a clean copy the first time and two bytes with
single-bit errors on the second try. That one was also a refurbished Seagate
provided under warranty.
If you use ZFS on these drives and copy the same file do you get any checksum errors?
As soon as I can get two more 2 TB drives and set them up under ZFS,
I intend to try the equivalent of that. Because drives cannot be added to
an existing raidzN, I need to wait until then to create the 6-drive raidz2.
However, that original 1.1 TB file is currently sitting on one of the four
drives I already have that are intended for the raidz2, so that file will
be trashed by creating the raidz2. The file is a dump(8) file of a 1.2 TB
file system that is nearly full, so I can run the dump again with the output
going to the newly created pool, after which I can try a
"dd if=dumpfile of=/dev/null" to see whether ZFS detects any problems. If
it doesn't, then I can try a scrub on the pool to see whether that finds any
problems.
My expectation is that I will end up contacting one or more manufacturers
to try to replace at least two drives based on whatever ZFS detects, but I
would be glad to be mistaken about that for now. If two are that bad, then
I hope that ZFS can keep things running until the replacements show up here.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
daniel
2014-08-30 13:05:12 UTC
Permalink
Post by Scott Bennett
Post by Paul Kraus
Post by Scott Bennett
Are you using any drives larger than 1 TB?
I have been testing with a bunch of 2TB (3 HGST and 1 WD). I have been
using ZFS and it has not reported *any* checksum errors.
What sort of testing? Unless the data written with errors are read back,
how would ZFS know about any checksum errors? Does ZFS implement write-with-
verify? Copying some humongous file and then reading it back for comparison
(or, with ZFS, just reading them) ought to bring the checksums into play. Of
course, a scrub should do that, too.
No write with verify that I know of (at least, this type of verify) but
any read should brings the checksums into play. (And scrub, of course.)
Post by Scott Bennett
As soon as I can get two more 2 TB drives and set them up under ZFS,
I intend to try the equivalent of that. Because drives cannot be added to
an existing raidzN, I need to wait until then to create the 6-drive raidz2.
However, that original 1.1 TB file is currently sitting on one of the four
drives I already have that are intended for the raidz2, so that file will
be trashed by creating the raidz2. The file is a dump(8) file of a 1.2 TB
file system that is nearly full, so I can run the dump again with the output
going to the newly created pool, after which I can try a
"dd if=dumpfile of=/dev/null" to see whether ZFS detects any problems.
If
it doesn't, then I can try a scrub on the pool to see whether that finds any
problems.
My expectation is that I will end up contacting one or more manufacturers
to try to replace at least two drives based on whatever ZFS detects, but I
would be glad to be mistaken about that for now. If two are that bad, then
I hope that ZFS can keep things running until the replacements show up here.
Just for the testing, you can set up a one-drive zpool. ZFS wouldn't be
able repair the error in that case (unless you set the 'copies'
property, but then you'll need more disk space for the write; basically
that would mean writing a backup to the same drive), but it will still
be able to detect it.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
Paul Kraus
2014-08-30 23:00:32 UTC
Permalink
<snip>
Post by Scott Bennett
Post by Paul Kraus
I have been testing with a bunch of 2TB (3 HGST and 1 WD). I have been using ZFS and it has not reported *any* checksum errors.
What sort of testing? Unless the data written with errors are read back,
how would ZFS know about any checksum errors? Does ZFS implement write-with-
verify? Copying some humongous file and then reading it back for comparison
(or, with ZFS, just reading them) ought to bring the checksums into play. Of
course, a scrub should do that, too.
I typically run a scrub on any new drive after writing a bunch of data to it, specifically to look for infant mortality :-)
Post by Scott Bennett
I have never bought the enterprise-grade drives--though I may begin doing
so after having read the information you've brought up here--so the difference
in drive quality at the outset may explain why your results so far have been
so much better than mine.
Don’t go by what *I* say, go the manufacturer’s web sites and download and read the full specifications on the drives you are looking at. None of the sales sites (Newegg, CDW, etc.) post the full specs, yet they are all (still) available from the Seagate / Western Digital / HGST etc. web sites.

I am just starting to play with a different WD Enterprise series, so far all my testing (and use) has been with the RE series, I just got two 1TB SE series (which are also 5 year warranty and claim to be Enterprise grade, rated for 24x7 operation). I put them into service today and expect to be loading data on them tomorrow or Monday. So now I will have Seagate ES, ES.2, HGST Ultrastar (various P/N), and WD RE, SE drives in use.

<snip>
Post by Scott Bennett
Post by Paul Kraus
Post by Scott Bennett
If so, try copying a 1.1 TB
file to one of them, and then trying comparing the copy against the original.
Hurmmm. I have not worked with individual files that large. What filesystem are you using here?
At the moment, all of my file systems on hard drives are UFS2.
I wonder if it an issue with a single file larger than 1TB … just wondering out loud here.

<snip>
Post by Scott Bennett
My expectation is that I will end up contacting one or more manufacturers
to try to replace at least two drives based on whatever ZFS detects, but I
would be glad to be mistaken about that for now. If two are that bad, then
I hope that ZFS can keep things running until the replacements show up here.
I have never had to warranty a drive for uncorrectable errors, they have been a small enough percentage that I did not worry about them, and when the error rate gets big enough other things start going wrong as well. At least that has been my experience.

--
Paul Kraus
***@kraus-haus.org
Scott Bennett
2014-08-31 07:49:30 UTC
Permalink
Post by Paul Kraus
<snip>
Post by Scott Bennett
Post by Paul Kraus
I have been testing with a bunch of 2TB (3 HGST and 1 WD). I have been using ZFS and it has not reported *any* checksum errors.
What sort of testing? Unless the data written with errors are read back,
how would ZFS know about any checksum errors? Does ZFS implement write-with-
verify? Copying some humongous file and then reading it back for comparison
(or, with ZFS, just reading them) ought to bring the checksums into play. Of
course, a scrub should do that, too.
I typically run a scrub on any new drive after writing a bunch of data to it, specifically to look for infant mortality :-)
Looks like a good idea. Whenever I get the raidz2 set up and some
sizable amount of data loaded into it, I intend to do the same. However,
because the capacity of the 6-drive raidz2 will be about four times the
original UFS2 capacity, I suppose I'll need to find a way to expand the
dump file in other ways, so as to cover the misbehaving tracks on the
individual drives.
Post by Paul Kraus
Post by Scott Bennett
I have never bought the enterprise-grade drives--though I may begin doing
so after having read the information you've brought up here--so the difference
in drive quality at the outset may explain why your results so far have been
so much better than mine.
Don?t go by what *I* say, go the manufacturer?s web sites and download and read the full specifications on the drives you are looking at. None of the sales sites (Newegg, CDW, etc.) post the full specs, yet they are all (still) available from the Seagate / Western Digital / HGST etc. web sites.
Yes, I understood that from what you had already written. What I meant
was that I hadn't been aware that the manufacturers were selling the drives
divided into two differing grades of reliability. From now on, the issue
will be a matter of my budget vs. the price differences.
Post by Paul Kraus
I am just starting to play with a different WD Enterprise series, so far all my testing (and use) has been with the RE series, I just got two 1TB SE series (which are also 5 year warranty and claim to be Enterprise grade, rated for 24x7 operation). I put them into service today and expect to be loading data on them tomorrow or Monday. So now I will have Seagate ES, ES.2, HGST Ultrastar (various P/N), and WD RE, SE drives in use.
Okay. Thanks again for the info. Just out of curiosity, where do you
usually find those Hitachi drives?
Post by Paul Kraus
<snip>
Post by Scott Bennett
Post by Paul Kraus
Post by Scott Bennett
If so, try copying a 1.1 TB
file to one of them, and then trying comparing the copy against the original.
Hurmmm. I have not worked with individual files that large. What filesystem are you using here?
At the moment, all of my file systems on hard drives are UFS2.
I wonder if it an issue with a single file larger than 1TB ? just wondering out loud here.
Well, all I can say is that it is not supposed to be. After all, file
systems that were very large were the reason for going from UFS1 to UFS2.
Post by Paul Kraus
<snip>
Post by Scott Bennett
My expectation is that I will end up contacting one or more manufacturers
to try to replace at least two drives based on whatever ZFS detects, but I
would be glad to be mistaken about that for now. If two are that bad, then
I hope that ZFS can keep things running until the replacements show up here.
I have never had to warranty a drive for uncorrectable errors, they have been a small enough percentage that I did not worry about them, and when the error rate gets big enough other things start going wrong as well. At least that has been my experience.
I would count yourself very lucky if I were you, although my previous
remark regarding the difference in reliability grades still holds.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Paul Kraus
2014-08-31 17:12:32 UTC
Permalink
Post by Scott Bennett
Post by Paul Kraus
I typically run a scrub on any new drive after writing a bunch of data to it, specifically to look for infant mortality :-)
Looks like a good idea. Whenever I get the raidz2 set up and some
sizable amount of data loaded into it, I intend to do the same. However,
because the capacity of the 6-drive raidz2 will be about four times the
original UFS2 capacity, I suppose I'll need to find a way to expand the
dump file in other ways, so as to cover the misbehaving tracks on the
individual drives.
I’m not sure I would worry about exercising the entire range of tracks on the platters, if a platter has a problem (heads or coating) it will likely show up all over the platter. If the problem is specific to a region, I would expect the drive to be able to remap the bad sectors (as we previously discussed).
Post by Scott Bennett
Post by Paul Kraus
Don?t go by what *I* say, go the manufacturer?s web sites and download and read the full specifications on the drives you are looking at. None of the sales sites (Newegg, CDW, etc.) post the full specs, yet they are all (still) available from the Seagate / Western Digital / HGST etc. web sites.
Yes, I understood that from what you had already written. What I meant
was that I hadn't been aware that the manufacturers were selling the drives
divided into two differing grades of reliability. From now on, the issue
will be a matter of my budget vs. the price differences.
Sorry If I was being overly descriptive, I am more of a math and science guy than an english guy, so my writing is often not the most clear. When I started buying Enterprise instead of Desktop drives the price difference was under $20 for a $100 drive. The biggest reason I started buying the Enterprise drives is that they are RATED for 24x7 operation, while Desktop are typically designed for 8x5 (but rarely do they say :-) While I do have my desktop and laptop systems setup to spin down the drives when not in use (and I leave some of them booted 24x7), my server(s) run 24x7 and THAT is where I pay for the Enterprise drives. I treat the drives in the laptop / desktop systems as disposable and do NOT keep any important data only on them (I rsync my laptop to the server a couple times per week and use TimeMachine when at the office).

<snip>
Post by Scott Bennett
Okay. Thanks again for the info. Just out of curiosity, where do you
usually find those Hitachi drives?
Newegg … Once they lean red how to ship drives without destroying them I started buying drives from them :-)

<snip>
Post by Scott Bennett
Post by Paul Kraus
I wonder if it an issue with a single file larger than 1TB ? just wondering out loud here.
Well, all I can say is that it is not supposed to be. After all, file
systems that were very large were the reason for going from UFS1 to UFS2.
I realized that I proposed something ludicrous (the problem with thinking “aloud”), if the FS did not support -files- larger than 1TB, then the write operation would have failed when you got to that point. Yes, I remember FSes that could not handle a -file- larger than 2GB!

Note that there is a difference between the size of a filesystem and the size of the largest -file- that filesystem may contain.

<snip>
Post by Scott Bennett
Post by Paul Kraus
I have never had to warranty a drive for uncorrectable errors, they have been a small enough percentage that I did not worry about them, and when the error rate gets big enough other things start going wrong as well. At least that has been my experience.
I would count yourself very lucky if I were you, although my previous
remark regarding the difference in reliability grades still holds.
I have not tried to use Desktop drives in a Servers (either my own or a client’s) for well over a decade. I do not remember much about drive failures before that. Back then my need for capacity was growing faster than drives were failing, so I was upgrading before the drives failed. I still have a pile of 9GB SCSI drives (and some 18GB and 36GB) kicking around from those days. Not to mention the drawer full of 500MB (yes, 0.5GB) drives I harvested from an old Sun SS1000 before I sold it … I should have left the drives in it.

--
Paul Kraus
***@kraus-haus.org
Scott Bennett
2014-09-01 22:19:36 UTC
Permalink
This post might be inappropriate. Click to display it.
Scott Bennett
2014-08-09 05:52:36 UTC
Permalink
Post by Scott Bennett
Even just as parity bits, those would amount to only one bit per
eight bytes, which seems inadequate. OTOH, the 520 bytes thing is
tickling something in my memory that I can't quite seem to recover, and
Weren't those drives used on the AS/400?
I never worked with the AS/400 and paid little attention to stuff
about it, so I really don't know. I had assumed that they used the same
disk drive lines that were used on the mainframes. IBM kind of abandoned
sectored drives after the 1311 and switched to all count-data|count-key-data
drives, so if the AS/400 used the drives being made at the time the AS/400
made its appearance in the market, then the 520 bytes stuff was in relation
to non-IBM systems. OTOH, if IBM introduced a new line of drives for the
AS/400, then they could have been sectored drives with 520-byte sectors.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Paul Kraus
2014-08-06 23:30:50 UTC
Permalink
Post by Scott Bennett
Post by Arthur Chance
Quite right. If you have N disks in a RAIDZx configuration, the fraction
used for data is (N-x)/N and the fraction for parity is x/N. There's
always overhead for the file system bookkeeping of course, but that's
not specific to ZFS or RAID.
But ZFS does NOT use fixed width stripes across the devices in the RAIDz<n> vdev. The stripe size changes based on number of devices and size of the write operation. ZFS adds parity and padding to make the data fit among the number of devices.
Post by Scott Bennett
I wonder if what varies is the amount of space taken up by the
checksums. If there's a checksum for each block, then the block size
would change the fraction of the space lost to checksums, and the parity
for the checksums would thus also change. Enough to matter? Maybe.
Nope, the size of checksum does NOT vary with vdev configuration.

Going back to Matt’s blog again (and I agree that his use of the term “n-sector block is confusing).

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

Read the blog, don’t just look at the charts :-) My summary is below and may help folks to better understand Matt’s text.

According to the blog (and I trust Matt in this regard), RAIDz does NOT calculate parity per stripe across devices, but on a write by write basis. Matt linked to a descriptive chart: Loading Image... … The chart assumes a 5 device RAIDz1. Each color is a different write operation (remember that ZFS is a copy on write, so every write is a new write, no modifying existing data on disk).

The orange write consists of 8 data blocks and 2 parity blocks. Assuming 512B disk blocks, then you have 8KB of data and 1KB of parity. This is an 8KB write operation.

The yellow write is a 1.5KB write (3 data blocks) and 1 parity.

The green is the same as the yellow, just aligned differently.

Note that all columns (drives) are NOT involved in all write (and later read) operations.

The brown write is one data block (512B) and one parity.

The light purple write is 14 data blocks (7KB) and 4 parity.

Quoting directly form Matt:

A 11-sector block will use 1 parity + 4 data + 1 parity + 4 data + 1 parity + 3 data (e.g. the blue block in rows 9-12). Note that if there are several blocks sharing what would traditionally be thought of as a single “stripe”, there will be multiple parity blocks in the “stripe”.

RAID-Z also requires that each allocation be a multiple of (p+1), so that when it is freed it does not leave a free segment which is too small to be used (i.e. too small to fit even a single sector of data plus p parity sectors – e.g. the light blue block at left in rows 8-9 with 1 parity + 2 data + 1 padding). Therefore, RAID-Z requires a bit more space for parity and overhead than RAID-4/5/6.

This leads to the spreadsheet: https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674

The column down the left is filesystem block size in disk sectors (512B sectors), so it goes from 0.5KB to 128KB filesystem block size (recordsize is max you set when you tune the zfs dataset, zfs can and will write less than full records).

The column across the top is number of devices in the RAIDz1 vdev (see other sheets in the workbook for RAIDz2 and RAIDz3).

Keep in mind that the left column is also the size of the data you are writing. If you are using a database with an 8KB recordsize (16 disk sectors) and you have 6 devices per vdev, then you will loose 20% of the raw space to parity (plus additional for checksums and metadata). The chart further down (rows 29 through 37) show the same data but just for the powers of 2 increments.

So, as Matt says, the more devices you add to a RAID vdev, the more net capacity you will have. At the expense of performance. Quoting Matt’s opening:

TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the amount of space you are willing to devote to parity information. If you need more IOPS, use fewer disks per stripe. If you need more usable space, use more disks per stripe. Trying to optimize your RAID-Z stripe width based on exact numbers is irrelevant in nearly all cases.

and his summary at the end:

The strongest valid recommendation based on exact fitting of blocks into stripes is the following: If you are using RAID-Z with 512-byte sector devices with recordsize=4K or 8K and compression=off (but you probably want compression=lz4): use at least 5 disks with RAIDZ1; use at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3.

Note that you would ONLY use recordsize = 4KB or 8KB if you knew that your workload was ONLY 4 or 8 KB blocks of data (a database).

and finally:

To summarize: Use RAID-Z. Not too wide. Enable compression.

--
Paul Kraus
***@kraus-haus.org
Paul Kraus
2014-08-02 15:58:47 UTC
Permalink
Post by Paul Kraus
ZFS parity is handled slightly differently than for traditional raid-5 (as well as the striping of data / parity blocks). So you cannot just count on loosing 1, 2, or 3 drives worth of space to parity. See Matt Ahren?s Blog entry here http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for (probably) more data on this than you want :-) And here https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674 is his spreadsheet that relates space lost due to parity to number of drives in raidz vdev and data block size (yes, the amount of space lost to parity caries with data block, not configured filesystem block size!). There is a separate tab for each of RAIDz1, RAIDz2, and RAIDz3.
Anyway, using lynx(1), it is very hard to make any sense of the spreadsheet.
Even with a graphic browser, let's say that spreadsheet is not a paragon of clarity.
Do NOT try to understand the spreadsheet on it’s own, it is part of the Blog entry. Read the blog and look at the spreadsheet as Matt refers to it.
It's not clear what "block size in sectors" means in that context. Filesystem blocks, presumably, but are sectors physical or virtual disk blocks, 512 or 4K? What is that number when using a standard configuration of a disk with 4K sectors and ashift=12? It could be 1, or 8, or maybe something else.
As I read it, RAIDZ2 with five disks uses somewhere between 67% and 40% of the data space for redundancy. The first seems unlikely, but I can't tell. Better labels or rearrangement would help.
A second chart with no labels at all follows the first. It has only the
power-of-two values in the "block size in sectors" column. A restatement of the first one... but it's not clear why.
Look at the names of the sheets in the document. They are referred to back in the blog entry.

--
Paul Kraus
***@kraus-haus.org
Scott Bennett
2014-08-06 05:48:14 UTC
Permalink
Post by Warren Block
Post by Paul Kraus
ZFS parity is handled slightly differently than for traditional
raid-5 (as well as the striping of data / parity blocks). So you
cannot just count on loosing 1, 2, or 3 drives worth of space to
parity. See Matt Ahren?s Blog entry here
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for
(probably) more data on this than you want :-) And here
https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674
is his spreadsheet that relates space lost due to parity to number of
drives in raidz vdev and data block size (yes, the amount of space
lost to parity caries with data block, not configured filesystem
block size!). There is a separate tab for each of RAIDz1, RAIDz2, and
RAIDz3.
Anyway, using lynx(1), it is very hard to make any sense of the
spreadsheet.
Even with a graphic browser, let's say that spreadsheet is not a paragon
of clarity. It's not clear what "block size in sectors" means in that
context. Filesystem blocks, presumably, but are sectors physical or
virtual disk blocks, 512 or 4K? What is that number when using a
Sounds like that documents the situation no better than the gcache(8)
man page regarding the use of gcache(8) with graid3(8). :-(
Post by Warren Block
standard configuration of a disk with 4K sectors and ashift=12? It
could be 1, or 8, or maybe something else.
As I read it, RAIDZ2 with five disks uses somewhere between 67% and 40%
of the data space for redundancy. The first seems unlikely, but I can't
tell. Better labels or rearrangement would help.
A second chart with no labels at all follows the first. It has only the
power-of-two values in the "block size in sectors" column. A
restatement of the first one... but it's not clear why.
I wish I knew a way to get these drives to admit to the operating
system that they really use 4k sectors, rather than wasting kernel time
supervising eight 512-byte I/O operations for each real 4096-byte I/O
operations. :-{
Post by Warren Block
My previous understanding was that RAIDZ2 with five disks would leave
60% of the capacity for data.
That was the way I had understood it, too. I have nowhere found
any explanation of his reference to "padding" either.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Daniel Staal
2014-08-02 22:33:38 UTC
Permalink
--As of August 2, 2014 1:21:54 AM -0500, Scott Bennett is alleged to have
Post by Scott Bennett
Post by Paul Kraus
Post by Scott Bennett
Does not support migration to any other Does not support migration
RAID levels or their equivalents. between raidz levels, even by
Correct. Once you have created a vdev, that vdev must remain the same
type. You can add mirrors to a mirror vdev, but you cannot add drives or
change raid level to raidz1, raidz2, or raidz3 vdevs.
Too bad. Increasing the raidz level ought to be not much more
difficult than growing the raidz device by adding more spindles. Doing
the latter ought to be no more difficult that doing it with gvinum's
stripe or raid5 devices. Perhaps the ZFS developers will eventually
implement these capabilities. (A side thought: gstripe and graid3
devices ought also to be expandable in this manner, although the resulting
number of graid3 components would still need to be 2^n + 1.)
--As for the rest, it is mine.

There actually is a semi-simple way, even if it's not direct...

You can 'send' a ZFS filesystem to a backup drive, and then 'receive' it
back to a new pool. It will keep all file and volume level options when
you do that, but the pools can be set up differently. It's not something
you can do in-place, but it's not hard either.

(Basically, it's a simplified 'backup and restore to new setup', but it is
majorly simplified.)

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
Scott Bennett
2014-08-06 06:02:11 UTC
Permalink
Post by Scott Bennett
[Ouch. Trying to edit a response into entire paragraphs on single lines
is a drag.]
map # !}fmt 72
With this I can then, in vi's command mode, hit the # key to reformat
paragraphs. Easy.
Thanks for the tip! I'll give it a try. But I still think people
should use editors, rather than word processors, to post to mailing lists.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Scott Bennett
2014-08-06 06:23:35 UTC
Permalink
Post by Daniel Staal
--As of August 2, 2014 1:21:54 AM -0500, Scott Bennett is alleged to have
Post by Scott Bennett
Post by Paul Kraus
Post by Scott Bennett
Does not support migration to any other Does not support migration
RAID levels or their equivalents. between raidz levels, even by
Correct. Once you have created a vdev, that vdev must remain the same
type. You can add mirrors to a mirror vdev, but you cannot add drives or
change raid level to raidz1, raidz2, or raidz3 vdevs.
Too bad. Increasing the raidz level ought to be not much more
difficult than growing the raidz device by adding more spindles. Doing
the latter ought to be no more difficult that doing it with gvinum's
stripe or raid5 devices. Perhaps the ZFS developers will eventually
implement these capabilities. (A side thought: gstripe and graid3
devices ought also to be expandable in this manner, although the resulting
number of graid3 components would still need to be 2^n + 1.)
--As for the rest, it is mine.
There actually is a semi-simple way, even if it's not direct...
You can 'send' a ZFS filesystem to a backup drive, and then 'receive' it
back to a new pool. It will keep all file and volume level options when
you do that, but the pools can be set up differently. It's not something
you can do in-place, but it's not hard either.
Right, and that's why it is actually *not* a way. A "grow" operation,
as implemented in gvinum(8), is done in place.
Post by Daniel Staal
(Basically, it's a simplified 'backup and restore to new setup', but it is
majorly simplified.)
That would mean having/buying enough more space to be able to do that.
Further, one would probably want to set up a new raidzN on those devices
to hold all that output, so that one would not risk corrupting/losing one's
data during the process. And if one is going to do that, then why copy it
back? Growing it in place would eliminate that space requirement, perhaps
saving the owner a considerable amount of money. The only additional
spindles needed would be the new ones to increase the space or raidz level.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Daniel Staal
2014-08-06 15:49:46 UTC
Permalink
--As of August 6, 2014 1:23:35 AM -0500, Scott Bennett is alleged to have
Post by Scott Bennett
That would mean having/buying enough more space to be able to do
that. Further, one would probably want to set up a new raidzN on those
devices to hold all that output, so that one would not risk
corrupting/losing one's data during the process. And if one is going to
do that, then why copy it back? Growing it in place would eliminate that
space requirement, perhaps saving the owner a considerable amount of
money. The only additional spindles needed would be the new ones to
increase the space or raidz level.
--As for the rest, it is mine.

Agreed. ;) But I've seen it mentioned on pages talking about moving to
new datasets (there are other reasons - notably the *other* way to grow a
zfs pool, by adding vdevs, doesn't spread the IO across all the disks
without it: writes go to the *empty* vdev by preference), and you can do
things like store the zfs stream at a cloud provider for the duration of
the exchange.

So I thought it was worth mentioning in this context, even if it wasn't
exactly the same thing.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
Continue reading on narkive:
Loading...