Discussion:
ZFS and 2 TB disk drive technology :-(
(too old to reply)
Scott Bennett
2014-09-24 11:08:05 UTC
Permalink
I've now tried some testing with ZFS on four of the five drives
that I currently have ready to put into use for a raidz2 cluster. In
the process, I've found that some of the recommendations made for
setting various kernel variables in /boot/loader.conf don't seem to
work as represented, at least not in i386. To the best of my memory,
setting vfs.zfs.arc_max or vm.kmem_size results in a panic in very short
order. Secondly, setting vm.kmem_size_max works, but only if the value
to which it is set does not exceed 512 MB. 512 MB, however, does seem
to be sufficient to eliminate the ZFS kernel module's initialization
warning that says to expect unstable behavior, so that problem appears
to have been resolved.
I created a four-way mirror vdev, where the four drives were as
follows.

da1 WD 2TB drive (new, in old "MyBook" case with USB 2.0,
Firewire 400, and eSATA interfaces, connected via
Firewire 400)

da2 Seagate 2TB drive (refurbished and seems to work
tolerably well, in old Backups Plus case with USB 3.0
interface)

da5 Seagate 2TB drive (refurbished, already shown to get
between 1900 and 2000 bytes in error on a 1.08 TB file
copy, in old Backups Plus case with USB 3.0 interface)

da7 Samsung 2TB drive (Samsung D3 Station, new in June,
already shown to get between 1900 and 2000 bytes in
error on a 1.08 TB file copy, with USB 3.0 interface)

Then I copied the 1.08 TB file again from another Seagate 2 TB drive
to the mirror vdev. No errors were detected during the copy. Then I
began creating a tar file from large parts of a nearly full 1.2 TB file
system (UFS2) on yet another Seagate 2TB on the Firewire 400 bus with the
tar output going to a file in the mirror in order to try to have written
something to most of the sectors on the four-drive mirror. I terminated
tar after the empty space in the mirror got down to about 3% because the
process had slowed to a crawl. (Apparently, space allocation in ZFS
slows down far more than UFS2 when available space gets down to the last
few percent.:-( )
Next, I ran a scrub on the mirror and, after the scrub finished, got
the following output from a "zpool status -v".

pool: testmirror
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 1.38M in 17h59m with 1 errors on Mon Sep 15 19:53:45 2014
config:

NAME STATE READ WRITE CKSUM
testmirror ONLINE 0 0 1
mirror-0 ONLINE 0 0 2
da1p5 ONLINE 0 0 2
da2p5 ONLINE 0 0 2
da5p5 ONLINE 0 0 8
da7p5 ONLINE 0 0 7

errors: Permanent errors have been detected in the following files:

/backups/testmirror/backups.s2A

Note that the choices of recommended action above do *not* include
replacing a bad drive and having ZFS rebuild its content on the
replacement. Why is that so?
Thinking, apparently naively, that the scrub had repaired some or
most of the errors and wanting to know which drives had ended up with
permanent errors, I did a "zpool clear testmirror" and ran another scrub.
During this scrub, I got some kernel messages on the console:

(da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
(da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
(da7:umass-sim5:5:0:0): Retrying command
(da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
(da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
(da7:umass-sim5:5:0:0): Retrying command

I don't know how to decipher these error messages (i.e., what do the hex
digits after "CDB: " mean?) When it had finished, another "zpool status
-v" showed these results.

pool: testmirror
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56 2014
config:

NAME STATE READ WRITE CKSUM
testmirror ONLINE 0 0 1
mirror-0 ONLINE 0 0 2
da1p5 ONLINE 0 0 2
da2p5 ONLINE 0 0 2
da5p5 ONLINE 0 0 6
da7p5 ONLINE 0 0 8

errors: Permanent errors have been detected in the following files:

/backups/testmirror/backups.s2A

So it is not clear to me that either scrub fixed *any* errors at
all. I next ran a comparison ("cmp -z -l") of the original against the
copy now on the mirror, which found these differences before cmp(1) was
terminated because the vm_pager got an error while trying to read in a
block from the mirror vdev. (The cpuset stuff was to prevent cmp(1)
from interfering too much with another ongoing, but unrelated, process.)

Script started on Wed Sep 17 01:37:38 2014
[hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
8169610513 164 124
71816953105 344 304
121604893969 273 233
160321633553 170 130
388494183697 42 2
488384007441 266 226
574339165457 141 101
662115138833 145 105
683519290641 157 117
683546029329 60 20
cmp: Input/output error (caught SIGSEGV)
4144.600u 3948.457s 8:08:08.33 27.6% 15+-393k 5257820+0io 10430953pf+0w
[hellas] 104 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
6022126866 164 124
69669469458 344 304
119457410322 273 233
158174149906 170 130
386346700050 42 2
486236523794 266 226
572191681810 141 101
659967655186 145 105
681371806994 157 117
681398545682 60 20
cmp: Input/output error (caught SIGSEGV)
4132.551u 4003.112s 8:13:20.95 27.4% 15+-345k 5241297+0io 10560652pf+0w
[hellas] 105 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
8169610513 164 124
71816953105 344 304
121604893969 273 233
160321633553 170 130
388494183697 42 2
488384007441 266 226
574339165457 141 101
662115138833 145 105
683519290641 157 117
683546029329 60 20
cmp: Input/output error (caught SIGSEGV)
4136.621u 3977.459s 8:07:43.85 27.7% 15+-378k 5257810+0io 10430951pf+0w
[hellas] 106 %

As you can see, the hard error seems to be pretty consistent. Also, the
bytes found to differ up until termination all differ by a single bit that
was on in the original and is off in the copy, always the same bit in the
byte.
Another issue revealed above is that ZFS, in spite of having *four*
copies of the data and checksums of them, failed to detect any problem
while reading the data back for cmp(1), much less feed cmp(1) the correct
version of the data rather than a corrupted version. Similarly, the hard
error (not otherwise logged by the kernel) apparently encountered by
vm_pager resulted in termination of cmp(1) rather than resulting in ZFS
reading the page from one of the other three drives. I don't see how ZFS
is of much help here, so I guess I must have misunderstood the claims for
ZFS that I've read on this list and in the available materials on-line.
I don't know where to turn next. I will try to call Seagate/Samsung
later today again about the bad Samsung drive and the bad, refurbished
Seagate drive, but they already told me once that having a couple of kB
of errors in a ~1.08 TB file copy does not mean that the drive is bad.
I don't know whether they will consider a hard write error to mean the
drive is bad. The kernel messages shown above are the first ones I've
gotten about any of the drives involved in the copy operation or the
tests described above.
If anyone reading this has any suggestions for a course of action
here, I'd be most interested in reading them. Thanks in advance for any
ideas and also for any corrections if I've misunderstood what a ZFS
mirror was supposed to have done to preserve the data and maintain
correct operation at the application level.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Andrew Berg
2014-09-24 11:37:30 UTC
Permalink
Post by Scott Bennett
If anyone reading this has any suggestions for a course of action
here, I'd be most interested in reading them. Thanks in advance for any
ideas and also for any corrections if I've misunderstood what a ZFS
mirror was supposed to have done to preserve the data and maintain
correct operation at the application level.
I skimmed over the long message, and my first thought is that you have a messed
up controller that is lying. I've run into such a controller on a hard drive
enclosure that is supposed to support disks larger than 2TB, but seems to write
to who knows where when you want a sector beyond 2TB, and the filesystem layer
has no idea anything is wrong. This is all just an educated, guess, but
considering you get errors at a level below ZFS (would it be called the CAM
layer?), my advice would be to check the controllers and perhaps even the disks
themselves. AFAIK, issues at that layer are rarely software ones.
Paul Kraus
2014-09-24 15:24:35 UTC
Permalink
On 9/24/14 7:08, Scott Bennett wrote:

<snip>

What version of FreeBSD are you running ?

What hardware are you running it on ?
Post by Scott Bennett
Then I copied the 1.08 TB file again from another Seagate 2 TB drive
to the mirror vdev. No errors were detected during the copy. Then I
began creating a tar file from large parts of a nearly full 1.2 TB file
system (UFS2) on yet another Seagate 2TB on the Firewire 400 bus with the
tar output going to a file in the mirror in order to try to have written
something to most of the sectors on the four-drive mirror. I terminated
tar after the empty space in the mirror got down to about 3% because the
process had slowed to a crawl. (Apparently, space allocation in ZFS
slows down far more than UFS2 when available space gets down to the last
few percent.:-( )
ZFS's space allocation algorithm will have trouble (performance issues)
allocating new blocks long before you get a few percent free. This is
known behavior and the threshold for performance degradation varies with
work load and historical write patterns. My rule of thumb is that you
really do not want to go past 75-80% full, but I have seen reports over
on the ZFS list of issues with very specific write patterns and work
load with as little as 50% used. For your work load, writing very large
files once, I would expect that you can get close to 90% used before
seeing real performance issues.
Post by Scott Bennett
Next, I ran a scrub on the mirror and, after the scrub finished, got
the following output from a "zpool status -v".
pool: testmirror
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 1.38M in 17h59m with 1 errors on Mon Sep 15 19:53:45 2014
The above means that ZFS was able to repair 1.38MB of bad data but still
ran into 1 situation (unknown size) that it could not fix.
Post by Scott Bennett
NAME STATE READ WRITE CKSUM
testmirror ONLINE 0 0 1
mirror-0 ONLINE 0 0 2
da1p5 ONLINE 0 0 2
da2p5 ONLINE 0 0 2
da5p5 ONLINE 0 0 8
da7p5 ONLINE 0 0 7
/backups/testmirror/backups.s2A
And here is the file that contains the bad data.
Post by Scott Bennett
Note that the choices of recommended action above do *not* include
replacing a bad drive and having ZFS rebuild its content on the
replacement. Why is that so?
Correct, because for some reason ZFS was not able to read enough of the
data without checksum errors to gibe you back your data in tact.
Post by Scott Bennett
Thinking, apparently naively, that the scrub had repaired some or
most of the errors
It did, 1.38MB worth. But it also had errors it could not repair.
Post by Scott Bennett
and wanting to know which drives had ended up with
permanent errors, I did a "zpool clear testmirror" and ran another scrub.
(da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
(da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
(da7:umass-sim5:5:0:0): Retrying command
(da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
(da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
(da7:umass-sim5:5:0:0): Retrying command
How many device errors have you had since booting the system / creating
the zpool ?
Post by Scott Bennett
I don't know how to decipher these error messages (i.e., what do the hex
digits after "CDB: " mean?)
I do not know the specifics in this case, but whenever I have seen
device errors it has always been due to either bad communication with a
drive or a drive reporting an error. If there are ANY device errors you
must address them before you go any further.

As an anecdotal note, I have not had terribly good luck with USB
attached drives under FreeBSD, especially under 9.x. I suspect that the
USB stack just can't keep up and ends up dropping things (or hanging). I
have had better luck with the 10.x release but still do not trust it for
high traffic loads. I have had no issues with SAS or SATA interfaces
(using supported chipsets, I have had very good luck with any of the
Marvell JBOD SATA controllers), _except_ when I was using a SATA port
multiplier. Over on the ZFS list the consensus is that port multipliers
are problematic at best and they should be avoided.
Post by Scott Bennett
When it had finished, another "zpool status
-v" showed these results.
pool: testmirror
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56 2014
This time it fixed 1.25MB of data and still had an error (of unknown
size) that it could not fix.
Post by Scott Bennett
NAME STATE READ WRITE CKSUM
testmirror ONLINE 0 0 1
mirror-0 ONLINE 0 0 2
da1p5 ONLINE 0 0 2
da2p5 ONLINE 0 0 2
da5p5 ONLINE 0 0 6
da7p5 ONLINE 0 0 8
Once again you have errors on ALL your devices. This points to a
systemic problem of some sort on your system. On the ZFS list people
have reported bad memory as sometimes being the cause of these errors. I
would look for a system component that is common to all the drives and
controllers. How healthy is your power supply ? How close to it's limits
are you ?
Post by Scott Bennett
/backups/testmirror/backups.s2A
So it is not clear to me that either scrub fixed *any* errors at
all.
Why is it not clear? The message from zpool status is very clear:

scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56
2014

There were errors that were repaired and an error that was not.
Post by Scott Bennett
I next ran a comparison ("cmp -z -l") of the original against the
copy
If you are comparing the file that ZFS reported was corrupt, then you
should not expect them to match.
Post by Scott Bennett
now on the mirror, which found these differences before cmp(1) was
terminated because the vm_pager got an error while trying to read in a
block from the mirror vdev. (The cpuset stuff was to prevent cmp(1)
from interfering too much with another ongoing, but unrelated, process.)
It sounds like you are really pushing this system to do more than it
reasonably can. In a situation like this you should really not be doing
anything else at the same time given that you are already pushing what
the system can do.
Post by Scott Bennett
Script started on Wed Sep 17 01:37:38 2014
[hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
This is the file the ZFS told you was corrupt, all bets are off.

<snip>
Post by Scott Bennett
Another issue revealed above is that ZFS, in spite of having *four*
copies of the data and checksums of them, failed to detect any problem
while reading the data back for cmp(1), much less feed cmp(1) the correct
version of the data rather than a corrupted version.
ZFS told you that file was corrupt. You are choosing to try to read it.
ZFS used to not even let you try to access a corrupt file but that
behavior was changed to permit people to try to salvage what they could
instead of write it all off.
Post by Scott Bennett
Similarly, the hard
error (not otherwise logged by the kernel) apparently encountered by
vm_pager resulted in termination of cmp(1) rather than resulting in ZFS
reading the page from one of the other three drives. I don't see how ZFS
is of much help here, so I guess I must have misunderstood the claims for
ZFS that I've read on this list and in the available materials on-line.
I suggest that you are ignoring what ZFS is telling you, specifically
that your system is incapable of reliably write to and reading from
_any_ of the four drives you are trying to use and that there is a
corrupt file due to this and here it the name of that corrupt file.

Until you fix the underlying issues with your system, ZFS (or any FS for
that matter) will not be of much use to you.
Post by Scott Bennett
I don't know where to turn next. I will try to call Seagate/Samsung
later today again about the bad Samsung drive and the bad, refurbished
Seagate drive, but they already told me once that having a couple of kB
of errors in a ~1.08 TB file copy does not mean that the drive is bad.
I don't know whether they will consider a hard write error to mean the
drive is bad. The kernel messages shown above are the first ones I've
gotten about any of the drives involved in the copy operation or the
tests described above.
The fact that you have TWO different drives from TWO different vendors
exhibiting the same problem (and to the same degree) makes me think that
the problem is NOT with the drives but elsewhere with your system. I
have started tracking usage an failure statistics for my personal drives
(currently 26 of them, but I have 4 more coming back from Seagate as
warranty replacements). I know that I do not have a statistically
significant sample, but it is what I have to work with. Taking into
account the drive I have as well as the hundreds of drives I managed at
a past client, I have never seen the kind of bad data failures you are
seeing UNLESS I had another underlying problem. Especially when the
problem appears on multiple drives. I suspect that the real odds of
having the same type of bad data failure on TWO drives in this case is
so small that another cause needs to be identified.
Post by Scott Bennett
If anyone reading this has any suggestions for a course of action
here, I'd be most interested in reading them. Thanks in advance for any
ideas and also for any corrections if I've misunderstood what a ZFS
mirror was supposed to have done to preserve the data and maintain
correct operation at the application level.
The system you are trying to use ZFS on may just not be able to handle
the throughput (both memory and disk I/O) generated by ZFS without
breaking. This may NOT just be a question of amount of RAM, but of the
reliability of the motherboard/CPU/RAM/device interfaces when stressed.
In the early days of ZFS it was noticed that ZFS stressed the CPU and
memory systems of a server harder than virtually any other task.
--
--
Paul Kraus ***@kraus-haus.org
Co-Chair Albacon 2014.5 http://www.albacon.org/2014/
Scott Bennett
2014-09-28 04:28:41 UTC
Permalink
Thank you for your reply.
On Wed, 24 Sep 2014 06:37:30 -0500 Andrew Berg
Post by Andrew Berg
Post by Scott Bennett
If anyone reading this has any suggestions for a course of action
here, I'd be most interested in reading them. Thanks in advance for any
ideas and also for any corrections if I've misunderstood what a ZFS
mirror was supposed to have done to preserve the data and maintain
correct operation at the application level.
I skimmed over the long message, and my first thought is that you have a messed
up controller that is lying. I've run into such a controller on a hard drive
Yes, this thought has crossed my mind, too. I don't think it explains
all of the evidence well, but OTOH, I can't quite rule it out yet either.
The error rate appears to differ from drive to drive.
Post by Andrew Berg
enclosure that is supposed to support disks larger than 2TB, but seems to write
to who knows where when you want a sector beyond 2TB, and the filesystem layer
has no idea anything is wrong. This is all just an educated, guess, but
I ran into that problem last year when I had a 3 TB drive put into a
case with the interface combination that I wanted. Someone on the list clued
me in about old controllers, so we checked, and sure enough, the controller
in that case was unable to handle devices larger than 2 TB. In the current
situation, however, all four drives are 2 TB drives.
Post by Andrew Berg
considering you get errors at a level below ZFS (would it be called the CAM
layer?), my advice would be to check the controllers and perhaps even the disks
themselves. AFAIK, issues at that layer are rarely software ones.
Yes, as noted in earlier threads, I've already seen the problem of
undetected write errors on these drives without ZFS being involved, just
UFS2. I had been planning to set up a gvinum raid5 device until I realized
that protection against loss of a drive would not protect me from corruption
of data without loss. Although running a parity check on the raid5 device
should reveal errors, it would not fix them, whereas the claim was made that
raidzN would fix them.
So I decided to try ZFS in hopes that errors would be both detected
when they occurred (they are not) and corrected upon detection (they appear
not to be, regardless of what ZFS scrub results say).
Meanwhile, Seagate has not seemed willing to replace running drives
that have errors with running drives that have been tested and shown to
have no errors. :-( That leaves me with my money spent on equipment that
does not work properly.
I may have to buy another card and swap it with the one that is in the
tower at present, but I'd rather not do that unless I find better evidence
that the problems come from the card I have now, especially given that much
of the evidence thus far gathered points to the quality of the drives as
the culprit.
I suppose I could reconnect the USB 3.0 drives to USB 2.0 ports and
then repeat my tests, but I'm already kind of fed up with all the delays.
Also, as previously noted, the Western Digital drive is connected via
Firewire 400 and is showing scrub errors as well, albeit comparatively few.
It has been several months now since I last had a place to write backups,
and the lack of recent backups is giving me the heebie jeebies more and
more by the day.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Scott Bennett
2014-09-28 10:30:08 UTC
Permalink
On Wed, 24 Sep 2014 11:24:35 -0400 Paul Kraus <***@kraus-haus.org>
wrote:
Thanks for chiming in, Paul.
Post by Paul Kraus
<snip>
What version of FreeBSD are you running ?
What hardware are you running it on ?
The CPU is a Q6600 running on a PCIE2 Gigabyte motherboard, whose model
number I did have written down around here somewhere but can't lay hands on
at the moment. I looked it up at the gigabyte.com web site, and it supposed
to be okay for a number of considerably faster CPU models. It has 4 GB of
memory, but FreeBSD ignores the last ~1.1 GB of it, so ~2.9 GB usable. It
has been running mprime worker threads on all four cores with no apparent
problems for almost 11 months now.
The USB 3.0 card is a "rocketfish USB 3.0 PCI Express Card", which
reports a NEC uPD720200 USB 3.0 controller. It has two ports, into which
I have two USB 3.0 hubs plugged. There are currently four 2 TB drives
plugged into those hubs.
The Firewire 400 card reports itself as a "VIA Fire II (VT6306)", and
it has two ports on it with one 2 TB drive connected to each port.
Post by Paul Kraus
Post by Scott Bennett
Then I copied the 1.08 TB file again from another Seagate 2 TB drive
to the mirror vdev. No errors were detected during the copy. Then I
began creating a tar file from large parts of a nearly full 1.2 TB file
system (UFS2) on yet another Seagate 2TB on the Firewire 400 bus with the
tar output going to a file in the mirror in order to try to have written
something to most of the sectors on the four-drive mirror. I terminated
tar after the empty space in the mirror got down to about 3% because the
process had slowed to a crawl. (Apparently, space allocation in ZFS
slows down far more than UFS2 when available space gets down to the last
few percent.:-( )
ZFS's space allocation algorithm will have trouble (performance issues)
allocating new blocks long before you get a few percent free. This is
known behavior and the threshold for performance degradation varies with
work load and historical write patterns. My rule of thumb is that you
really do not want to go past 75-80% full, but I have seen reports over
on the ZFS list of issues with very specific write patterns and work
load with as little as 50% used. For your work load, writing very large
files once, I would expect that you can get close to 90% used before
seeing real performance issues.
Thanks for that information. Yeah, I think I saw it starting to slow
when it got into the low 90s% full, but I wasn't watching all of the time,
so I don't know when it first became noticeable. Anyway, I'll keep those
examples in mind for planning purposes.
Post by Paul Kraus
Post by Scott Bennett
Next, I ran a scrub on the mirror and, after the scrub finished, got
the following output from a "zpool status -v".
pool: testmirror
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 1.38M in 17h59m with 1 errors on Mon Sep 15 19:53:45 2014
The above means that ZFS was able to repair 1.38MB of bad data but still
ran into 1 situation (unknown size) that it could not fix.
But I'm not sure that the repairs actually took. Read on.
Post by Paul Kraus
Post by Scott Bennett
NAME STATE READ WRITE CKSUM
testmirror ONLINE 0 0 1
mirror-0 ONLINE 0 0 2
da1p5 ONLINE 0 0 2
da2p5 ONLINE 0 0 2
da5p5 ONLINE 0 0 8
da7p5 ONLINE 0 0 7
/backups/testmirror/backups.s2A
And here is the file that contains the bad data.
Yes. There are only two files, and that one is the larger one and
was written first.
Post by Paul Kraus
Post by Scott Bennett
Note that the choices of recommended action above do *not* include
replacing a bad drive and having ZFS rebuild its content on the
replacement. Why is that so?
Correct, because for some reason ZFS was not able to read enough of the
data without checksum errors to gibe you back your data in tact.
Okay, laying aside the question of why no drive out of four in a mirror
vdev can provide the correct data, so that's why a rebuild wouldn't work.
Couldn't it at least give a clue about drive(s) to be replaced/repaired?
I.e., the drive(s) and sector number(s)? Otherwise, one would spend a lot
of time reloading data without knowing whether a failure at the same place(s)
would just happen again.
Post by Paul Kraus
Post by Scott Bennett
Thinking, apparently naively, that the scrub had repaired some or
most of the errors
It did, 1.38MB worth. But it also had errors it could not repair.
It *says* it did, but did it really? How does it know? Did it read
the results of its correction back in from the drive(s) to see?
Post by Paul Kraus
Post by Scott Bennett
and wanting to know which drives had ended up with
permanent errors, I did a "zpool clear testmirror" and ran another scrub.
(da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
(da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
(da7:umass-sim5:5:0:0): Retrying command
(da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
(da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
(da7:umass-sim5:5:0:0): Retrying command
How many device errors have you had since booting the system / creating
the zpool ?
Those were the first two times that I've gotten kernel error messages
on any of those devices ever. My (non-ZFS) comparison results presented here
some weeks ago showed that the errors found in the comparisons went
undetected by hardware or software while the file was being written to disk.
No errors were detected during readback either, except by the application
(cmp(1)).
Post by Paul Kraus
Post by Scott Bennett
I don't know how to decipher these error messages (i.e., what do the hex
digits after "CDB: " mean?)
I do not know the specifics in this case, but whenever I have seen
device errors it has always been due to either bad communication with a
drive or a drive reporting an error. If there are ANY device errors you
must address them before you go any further.
Yes, but I need to know what those messages actually say when I talk
to manufacturers. For example, if they contain addresses of failed sectors,
then I need to know what those addresses are if the manufacturers want me
to attempt to reassign them. Also, if I tell them there are kernel messages,
they want to look up Micro$lop messages, and when I point out that I don't
use Micro$lop and that I run FreeBSD, they usually try to tell me that "We
don't support that", so it really helps if I can translate the messages to
them. IOW, I can't address the device errors if I don't know how to read
the messages, which I'm hoping someone reading this may be able to help
with. (I wish the FreeBSD Handbook had a comprehensive list of kernel
messages and what they mean.)
Post by Paul Kraus
As an anecdotal note, I have not had terribly good luck with USB
attached drives under FreeBSD, especially under 9.x. I suspect that the
USB stack just can't keep up and ends up dropping things (or hanging). I
have had better luck with the 10.x release but still do not trust it for
high traffic loads. I have had no issues with SAS or SATA interfaces
Okay. I'll keep that in mind for the future, but for now I'm stuck
with 9.2 until I can get some stable disk space to work with to do the
upgrades to amd64 and then to later releases. The way things have been
going, I may have to relegate at least four 2 TB drives to paperweight
supply and then wait until I can replace them with smaller capacity drives
that will actually work. Also, I have four 2 TB drives in external cases
that have only USB 3.0 interfaces on them, so I have no other way to
connect them (except USB 2.0, of course), so I'm stuck with (some) USB,
too.
Post by Paul Kraus
(using supported chipsets, I have had very good luck with any of the
Marvell JBOD SATA controllers), _except_ when I was using a SATA port
multiplier. Over on the ZFS list the consensus is that port multipliers
are problematic at best and they should be avoided.
What kinds of problems did they mention? Also, how are those Marvell
controllers connected to your system(s)? I'm just wondering whether
I would be able to use any of those models of controllers. I've not dealt
with SATA port multipliers. Would an eSATA card with two ports on it be
classed as a port multiplier?
At the moment, all of my ZFS devices are connected by either USB 3.0
or Firewire 400. I now have an eSATA card with two ports on it that I
plan to install at some point, which will let me move the Firewire 400
drive to eSATA. Should I expect any new problem for that drive after the
change?
Post by Paul Kraus
Post by Scott Bennett
When it had finished, another "zpool status
-v" showed these results.
pool: testmirror
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56 2014
This time it fixed 1.25MB of data and still had an error (of unknown
size) that it could not fix.
Unfortunately, ZFS did not report the addresses of any of the errors.
If there are hard errors, how can I find out where the bad sectors are
located on each disk? It might be possible to reassign those sectors to
spares if ZFS can tell me their addresses.
Post by Paul Kraus
Post by Scott Bennett
NAME STATE READ WRITE CKSUM
testmirror ONLINE 0 0 1
mirror-0 ONLINE 0 0 2
da1p5 ONLINE 0 0 2
da2p5 ONLINE 0 0 2
da5p5 ONLINE 0 0 6
da7p5 ONLINE 0 0 8
Once again you have errors on ALL your devices. This points to a
Why did it not fix them during the first scrub?
Post by Paul Kraus
systemic problem of some sort on your system. On the ZFS list people
have reported bad memory as sometimes being the cause of these errors. I
would look for a system component that is common to all the drives and
controllers. How healthy is your power supply ? How close to it's limits
are you ?
The box is mostly empty and nowhere near the power supply's capacity.
The box and all other attached devices are plugged into a SmartUPS 1000,
which is plugged into a surge protector. As for the health of the power
supply, I guess I don't know how to check that, but everything else that
depends upon the power supply seems to be working fine.
Post by Paul Kraus
Post by Scott Bennett
/backups/testmirror/backups.s2A
So it is not clear to me that either scrub fixed *any* errors at
all.
It is not clear because I got similar results reported after two
consecutive scrubs with no changes made by me in between them. If it
really fixed all but one error during the first scrub, why are there
still more errors for the second one to correct? Unless ZFS checks the
results of its corrections, how can it know whether it really succeeded
in fixing anything?
Post by Paul Kraus
scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56
2014
There were errors that were repaired and an error that was not.
Yes, like it said the first time.
Post by Paul Kraus
Post by Scott Bennett
I next ran a comparison ("cmp -z -l") of the original against the
copy
If you are comparing the file that ZFS reported was corrupt, then you
should not expect them to match.
Well, you omitted the results of the comparison that I ran after
the second scrub had completed. As I wrote before,
+ Another issue revealed above is that ZFS, in spite of having *four*
+ copies of the data and checksums of them, failed to detect any problem
+ while reading the data back for cmp(1), much less feed cmp(1) the correct
+ version of the data rather than a corrupted version. Similarly, the hard
+ error (not otherwise logged by the kernel) apparently encountered by
+ vm_pager resulted in termination of cmp(1) rather than resulting in ZFS
+ reading the page from one of the other three drives. I don't see how ZFS
+ is of much help here, so I guess I must have misunderstood the claims for
+ ZFS that I've read on this list and in the available materials on-line.

Now, given that there were 10 errors widely separated in that file
(before the kernel-detected error that killed cmp(1)) that went undetected
by hardware, drivers, or ZFS when the file was written and also went
undetected by ZFS when read back in, in spite of all the data copies and
checksum copies, and in view of two consecutive scrubs without changes made
between them to the data yielding nearly identical numbers of errors "fixed",
I am skeptical of the claims for ZFS's "self-healing" ability. With access
to four copies of each of those 10 data blocks and four copies of each of the
checksums, why did ZFS detect nothing while reading the file? And in the
case of the uncorrectable error, why could ZFS not find any copy of the data
block that matched any copy of the checksum or even two matching copies of
the data block, so that it could provide the correct data to the application
program anyway, while logging the sector(s) involved in the block containing
the uncorrectable error? If it can't do any of those things, why have
redundancy?
Post by Paul Kraus
Post by Scott Bennett
[stuff deleted --SB]
It sounds like you are really pushing this system to do more than it
reasonably can. In a situation like this you should really not be doing
anything else at the same time given that you are already pushing what
the system can do.
It seems to me that the only places that could fail to keep up would
be the motherboard's chip(set) or one of the controller cards. The
motherboard controller knows the speed of the memory, so it will only
cycle the memory at that speed. The CPU, of course, should be at a lower
priority for bus cycles, so it would just use whatever were left over. There
is no overclocking involved, so that is not an issue here. The machine goes
as fast as it goes and no faster. If it takes longer for it to complete a
task, then that's how long it takes. I don't see that "pushing this system
to do more than it reasonably can" is even possible for me to do. It does
what it does, and it does it when it gets to it. Would I like it to do
things faster? Of course, I would, but what I want does not change physics.
I'm not getting any machine check or overrun messages, either.
Further, because one of the drives is limited to 50 MB/s (Firewire 400)
transfer rates, ZFS really can't go any faster than that drive. Most of the
time, a systat vmstat display during the scrubs showed the MB/s actually
transferred for all four drives as being about the same (~23 - ~35 MB/s).
The scrubs took from 5% to 25% of one core's time, and associated
kernel functions took from 0% to ~9% (combined) from other cores. cmp(1)
took 25% - 35% of one core with associated kernel functions taking 5% - 15%
(combined) from other cores. I used cpuset(1) to keep cmp(1) from bothering
the mprime thread I cared about the most. (Note that mprime runs niced
to 18, so its threads should not slow any of the testing I was doing.) It
really doesn't look to me like an overload situation, but I can try moving
the three USB 3.0 drives to USB 2.0 to slow things down even further. That
leaves still unexplained ZFS's failure to make use of multiple copies for
error correction during the reading of a file or to fix in one scrub
everything that was fixable.
Post by Paul Kraus
Post by Scott Bennett
Script started on Wed Sep 17 01:37:38 2014
[hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
This is the file the ZFS told you was corrupt, all bets are off.
There should be only one bad block because the scrubs fixed everything
else, right? And that bad block is bad on all four drives, right?
Post by Paul Kraus
<snip>
Post by Scott Bennett
[point made again elsewhere deleted --SB]
ZFS told you that file was corrupt. You are choosing to try to read it.
ZFS used to not even let you try to access a corrupt file but that
behavior was changed to permit people to try to salvage what they could
instead of write it all off.
See what I wrote above. It was a *four-way mirror*, not an unreplicated
pool. It strikes me as extremely unlikely that the *same* blocks would be
damaged on *all four* drives. ZFS ought to be able to identify and provide
the correct version from one or more blocks and checksums.
Post by Paul Kraus
Post by Scott Bennett
[text included in a quotation above deleted here --SB]
I suggest that you are ignoring what ZFS is telling you, specifically
that your system is incapable of reliably write to and reading from
_any_ of the four drives you are trying to use and that there is a
corrupt file due to this and here it the name of that corrupt file.
Until you fix the underlying issues with your system, ZFS (or any FS for
that matter) will not be of much use to you.
It looks to me as nearly certain that the underlying issue is poor
manufacturing standards for 2 TB drives.
Post by Paul Kraus
Post by Scott Bennett
I don't know where to turn next. I will try to call Seagate/Samsung
later today again about the bad Samsung drive and the bad, refurbished
Seagate drive, but they already told me once that having a couple of kB
of errors in a ~1.08 TB file copy does not mean that the drive is bad.
I don't know whether they will consider a hard write error to mean the
drive is bad. The kernel messages shown above are the first ones I've
gotten about any of the drives involved in the copy operation or the
tests described above.
The fact that you have TWO different drives from TWO different vendors
exhibiting the same problem (and to the same degree) makes me think that
the problem is NOT with the drives but elsewhere with your system. I
have started tracking usage an failure statistics for my personal drives
(currently 26 of them, but I have 4 more coming back from Seagate as
Whooweee! That's a heap of drives! IIRC, for a chi^2 distribution,
30 isn't bad for a sample size. How many of those drives are of larger
capacity than 1 TB?
Post by Paul Kraus
warranty replacements). I know that I do not have a statistically
significant sample, but it is what I have to work with. Taking into
account the drive I have as well as the hundreds of drives I managed at
a past client, I have never seen the kind of bad data failures you are
seeing UNLESS I had another underlying problem. Especially when the
problem appears on multiple drives. I suspect that the real odds of
having the same type of bad data failure on TWO drives in this case is
so small that another cause needs to be identified.
Recall that I had two 2 TB drives that failed this year at, IIRC,
11 and 13 months since purchase, which is why two of the drives I was
testing in the mirror were refurbished drives (supplied under warranty).
One of those drives was showing hard errors on many sectors for a while
before it failed completely. Having two drives that are bad doesn't
seem so unlikely, although having two drives (much less four!) with an
identical scattering of bad sectors on each is rather a stretch.
You are referring to the two drives that showed two checksum errors
each in both post-scrub status reports? Yes, those were from two
manufacturers, but the two drives with the greatest numbers of checksum
errors, along with one of the drives showing only 2 checksum errors, were
all from one manufacturer, who claims that such occurrences are "normal"
for those drives because they have no parity checking or parity recording.
That manufacturer does not suggest that there is anything wrong with the
system to which they are attached. (By that, I mean that the guy at that
manufacturer who spoke with me on the phone made those claims.)
Post by Paul Kraus
Post by Scott Bennett
If anyone reading this has any suggestions for a course of action
here, I'd be most interested in reading them. Thanks in advance for any
ideas and also for any corrections if I've misunderstood what a ZFS
mirror was supposed to have done to preserve the data and maintain
correct operation at the application level.
The system you are trying to use ZFS on may just not be able to handle
the throughput (both memory and disk I/O) generated by ZFS without
breaking. This may NOT just be a question of amount of RAM, but of the
reliability of the motherboard/CPU/RAM/device interfaces when stressed.
I did do a fair amount of testing with mprime last year and found no
problems. I monitor CPU temperatures frequently, especially when I'm
running a test like the ones I've been doing, and the temperatures have
remained reasonable throughout. (My air-conditioning bill has not been
similarly reasonable, I'm sorry to say.)
That having been said, though, between your remarks and Andrew Berg's,
there does seem cause to run another scrub, perhaps two, with those three
drives connected via USB 2.0 instead of USB 3.0 to see what happens when
everything is slowed down drastically. I'll give that a try when I find
time. That won't address the ZFS-related questions or the differences
in error rates on different drives, but might reveal an underlying system
hardware issue.
Maybe a PCIE2 board is too slow for USB 3.0, although the motherboard
controller, BIOS, USB 3.0 controller, and kernel all declined to complain.
If it is, then the eSATA card I bought (SATA II) would likely be useless
as well. :-<
Post by Paul Kraus
In the early days of ZFS it was noticed that ZFS stressed the CPU and
memory systems of a server harder than virtually any other task.
When would that have been, please? (I don't know much ZFS history.)
I believe this machine dates to 2006 or more likely 2007, although the
USB 3.0 card was new last year. The VIA Firewire card was installed at
the same time as the USB 3.0 card, but it was not new at that time.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Warren Block
2014-09-28 14:45:35 UTC
Permalink
Post by Scott Bennett
Thanks for chiming in, Paul.
Post by Paul Kraus
<snip>
What version of FreeBSD are you running ?
What hardware are you running it on ?
The CPU is a Q6600 running on a PCIE2 Gigabyte motherboard, whose model
number I did have written down around here somewhere but can't lay hands on
at the moment.
sysutils/dmidecode will show that without having to open the case.
Scott Bennett
2014-09-29 00:29:00 UTC
Permalink
Post by Warren Block
Post by Scott Bennett
Thanks for chiming in, Paul.
Post by Paul Kraus
<snip>
What version of FreeBSD are you running ?
What hardware are you running it on ?
The CPU is a Q6600 running on a PCIE2 Gigabyte motherboard, whose model
number I did have written down around here somewhere but can't lay hands on
at the moment.
sysutils/dmidecode will show that without having to open the case.
I had no idea, so thanks for that. I'll have to install it once I have
stable space for a ports tree again in case it also shows other good stuff.
In any case, I was pretty sure I had seen it somewhere else, and it turns
out that the motherboard manufacturer and model number show up in a couple
of places in the output of kenv(1). :-)
BTW, your message showed up without a Date: header, so I used the date
and timestamp from the From line of the envelope. Must be something weird
about your mail interface.


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Paul Kraus
2014-09-30 02:29:51 UTC
Permalink
Post by Scott Bennett
Thanks for chiming in, Paul.
Post by Paul Kraus
<snip>
What version of FreeBSD are you running ?
I asked this specifically because I have seen lots of issues with hard
drives connected via USB on 9.x. In some cases the system hangs
(silently, with no evidence after a hard reboot), in some cases just
flaky I/O to the drives that caused performance issues. And _none_ of
the attached drives were over 1TB. There were three different 1TB drives
(one IOmega and 2 Seagate) and 1 500GB (Seagate drive in a Gigaware
enclosure). These were on three different systems (two SuperMicro dual
Quad-Xeon CPU and one HP MicroProliant N36L). These were all USB2, I
would expect more problems and weirder problems with USB3 as it (tries
to) goes much faster.

Skipping lots ...
Post by Scott Bennett
Okay, laying aside the question of why no drive out of four in a mirror
vdev can provide the correct data, so that's why a rebuild wouldn't work.
Couldn't it at least give a clue about drive(s) to be replaced/repaired?
I.e., the drive(s) and sector number(s)? Otherwise, one would spend a lot
of time reloading data without knowing whether a failure at the same place(s)
would just happen again.
You can probably dig that out of the zpool using zdb, but I am no zdb
expert and refer you to the experts on the ZFS list (find out how to
subscribe at the bottom of this list
http://wiki.illumos.org/display/illumos/illumos+Mailing+Lists ).

<snip>
Post by Scott Bennett
Post by Paul Kraus
As an anecdotal note, I have not had terribly good luck with USB
attached drives under FreeBSD, especially under 9.x. I suspect that the
USB stack just can't keep up and ends up dropping things (or hanging). I
have had better luck with the 10.x release but still do not trust it for
high traffic loads. I have had no issues with SAS or SATA interfaces
Okay. I'll keep that in mind for the future, but for now I'm stuck
with 9.2 until I can get some stable disk space to work with to do the
upgrades to amd64 and then to later releases. The way things have been
going, I may have to relegate at least four 2 TB drives to paperweight
supply and then wait until I can replace them with smaller capacity drives
that will actually work. Also, I have four 2 TB drives in external cases
that have only USB 3.0 interfaces on them, so I have no other way to
connect them (except USB 2.0, of course), so I'm stuck with (some) USB,
too.
While I have had nothing but trouble every single time I tried using a
USB attached drive for more than a few MB at a time under 9.x, I have
had no problems with Marvell based JBOD SATA cards under 9.x or 10.0.
Post by Scott Bennett
Post by Paul Kraus
(using supported chipsets, I have had very good luck with any of the
Marvell JBOD SATA controllers), _except_ when I was using a SATA port
multiplier. Over on the ZFS list the consensus is that port multipliers
are problematic at best and they should be avoided.
What kinds of problems did they mention? Also, how are those Marvell
controllers connected to your system(s)? I'm just wondering whether
I would be able to use any of those models of controllers. I've not dealt
with SATA port multipliers. Would an eSATA card with two ports on it be
classed as a port multiplier?
I do not recall specifics, but I do recall a variety of issues, mostly
around SATA buss resets. The problem _I_ had was that if one of the four
drives in the enclosure (behind the port multiplier) failed it knocked
all four off-line.

The cards were PCIE 1X and PCIE 2X. All of the Marvell cards I have seen
have been one logical port per physical port. The chipsets in the add-on
cards seem to be in sets of 4 ports (although the on-board chipsets seem
to be sets of 6). I currently have one 4 port card (2 internal, 2
external) and one 8 port card (4 internal, 4 external) with no problems.
They are in an HP MicroProliant N54L with 16 GB RAM.

Here is the series of cards that I have been using:
http://www.sybausa.com/productList.php?cid=142&currentPage=0
Specifically the SI-PEX40072 and SI-PEX40065, stay away from the RAID
versions and just go for the JBOD. The Marvell JBOD chips were
recommended over on the ZFS list.
Post by Scott Bennett
At the moment, all of my ZFS devices are connected by either USB 3.0
or Firewire 400.
Are the USB drives directly attached or via hubs ? The hubs may be
introducing more errors (I have not had good luck finding USB hubs that
are reliable in transferring data ... on my Mac, I have never tried
using external hubs on my servers).
Post by Scott Bennett
I now have an eSATA card with two ports on it that I
plan to install at some point, which will let me move the Firewire 400
drive to eSATA. Should I expect any new problem for that drive after the
change?
I would expect a decrease in error counts for the eSATA attached drive.

I have been running LOTS of scrubs against my ~1.6TB of data recently,
both on a 2-way 2-column mirror of 2TB drives or the current config of
2-way 3-column mirror of 1TB drives. I have seen no errors of any kind.

[***@FreeBSD2 ~]$ zpool status
pool: KrausHaus
state: ONLINE
scan: scrub repaired 0 in 1h49m with 0 errors on Fri Sep 26 13:51:21 2014
config:

NAME STATE READ WRITE CKSUM
KrausHaus ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
diskid/DISK-Seaagte ES.3 ONLINE 0 0 0
diskid/DISK-Seaagte ES.2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
diskid/DISK-WD-SE ONLINE 0 0 0
diskid/DISK-HGST UltraStar ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
diskid/DISK-WD-SE ONLINE 0 0 0
diskid/DISK-Seaagte ES.2 ONLINE 0 0 0
spares
diskid/DISK-Seaagte ES.3 AVAIL

errors: No known data errors

Note: Disk serial numbers replaced with type of drive. All are 1TB.

<snip>
Post by Scott Bennett
Post by Paul Kraus
It sounds like you are really pushing this system to do more than it
reasonably can. In a situation like this you should really not be doing
anything else at the same time given that you are already pushing what
the system can do.
It seems to me that the only places that could fail to keep up would
be the motherboard's chip(set) or one of the controller cards. The
motherboard controller knows the speed of the memory, so it will only
cycle the memory at that speed. The CPU, of course, should be at a lower
priority for bus cycles, so it would just use whatever were left over. There
is no overclocking involved, so that is not an issue here. The machine goes
as fast as it goes and no faster. If it takes longer for it to complete a
task, then that's how long it takes. I don't see that "pushing this system
to do more than it reasonably can" is even possible for me to do. It does
what it does, and it does it when it gets to it. Would I like it to do
things faster? Of course, I would, but what I want does not change physics.
I'm not getting any machine check or overrun messages, either.
So you deny that race states can exist in a system as complex as a
modern computer running a modern OS ? The OS is an integral part of all
this, including all the myriad device drivers. And with multiple CPUs
the problem may be even worse.
Post by Scott Bennett
Further, because one of the drives is limited to 50 MB/s (Firewire 400)
transfer rates, ZFS really can't go any faster than that drive. Most of the
time, a systat vmstat display during the scrubs showed the MB/s actually
transferred for all four drives as being about the same (~23 - ~35 MB/s).
What does `iostat -x -w 1` show ? How many drives are at 100 %b ? How
many drives have a qlen of 10 ? For how many samples in a row ? That is
the limit of what ZFS will dispatch, once there are 10 outstanding I/O
requests for a given device, ZFS does not dispatch more I/O requests
until the qlen drops below 10. This is tunable (look through sysctl -a |
grep vfs.zfs). On my system with the port multiplier I had to tune this
down to 4 (found empirically) or I would see underlying SATA device
errors and retries.

I find it useful to look at 1 second as well as 10 seconds samples (to
see both peak load on the drives as well as more average).

Here is my system with a scrub running on the above zpool and 10 second
sample time (iostat -x -w 10):

extended device statistics
device r/s w/s kr/s kw/s qlen svc_t %b
ada0 783.8 3.0 97834.2 15.8 10 10.4 84
ada1 792.8 3.0 98649.5 15.8 4 3.7 49
ada2 789.9 3.0 98457.0 15.8 4 3.6 47
ada3 0.1 13.1 0.0 59.0 0 6.1 0
ada4 0.8 13.1 0.4 59.0 0 5.8 0
ada5 794.0 3.0 98703.7 15.8 0 4.1 62
ada6 785.9 3.0 98158.3 15.8 10 11.2 98
ada7 0.0 0.0 0.0 0.0 0 0.0 0
ada8 791.4 3.0 98458.2 15.8 0 3.0 53

In the above, ada0 and ada6 have hit their outstanding I/O limit (in
zfs), both are slower than the others, with both longer service time
(svc_t) and % busy (%b). These are the oldest drives in the zpool,
Seagate ES.2 series and are 5 years old (and just out of warranty). So
it is not surprising that they are the slowest. They are the limiting
factor on how fast the scrub can progress.
Post by Scott Bennett
The scrubs took from 5% to 25% of one core's time,
Because they are limited by the I/O stack between the kernal and the device.
Post by Scott Bennett
and associated
kernel functions took from 0% to ~9% (combined) from other cores. cmp(1)
took 25% - 35% of one core with associated kernel functions taking 5% - 15%
(combined) from other cores. I used cpuset(1) to keep cmp(1) from bothering
the mprime thread I cared about the most. (Note that mprime runs niced
to 18, so its threads should not slow any of the testing I was doing.) It
really doesn't look to me like an overload situation, but I can try moving
the three USB 3.0 drives to USB 2.0 to slow things down even further.
Do you have a way to look at errors directly on the USB buss ?
Post by Scott Bennett
That
leaves still unexplained ZFS's failure to make use of multiple copies for
error correction during the reading of a file or to fix in one scrub
everything that was fixable.
Post by Paul Kraus
Post by Scott Bennett
Script started on Wed Sep 17 01:37:38 2014
[hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
This is the file the ZFS told you was corrupt, all bets are off.
There should be only one bad block because the scrubs fixed everything
else, right?
Not necessarily, I have not looked at the ZFS code (feel free to, it is
all open source), so I do not know for certain whether it gives up on a
file once it finds corruption.
Post by Scott Bennett
And that bad block is bad on all four drives, right?
Or the I/O to all four drives was interrupted at the same TIME ... I
have seen that before when it was a device driver stack that was having
trouble (which is what I suspect here).

<snip>
Post by Scott Bennett
Post by Paul Kraus
The fact that you have TWO different drives from TWO different vendors
exhibiting the same problem (and to the same degree) makes me think that
the problem is NOT with the drives but elsewhere with your system. I
have started tracking usage an failure statistics for my personal drives
(currently 26 of them, but I have 4 more coming back from Seagate as
Whooweee! That's a heap of drives! IIRC, for a chi^2 distribution,
30 isn't bad for a sample size. How many of those drives are of larger
capacity than 1 TB?
Not really, I used to manage hundreds of drives. When I have 2 out of 4
Seagate ES.2 1TB drives and 1 out of 2 HGST UltraStar 1TB drives fail
under warranty I am still not willing to say that overall both Seagate
and HGST have a 50% failure rate ... specifically because I do not
consider 4 (or worse 2) drives a statistically significant sample :-)

In terms of drive sizes, a little over 50% are 1TB or over (not counting
the 4 Seagate 1TB warranty replacement drives that arrived today). Of
the 11 1TB drives in the sample (not counting the ones that arrived
today), 3 have failed under warranty (so far). The 4 2TB drives in the
sample set none have failed yet, but they are all less than 1 year old.

<snip>
Post by Scott Bennett
Post by Paul Kraus
The system you are trying to use ZFS on may just not be able to handle
the throughput (both memory and disk I/O) generated by ZFS without
breaking. This may NOT just be a question of amount of RAM, but of the
reliability of the motherboard/CPU/RAM/device interfaces when stressed.
I did do a fair amount of testing with mprime last year and found no
problems.
From the brief research I did, it looks like mprime is a computational
program and will test only limited portions of a system (CPU and RAM
mostly).
Post by Scott Bennett
I monitor CPU temperatures frequently, especially when I'm
running a test like the ones I've been doing, and the temperatures have
remained reasonable throughout. (My air-conditioning bill has not been
similarly reasonable, I'm sorry to say.)
That having been said, though, between your remarks and Andrew Berg's,
there does seem cause to run another scrub, perhaps two, with those three
drives connected via USB 2.0 instead of USB 3.0 to see what happens when
everything is slowed down drastically. I'll give that a try when I find
time. That won't address the ZFS-related questions or the differences
in error rates on different drives, but might reveal an underlying system
hardware issue.
Maybe a PCIE2 board is too slow for USB 3.0, although the motherboard
controller, BIOS, USB 3.0 controller, and kernel all declined to complain.
If it is, then the eSATA card I bought (SATA II) would likely be useless
as well. :-<
Post by Paul Kraus
In the early days of ZFS it was noticed that ZFS stressed the CPU and
memory systems of a server harder than virtually any other task.
When would that have been, please? (I don't know much ZFS history.)
I believe this machine dates to 2006 or more likely 2007, although the
USB 3.0 card was new last year. The VIA Firewire card was installed at
the same time as the USB 3.0 card, but it was not new at that time.
That would have been the 2005-2007 timeframe. A Sun SF-V240 could be
brought to it's knees by a large ZFS copy operation. Both CPUs would peg
and memory bandwidth would all be consumed by the I/O operations. The
Sun T-2000 was much better as it had (effectively) 32 logical CPUs (8
cores each with 4 execution threads) and ZFS really likes multiprocessor
environments.
--
--
Paul Kraus ***@kraus-haus.org
Co-Chair Albacon 2014.5 http://www.albacon.org/2014/
Scott Bennett
2014-09-28 12:06:43 UTC
Permalink
Post by Scott Bennett
Post by Paul Kraus
What hardware are you running it on ?
The CPU is a Q6600 running on a PCIE2 Gigabyte motherboard, whose model
number I did have written down around here somewhere but can't lay hands on
at the moment. I looked it up at the gigabyte.com web site, and it supposed
to be okay for a number of considerably faster CPU models. It has 4 GB of
memory, but FreeBSD ignores the last ~1.1 GB of it, so ~2.9 GB usable. It
has been running mprime worker threads on all four cores with no apparent
problems for almost 11 months now.
Here's the motherboard model info.

Gigabyte Technology Co., Ltd.
G41M-ES2L


Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at sdf.org *xor* bennett at freeshell.org *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************
Continue reading on narkive:
Loading...