Debian Bug report logs - #658701
mdadm: should send email if mismatches are reported by a check

version graph

Package: mdadm; Maintainer for mdadm is Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>; Source for mdadm is src:mdadm.

Reported by: Russell Coker <russell@coker.com.au>

Date: Sun, 5 Feb 2012 12:36:52 UTC

Severity: important

Found in version mdadm/3.2.3-2

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#658701; Package mdadm. (Sun, 05 Feb 2012 12:36:55 GMT) Full text and rfc822 format available.

Acknowledgement sent to Russell Coker <russell@coker.com.au>:
New Bug report received and forwarded. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Sun, 05 Feb 2012 12:36:58 GMT) Full text and rfc822 format available.

Message #5 received at submit@bugs.debian.org (full text, mbox):

From: Russell Coker <russell@coker.com.au>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: mdadm: should send email if mismatches are reported by a check
Date: Sun, 05 Feb 2012 23:34:50 +1100
Package: mdadm
Version: 3.2.3-2
Severity: important

Feb  5 22:55:09 xev mdadm[20730]: RebuildFinished event detected on md device /dev/md0, component device  mismatches found: 20608 (on raid level 1)

When a check initiated by /etc/cron.d/mdadm finds an error mdadm will discover
this and log an error such as the above with facility DAEMON.  But it doesn't
send an email.

I believe that this is a serious bug, it seems to me that one of the most
significant conditions it can encounter that should be immediately reported to
the sysadmin is the fact that the contents of disks are changing and breaking
RAID consistency!

For a 3-disk mirror or a RAID-6 such an error can be reliably corrected as long
as all the other disks are fine.  If you have an array with double-redundancy
and one disk fails entirely while another returns dodgey data then you lose,
and obviously anyone who creates a doubly-redundant array wants protection
against that sort of thing.

With a RAID-1 or RAID-5 array every mismatch is an indication of real data
corruption and is very important.

The following patch makes mdadm send email about such events.

--- /tmp/Monitor.c	2012-02-05 23:28:41.873079816 +1100
+++ ./Monitor.c	2012-02-05 23:32:03.961132380 +1100
@@ -364,6 +364,7 @@
 	    (strncmp(event, "Fail", 4)==0 ||
 	     strncmp(event, "Test", 4)==0 ||
 	     strncmp(event, "Spares", 6)==0 ||
+	     (strncmp(event, "RebuildFinished", 15)==0 && disc) ||
 	     strncmp(event, "Degrade", 7)==0)) {
 		FILE *mp = popen(Sendmail, "w");
 		if (mp) {

-- System Information:
Debian Release: wheezy/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-1-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages mdadm depends on:
ii  debconf      1.5.41
ii  initscripts  2.88dsf-22
ii  libc6        2.13-25
ii  lsb-base     3.2-28.1
ii  udev         175-3

Versions of packages mdadm recommends:
ii  module-init-tools               3.16-1
ii  postfix [mail-transport-agent]  2.8.7-1

mdadm suggests no packages.

-- debconf information:
  mdadm/initrdstart_msg_errexist:
  mdadm/initrdstart_msg_intro:
* mdadm/autostart: false
  mdadm/autocheck: true
  mdadm/initrdstart_msg_errblock:
  mdadm/mail_to: root
  mdadm/initrdstart_msg_errmd:
* mdadm/initrdstart: none
  mdadm/initrdstart_msg_errconf:
  mdadm/initrdstart_notinconf: false
  mdadm/start_daemon: true




Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#658701; Package mdadm. (Sun, 05 Feb 2012 14:24:21 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael Tokarev <mjt@tls.msk.ru>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Sun, 05 Feb 2012 14:24:21 GMT) Full text and rfc822 format available.

Message #10 received at 658701@bugs.debian.org (full text, mbox):

From: Michael Tokarev <mjt@tls.msk.ru>
To: Russell Coker <russell@coker.com.au>, 658701@bugs.debian.org
Subject: Re: Bug#658701: mdadm: should send email if mismatches are reported by a check
Date: Sun, 05 Feb 2012 18:22:37 +0400
On 05.02.2012 16:34, Russell Coker wrote:
> Package: mdadm
> Version: 3.2.3-2
> Severity: important
> 
> Feb  5 22:55:09 xev mdadm[20730]: RebuildFinished event detected on md device /dev/md0, component device  mismatches found: 20608 (on raid level 1)
> 
> When a check initiated by /etc/cron.d/mdadm finds an error mdadm will discover
> this and log an error such as the above with facility DAEMON.  But it doesn't
> send an email.

This is the same as discussed in #599821 and #588516.  I'll think about
mergeing all 3 together.

> I believe that this is a serious bug, it seems to me that one of the most
> significant conditions it can encounter that should be immediately reported to
> the sysadmin is the fact that the contents of disks are changing and breaking
> RAID consistency!

Yes that's the condition it may encouner indeed.  The question is WHY - under normal
conditions there should be no such errors.

There are two points there.

First, a formal one.  Were it a serious issue if such a check weren't be done at
all?  I think that in this case this bugreport didt'n exist to start with.

And second, more to the point, Neil gave a very good writeup of these checks and
repairs of raid arrays, about deciding which part/component of the array is
"more right".  Unfortunately I can't find it right now.

> 
> For a 3-disk mirror or a RAID-6 such an error can be reliably corrected as long
> as all the other disks are fine.  If you have an array with double-redundancy
> and one disk fails entirely while another returns dodgey data then you lose,
> and obviously anyone who creates a doubly-redundant array wants protection
> against that sort of thing.
> 
> With a RAID-1 or RAID-5 array every mismatch is an indication of real data
> corruption and is very important.
> 
> The following patch makes mdadm send email about such events.
> 
> --- /tmp/Monitor.c	2012-02-05 23:28:41.873079816 +1100
> +++ ./Monitor.c	2012-02-05 23:32:03.961132380 +1100
> @@ -364,6 +364,7 @@
>  	    (strncmp(event, "Fail", 4)==0 ||
>  	     strncmp(event, "Test", 4)==0 ||
>  	     strncmp(event, "Spares", 6)==0 ||
> +	     (strncmp(event, "RebuildFinished", 15)==0 && disc) ||
>  	     strncmp(event, "Degrade", 7)==0)) {
>  		FILE *mp = popen(Sendmail, "w");
>  		if (mp) {
> 

This might be more interesting approach than already offered in two
other mentioned patches.

/mjt




Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#658701; Package mdadm. (Sun, 05 Feb 2012 15:00:09 GMT) Full text and rfc822 format available.

Acknowledgement sent to russell@coker.com.au:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Sun, 05 Feb 2012 15:00:09 GMT) Full text and rfc822 format available.

Message #15 received at 658701@bugs.debian.org (full text, mbox):

From: Russell Coker <russell@coker.com.au>
To: Michael Tokarev <mjt@tls.msk.ru>
Cc: 658701@bugs.debian.org
Subject: Re: Bug#658701: mdadm: should send email if mismatches are reported by a check
Date: Mon, 6 Feb 2012 01:58:13 +1100
On Mon, 6 Feb 2012, Michael Tokarev <mjt@tls.msk.ru> wrote:
> > I believe that this is a serious bug, it seems to me that one of the most
> > significant conditions it can encounter that should be immediately
> > reported to the sysadmin is the fact that the contents of disks are
> > changing and breaking RAID consistency!
> 
> Yes that's the condition it may encouner indeed.  The question is WHY -
> under normal conditions there should be no such errors.

http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805

The disk just has errors sometimes.  The above article has some calculations 
of the probabilities.

> There are two points there.
> 
> First, a formal one.  Were it a serious issue if such a check weren't be
> done at all?  I think that in this case this bugreport didt'n exist to
> start with.

http://etbe.coker.com.au/2012/02/06/reliability-raid/

If there were no checks at all then we would migrate to BTRFS even sooner, at 
the above URL I've written some of the thoughts about BTRFS vs software RAID.

> And second, more to the point, Neil gave a very good writeup of these
> checks and repairs of raid arrays, about deciding which part/component of
> the array is "more right".  Unfortunately I can't find it right now.

Unfortunately at the moment it seems impossible to determine which disk had 
the error, if you even know that there was an error.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/




Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#658701; Package mdadm. (Sun, 05 Feb 2012 15:45:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael Tokarev <mjt@tls.msk.ru>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Sun, 05 Feb 2012 15:45:03 GMT) Full text and rfc822 format available.

Message #20 received at 658701@bugs.debian.org (full text, mbox):

From: Michael Tokarev <mjt@tls.msk.ru>
To: russell@coker.com.au
Cc: 658701@bugs.debian.org
Subject: Re: Bug#658701: mdadm: should send email if mismatches are reported by a check
Date: Sun, 05 Feb 2012 19:40:30 +0400
On 05.02.2012 18:58, Russell Coker wrote:
> On Mon, 6 Feb 2012, Michael Tokarev <mjt@tls.msk.ru> wrote:
>>> I believe that this is a serious bug, it seems to me that one of the most
>>> significant conditions it can encounter that should be immediately
>>> reported to the sysadmin is the fact that the contents of disks are
>>> changing and breaking RAID consistency!
>>
>> Yes that's the condition it may encouner indeed.  The question is WHY -
>> under normal conditions there should be no such errors.
> 
> http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805
> 
> The disk just has errors sometimes.  The above article has some calculations 
> of the probabilities.

The point here is latent errors only.  Yes these becomes more and more
common "per drive" as drives grow in size, and also becomes less and less
common with new/improved technologies (like switching to 4k sector size
where error detection checksums work a bit differently and has more chances
to detect the error).  Note that these all are internal to the drive and
usually is a know-how of the manufacturer and can be changed without breaking
any compatibility whatsoever, since again these are all internal things.
It is just not realistic to draw an interpolation line based on current
volumes, because handling larger volumes may require more reliable error
detection mechanisms, be it internal for drives or by external means
(adding (meta)data checksumming, using various raid tecniques and so on).

>> There are two points there.
>>
>> First, a formal one.  Were it a serious issue if such a check weren't be
>> done at all?  I think that in this case this bugreport didt'n exist to
>> start with.
> 
> http://etbe.coker.com.au/2012/02/06/reliability-raid/

I recall again this is a "formal point".  Lack of any scrubbing is serious
bug, but lack of reporting is a wishlist, that's what i'm saying, nothing
more.

> If there were no checks at all then we would migrate to BTRFS even sooner, at 
> the above URL I've written some of the thoughts about BTRFS vs software RAID.
> 
>> And second, more to the point, Neil gave a very good writeup of these
>> checks and repairs of raid arrays, about deciding which part/component of
>> the array is "more right".  Unfortunately I can't find it right now.
> 
> Unfortunately at the moment it seems impossible to determine which disk had 
> the error, if you even know that there was an error.

Yes that's the bottom line of that article, and that's exactly what I had
in mind.  It describes in great details (without touching latent errors much)
why it is so.

For the future, I think drive manufacturers will do something to reduce
probability of latent errors dramatically maybe to cryptographically-impossible
levels, by changing ways how error detection and correction is done.

Please note that I don't argue against the lack of reporting - just about
the severity of the bugreport.

/mjt





Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#658701; Package mdadm. (Sat, 11 Feb 2012 14:06:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael Tokarev <mjt@tls.msk.ru>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Sat, 11 Feb 2012 14:06:04 GMT) Full text and rfc822 format available.

Message #25 received at 658701@bugs.debian.org (full text, mbox):

From: Michael Tokarev <mjt@tls.msk.ru>
To: 658701@bugs.debian.org
Cc: russell@coker.com.au
Subject: Re: Bug#658701: mdadm: should send email if mismatches are reported by a check
Date: Sat, 11 Feb 2012 18:03:29 +0400
On 05.02.2012 19:40, Michael Tokarev wrote:
> On 05.02.2012 18:58, Russell Coker wrote:
>> On Mon, 6 Feb 2012, Michael Tokarev <mjt@tls.msk.ru> wrote:
[]
>>> And second, more to the point, Neil gave a very good writeup of these
>>> checks and repairs of raid arrays, about deciding which part/component of
>>> the array is "more right".  Unfortunately I can't find it right now.
>>
>> Unfortunately at the moment it seems impossible to determine which disk had 
>> the error, if you even know that there was an error.
> 
> Yes that's the bottom line of that article, and that's exactly what I had
> in mind.  It describes in great details (without touching latent errors much)
> why it is so.

I meant this one: http://neil.brown.name/blog/20100211050355
"Smart or simple RAID recovery??".

/mjt




Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#658701; Package mdadm. (Thu, 12 Apr 2012 19:33:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael Tokarev <mjt@tls.msk.ru>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Thu, 12 Apr 2012 19:33:03 GMT) Full text and rfc822 format available.

Message #30 received at 658701@bugs.debian.org (full text, mbox):

From: Michael Tokarev <mjt@tls.msk.ru>
To: Neil Brown <neilb@suse.de>
Cc: 658701@bugs.debian.org, linux-raid <linux-raid@vger.kernel.org>
Subject: Re: Bug#658701: mdadm: should send email if mismatches are reported by a check
Date: Thu, 12 Apr 2012 23:28:45 +0400
Neil, re http://bugs.debian.org/658701 , how do you think,
is it okay if mdadm --monitor will send email in case check
found mismatches, the same way it sends email about other
more critical errors?

I think Russell has a good point here, but there's one more
source of mismatches we have in kernel - some "sporadic"
mismatches in raid1 and raid10, especially when these are
used as swap space...

In Debian we've several bugreports already requesting more
attention to mismatch_cnt, see:

 http://bugs.debian.org/658701 (this one)
 http://bugs.debian.org/599821
 http://bugs.debian.org/588516

Thank you!

/mjt




Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#658701; Package mdadm. (Sat, 26 May 2012 14:42:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael Tokarev <mjt@tls.msk.ru>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Sat, 26 May 2012 14:42:04 GMT) Full text and rfc822 format available.

Message #35 received at 658701@bugs.debian.org (full text, mbox):

From: Michael Tokarev <mjt@tls.msk.ru>
To: Michael Tokarev <mjt@tls.msk.ru>, 658701@bugs.debian.org
Cc: Neil Brown <neilb@suse.de>, linux-raid <linux-raid@vger.kernel.org>
Subject: Re: Bug#658701: mdadm: should send email if mismatches are reported by a check
Date: Sat, 26 May 2012 18:39:08 +0400
Neil, can you comment on the change to Monitor offered
in the mentioned bugreport please?

On 12.04.2012 23:28, Michael Tokarev wrote:
> Neil, re http://bugs.debian.org/658701 , how do you think,
> is it okay if mdadm --monitor will send email in case check
> found mismatches, the same way it sends email about other
> more critical errors?
> 
> I think Russell has a good point here, but there's one more
> source of mismatches we have in kernel - some "sporadic"
> mismatches in raid1 and raid10, especially when these are
> used as swap space...
> 
> In Debian we've several bugreports already requesting more
> attention to mismatch_cnt, see:
> 
>  http://bugs.debian.org/658701 (this one)
>  http://bugs.debian.org/599821
>  http://bugs.debian.org/588516
> 
> Thank you!
> 
> /mjt




Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#658701; Package mdadm. (Mon, 28 May 2012 01:45:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to NeilBrown <neilb@suse.de>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Mon, 28 May 2012 01:45:03 GMT) Full text and rfc822 format available.

Message #40 received at 658701@bugs.debian.org (full text, mbox):

From: NeilBrown <neilb@suse.de>
To: Michael Tokarev <mjt@tls.msk.ru>
Cc: 658701@bugs.debian.org, linux-raid <linux-raid@vger.kernel.org>
Subject: Re: Bug#658701: mdadm: should send email if mismatches are reported by a check
Date: Mon, 28 May 2012 11:41:30 +1000
[Message part 1 (text/plain, inline)]
On Sat, 26 May 2012 18:39:08 +0400 Michael Tokarev <mjt@tls.msk.ru> wrote:

> Neil, can you comment on the change to Monitor offered
> in the mentioned bugreport please?
> 
> On 12.04.2012 23:28, Michael Tokarev wrote:
> > Neil, re http://bugs.debian.org/658701 , how do you think,
> > is it okay if mdadm --monitor will send email in case check
> > found mismatches, the same way it sends email about other
> > more critical errors?
> > 
> > I think Russell has a good point here, but there's one more
> > source of mismatches we have in kernel - some "sporadic"
> > mismatches in raid1 and raid10, especially when these are
> > used as swap space...
> > 
> > In Debian we've several bugreports already requesting more
> > attention to mismatch_cnt, see:
> > 
> >  http://bugs.debian.org/658701 (this one)
> >  http://bugs.debian.org/599821
> >  http://bugs.debian.org/588516
> > 
> > Thank you!
> > 
> > /mjt

Sorry for not replying the first time :-(

I do not agree with the suggested change to mdadm.
A non-zero mismatch count may not be a problem.
It could be due to swap writing to a RAID1/RAID10.
It could also be due to a RAID1/RAID10/RAID6 having been
created with --assume-clean.  This is perfectly safe thing
to do but results in a non-zero mismatch_cnt.

mdadm --monitor will run a program on every event.  If someone
wants more events reported than currently are reported, they are
free to write a script to do whatever they like.

If md finds unreadable blocks and fixes them, then that certainly
might be interesting.  However that is interesting much more broadly than
just for md, and I believe 'smart' makes that information available.  So
having it reported from SMART would be more sensible.

In brief: mismatch_cnt maybe useful to someone who understands what is means
and is investigating some issues, but it is not something that should be
automatically reported to a casual sysadmin.

NeilBrown
[signature.asc (application/pgp-signature, attachment)]

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Sun Apr 20 13:21:29 2014; Machine Name: buxtehude.debian.org

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.