Debian Bug report logs - #405919
please explain mismatch_cnt so I can sleep better at night

version graph

Package: mdadm; Maintainer for mdadm is Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>; Source for mdadm is src:mdadm.

Reported by: Michel Lespinasse <walken@Angel.zoy.org>

Date: Sun, 7 Jan 2007 12:33:02 UTC

Severity: wishlist

Tags: confirmed, upstream, wontfix

Merged with 518834

Found in versions mdadm/2.5.6-7, mdadm/2.6.7.2-1

Fixed in version 3.1.2-1

Done: Andreas Beckmann <anbe@debian.org>

Bug is archived. No further changes may be made.

Forwarded to neilb@suse.de

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. Full text and rfc822 format available.

Acknowledgement sent to Michel Lespinasse <walken@Angel.zoy.org>:
New Bug report received and forwarded. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. Full text and rfc822 format available.

Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Michel Lespinasse <walken@Angel.zoy.org>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Sun, 07 Jan 2007 03:58:45 -0800
Package: mdadm
Version: 2.5.6-7
Severity: normal



My RAID1 arrays get checked by checkarray the first sunday of every 
month (default mdadm configuration, I think).

I have noticed that /sys/block/md1/md/mismatch_cnt reports a count of 
128 unsynchronized blocks. checkarray does not report or fix this issue.
Doing the same manually (echo check >/sys/block/md1/md/sync_action) does
not fix the issue either - mismatch_cnt is reset to 0 at the start of 
the resync, and goes up to 128 somewhere between 40% and 50% of the 
resync.

I believe checkarray should be made to report this issue, as my 
understanding was that the point of checkarray was to help with 
unsynchronized arrays ???

smartctl -a does not report any issues on either devices the RAID1 is 
based on.

I just noticed today that mdadm.conf only lists 3 of my 5 RAID1 volumes,
I do not know why (I did not edit the file after it was auto-generated).


-- Package-specific info:
--- mount output
/dev/md0 on / type ext2 (rw,errors=remount-ro)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
tmpfs on /tmp type tmpfs (rw)
tmpfs on /var/tmp type tmpfs (rw,size=1g)
/dev/md1 on /home type ext2 (rw,nosuid,nodev,errors=remount-ro)
/dev/md3 on /mnt/extra0 type ext2 (rw,noexec,nosuid,nodev,errors=remount-ro)
/dev/md4 on /mnt/extra1 type ext2 (rw,noexec,nosuid,nodev,errors=remount-ro)
automount(pid1430) on /mnt/autofs type autofs (rw,fd=4,pgrp=1430,minproto=2,maxproto=4)

--- mdadm.conf
# Autogenerated by mdcfg. See mdadm.conf(5) for more details on this file.
DEVICE partitions
ARRAY /dev/md0 level=raid1 num-devices=2 devices=/dev/hde1,/dev/hdg1
   UUID=436d1e33:5613bb60:73401891:847eb5d1
ARRAY /dev/md1 level=raid1 num-devices=2 devices=/dev/hde2,/dev/hdg2
   UUID=096310d2:3269c085:7757eec1:62d16352
ARRAY /dev/md2 level=raid1 num-devices=2 devices=/dev/hde3,/dev/hdg3
   UUID=0b80a931:d5aa8c2c:bd233b49:f4b6d4ab
MAILADDR root

--- /proc/mdstat:
Personalities : [raid1] 
md4 : active raid1 hde3[1] hdc2[0]
      9783488 blocks [2/2] [UU]
      
md1 : active raid1 hdg2[1] hde2[0]
      126953536 blocks [2/2] [UU]
      [===========>.........]  resync = 56.0% (71133440/126953536) finish=27.1min speed=34242K/sec
      
md3 : active raid1 hdg3[1] hdc1[0]
      48827904 blocks [2/2] [UU]
      
md2 : active raid1 hdg4[1] hde4[0]
      1975104 blocks [2/2] [UU]
      
md0 : active raid1 hdg1[1] hde1[0]
      17574976 blocks [2/2] [UU]
      
unused devices: <none>

--- /proc/partitions:
major minor  #blocks  name

  22     0   58615704 hdc
  22     1   48827992 hdc1
  22     2    9787680 hdc2
  33     0  156290904 hde
  33     1   17575078 hde1
  33     2  126953662 hde2
  33     3    9783585 hde3
  33     4    1975995 hde4
  34     0  195360984 hdg
  34     1   17575078 hdg1
  34     2  126953662 hdg2
  34     3   48853665 hdg3
  34     4    1975995 hdg4
   9     0   17574976 md0
   9     2    1975104 md2
   9     3   48827904 md3
   9     1  126953536 md1
   9     4    9783488 md4

--- initrd.img-2.6.16.36:

--- /proc/modules:

-- System Information:
Debian Release: 4.0
  APT prefers testing
  APT policy: (500, 'testing'), (50, 'unstable')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.16.36
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968)

Versions of packages mdadm depends on:
ii  debconf [debconf-2.0]        1.5.11      Debian configuration management sy
ii  libc6                        2.3.6.ds1-8 GNU C Library: Shared libraries
ii  lsb-base                     3.1-22      Linux Standard Base 3.1 init scrip
ii  makedev                      2.3.1-83    creates device files in /dev

Versions of packages mdadm recommends:
ii  module-init-tools             3.3-pre3-1 tools for managing Linux kernel mo
ii  postfix [mail-transport-agent 2.3.4-3    A high-performance mail transport 

-- debconf information:
* mdadm/autostart: true
* mdadm/initrdstart: /dev/md0
  mdadm/initrdstart_notinconf: false
  mdadm/initrdstart_msg_errexist:
  mdadm/initrdstart_msg_intro:
  mdadm/initrdstart_msg_errblock:
* mdadm/warning:
* mdadm/start_daemon: true
* mdadm/mail_to: root
  mdadm/initrdstart_msg_errmd:
  mdadm/initrdstart_msg_errconf:
* mdadm/autocheck: true



Reply sent to martin f krafft <madduck@debian.org>:
You have marked Bug as forwarded. Full text and rfc822 format available.

Message #8 received at 405919-forwarded@bugs.debian.org (full text, mbox, reply):

From: martin f krafft <madduck@debian.org>
To: neilb@suse.de
Cc: 405919-forwarded@bugs.debian.org, Michel Lespinasse <walken@Angel.zoy.org>
Subject: Re: Bug#405919: mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Sun, 7 Jan 2007 14:07:52 +0100
[Message part 1 (text/plain, inline)]
tags 405919 confirmed moreinfo
thanks

Forwarding this to Neil Brown, the md upstream maintainer. Neil, if
you don't feel like reading all of this, skip to "*** Neil:" below.
The full log is at http://bugs.debian.org/405919 .

also sprach Michel Lespinasse <walken@Angel.zoy.org> [2007.01.07.1258 +0100]:
> I have noticed that /sys/block/md1/md/mismatch_cnt reports a count of 
> 128 unsynchronized blocks. checkarray does not report or fix this issue.
> Doing the same manually (echo check >/sys/block/md1/md/sync_action) does
> not fix the issue either - mismatch_cnt is reset to 0 at the start of 
> the resync, and goes up to 128 somewhere between 40% and 50% of the 
> resync

Well, checkarray is called checkarray, not fixarray. Anyway, I agree
that it should report any problems and I thank you for pointing me
to this problem -- I thought that the kernel would log problems
itself, but apparently it does not. Could you please verify this by
checking all your logs?

I am going to address two points in turn: first, repairing the
array, then user notification:

This is the relevant information from md.txt:

  md/sync_action
      This can be used to monitor and control the resync/recovery
      process of MD. In particular, writing "check" here will cause
      the array to read all data block and check that they are
      consistent (e.g. parity is correct, or all mirror replicas are
      the same). Any discrepancies found are NOT corrected.

      A count of problems found will be stored in md/mismatch_count.

      Alternately, "repair" can be written which will cause the same
      check to be performed, but any errors will be corrected. 

So you can easily repair the array yourself with 'repair' instead of
'check'. checkarray could be doing this, but I'd much rather not
have checkarray write to the array every first Sunday of a month
while the admin may be sleeping. Thus, I am convinced that repairing
should be the job for 'repairarray', which I'll add in a future
release.

So this leaves user notification. The problem here is that
checkarray is asynchronous (sync_action is asynchronous), meaning
that it just tells the array to run a check and quits -- it does not
actually know when the check finishes.

*** Neil:

I see three solutions, which I will present in decreasing order of
preference. I am not opposed to combining 1&2:

  1. IMHO, the best solution would be if the md kernel driver would
     tell klogd if it finds a mismatch on an array. This would then
     end up with syslog and thus hopefully reach the admin.
     Alternatively, the kernel could be told to call a user-space
     programme specified via /proc, similar to how hotplug works.

  2. I introduce another cron job or daemon, which doesn't do
     anything but monitor the /sys/block/*/md/mismatch_cnt files and
     report any non-zero contents via email. Obviously, it could
     also write a log entry so the kernel would not have to. The
     problem is simply the delay caused by the period of the checks
     (e.g. only every 5 minutes), and the extra system load, which
     I guess is negligible.

     2b. The monitoring *could* well be done by mdadm --monitor.
         I would greatly favour that.

  3. I make checkarray synchronous by looping until a check
     completes, then checking mismatch_cnt and taking similar action
     to (2). I am not totally opposed to this, apart from the extra
     complexity, since it would advance checkarray from being
     a simple helper to a user-space tool that the admin could use
     at will and be kept up to date on the progress. checkarray
     could even crop /proc/mdstat and keep displaying it for visual
     satisfaction.

I am interested as to what Neil has to say. Also, Michel, what are
your thoughts?

> I just noticed today that mdadm.conf only lists 3 of my 5 RAID1
> volumes, I do not know why (I did not edit the file after it was
> auto-generated).

Did you add the two arrays after it was auto-generated? What does
/usr/share/mdadm/mkconf output?

-- 
 .''`.   martin f. krafft <madduck@debian.org>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems
[signature.asc (application/pgp-signature, inline)]

Message #9 received at 405919-forwarded@bugs.debian.org (full text, mbox, reply):

From: Michel Lespinasse <walken@zoy.org>
To: martin f krafft <madduck@debian.org>
Cc: neilb@suse.de, 405919-forwarded@bugs.debian.org
Subject: Re: Bug#405919: mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Sun, 7 Jan 2007 05:58:53 -0800
On Sun, Jan 07, 2007 at 02:07:52PM +0100, martin f krafft wrote:
> Well, checkarray is called checkarray, not fixarray. Anyway, I agree
> that it should report any problems and I thank you for pointing me
> to this problem -- I thought that the kernel would log problems
> itself, but apparently it does not. Could you please verify this by
> checking all your logs?

I confirm that I could not find any such message. If it's there, it must
be well hidden :)

> So you can easily repair the array yourself with 'repair' instead of
> 'check'. checkarray could be doing this, but I'd much rather not
> have checkarray write to the array every first Sunday of a month
> while the admin may be sleeping. Thus, I am convinced that repairing
> should be the job for 'repairarray', which I'll add in a future
> release.

This is reasonable. I would be worried about unnecessary rewrites too.

> I see three solutions, which I will present in decreasing order of
> preference. I am not opposed to combining 1&2:
> 
>   1. IMHO, the best solution would be if the md kernel driver would
>      tell klogd if it finds a mismatch on an array. This would then
>      end up with syslog and thus hopefully reach the admin.
>      Alternatively, the kernel could be told to call a user-space
>      programme specified via /proc, similar to how hotplug works.
> 
>   2. I introduce another cron job or daemon, which doesn't do
>      anything but monitor the /sys/block/*/md/mismatch_cnt files and
>      report any non-zero contents via email. Obviously, it could
>      also write a log entry so the kernel would not have to. The
>      problem is simply the delay caused by the period of the checks
>      (e.g. only every 5 minutes), and the extra system load, which
>      I guess is negligible.
> 
>      2b. The monitoring *could* well be done by mdadm --monitor.
>          I would greatly favour that.

> I am interested as to what Neil has to say. Also, Michel, what are
> your thoughts?

Pretty close to yours. I think the kernel should printk something when
mismatch_cnt becomes nonzero. Ideally I wish this was also shown somehow
in /proc/mdstat. Finally the mail from the mdadm monitor would seem the
most reliable way to make sure the admin hears of the issue.
So I would think 1 + 2b.

Thanks,

-- 
Michel "Walken" Lespinasse
"Bill Gates is a monocle and a Persian cat away from being the villain
in a James Bond movie." -- Dennis Miller



Message #10 received at 405919-forwarded@bugs.debian.org (full text, mbox, reply):

From: Michel Lespinasse <walken@zoy.org>
To: martin f krafft <madduck@debian.org>
Cc: neilb@suse.de, 405919-forwarded@bugs.debian.org
Subject: Re: Bug#405919: mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Sun, 7 Jan 2007 18:53:47 -0800
Following up about the MD recovery process. Looks like there is a
kernel bug as well (though apparently a benign one).

On Sun, Jan 07, 2007 at 02:07:52PM +0100, martin f krafft wrote:
>   md/sync_action
>       This can be used to monitor and control the resync/recovery
>       process of MD. In particular, writing "check" here will cause
>       the array to read all data block and check that they are
>       consistent (e.g. parity is correct, or all mirror replicas are
>       the same). Any discrepancies found are NOT corrected.
> 
>       A count of problems found will be stored in md/mismatch_count.
> 
>       Alternately, "repair" can be written which will cause the same
>       check to be performed, but any errors will be corrected. 

Turns out writing "repair" does NOT work with my kernel (2.6.16.36).
Looking at action_store in md.c I see the following:

                if (cmd_match(page, "check"))
                        set_bit(MD_RECOVERY_CHECK, &mddev->recovery);
                else if (cmd_match(page, "repair"))
                        return -EINVAL;

I believe the check for "repair" needs to be inverted -
i.e. everything but repair should be valid at that point.

Here is what I see from the shell:
# cat /sys/block/md0/md/sync_action
idle
# echo repair > /sys/block/md1/md/sync_action 
echo: write error: invalid argument
# echo asdf > /sys/block/md1/md/sync_action
# cat /sys/block/md1/md/sync_action          
repair
# cat /proc/mdstat 
Personalities : [raid1] 
...
md1 : active raid1 hdg2[1] hde2[0]
      126953536 blocks [2/2] [UU]
      [==>..................]  resync = 14.2% (18054976/126953536) finish=53.7min speed=33773K/sec
...
unused devices: <none>
# ... wait one hour ...
# cat /sys/block/md1/md/sync_action
idle
# cat /sys/block/md1/md/mismatch_cnt 
128
# echo check > /sys/block/md1/md/sync_action
# cat /sys/block/md1/md/sync_action         
check
# ... wait one hour ...
# cat /sys/block/md1/md/mismatch_cnt 
128

OK, so I'm confused. Besides the obvious issue that I could not use "repair"
to start the repair process, there is also the issue that this repair did
not fix the mismatch_cnt issue. I'm not sure if it did anything - I get
a nervous feeling that this code path might not have been tested much,
but nothing has blown up in my face yet :)

Cheers,

-- 
Michel "Walken" Lespinasse
"Bill Gates is a monocle and a Persian cat away from being the villain
in a James Bond movie." -- Dennis Miller



Information stored:
Bug#405919; Package mdadm. Full text and rfc822 format available.

Acknowledgement sent to martin f krafft <madduck@debian.org>:
Extra info received and filed, but not forwarded. Full text and rfc822 format available.

Message #15 received at 405919-quiet@bugs.debian.org (full text, mbox, reply):

From: martin f krafft <madduck@debian.org>
To: 405919-quiet@bugs.debian.org
Subject: Fwd: Re: Bug#405919: mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Mon, 8 Jan 2007 16:27:24 +0100
[Message part 1 (text/plain, inline)]
----- Forwarded message from Michel Lespinasse <walken@zoy.org> -----

On Mon, Jan 08, 2007 at 11:39:44AM +0100, martin f krafft wrote:
> also sprach Michel Lespinasse <walken@zoy.org> [2007.01.08.0353 +0100]:
> > Following up about the MD recovery process. Looks like there is a
> > kernel bug as well (though apparently a benign one).
> 
> Could you please create a new bug for this issue?

After you asked, I figured out it would probably not be fair to file
it against mdadm since that second issue seems to be kernel
related. And I checked the linux-2.6_2.6.18-8 debian source package
and I see the cmd_match(page, "repair") there is preceded by a ! as I
thought it should be. So at least the issue about using asdf to start
repairs seems to be fixed in the debian kernel - good. I have not tried
doing a repair yet though.

At this point it looks like debian kernels are fine, so I'll not file
the bug before I try a repair :)

-- 
Michel "Walken" Lespinasse
"Bill Gates is a monocle and a Persian cat away from being the villain
in a James Bond movie." -- Dennis Miller

----- End forwarded message -----
[signature.asc (application/pgp-signature, inline)]

Message #16 received at 405919-forwarded@bugs.debian.org (full text, mbox, reply):

From: Neil Brown <neilb@suse.de>
To: martin f krafft <madduck@debian.org>
Cc: 405919-forwarded@bugs.debian.org, Michel Lespinasse <walken@Angel.zoy.org>
Subject: Re: Bug#405919: mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Tue, 9 Jan 2007 12:34:16 +1100
On Sunday January 7, madduck@debian.org wrote:
> 
> *** Neil:
> 
> I see three solutions, which I will present in decreasing order of
> preference. I am not opposed to combining 1&2:
> 
>   1. IMHO, the best solution would be if the md kernel driver would
>      tell klogd if it finds a mismatch on an array. This would then
>      end up with syslog and thus hopefully reach the admin.
>      Alternatively, the kernel could be told to call a user-space
>      programme specified via /proc, similar to how hotplug works.

Yes, a printk when mismatch_cnt becomes non-zero and another at the
end of the check/recovery process would make sense.

Could possibly put something in /proc/mdstat too, but that is
cluttered enough already.

> 
>   2. I introduce another cron job or daemon, which doesn't do
>      anything but monitor the /sys/block/*/md/mismatch_cnt files and
>      report any non-zero contents via email. Obviously, it could
>      also write a log entry so the kernel would not have to. The
>      problem is simply the delay caused by the period of the checks
>      (e.g. only every 5 minutes), and the extra system load, which
>      I guess is negligible.
> 
>      2b. The monitoring *could* well be done by mdadm --monitor.
>          I would greatly favour that.

Yes, mdadm --monitor should send an email on resync-complete if
 mismatch_cnt != 0 (hmmm. unless it was the initial resync...
 might need to be careful there).

Both these are on my todo lists now.

NeilBrown



Message #17 received at 405919-forwarded@bugs.debian.org (full text, mbox, reply):

From: Neil Brown <neilb@suse.de>
To: Michel Lespinasse <walken@zoy.org>
Cc: martin f krafft <madduck@debian.org>, 405919-forwarded@bugs.debian.org
Subject: Re: Bug#405919: mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Tue, 9 Jan 2007 15:19:29 +1100
On Sunday January 7, walken@zoy.org wrote:
> 
> Turns out writing "repair" does NOT work with my kernel (2.6.16.36).
> Looking at action_store in md.c I see the following:
> 
>                 if (cmd_match(page, "check"))
>                         set_bit(MD_RECOVERY_CHECK, &mddev->recovery);
>                 else if (cmd_match(page, "repair"))
>                         return -EINVAL;
> 
> I believe the check for "repair" needs to be inverted -
> i.e. everything but repair should be valid at that point.

Yes, sorry 'bout that.  It is fixed in more recent kernels.

> # cat /sys/block/md1/md/sync_action          
> repair
....
> # echo check > /sys/block/md1/md/sync_action
> # cat /sys/block/md1/md/sync_action         
> check
> # ... wait one hour ...
> # cat /sys/block/md1/md/mismatch_cnt 
> 128

Hmmm..... oh dear.  'recovery' doesn't seem to be working at all on
raid1.
Without the following patch, it has exactly the same effect as
'check'.  I wonder how that slipped through my testing.

Thanks,
NeilBrown


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |    5 +++++
 1 file changed, 5 insertions(+)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2006-12-13 17:37:16.000000000 +1100
+++ ./drivers/md/raid1.c	2007-01-09 14:00:58.000000000 +1100
@@ -1263,6 +1263,11 @@ static void sync_request_write(mddev_t *
 					sbio->bi_sector = r1_bio->sector +
 						conf->mirrors[i].rdev->data_offset;
 					sbio->bi_bdev = conf->mirrors[i].rdev->bdev;
+					for (j = 0; j < vcnt ; j++)
+						memcpy(page_address(sbio->bi_io_vec[j].bv_page),
+						       page_address(pbio->bi_io_vec[j].bv_page),
+						       PAGE_SIZE);
+
 				}
 			}
 	}



Message #18 received at 405919-forwarded@bugs.debian.org (full text, mbox, reply):

From: Michel Lespinasse <walken@zoy.org>
To: Neil Brown <neilb@suse.de>
Cc: martin f krafft <madduck@debian.org>, 405919-forwarded@bugs.debian.org
Subject: Re: Bug#405919: mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Tue, 9 Jan 2007 02:29:09 -0800
On Tue, Jan 09, 2007 at 03:19:29PM +1100, Neil Brown wrote:
> > I believe the check for "repair" needs to be inverted -
> > i.e. everything but repair should be valid at that point.
> 
> Yes, sorry 'bout that.  It is fixed in more recent kernels.

Yes, I noticed that when I looked at debian's source after martin asked me
to file a separate bug for the kernel issue.

> Hmmm..... oh dear.  'recovery' doesn't seem to be working at all on
> raid1.
> Without the following patch, it has exactly the same effect as
> 'check'.  I wonder how that slipped through my testing.

Heh :)

So I tried the change here on my 2.6.16.36 kernel and it worked.
echo asdf > /sys/block/md1/md/sync_action did start a resync
which showed mismatch_cnt=128 at the end, and after that a check
returned mismatch_cnt=0. Thanks !

Since both patches are small maybe they should be pushed into 2.6.16 too ?

Cheers,

-- 
Michel "Walken" Lespinasse
"Bill Gates is a monocle and a Persian cat away from being the villain
in a James Bond movie." -- Dennis Miller



Message #19 received at 405919-forwarded@bugs.debian.org (full text, mbox, reply):

From: martin f krafft <madduck@debian.org>
To: Michel Lespinasse <walken@zoy.org>, Neil Brown <neilb@suse.de>
Cc: 405919-forwarded@bugs.debian.org
Subject: Re: Bug#405919: mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Tue, 16 Jan 2007 14:03:38 +0100
[Message part 1 (text/plain, inline)]
also sprach Neil Brown <neilb@suse.de> [2007.01.09.0234 +0100]:
> Yes, a printk when mismatch_cnt becomes non-zero and another at the
> end of the check/recovery process would make sense.
> 
> Could possibly put something in /proc/mdstat too, but that is
> cluttered enough already.

The point of /proc/mdstat seems to be to provide information, not
look nice, right? :)

It would be great to have mismatch_cnt in /proc/mdstat!

> Yes, mdadm --monitor should send an email on resync-complete if
> mismatch_cnt != 0 (hmmm. unless it was the initial resync... might
> need to be careful there).

Shouldn't it be 0 after the initial resync or we're facing just the
same problem as when it's non-zero otherwise?

> Both these are on my todo lists now.

Thanks, Neil!

I am halting mdadm development for now until Debian etch is out,
when I will sync with 2.6 and make many other scheduled
improvements, including better udev integration.

Cheers,

-- 
 .''`.   martin f. krafft <madduck@debian.org>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems
[signature.asc (application/pgp-signature, inline)]

Tags added: upstream Request was from martin f. krafft <madduck@debian.org> to control@bugs.debian.org. (Thu, 07 Jun 2007 08:09:09 GMT) Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. Full text and rfc822 format available.

Acknowledgement sent to Michael Schmitt <mschmitt@unixkiste.org>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. Full text and rfc822 format available.

Message #26 received at 405919@bugs.debian.org (full text, mbox, reply):

From: Michael Schmitt <mschmitt@unixkiste.org>
To: madduck@debian.org, neilb@suse.de
Cc: 405919@bugs.debian.org
Subject: http://bugs.debian.org/405919 mdadm: checkarray does not report or fix mismatch_cnt issues
Date: Tue, 04 Dec 2007 04:06:13 +0100
Hi,

as this issue is fairly old and I am really interested in this too, I
feel like asking about its status. Is there any progress? I know it may
be related to the current discussion on the linux-raid mailinglist if
mdadm should note desynced arrays or even try to repair them but
nevertheless I ask here again.

greetings
Michael





Tags added: confirmed, help Request was from martin f. krafft <madduck@debian.org> to control@bugs.debian.org. (Fri, 11 Apr 2008 12:21:04 GMT) Full text and rfc822 format available.

Changed Bug title to `mdadm should provide a way to fix mismatch_cnt issues' from `mdadm: checkarray does not report or fix mismatch_cnt issues'. Request was from martin f. krafft <madduck@debian.org> to control@bugs.debian.org. (Fri, 11 Apr 2008 12:21:05 GMT) Full text and rfc822 format available.

Forcibly Merged 405919 518834. Request was from martin f krafft <madduck@madduck.net> to control@bugs.debian.org. (Wed, 29 Apr 2009 13:51:03 GMT) Full text and rfc822 format available.

Changed Bug title to `please explain mismatch_cnt so I can sleep better at night' from `mdadm should provide a way to fix mismatch_cnt issues'. Request was from martin f krafft <madduck@madduck.net> to control@bugs.debian.org. (Wed, 29 Apr 2009 13:51:05 GMT) Full text and rfc822 format available.

Severity set to `wishlist' from `normal' Request was from martin f krafft <madduck@madduck.net> to control@bugs.debian.org. (Wed, 29 Apr 2009 13:51:05 GMT) Full text and rfc822 format available.

Tags set to: confirmed, upstream Request was from martin f krafft <madduck@madduck.net> to control@bugs.debian.org. (Wed, 29 Apr 2009 13:51:06 GMT) Full text and rfc822 format available.

Message sent on to Michel Lespinasse <walken@Angel.zoy.org>:
Bug#405919. (Wed, 29 Apr 2009 13:51:11 GMT) Full text and rfc822 format available.

Message #41 received at 405919-submitter@bugs.debian.org (full text, mbox, reply):

From: martin f krafft <madduck@madduck.net>
To: 405919-submitter@bugs.debian.org, 518834-submitter@bugs.debian.org
Subject: the issue with mismatch_cnt
Date: Wed, 29 Apr 2009 15:46:57 +0200
[Message part 1 (text/plain, inline)]
forcemerge 405919 518834
retitle 405919 please explain mismatch_cnt so I can sleep better at night
severity 405919 wishlist
tags 405919 = confirmed upstream
thanks

The following explanation from upstream might help you get better
sleep at night. I am also forwarding it so that I can present people
asking on IRC with a link. We'll work on folding this into
documentation for an upcoming release.

I'll start with the last point from Neil's reply before getting to
the details:

> Can anyone explain what these mismatches are...is my data at risk?

Data is not at risk.

\o/

----- Forwarded message from Neil Brown <neilb@suse.de> -----

> I keep getting more and more requests about those mismatches from
> people, and I cannot quite explain it correctly. I think
> I understand what they are -- redundancy that isn't anymore, but
> I don't know how they come to be, and I don't know why md doesn't
> fix them automatically, or try to be louder about them.
> 
> They're bad, aren't they? And how does one recover from them?
> I mean, if you have a RAID5 and the bits should be [110] across the
> components, but they end up as, say [100], how do you recover from
> them? [100] is a mismatch, just as much as [010] or [111] or [001]
> would be, no? Isn't the original data then lost since you don't
> actually know which bit flipped?
> 
> It would be great if you could help me understand. I'd then take the
> time to write it up for everyone...

For RAID5, unexpected mismatches would be a problem, yes.
But that is not what is being reported.  All the reports related to
raid1, though raid10 could be affected as well.

These mismatches can happen for a number of different reasons, but are
most likely when swap is on the array.

It goes like this:
 - we have a page of memory that hasn't been changed in a while.
   The VM notices and decided to write it out to the swap device so
   that the memory can be freed more easily.
 - The write is sent to the raid1, which creates to write request for
   that page, one to each device.
 - These write requests sit on the queue for a little while and
   eventually get processed.  Only when their turn arrives is the data
   copied (probably by DMA) out of the page and into a buffer in the
   controller.
   These two copies will almost certainly happen at different times.
 - While the requests are sitting in the queue, the application that
   owns the page happens to wake up and starts writing to the page.
   If is entirely possible that it will make some change to the page
   between the two copies (DMAs) out to the controller(s).  So
   the two pages that are written are different.
 - The VM notices that the page has been changed, and so forgets about
   that fact that the data has been written to swap.  If it ever
   decide that page is suitable for swap-out again, it will write the
   data out again.  In particular it will never try to read in that
   pages which is different on the two devices.  If it never decided
   to write any page out to the location again, then the location
   stays out-of-sync.
 - as no-one will ever read either of the block that are different,
   the fact that they are different isn't really important.  Except
   that md check/repair will notice and report it.

This can conceivably happen with out swap being part of the picture,
if memory mapped files are used.  However in this case it is less
likely to remain out-of-sync and dirty file data will be written out
soon, where as there is no guarantee that dirty anonymous pages will
be written to swap in any particular hurry, or at any particular
location.

md/raid1 could avoid this by copying the page once in to a temporary
buffer, then doing all writes from this buffer.  That is effectively
what raid5 does which is why raid5 doesn't suffer from the symptom.
However that would cause a measurable performance decrease with no
significant value.

A slightly less intrusive 'fix' could be to check each page when an IO
completes.  If the page is 'dirty', schedule a 'repair' for just that
page of the array.  This might work, but feels like a layering
violation (block device driver has not business looking at any of the
page flags) and would be extra complexity for minimal gain.

Is that sufficiently clear?  Maybe it should go in md.4.

NeilBrown

----- End forwarded message -----

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"everyone smiles as you drift past the flower
 that grows so incredibly high."
                                                        -- the beatles
 
spamtraps: madduck.bogus@madduck.net
[digital_signature_gpg.asc (application/pgp-signature, inline)]

Added tag(s) fixed-upstream. Request was from martin f. krafft <madduck@debian.org> to control@bugs.debian.org. (Thu, 28 Jan 2010 21:33:02 GMT) Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. (Sun, 07 Feb 2010 13:18:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jort Koopmans <jort@koopmans-online.net>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Sun, 07 Feb 2010 13:18:02 GMT) Full text and rfc822 format available.

Message #48 received at 405919@bugs.debian.org (full text, mbox, reply):

From: Jort Koopmans <jort@koopmans-online.net>
To: 405919@bugs.debian.org
Subject: the issue with mismatch_cnt
Date: Sun, 07 Feb 2010 14:05:51 +0100
Thanks for the explanation of this mismatch.
I agree it would be very useful to distinguish between these 'memory
mapped files' mismatches and true data mismatches.
In Etch this issue was not present if i'm correct, how was it working
there?

J. Koopmans





Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. (Thu, 09 Sep 2010 16:36:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Alexander Kurtz <kurtz.alex@googlemail.com>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Thu, 09 Sep 2010 16:36:03 GMT) Full text and rfc822 format available.

Message #53 received at 405919@bugs.debian.org (full text, mbox, reply):

From: Alexander Kurtz <kurtz.alex@googlemail.com>
To: martin f krafft <madduck@debian.org>
Cc: 495755@bugs.debian.org, 405919@bugs.debian.org, 427777@bugs.debian.org, 534571@bugs.debian.org
Subject: Re: Bug#564004: mdadm: option -N does not work as documented
Date: Thu, 09 Sep 2010 18:33:14 +0200
[Message part 1 (text/plain, inline)]
Am Donnerstag, den 09.09.2010, 07:24 +0200 schrieb martin f krafft:
> Indeed. Thanks for your bug triaging contribution, I greatly
> appreciate it.

My pleasure! Maybe you could also look at those four:

	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=495755
	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919
	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=427777
	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534571

You tagged them all `fixed-upstream' on January 28th/29th. The current
version in squeeze (3.1.2) was released in March, so maybe those are
also fixed, at least for 3.1.2 and later.

Best regards

Alexander Kurtz
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. (Fri, 10 Sep 2010 10:42:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to martin f krafft <madduck@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Fri, 10 Sep 2010 10:42:03 GMT) Full text and rfc822 format available.

Message #58 received at 405919@bugs.debian.org (full text, mbox, reply):

From: martin f krafft <madduck@debian.org>
To: Alexander Kurtz <kurtz.alex@googlemail.com>
Cc: 495755@bugs.debian.org, 405919@bugs.debian.org, 427777@bugs.debian.org, 534571@bugs.debian.org
Subject: Re: Bug#564004: mdadm: option -N does not work as documented
Date: Fri, 10 Sep 2010 12:38:32 +0200
[Message part 1 (text/plain, inline)]
also sprach Alexander Kurtz <kurtz.alex@googlemail.com> [2010.09.09.1833 +0200]:
> My pleasure! Maybe you could also look at those four:
> 
> 	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=495755
> 	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919
> 	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=427777
> 	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534571
> 
> You tagged them all `fixed-upstream' on January 28th/29th. The current
> version in squeeze (3.1.2) was released in March, so maybe those are
> also fixed, at least for 3.1.2 and later.

I don't know and I cannot reconstruct why I tagged them
fixed-upstream. Maybe it was an error, as there are no commits
referenced? We'll need to investigate them one by one, but I won't
have time to do so before the end of September.

-- 
 .''`.   martin f. krafft <madduck@d.o>      Related projects:
: :'  :  proud Debian developer               http://debiansystem.info
`. `'`   http://people.debian.org/~madduck    http://vcs-pkg.org
  `-  Debian - when you have better things to do than fixing systems
 
"politics is the entertainment branch of industry."
                                                        -- frank zappa
[digital_signature_gpg.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. (Thu, 16 Sep 2010 11:51:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Alexander Kurtz <kurtz.alex@googlemail.com>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Thu, 16 Sep 2010 11:51:03 GMT) Full text and rfc822 format available.

Message #63 received at 405919@bugs.debian.org (full text, mbox, reply):

From: Alexander Kurtz <kurtz.alex@googlemail.com>
To: martin f krafft <madduck@debian.org>
Cc: 495755@bugs.debian.org, 405919@bugs.debian.org, 427777@bugs.debian.org, 534571@bugs.debian.org
Subject: Re: Bug#564004: mdadm: option -N does not work as documented
Date: Thu, 16 Sep 2010 13:47:14 +0200
[Message part 1 (text/plain, inline)]
Hi,

Am Freitag, den 10.09.2010, 12:38 +0200 schrieb martin f krafft: 
> also sprach Alexander Kurtz <kurtz.alex@googlemail.com> [2010.09.09.1833 +0200]:
> > My pleasure! Maybe you could also look at those four:
> > 
> > 	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=495755
> > 	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919
> > 	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=427777
> > 	http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534571
> > 
> > You tagged them all `fixed-upstream' on January 28th/29th. The current
> > version in squeeze (3.1.2) was released in March, so maybe those are
> > also fixed, at least for 3.1.2 and later.
> 
> I don't know and I cannot reconstruct why I tagged them
> fixed-upstream. Maybe it was an error, as there are no commits
> referenced? We'll need to investigate them one by one, but I won't
> have time to do so before the end of September.

I just had some time:

405919: http://git.debian.org/?p=pkg-mdadm/mdadm.git;a=commit;h=1cc44574b2b7089275d2aea592a57294880ee45d
427777: http://git.debian.org/?p=pkg-mdadm/mdadm.git;a=commit;h=a1331cc4068d4c0723dd46f3a170ed100adba000
495755: http://git.debian.org/?p=pkg-mdadm/mdadm.git;a=commit;h=d998adc316299efc44cb6e70ecc2e04bffb76d17
534571: http://git.debian.org/?p=pkg-mdadm/mdadm.git;a=commit;h=39bbb392022d7d3008a0695755ced84fa49d2231

Thanks to Neil Brown who kindly noted the bug numbers in the commits!

Best regards

Alexander Kurtz




[signature.asc (application/pgp-signature, inline)]

Reply sent to Alexander Kurtz <kurtz.alex@googlemail.com>:
You have taken responsibility. (Fri, 24 Sep 2010 10:27:03 GMT) Full text and rfc822 format available.

Notification sent to Michel Lespinasse <walken@Angel.zoy.org>:
Bug acknowledged by developer. (Fri, 24 Sep 2010 10:27:03 GMT) Full text and rfc822 format available.

Message #68 received at 405919-done@bugs.debian.org (full text, mbox, reply):

From: Alexander Kurtz <kurtz.alex@googlemail.com>
To: 405919-done@bugs.debian.org
Subject: Re: please explain mismatch_cnt so I can sleep better at night
Date: Fri, 24 Sep 2010 12:22:54 +0200
[Message part 1 (text/plain, inline)]
Version: 3.1.2-1

This was fixed by an upstream commit[1] which is part of mdadm 3.1.2.
Please feel free to reopen if necessary.

Best regards

Alexander Kurtz

[1] http://git.debian.org/?p=pkg-mdadm/mdadm.git;a=commit;h=1cc44574b2b7089275d2aea592a57294880ee45d
[signature.asc (application/pgp-signature, inline)]

Reply sent to Alexander Kurtz <kurtz.alex@googlemail.com>:
You have taken responsibility. (Fri, 24 Sep 2010 10:27:04 GMT) Full text and rfc822 format available.

Notification sent to Cristian Ionescu-Idbohrn <cristian.ionescu-idbohrn@axis.com>:
Bug acknowledged by developer. (Fri, 24 Sep 2010 10:27:04 GMT) Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. (Tue, 28 Sep 2010 14:55:37 GMT) Full text and rfc822 format available.

Acknowledgement sent to Tim Small <tim@seoss.co.uk>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Tue, 28 Sep 2010 14:55:37 GMT) Full text and rfc822 format available.

Message #78 received at 405919@bugs.debian.org (full text, mbox, reply):

From: Tim Small <tim@seoss.co.uk>
To: Debian Bug Tracking System <405919@bugs.debian.org>
Subject: Re: checkarray does not report or fix mismatch_cnt issues
Date: Tue, 28 Sep 2010 15:48:10 +0100
Package: mdadm
Version: 3.1.4-1+8efb9d1
Severity: normal
Tags: patch


Err, by the look of it, the patch which was referenced when this bug was
closed fixes the documentation only, not the fact that the mismatch count
isn't normally reported (actually, it does get reported by logcheck, but
it's still easy to miss in the logcheck output, not to mention most
people don't use logcheck).

The attached patch complains more vocally....

Tim.

*** /home/tim/mdadm-mismatch-fix.diff
--- mdadm.old	2010-09-28 15:35:15.954390947 +0100
+++ mdadm	2010-09-28 15:35:24.763517144 +0100
@@ -15,4 +15,49 @@
 MDADM=/sbin/mdadm
 [ -x $MDADM ] || exit 0 # package may be removed but not purged
 
+PRINT_SUMMARY=0
+
+for mcnt in /sys/block/md?/md/mismatch_cnt
+do
+	if [ -f $mcnt ]
+	then
+		read cnt < $mcnt
+		if [ $cnt != 0 ]
+		then
+			cat << WARN_TEXT
+
+Warning - $mcnt indicates that the associated RAID
+device has $cnt blocks in which the data on one array member is inconsistent
+with the data on the other array member(s).
+WARN_TEXT
+			PRINT_SUMMARY=1
+		fi
+	fi
+done
+
+if [ $PRINT_SUMMARY != 0 ]
+then
+	cat << WARN_TEXT
+
+DATA LOSS MAY HAVE OCCURRED.
+
+This condition may have been caused by one of more of the following events:
+
+. A power failure when the array was being written-to.
+. Data corruption by a hard disk drive, drive controller, cable etc.
+. A kernel bug in the md or storage subsystems etc.
+. An array was forcibly created in an inconsistent state using --assume-clean
+
+This count is updated when the md subsystem carries out a 'check' or
+'repair' action.  In the case of 'repair' it reflects the number of
+mismatched blocks prior to carrying out the repair.
+
+Once you have fixed the error, carry out a 'check' action to reset the count
+to zero.
+
+See:
+https://raid.wiki.kernel.org/index.php/Linux_Raid#Frequently_Asked_Questions_-_FAQ
+WARN_TEXT
+fi
+
 exec $MDADM --monitor --scan --oneshot




Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. (Tue, 28 Sep 2010 15:03:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to Tim Small <tim@seoss.co.uk>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Tue, 28 Sep 2010 15:03:07 GMT) Full text and rfc822 format available.

Message #83 received at 405919@bugs.debian.org (full text, mbox, reply):

From: Tim Small <tim@seoss.co.uk>
To: Debian Bug Tracking System <405919@bugs.debian.org>
Subject: Re: checkarray does not report or fix mismatch_cnt issues
Date: Tue, 28 Sep 2010 16:01:29 +0100
Package: mdadm
Version: 3.1.4-1+8efb9d1
Severity: normal
Tags: patch

Perhaps this is a better patch in light of some of the points above.

Tim.

*** /home/tim/mdadm-mismatch-fix.diff
--- /etc/cron.daily/mdadm.old	2010-09-28 15:35:15.954390947 +0100
+++ /etc/cron.daily/mdadm	2010-09-28 15:59:14.422378309 +0100
@@ -15,4 +15,53 @@
 MDADM=/sbin/mdadm
 [ -x $MDADM ] || exit 0 # package may be removed but not purged
 
+PRINT_SUMMARY=0
+
+for mcnt in /sys/block/md?/md/mismatch_cnt
+do
+	if [ -f $mcnt ]
+	then
+		read cnt < $mcnt
+		if [ $cnt != 0 ]
+		then
+			cat << WARN_TEXT
+
+Warning - $mcnt indicates that the associated RAID
+device has $cnt blocks in which the data on one array member is inconsistent
+with the data on the other array member(s).
+WARN_TEXT
+			PRINT_SUMMARY=1
+		fi
+	fi
+done
+
+if [ $PRINT_SUMMARY != 0 ]
+then
+	cat << WARN_TEXT
+
+DATA LOSS MAY HAVE OCCURRED.
+
+This condition may have been caused by one of more of the following events:
+
+. A LEGITIMATE write to a memory mapped file or swap partition backed by a
+    RAID1 (and only a RAID1) device - see the md(4) man page for details.
+. A power failure when the array was being written-to.
+. Data corruption by a hard disk drive, drive controller, cable etc.
+. A kernel bug in the md or storage subsystems etc.
+. An array being forcibly created in an inconsistent state using --assume-clean
+
+This count is updated when the md subsystem carries out a 'check' or
+'repair' action.  In the case of 'repair' it reflects the number of
+mismatched blocks prior to carrying out the repair.
+
+Once you have fixed the error, carry out a 'check' action to reset the count
+to zero.
+
+See the md (section 4) manual page, and the following URL for details:
+
+https://raid.wiki.kernel.org/index.php/Linux_Raid#Frequently_Asked_Questions_-_FAQ
+
+WARN_TEXT
+fi
+
 exec $MDADM --monitor --scan --oneshot




Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. (Tue, 28 Sep 2010 15:33:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Tim Small <tim@seoss.co.uk>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Tue, 28 Sep 2010 15:33:04 GMT) Full text and rfc822 format available.

Message #88 received at 405919@bugs.debian.org (full text, mbox, reply):

From: Tim Small <tim@seoss.co.uk>
To: Debian Bug Tracking System <405919@bugs.debian.org>
Subject: Re: checkarray does not report or fix mismatch_cnt issues
Date: Tue, 28 Sep 2010 16:29:22 +0100
Hmm,

On further reflection, perhaps it's best if this patch is tweaked so 
that it doesn't say anything at all about raid1, or raid10 devices....  
I'll rework it a bit.

The thing about mismatch_cnt being non-trustworthy on RAID1 and 10 is 
nasty tho', and probably means that

/etc/logcheck/violations.d/mdadm

needs patching so that it doesn't mail out about mismatches found, 
because it can send you on a giant wild goose chase (it certainly did 
with me)....

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309





Information forwarded to debian-bugs-dist@lists.debian.org, Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>:
Bug#405919; Package mdadm. (Tue, 28 Sep 2010 16:15:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Tim Small <tim@seoss.co.uk>:
Extra info received and forwarded to list. Copy sent to Debian mdadm maintainers <pkg-mdadm-devel@lists.alioth.debian.org>. (Tue, 28 Sep 2010 16:15:04 GMT) Full text and rfc822 format available.

Message #93 received at 405919@bugs.debian.org (full text, mbox, reply):

From: Tim Small <tim@seoss.co.uk>
To: Debian Bug Tracking System <405919@bugs.debian.org>
Subject: Re: checkarray does not report or fix mismatch_cnt issues
Date: Tue, 28 Sep 2010 17:10:47 +0100
Package: mdadm
Version: 3.1.4-1+8efb9d1
Severity: normal
Tags: patch

How about these patches?

Sorry for the earlier noise....

BTW, would you prefer that I open a new bug for this one?

Cheers,

Tim.

*** mdadm-logcheck-patch.diff
--- mdadm.orig	2010-09-28 16:45:03.000000000 +0100
+++ /etc/logcheck/ignore.d.server/mdadm	2010-09-28 16:58:25.000000000 +0100
@@ -17,7 +17,7 @@
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ kernel:( \[ *[[:digit:]]+\.[[:digit:]]+\])? RAID([01456]|10) conf printout:$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ kernel:( \[ *[[:digit:]]+\.[[:digit:]]+\])?[[:space:]]+---( [wrf]d:[[:digit:]]+){2,3}$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ kernel:( \[ *[[:digit:]]+\.[[:digit:]]+\])?[[:space:]]+disk [[:digit:]]+,( wo:[[:digit:]]+,)? o:[[:digit:]]+, dev:[[:alnum:]]+$
-^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: Rebuild((Start|Finish)ed|[[:digit:]]+) event detected on md device /dev/[-_./[:alnum:]]+$
+^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: Rebuild((Start|Finish)ed|[[:digit:]]+) event detected on md device /dev/[-_./[:alnum:]]+(, component device  ?mismatches found: [[:digit:]]+)?$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: SpareActive event detected on md device /dev/[-_./[:alnum:]]+, component device /dev/[-_./[:alnum:]]+$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: (New|Degraded)Array event detected on md device /dev/[-_./[:alnum:]]+$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: DeviceDisappeared event detected on md device /dev/[-_./[:alnum:]]+$

*** /home/tim/mdadm-mismatch-fix.diff
--- /etc/cron.daily/mdadm.old	2010-09-28 15:35:15.954390947 +0100
+++ /etc/cron.daily/mdadm	2010-09-28 17:07:19.954518154 +0100
@@ -15,4 +15,59 @@
 MDADM=/sbin/mdadm
 [ -x $MDADM ] || exit 0 # package may be removed but not purged
 
+PRINT_SUMMARY=0
+
+for mcnt in /sys/block/md?/md/mismatch_cnt
+do
+	if [ -f $mcnt ]
+	then
+		read cnt < $mcnt
+		read level < $( dirname $mcnt )/level
+		if [ $cnt != 0 ] && ! ( [ "$level" = "raid10" ] || [ "$level" = "raid1" ])
+		then
+			cat << WARN_TEXT
+
+Warning - $mcnt indicates that the associated RAID
+device has $cnt blocks in which the data on one array member is inconsistent
+with the data on the other array member(s).
+WARN_TEXT
+			PRINT_SUMMARY=1
+		fi
+	fi
+done
+
+exit
+
+
+if [ $PRINT_SUMMARY != 0 ]
+then
+	cat << WARN_TEXT
+
+DATA LOSS MAY HAVE OCCURRED.
+
+This condition may have been caused by one or more of the following events:
+
+. A power failure whilst the array was being written-to.
+. Data corruption by faulty hard disk drive, drive controller, cabling, RAM,
+    motherboard, PSU etc. etc.
+. A kernel bug.
+. An array being forcibly created in an inconsistent state using the 
+    "--assume-clean" argument to mdadm.
+
+This count is updated when the md subsystem carries out a 'check' or
+'repair' action.  In the case of 'repair' it reflects the number of
+mismatched blocks prior to carrying out the repair.
+
+Once you have fixed the error, carry out a 'check' action to reset the count
+to zero.
+
+Note that this check is only applied to arrays which aren't RAID1 or RAID10,
+due to a kernel limitation.  See the md (section 4) manual page, and the
+following URL for details:
+
+https://raid.wiki.kernel.org/index.php/Linux_Raid#Frequently_Asked_Questions_-_FAQ
+
+WARN_TEXT
+fi
+
 exec $MDADM --monitor --scan --oneshot




No longer marked as found in versions mdadm/3.1.4-1+8efb9d1. Request was from Andreas Beckmann <anbe@debian.org> to control@bugs.debian.org. (Sat, 02 Nov 2013 15:57:11 GMT) Full text and rfc822 format available.

Marked as found in versions mdadm/3.1.4-1+8efb9d1 and reopened. Request was from Andreas Beckmann <anbe@debian.org> to control@bugs.debian.org. (Sat, 02 Nov 2013 15:57:11 GMT) Full text and rfc822 format available.

Added tag(s) wontfix. Request was from Michael Tokarev <mjt@tls.msk.ru> to 518834-submit@bugs.debian.org. (Sat, 02 Nov 2013 16:57:10 GMT) Full text and rfc822 format available.

Removed tag(s) fixed-upstream. Request was from Michael Tokarev <mjt@tls.msk.ru> to 518834-submit@bugs.debian.org. (Sat, 02 Nov 2013 16:57:11 GMT) Full text and rfc822 format available.

No longer marked as found in versions mdadm/3.1.4-1+8efb9d1. Request was from Andreas Beckmann <anbe@debian.org> to 518834-submit@bugs.debian.org. (Sat, 02 Nov 2013 18:12:05 GMT) Full text and rfc822 format available.

Marked Bug as done Request was from Andreas Beckmann <anbe@debian.org> to 518834-submit@bugs.debian.org. (Sat, 02 Nov 2013 18:12:07 GMT) Full text and rfc822 format available.

Notification sent to Michel Lespinasse <walken@Angel.zoy.org>:
Bug acknowledged by developer. (Sat, 02 Nov 2013 18:12:07 GMT) Full text and rfc822 format available.

Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Sun, 01 Dec 2013 07:30:09 GMT) Full text and rfc822 format available.

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Tue Aug 2 16:45:40 2016; Machine Name: buxtehude

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.