Debian Bug report logs - #988477
xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

version graph

Package: src:xen; Maintainer for src:xen is Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>;

Affects: src:linux

Reported by: Imre Szőllősi <debianbts@virtualzone.hu>

Date: Thu, 13 May 2021 19:15:02 UTC

Severity: critical

Tags: moreinfo, upstream

Found in versions xen/4.14.1+11-gb0b734a8b3-1, xen/4.17.2+76-ge1f9cb16e2-1~deb12u1, xen/4.17.3+10-g091466ba55-1~deb12u1

Reply or subscribe to this bug.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, debianbts@virtualzone.hu, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Thu, 13 May 2021 19:15:04 GMT) (full text, mbox, link).


Acknowledgement sent to Imre Szőllősi <debianbts@virtualzone.hu>:
New Bug report received and forwarded. Copy sent to debianbts@virtualzone.hu, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Thu, 13 May 2021 19:15:04 GMT) (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Imre Szőllősi <debianbts@virtualzone.hu>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Thu, 13 May 2021 21:13:44 +0200
Package: src:xen
Version: 4.14.1+11-gb0b734a8b3-1
Severity: critical
Justification: causes serious data loss
X-Debbugs-Cc: debianbts@virtualzone.hu

Dear Maintainer,

after a clean install of bullseye/testing the xen dmesg shows the following message:
(XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.1 d0 addr fffffffdf8000000 flags 0x8 I
this is the sata device:
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01)
or on another mb
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb
in the case of write operations - ie. dbench or windows guest - there are a lot of messages
sometimes the filesystem goes to read-only state, and the windows guest goes bsod
tested on 3 hw:
1. asus prime b450m-a, ryzen 5 2600x, md raid1, 2x samsung 1TB 860evo, lvm: problem does appear
2. asus prime b550m-k, ryzen 5 5600x, md raid1, 2x samsung 1TB 870evo, lvm: problem does appear
3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 1TB 850evo, lvm: problem does not appear
3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 128GB 840pro, lvm: problem does not appear
3. asus prime b550m-k, ryzen 5 5600x, samsung 1TB 850evo + samsung 128GB 840pro, lvm, dbench on 2 ssds in parallel: problem does appear

as i see, the problem does appear, when writes data parallel to 2 ssds

Thanks!

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing-security
  APT policy: (500, 'testing-security'), (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-6-amd64 (SMP w/12 CPU threads)
Locale: LANG=hu_HU.UTF-8, LC_CTYPE=hu_HU.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

xen-hypervisor-4.14-amd64 depends on no packages.

Versions of packages xen-hypervisor-4.14-amd64 recommends:
ii  xen-hypervisor-common  4.14.1+11-gb0b734a8b3-1
ii  xen-utils-4.14         4.14.1+11-gb0b734a8b3-1

xen-hypervisor-4.14-amd64 suggests no packages.

-- no debconf information



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Sun, 13 Jun 2021 14:33:02 GMT) (full text, mbox, link).


Acknowledgement sent to Imre Szőllősi <debianbts@virtualzone.hu>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 13 Jun 2021 14:33:03 GMT) (full text, mbox, link).


Message #10 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Imre Szőllősi <debianbts@virtualzone.hu>
To: 988477@bugs.debian.org
Subject: Re: Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)
Date: Sun, 13 Jun 2021 15:58:52 +0200
i tested on 4th hw

4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo, 
lvm: problem does not appear

as i see, not all mb/chipset/sata pcie device affected

Thanks!





Added tag(s) bullseye-ignore. Request was from Paul Gevers <elbrus@debian.org> to control@bugs.debian.org. (Sun, 01 Aug 2021 14:15:08 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Thu, 05 Aug 2021 20:57:03 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Thu, 05 Aug 2021 20:57:03 GMT) (full text, mbox, link).


Message #17 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Imre Szőllősi <debianbts@virtualzone.hu>, 988477@bugs.debian.org
Subject: Re: [Pkg-xen-devel] Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)
Date: Thu, 5 Aug 2021 22:46:39 +0200
severity 988477 normal
tags 988477 + moreinfo + upstream - bullseye-ignore
thanks

Hi!

On 6/13/21 3:58 PM, Imre Szőllősi wrote:
> i tested on 4th hw
> 
> 4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo, 
> lvm: problem does not appear
> 
> as i see, not all mb/chipset/sata pcie device affected

Thanks for your report, and for trying out different combinations of
hardware.

While doing a short internet search about the problems you're seeing
while using AMD ryzen, sata, nvme and iommu, I suspect this problem does
not have a lot to do with Xen specifically, but more with the hardware
and its firmware.

This also means that it's not a Debian packaging problem, and it cannot
be fixed by me (or the Debian Xen team). If you want to research this
problem more, I can maybe be of some help by providing suggestions.
Still, you will have to do all of the actual work, since I do not have
your hardware here.

The first thing I would suggest is to try reproduce the problem when
booting with just Linux without Xen, and then trying the dbench test.

If you don't actually need to directly pass-through hardware to a Xen
guest, you can also try disabling iommu, or researching other iommu=
options that can serve as a workaround.

In any case, further reports will need to have more detailed
information. For example, instead of "there are a lot of messages",
provide a text attachment with a piece of logging that shows these messages.

I'm tagging this bug 'moreinfo' now, since it will depend on your
availability and abilities to work on it to have it advance.

Have fun,
Hans van Kranenburg



Severity set to 'normal' from 'critical' Request was from Hans van Kranenburg <hans@knorrie.org> to control@bugs.debian.org. (Thu, 05 Aug 2021 20:57:04 GMT) (full text, mbox, link).


Added tag(s) moreinfo. Request was from Hans van Kranenburg <hans@knorrie.org> to control@bugs.debian.org. (Thu, 05 Aug 2021 20:57:05 GMT) (full text, mbox, link).


Added tag(s) upstream. Request was from Hans van Kranenburg <hans@knorrie.org> to control@bugs.debian.org. (Thu, 05 Aug 2021 20:57:05 GMT) (full text, mbox, link).


Removed tag(s) bullseye-ignore. Request was from Hans van Kranenburg <hans@knorrie.org> to control@bugs.debian.org. (Thu, 05 Aug 2021 20:57:06 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Sun, 08 Aug 2021 14:03:03 GMT) (full text, mbox, link).


Acknowledgement sent to Imre Szőllősi <debianbts@virtualzone.hu>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 08 Aug 2021 14:03:03 GMT) (full text, mbox, link).


Message #30 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Imre Szőllősi <debianbts@virtualzone.hu>
To: Hans van Kranenburg <hans@knorrie.org>, 988477@bugs.debian.org
Subject: Re: [Pkg-xen-devel] Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)
Date: Sun, 8 Aug 2021 15:34:42 +0200
[Message part 1 (text/html, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Thu, 18 Jan 2024 16:18:04 GMT) (full text, mbox, link).


Acknowledgement sent to Elliott Mitchell <ehem+undef@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Thu, 18 Jan 2024 16:18:04 GMT) (full text, mbox, link).


Message #35 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Elliott Mitchell <ehem+undef@m5p.com>
To: 988477@bugs.debian.org
Cc: control@bugs.debian.org
Subject: Also observing #988477
Date: Thu, 18 Jan 2024 08:04:13 -0800
tags 988477 - moreinfo
found 988477 4.17.2+76-ge1f9cb16e2-1~deb12u1
affects 988477 src:linux
severity 988477 critical
quit

I am also observing #988477 occur.  This machine has a AMD Zen 4
processor.  The first observation was when motherboard/processor was
swapped out, the older motherboard/processor was several generations old.

The pattern which is emerging is Linux MD RAID1 plus recent AMD processor
which has full IOMMU functionality.  The older machine was believed to
have an IOMMU, but the BIOS wasn't creating appropriate ACPI tables
(IVRS) and thus Xen was unable to utilize it.

This seems to be occuring with a small percentage of write operations.
Subsequent read operations appear to be fine.

I am not convinced this is a Xen bug.  I suspect this is instead a bug
in the Linux MD subsystem.  In particular if the DMA interface was
designed assuming only a single device would ever access any page, but
the MD RAID1 driver is reusing the same page for both devices.

IOMMU page release could be handled by marking the page unused in a
device data structure and later removed by sweeping a table.  In such
case if the MD-RAID1 driver was to redirect the page to another device
between these two steps, the entry for a subsequent device could be wiped
out when trying to invalidate an entry for a prior device.


Anyway, I'm also observing bug #988477.  This could also be a kernel bug.
So far no crashes/confirmed data loss have occured, but sweeping the
mirror does turn up small numbers of inconsistencies.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





Removed tag(s) moreinfo. Request was from Elliott Mitchell <ehem+undef@m5p.com> to control@bugs.debian.org. (Thu, 18 Jan 2024 16:18:05 GMT) (full text, mbox, link).


Marked as found in versions xen/4.17.2+76-ge1f9cb16e2-1~deb12u1. Request was from Elliott Mitchell <ehem+undef@m5p.com> to control@bugs.debian.org. (Thu, 18 Jan 2024 16:18:06 GMT) (full text, mbox, link).


Added indication that 988477 affects src:linux Request was from Elliott Mitchell <ehem+undef@m5p.com> to control@bugs.debian.org. (Thu, 18 Jan 2024 16:18:06 GMT) (full text, mbox, link).


Severity set to 'critical' from 'normal' Request was from Elliott Mitchell <ehem+undef@m5p.com> to control@bugs.debian.org. (Thu, 18 Jan 2024 16:18:07 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Wed, 10 Jul 2024 19:36:03 GMT) (full text, mbox, link).


Acknowledgement sent to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Wed, 10 Jul 2024 19:36:03 GMT) (full text, mbox, link).


Message #48 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Elliott Mitchell <ehem+debian@m5p.com>
To: 988477@bugs.debian.org
Subject: Potential Mitigation for #988477
Date: Wed, 10 Jul 2024 12:25:06 -0700
It was suggested as a debugging step, but adding the option
"iommu=no-intremap" to Xen's command-line may work as a short-term
mitigation for #988477.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Sun, 25 Aug 2024 21:54:02 GMT) (full text, mbox, link).


Acknowledgement sent to Maximilian Engelhardt <maxi@daemonizer.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 25 Aug 2024 21:54:02 GMT) (full text, mbox, link).


Message #53 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Maximilian Engelhardt <maxi@daemonizer.de>
To: 988477@bugs.debian.org
Cc: Elliott Mitchell <ehem+debian@m5p.com>
Subject: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Sun, 25 Aug 2024 23:41:44 +0200
[Message part 1 (text/plain, inline)]
Control: severity -1 normal

Hi Elliott,

I am changing the severity back to normal as the xen package works fine for 
many people without any serious issues. From your last message it also seems 
you found a workaround for your problem. Please don't change the bug severity 
without at least giving an explanation why you think the new severity is 
justified.

From the few log lines in this bug report this seems to be an upstream issue 
with xen or the linux kernel. Please report your observations upstream. The 
Debian xen team does not have the resources and knowledge to debug or fix such 
problems. Once the issue has been identified and fixed upstream we can see if 
we can backport a fix to our Debian packages, but this is only possible once 
an upstream fix has landed.

Thanks,
Maxi



[signature.asc (application/pgp-signature, inline)]

Severity set to 'normal' from 'critical' Request was from Maximilian Engelhardt <maxi@daemonizer.de> to 988477-submit@bugs.debian.org. (Sun, 25 Aug 2024 21:54:02 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Sun, 25 Aug 2024 23:27:02 GMT) (full text, mbox, link).


Acknowledgement sent to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 25 Aug 2024 23:27:02 GMT) (full text, mbox, link).


Message #60 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Elliott Mitchell <ehem+debian@m5p.com>
To: Maximilian Engelhardt <maxi@daemonizer.de>
Cc: 988477@bugs.debian.org
Subject: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Sun, 25 Aug 2024 15:58:30 -0700
On Sun, Aug 25, 2024 at 11:41:44PM +0200, Maximilian Engelhardt wrote:
> I am changing the severity back to normal as the xen package works fine for 
> many people without any serious issues. From your last message it also seems 

Yet for some lucky people data is corrupted/lost.  There could be other
people who reproduce this, but don't send e-mail saying "me too" to this
bug report.

Presently the main reason there aren't very many reproductions is few
people are bothering to use RAID with flash.  The initial reports are
SSDs have a lower failure rate than disks, but the failure rate isn't
even close to zero.  Whereas the data loss/corruption easily reproduces.

While both cases in #988477 were on systems with AMD hardware, I am
presently doubtful that is a requirement.  The most similar known bug was
found to be more severe on AMD hardware, but also occur on Intel
hardware.  I suspect this issue may be similar, simply no one has noticed
the problem yet...

> you found a workaround for your problem. Please don't change the bug severity 

Something was found which seems to have made another issue more
prominent.  It may reduce the rate at which data corruption occurs, but
I've since confirmed data loss/corruption continues to occur.

> without at least giving an explanation why you think the new severity is 
> justified.

I had thought the original reporter's justification was sufficient.  This
appears to have some specific requirement to meet, but if you meet them
you may be in trouble before alerts trigger.

So far both reports are with AMD machines with IOMMUv2 functionality (I
tried on a machine with IOMMUv1/GART and it didn't reproduce).  Both
reports feature Samsung SATA devices.  A NVMe device from another
manufacturer also showed the issue (I'm almost certain Samsung NVMe
devices will also show the issue).

I suspect Intel machines may also be effected by this issue, but it may
not manifest as severely.  I suspect this is a case of people with AMD
machines being a bit more wary of hardware failure (thus actually
bothering to use RAID1 even with flash devices).

> >From the few log lines in this bug report this seems to be an upstream issue 
> with xen or the linux kernel. Please report your observations upstream. The 
> Debian xen team does not have the resources and knowledge to debug or fix such 
> problems. Once the issue has been identified and fixed upstream we can see if 
> we can backport a fix to our Debian packages, but this is only possible once 
> an upstream fix has landed.

Perhaps it has become easier to report things upstream, but the original
procedure was reportters were supposed to report to bugs.debian.org and
NOT forward upstream.

Other problem is I've run into a chasm with upstream and no way to build
a bridge across.

I do have one more thing to try, but don't yet have a time-frame for
when I'll check that.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Tue, 03 Sep 2024 22:03:01 GMT) (full text, mbox, link).


Acknowledgement sent to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Tue, 03 Sep 2024 22:03:01 GMT) (full text, mbox, link).


Message #65 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Elliott Mitchell <ehem+debian@m5p.com>
To: Maximilian Engelhardt <maxi@daemonizer.de>
Cc: 988477@bugs.debian.org, control@bugs.debian.org, debianbts@virtualzone.hu
Subject: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Tue, 3 Sep 2024 14:58:18 -0700
found 988477 4.17.3+10-g091466ba55-1~deb12u1
severity 988477 critical
quit

Justification is same as original, data loss.  I'm unsure about of the
border between "data loss" and "serious data loss" is, but the original
reportter declared it so and I don't disagree.


On Sun, Aug 25, 2024 at 11:41:44PM +0200, Maximilian Engelhardt wrote:
> I am changing the severity back to normal as the xen package works fine for 
> many people without any serious issues. From your last message it also seems 

critical
    makes unrelated software on the system (or the whole system) break,
    or causes serious data loss, or introduces a security hole on systems
    where you install the package.

grave
    makes the package in question unusable or mostly so, or causes data
    loss, or introduces a security hole allowing access to the accounts
    of users who use the package.

Both of those are lists of conditions.  Since the conditions are
"causes serious data loss" and "causes data loss", those have been met
as there is no mention of "and cannot work acceptably for anyone".


> you found a workaround for your problem. Please don't change the bug severity 
> without at least giving an explanation why you think the new severity is 
> justified.

The key word was "may".  I was being cautious when testing due to the
severity of the issue.  As stated in the previous message, it was found
to merely mildly change the messages and not fix the issue.

> >From the few log lines in this bug report this seems to be an upstream issue 
> with xen or the linux kernel. Please report your observations upstream. The 
> Debian xen team does not have the resources and knowledge to debug or fix such 
> problems. Once the issue has been identified and fixed upstream we can see if 
> we can backport a fix to our Debian packages, but this is only possible once 
> an upstream fix has landed.

My understanding is being an upstream issue has no effect on severity.
It allows tagging as "upstream", but does not allow reducing severity.
The severity is meant as an alert to others there is a *severe* problem
lurking.

I've tried interacting with upstream, yet there has been a demand to
release `xl dmesg` to a public area.  While I cannot state any
information in `xl dmesg` can be used to compromise systems, nor can
point to hardware serial numbers or other private data which leak in, it
still triggers the TMI detector.

As such I'm uncomfortable with that being public and I don't know any way
to bridge that chasm.  If I was an installation of 10K nodes I wouldn't
be too bothered with details of a single test machine leaking, alas I'm
not in that category.

I could also send someone a pair of SATA devices known to manifest the
issue, but that has failed to generate interest.  As such I'm stuck.



Question for the original submitter, Imre Szőllősi, what was your
situation prior to seeing #988477 manifest?

Were you installing Xen 4.14 for the first time on Debian 11/bullseye?

Had you previously used Xen 4.11 with Debian 10/buster or earlier?

Knowing whether the bug was introduced between Xen 4.11 and Xen 4.14
would be valuable knowledge if you have it.  I had been using an older
processor with 4.14, so I hadn't observed it until 4.17.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





Marked as found in versions xen/4.17.3+10-g091466ba55-1~deb12u1. Request was from Elliott Mitchell <ehem+debian@m5p.com> to control@bugs.debian.org. (Tue, 03 Sep 2024 22:24:03 GMT) (full text, mbox, link).


Severity set to 'critical' from 'normal' Request was from Elliott Mitchell <ehem+debian@m5p.com> to control@bugs.debian.org. (Tue, 03 Sep 2024 22:24:03 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Fri, 14 Mar 2025 21:45:03 GMT) (full text, mbox, link).


Acknowledgement sent to Maximilian Engelhardt <maxi@daemonizer.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Fri, 14 Mar 2025 21:45:03 GMT) (full text, mbox, link).


Message #74 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Maximilian Engelhardt <maxi@daemonizer.de>
To: 988477@bugs.debian.org
Subject: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Fri, 14 Mar 2025 22:42:24 +0100
[Message part 1 (text/plain, inline)]
A fix [1] for the IO_PAGE_FAULT went into xen 4.20 which is now available in 
testing and unstable.
The 4.20.0-1 Debian source package can also be compiled for bookworm if you 
have a bookworm system running and want to test there. Please not that qemu 
also needs to be recompiled for this xen version if you are using qemu.

Can anyone affected by this bug conform if their issue is fixed in xen 4.20 or 
is still there?

[1] https://salsa.debian.org/xen-team/debian-xen/-/commit/b953a99da98d63a7c827248abc450d4e8e015ab6
[signature.asc (application/pgp-signature, inline)]

Added tag(s) moreinfo. Request was from Philipp Kern <pkern@debian.org> to control@bugs.debian.org. (Fri, 11 Apr 2025 12:24:02 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Sun, 13 Apr 2025 11:24:02 GMT) (full text, mbox, link).


Acknowledgement sent to Philipp Kern <pkern@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 13 Apr 2025 11:24:02 GMT) (full text, mbox, link).


Message #81 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Philipp Kern <pkern@debian.org>
To: 852617@bugs.debian.org, 1086145@bugs.debian.org, 988477@bugs.debian.org
Cc: control@bugs.debian.org, Marek Benc <benc.marek.elektro98@proton.me>
Subject: syslinux NMU 3:6.04~git20190206.bf6db5b4+dfsg1-3.1
Date: Sun, 13 Apr 2025 13:22:07 +0200
[Message part 1 (text/plain, inline)]
user debian-release@lists.debian.org
usertag 1091027 + bsp-2025-04-at-vienna
usertag 1057462 + bsp-2025-04-at-vienna
usertag 994274 + bsp-2025-04-at-vienna
tag 1091027 + pending
tag 1057462 + pending
tag 994274 + pending
thanks

Uploaded an NMU to DELAYED/0-day:

> syslinux (3:6.04~git20190206.bf6db5b4+dfsg1-3.1) unstable; urgency=medium
> 
>   * Non-maintainer upload.
>   * Add GCC 14 compatibility patch. Thanks to Marek Benc.
>     (Closes: #1091027, #1057462)
>   * Add wchar_t definition for gnu-efi >= 3.0.16 compatibility.
>     (Closes: #994274)
>   * Update build dependency on e2fslibs-dev => libext2fs-dev.
>   * Update Lintian overrides to match again.
> 
>  -- Philipp Kern <pkern@debian.org>  Sun, 13 Apr 2025 11:31:54 +0200

Kind regards
Philipp Kern
[syslinux_6.04~git20190206.bf6db5b4+dfsg1-3_6.04~git20190206.bf6db5b4+dfsg1-3.1.nmudiff (text/plain, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Sun, 13 Apr 2025 22:45:02 GMT) (full text, mbox, link).


Acknowledgement sent to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 13 Apr 2025 22:45:02 GMT) (full text, mbox, link).


Message #86 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Elliott Mitchell <ehem+debian@m5p.com>
To: Maximilian Engelhardt <maxi@daemonizer.de>
Cc: 988477@bugs.debian.org
Subject: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Sun, 13 Apr 2025 15:22:01 -0700
On Fri, Mar 14, 2025 at 10:42:24PM +0100, Maximilian Engelhardt wrote:
> A fix [1] for the IO_PAGE_FAULT went into xen 4.20 which is now available in 
> testing and unstable.
> The 4.20.0-1 Debian source package can also be compiled for bookworm if you 
> have a bookworm system running and want to test there. Please not that qemu 
> also needs to be recompiled for this xen version if you are using qemu.
> 
> Can anyone affected by this bug conform if their issue is fixed in xen 4.20 or 
> is still there?
> 
> [1] https://salsa.debian.org/xen-team/debian-xen/-/commit/b953a99da98d63a7c827248abc450d4e8e015ab6

The analysis is the "(XEN) AMD-Vi: IO_PAGE_FAULT" message, and the
software RAID data loss are distinct bugs.  That patch/commit likely
makes the correlated message disappear, but almost certainly leaves the
software RAID data loss behind.

Do any of the Debian maintainers have an AMD machine setup for debugging?
I'm not very well setup for debugging this particular issue.  If you've
got an AMD machine with a pair of available SATA ports (including SATA
power!), I could send a pair of SATA devices known to readily reproduce
the issue.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Sun, 18 May 2025 12:15:03 GMT) (full text, mbox, link).


Acknowledgement sent to Maximilian Engelhardt <maxi@daemonizer.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 18 May 2025 12:15:03 GMT) (full text, mbox, link).


Message #91 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Maximilian Engelhardt <maxi@daemonizer.de>
To: 988477@bugs.debian.org, Elliott Mitchell <ehem+debian@m5p.com>
Cc: pkg-xen-devel@alioth-lists.debian.net
Subject: Re: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Sun, 18 May 2025 14:10:25 +0200
[Message part 1 (text/plain, inline)]
On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote:
> The analysis is the "(XEN) AMD-Vi: IO_PAGE_FAULT" message, and the
> software RAID data loss are distinct bugs.  That patch/commit likely
> makes the correlated message disappear, but almost certainly leaves the
> software RAID data loss behind.
> 
> Do any of the Debian maintainers have an AMD machine setup for debugging?
> I'm not very well setup for debugging this particular issue.  If you've
> got an AMD machine with a pair of available SATA ports (including SATA
> power!), I could send a pair of SATA devices known to readily reproduce
> the issue.

I'm not aware of anybody in our team having hardware where they can reproduce 
this issue, else I'm sure they would have already provided feedback here. 
There are also not many reports here of people running into this problem. Thus 
I assume it needs a special (and probably rare) hardware combination to 
trigger this.
One thing I can add is that I have been running software raid1 with Xen on two 
SATA SSDs on an Intel CPU since many years without seeing any data corruption.

As Debian packages versions of xen, linux, etc. have changed a bit since the 
last time this issue was reported as reproduced in this bug, it would be good 
to get confirmation the problem is still there in Debian unstable or testing.


[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Thu, 29 May 2025 00:57:01 GMT) (full text, mbox, link).


Acknowledgement sent to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Thu, 29 May 2025 00:57:01 GMT) (full text, mbox, link).


Message #96 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Elliott Mitchell <ehem+debian@m5p.com>
To: Maximilian Engelhardt <maxi@daemonizer.de>
Cc: 988477@bugs.debian.org, pkg-xen-devel@alioth-lists.debian.net
Subject: Re: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Wed, 28 May 2025 17:20:52 -0700
On Sun, May 18, 2025 at 02:10:25PM +0200, Maximilian Engelhardt wrote:
> On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote:
> > 
> > Do any of the Debian maintainers have an AMD machine setup for debugging?
> > I'm not very well setup for debugging this particular issue.  If you've
> > got an AMD machine with a pair of available SATA ports (including SATA
> > power!), I could send a pair of SATA devices known to readily reproduce
> > the issue.
> 
> I'm not aware of anybody in our team having hardware where they can reproduce 
> this issue, else I'm sure they would have already provided feedback here. 
> There are also not many reports here of people running into this problem. Thus 
> I assume it needs a special (and probably rare) hardware combination to 
> trigger this.
> One thing I can add is that I have been running software raid1 with Xen on two 
> SATA SSDs on an Intel CPU since many years without seeing any data corruption.

I'm skeptical of it being rare, but certainly uncommon.  You've got some
similarity to the reproductions, but there are differences.

First question, what brand/model are the SSDs?  Samsung SSDs are known to
be effected (severely effected for some models), while Crucial/Micron
SSDs are uneffected (some models might be mildly effected).

Second question, where are the SATA ports?  They on-motherboard?  Add-on
card?  The reproductions were with on-motherboard ports.

What generation is your processor?  Are you sure it has an IOMMU and Xen
is driving the IOMMU?  I had suspected Intel systems would be effected,
but you may have disproven this.

> As Debian packages versions of xen, linux, etc. have changed a bit since the 
> last time this issue was reported as reproduced in this bug, it would be good 
> to get confirmation the problem is still there in Debian unstable or testing.

This is possible.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#988477; Package src:xen. (Fri, 04 Jul 2025 00:35:01 GMT) (full text, mbox, link).


Acknowledgement sent to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Fri, 04 Jul 2025 00:35:01 GMT) (full text, mbox, link).


Message #101 received at 988477@bugs.debian.org (full text, mbox, reply):

From: Elliott Mitchell <ehem+debian@m5p.com>
To: Maximilian Engelhardt <maxi@daemonizer.de>
Cc: 988477@bugs.debian.org, pkg-xen-devel@alioth-lists.debian.net
Subject: Re: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Thu, 3 Jul 2025 17:25:27 -0700
On Wed, May 28, 2025 at 05:21:00PM -0700, Elliott Mitchell wrote:
> On Sun, May 18, 2025 at 02:10:25PM +0200, Maximilian Engelhardt wrote:
> > On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote:
> > > 
> > > Do any of the Debian maintainers have an AMD machine setup for debugging?
> > > I'm not very well setup for debugging this particular issue.  If you've
> > > got an AMD machine with a pair of available SATA ports (including SATA
> > > power!), I could send a pair of SATA devices known to readily reproduce
> > > the issue.
> > 
> > I'm not aware of anybody in our team having hardware where they can reproduce 
> > this issue, else I'm sure they would have already provided feedback here. 
> > There are also not many reports here of people running into this problem. Thus 
> > I assume it needs a special (and probably rare) hardware combination to 
> > trigger this.
> > One thing I can add is that I have been running software raid1 with Xen on two 
> > SATA SSDs on an Intel CPU since many years without seeing any data corruption.
> 
> I'm skeptical of it being rare, but certainly uncommon.  You've got some
> similarity to the reproductions, but there are differences.
> 
> First question, what brand/model are the SSDs?  Samsung SSDs are known to
> be effected (severely effected for some models), while Crucial/Micron
> SSDs are uneffected (some models might be mildly effected).
> 
> Second question, where are the SATA ports?  They on-motherboard?  Add-on
> card?  The reproductions were with on-motherboard ports.
> 
> What generation is your processor?  Are you sure it has an IOMMU and Xen
> is driving the IOMMU?  I had suspected Intel systems would be effected,
> but you may have disproven this.

Uh.  I did hope you could help narrowing things down some.  Right now
we've got two confirmed reproductions, while you're the only person who
isn't seeing this reproduce.

The biggest difference is you've got a system with an Intel processor.
Yet we already know not all SSDs are effected, so could be your pair are
ones which won't reproduce the issue.  On top of that, similar to the
spurious interrupt issue, could be it is less severe on Intel processors
and that has kept you safe.

Presently the shortage of reports seems mostly attributable to few people
using RAID1 with SSDs.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Thu Oct 30 23:50:02 2025; Machine Name: berlioz

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU General Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.