Report forwarded
to debian-bugs-dist@lists.debian.org, debianbts@virtualzone.hu, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Thu, 13 May 2021 19:15:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Imre Szőllősi <debianbts@virtualzone.hu>:
New Bug report received and forwarded. Copy sent to debianbts@virtualzone.hu, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Thu, 13 May 2021 19:15:04 GMT) (full text, mbox, link).
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Thu, 13 May 2021 21:13:44 +0200
Package: src:xen
Version: 4.14.1+11-gb0b734a8b3-1
Severity: critical
Justification: causes serious data loss
X-Debbugs-Cc: debianbts@virtualzone.hu
Dear Maintainer,
after a clean install of bullseye/testing the xen dmesg shows the following message:
(XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.1 d0 addr fffffffdf8000000 flags 0x8 I
this is the sata device:
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01)
or on another mb
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb
in the case of write operations - ie. dbench or windows guest - there are a lot of messages
sometimes the filesystem goes to read-only state, and the windows guest goes bsod
tested on 3 hw:
1. asus prime b450m-a, ryzen 5 2600x, md raid1, 2x samsung 1TB 860evo, lvm: problem does appear
2. asus prime b550m-k, ryzen 5 5600x, md raid1, 2x samsung 1TB 870evo, lvm: problem does appear
3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 1TB 850evo, lvm: problem does not appear
3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 128GB 840pro, lvm: problem does not appear
3. asus prime b550m-k, ryzen 5 5600x, samsung 1TB 850evo + samsung 128GB 840pro, lvm, dbench on 2 ssds in parallel: problem does appear
as i see, the problem does appear, when writes data parallel to 2 ssds
Thanks!
-- System Information:
Debian Release: bullseye/sid
APT prefers testing-security
APT policy: (500, 'testing-security'), (500, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 5.10.0-6-amd64 (SMP w/12 CPU threads)
Locale: LANG=hu_HU.UTF-8, LC_CTYPE=hu_HU.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
xen-hypervisor-4.14-amd64 depends on no packages.
Versions of packages xen-hypervisor-4.14-amd64 recommends:
ii xen-hypervisor-common 4.14.1+11-gb0b734a8b3-1
ii xen-utils-4.14 4.14.1+11-gb0b734a8b3-1
xen-hypervisor-4.14-amd64 suggests no packages.
-- no debconf information
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Sun, 13 Jun 2021 14:33:02 GMT) (full text, mbox, link).
Acknowledgement sent
to Imre Szőllősi <debianbts@virtualzone.hu>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Sun, 13 Jun 2021 14:33:03 GMT) (full text, mbox, link).
Subject: Re: Bug#988477: Acknowledgement (xen-hypervisor-4.14-amd64: xen dmesg
shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device)
Date: Sun, 13 Jun 2021 15:58:52 +0200
i tested on 4th hw
4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo,
lvm: problem does not appear
as i see, not all mb/chipset/sata pcie device affected
Thanks!
Added tag(s) bullseye-ignore.
Request was from Paul Gevers <elbrus@debian.org>
to control@bugs.debian.org.
(Sun, 01 Aug 2021 14:15:08 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Thu, 05 Aug 2021 20:57:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Thu, 05 Aug 2021 20:57:03 GMT) (full text, mbox, link).
To: Imre Szőllősi <debianbts@virtualzone.hu>,
988477@bugs.debian.org
Subject: Re: [Pkg-xen-devel] Bug#988477: Acknowledgement
(xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on
sata pci device)
Date: Thu, 5 Aug 2021 22:46:39 +0200
severity 988477 normal
tags 988477 + moreinfo + upstream - bullseye-ignore
thanks
Hi!
On 6/13/21 3:58 PM, Imre Szőllősi wrote:
> i tested on 4th hw
>
> 4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo,
> lvm: problem does not appear
>
> as i see, not all mb/chipset/sata pcie device affected
Thanks for your report, and for trying out different combinations of
hardware.
While doing a short internet search about the problems you're seeing
while using AMD ryzen, sata, nvme and iommu, I suspect this problem does
not have a lot to do with Xen specifically, but more with the hardware
and its firmware.
This also means that it's not a Debian packaging problem, and it cannot
be fixed by me (or the Debian Xen team). If you want to research this
problem more, I can maybe be of some help by providing suggestions.
Still, you will have to do all of the actual work, since I do not have
your hardware here.
The first thing I would suggest is to try reproduce the problem when
booting with just Linux without Xen, and then trying the dbench test.
If you don't actually need to directly pass-through hardware to a Xen
guest, you can also try disabling iommu, or researching other iommu=
options that can serve as a workaround.
In any case, further reports will need to have more detailed
information. For example, instead of "there are a lot of messages",
provide a text attachment with a piece of logging that shows these messages.
I'm tagging this bug 'moreinfo' now, since it will depend on your
availability and abilities to work on it to have it advance.
Have fun,
Hans van Kranenburg
Severity set to 'normal' from 'critical'
Request was from Hans van Kranenburg <hans@knorrie.org>
to control@bugs.debian.org.
(Thu, 05 Aug 2021 20:57:04 GMT) (full text, mbox, link).
Added tag(s) moreinfo.
Request was from Hans van Kranenburg <hans@knorrie.org>
to control@bugs.debian.org.
(Thu, 05 Aug 2021 20:57:05 GMT) (full text, mbox, link).
Added tag(s) upstream.
Request was from Hans van Kranenburg <hans@knorrie.org>
to control@bugs.debian.org.
(Thu, 05 Aug 2021 20:57:05 GMT) (full text, mbox, link).
Removed tag(s) bullseye-ignore.
Request was from Hans van Kranenburg <hans@knorrie.org>
to control@bugs.debian.org.
(Thu, 05 Aug 2021 20:57:06 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Sun, 08 Aug 2021 14:03:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Imre Szőllősi <debianbts@virtualzone.hu>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Sun, 08 Aug 2021 14:03:03 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Thu, 18 Jan 2024 16:18:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Elliott Mitchell <ehem+undef@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Thu, 18 Jan 2024 16:18:04 GMT) (full text, mbox, link).
tags 988477 - moreinfo
found 988477 4.17.2+76-ge1f9cb16e2-1~deb12u1
affects 988477 src:linux
severity 988477 critical
quit
I am also observing #988477 occur. This machine has a AMD Zen 4
processor. The first observation was when motherboard/processor was
swapped out, the older motherboard/processor was several generations old.
The pattern which is emerging is Linux MD RAID1 plus recent AMD processor
which has full IOMMU functionality. The older machine was believed to
have an IOMMU, but the BIOS wasn't creating appropriate ACPI tables
(IVRS) and thus Xen was unable to utilize it.
This seems to be occuring with a small percentage of write operations.
Subsequent read operations appear to be fine.
I am not convinced this is a Xen bug. I suspect this is instead a bug
in the Linux MD subsystem. In particular if the DMA interface was
designed assuming only a single device would ever access any page, but
the MD RAID1 driver is reusing the same page for both devices.
IOMMU page release could be handled by marking the page unused in a
device data structure and later removed by sweeping a table. In such
case if the MD-RAID1 driver was to redirect the page to another device
between these two steps, the entry for a subsequent device could be wiped
out when trying to invalidate an entry for a prior device.
Anyway, I'm also observing bug #988477. This could also be a kernel bug.
So far no crashes/confirmed data loss have occured, but sweeping the
mirror does turn up small numbers of inconsistencies.
--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Removed tag(s) moreinfo.
Request was from Elliott Mitchell <ehem+undef@m5p.com>
to control@bugs.debian.org.
(Thu, 18 Jan 2024 16:18:05 GMT) (full text, mbox, link).
Marked as found in versions xen/4.17.2+76-ge1f9cb16e2-1~deb12u1.
Request was from Elliott Mitchell <ehem+undef@m5p.com>
to control@bugs.debian.org.
(Thu, 18 Jan 2024 16:18:06 GMT) (full text, mbox, link).
Added indication that 988477 affects src:linux
Request was from Elliott Mitchell <ehem+undef@m5p.com>
to control@bugs.debian.org.
(Thu, 18 Jan 2024 16:18:06 GMT) (full text, mbox, link).
Severity set to 'critical' from 'normal'
Request was from Elliott Mitchell <ehem+undef@m5p.com>
to control@bugs.debian.org.
(Thu, 18 Jan 2024 16:18:07 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Wed, 10 Jul 2024 19:36:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Wed, 10 Jul 2024 19:36:03 GMT) (full text, mbox, link).
It was suggested as a debugging step, but adding the option
"iommu=no-intremap" to Xen's command-line may work as a short-term
mitigation for #988477.
--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Sun, 25 Aug 2024 21:54:02 GMT) (full text, mbox, link).
Acknowledgement sent
to Maximilian Engelhardt <maxi@daemonizer.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Sun, 25 Aug 2024 21:54:02 GMT) (full text, mbox, link).
Control: severity -1 normal
Hi Elliott,
I am changing the severity back to normal as the xen package works fine for
many people without any serious issues. From your last message it also seems
you found a workaround for your problem. Please don't change the bug severity
without at least giving an explanation why you think the new severity is
justified.
From the few log lines in this bug report this seems to be an upstream issue
with xen or the linux kernel. Please report your observations upstream. The
Debian xen team does not have the resources and knowledge to debug or fix such
problems. Once the issue has been identified and fixed upstream we can see if
we can backport a fix to our Debian packages, but this is only possible once
an upstream fix has landed.
Thanks,
Maxi
Severity set to 'normal' from 'critical'
Request was from Maximilian Engelhardt <maxi@daemonizer.de>
to 988477-submit@bugs.debian.org.
(Sun, 25 Aug 2024 21:54:02 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Sun, 25 Aug 2024 23:27:02 GMT) (full text, mbox, link).
Acknowledgement sent
to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Sun, 25 Aug 2024 23:27:02 GMT) (full text, mbox, link).
Subject: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi:
IO_PAGE_FAULT on sata pci device
Date: Sun, 25 Aug 2024 15:58:30 -0700
On Sun, Aug 25, 2024 at 11:41:44PM +0200, Maximilian Engelhardt wrote:
> I am changing the severity back to normal as the xen package works fine for
> many people without any serious issues. From your last message it also seems
Yet for some lucky people data is corrupted/lost. There could be other
people who reproduce this, but don't send e-mail saying "me too" to this
bug report.
Presently the main reason there aren't very many reproductions is few
people are bothering to use RAID with flash. The initial reports are
SSDs have a lower failure rate than disks, but the failure rate isn't
even close to zero. Whereas the data loss/corruption easily reproduces.
While both cases in #988477 were on systems with AMD hardware, I am
presently doubtful that is a requirement. The most similar known bug was
found to be more severe on AMD hardware, but also occur on Intel
hardware. I suspect this issue may be similar, simply no one has noticed
the problem yet...
> you found a workaround for your problem. Please don't change the bug severity
Something was found which seems to have made another issue more
prominent. It may reduce the rate at which data corruption occurs, but
I've since confirmed data loss/corruption continues to occur.
> without at least giving an explanation why you think the new severity is
> justified.
I had thought the original reporter's justification was sufficient. This
appears to have some specific requirement to meet, but if you meet them
you may be in trouble before alerts trigger.
So far both reports are with AMD machines with IOMMUv2 functionality (I
tried on a machine with IOMMUv1/GART and it didn't reproduce). Both
reports feature Samsung SATA devices. A NVMe device from another
manufacturer also showed the issue (I'm almost certain Samsung NVMe
devices will also show the issue).
I suspect Intel machines may also be effected by this issue, but it may
not manifest as severely. I suspect this is a case of people with AMD
machines being a bit more wary of hardware failure (thus actually
bothering to use RAID1 even with flash devices).
> >From the few log lines in this bug report this seems to be an upstream issue
> with xen or the linux kernel. Please report your observations upstream. The
> Debian xen team does not have the resources and knowledge to debug or fix such
> problems. Once the issue has been identified and fixed upstream we can see if
> we can backport a fix to our Debian packages, but this is only possible once
> an upstream fix has landed.
Perhaps it has become easier to report things upstream, but the original
procedure was reportters were supposed to report to bugs.debian.org and
NOT forward upstream.
Other problem is I've run into a chasm with upstream and no way to build
a bridge across.
I do have one more thing to try, but don't yet have a time-frame for
when I'll check that.
--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Tue, 03 Sep 2024 22:03:01 GMT) (full text, mbox, link).
Acknowledgement sent
to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Tue, 03 Sep 2024 22:03:01 GMT) (full text, mbox, link).
Subject: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi:
IO_PAGE_FAULT on sata pci device
Date: Tue, 3 Sep 2024 14:58:18 -0700
found 988477 4.17.3+10-g091466ba55-1~deb12u1
severity 988477 critical
quit
Justification is same as original, data loss. I'm unsure about of the
border between "data loss" and "serious data loss" is, but the original
reportter declared it so and I don't disagree.
On Sun, Aug 25, 2024 at 11:41:44PM +0200, Maximilian Engelhardt wrote:
> I am changing the severity back to normal as the xen package works fine for
> many people without any serious issues. From your last message it also seems
critical
makes unrelated software on the system (or the whole system) break,
or causes serious data loss, or introduces a security hole on systems
where you install the package.
grave
makes the package in question unusable or mostly so, or causes data
loss, or introduces a security hole allowing access to the accounts
of users who use the package.
Both of those are lists of conditions. Since the conditions are
"causes serious data loss" and "causes data loss", those have been met
as there is no mention of "and cannot work acceptably for anyone".
> you found a workaround for your problem. Please don't change the bug severity
> without at least giving an explanation why you think the new severity is
> justified.
The key word was "may". I was being cautious when testing due to the
severity of the issue. As stated in the previous message, it was found
to merely mildly change the messages and not fix the issue.
> >From the few log lines in this bug report this seems to be an upstream issue
> with xen or the linux kernel. Please report your observations upstream. The
> Debian xen team does not have the resources and knowledge to debug or fix such
> problems. Once the issue has been identified and fixed upstream we can see if
> we can backport a fix to our Debian packages, but this is only possible once
> an upstream fix has landed.
My understanding is being an upstream issue has no effect on severity.
It allows tagging as "upstream", but does not allow reducing severity.
The severity is meant as an alert to others there is a *severe* problem
lurking.
I've tried interacting with upstream, yet there has been a demand to
release `xl dmesg` to a public area. While I cannot state any
information in `xl dmesg` can be used to compromise systems, nor can
point to hardware serial numbers or other private data which leak in, it
still triggers the TMI detector.
As such I'm uncomfortable with that being public and I don't know any way
to bridge that chasm. If I was an installation of 10K nodes I wouldn't
be too bothered with details of a single test machine leaking, alas I'm
not in that category.
I could also send someone a pair of SATA devices known to manifest the
issue, but that has failed to generate interest. As such I'm stuck.
Question for the original submitter, Imre Szőllősi, what was your
situation prior to seeing #988477 manifest?
Were you installing Xen 4.14 for the first time on Debian 11/bullseye?
Had you previously used Xen 4.11 with Debian 10/buster or earlier?
Knowing whether the bug was introduced between Xen 4.11 and Xen 4.14
would be valuable knowledge if you have it. I had been using an older
processor with 4.14, so I hadn't observed it until 4.17.
--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Marked as found in versions xen/4.17.3+10-g091466ba55-1~deb12u1.
Request was from Elliott Mitchell <ehem+debian@m5p.com>
to control@bugs.debian.org.
(Tue, 03 Sep 2024 22:24:03 GMT) (full text, mbox, link).
Severity set to 'critical' from 'normal'
Request was from Elliott Mitchell <ehem+debian@m5p.com>
to control@bugs.debian.org.
(Tue, 03 Sep 2024 22:24:03 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Fri, 14 Mar 2025 21:45:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Maximilian Engelhardt <maxi@daemonizer.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Fri, 14 Mar 2025 21:45:03 GMT) (full text, mbox, link).
A fix [1] for the IO_PAGE_FAULT went into xen 4.20 which is now available in
testing and unstable.
The 4.20.0-1 Debian source package can also be compiled for bookworm if you
have a bookworm system running and want to test there. Please not that qemu
also needs to be recompiled for this xen version if you are using qemu.
Can anyone affected by this bug conform if their issue is fixed in xen 4.20 or
is still there?
[1] https://salsa.debian.org/xen-team/debian-xen/-/commit/b953a99da98d63a7c827248abc450d4e8e015ab6
Added tag(s) moreinfo.
Request was from Philipp Kern <pkern@debian.org>
to control@bugs.debian.org.
(Fri, 11 Apr 2025 12:24:02 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Sun, 13 Apr 2025 11:24:02 GMT) (full text, mbox, link).
Acknowledgement sent
to Philipp Kern <pkern@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Sun, 13 Apr 2025 11:24:02 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Sun, 13 Apr 2025 22:45:02 GMT) (full text, mbox, link).
Acknowledgement sent
to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Sun, 13 Apr 2025 22:45:02 GMT) (full text, mbox, link).
Subject: Re: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi:
IO_PAGE_FAULT on sata pci device
Date: Sun, 13 Apr 2025 15:22:01 -0700
On Fri, Mar 14, 2025 at 10:42:24PM +0100, Maximilian Engelhardt wrote:
> A fix [1] for the IO_PAGE_FAULT went into xen 4.20 which is now available in
> testing and unstable.
> The 4.20.0-1 Debian source package can also be compiled for bookworm if you
> have a bookworm system running and want to test there. Please not that qemu
> also needs to be recompiled for this xen version if you are using qemu.
>
> Can anyone affected by this bug conform if their issue is fixed in xen 4.20 or
> is still there?
>
> [1] https://salsa.debian.org/xen-team/debian-xen/-/commit/b953a99da98d63a7c827248abc450d4e8e015ab6
The analysis is the "(XEN) AMD-Vi: IO_PAGE_FAULT" message, and the
software RAID data loss are distinct bugs. That patch/commit likely
makes the correlated message disappear, but almost certainly leaves the
software RAID data loss behind.
Do any of the Debian maintainers have an AMD machine setup for debugging?
I'm not very well setup for debugging this particular issue. If you've
got an AMD machine with a pair of available SATA ports (including SATA
power!), I could send a pair of SATA devices known to readily reproduce
the issue.
--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Sun, 18 May 2025 12:15:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Maximilian Engelhardt <maxi@daemonizer.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Sun, 18 May 2025 12:15:03 GMT) (full text, mbox, link).
On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote:
> The analysis is the "(XEN) AMD-Vi: IO_PAGE_FAULT" message, and the
> software RAID data loss are distinct bugs. That patch/commit likely
> makes the correlated message disappear, but almost certainly leaves the
> software RAID data loss behind.
>
> Do any of the Debian maintainers have an AMD machine setup for debugging?
> I'm not very well setup for debugging this particular issue. If you've
> got an AMD machine with a pair of available SATA ports (including SATA
> power!), I could send a pair of SATA devices known to readily reproduce
> the issue.
I'm not aware of anybody in our team having hardware where they can reproduce
this issue, else I'm sure they would have already provided feedback here.
There are also not many reports here of people running into this problem. Thus
I assume it needs a special (and probably rare) hardware combination to
trigger this.
One thing I can add is that I have been running software raid1 with Xen on two
SATA SSDs on an Intel CPU since many years without seeing any data corruption.
As Debian packages versions of xen, linux, etc. have changed a bit since the
last time this issue was reported as reproduced in this bug, it would be good
to get confirmation the problem is still there in Debian unstable or testing.
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Thu, 29 May 2025 00:57:01 GMT) (full text, mbox, link).
Acknowledgement sent
to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Thu, 29 May 2025 00:57:01 GMT) (full text, mbox, link).
Subject: Re: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg
shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Wed, 28 May 2025 17:20:52 -0700
On Sun, May 18, 2025 at 02:10:25PM +0200, Maximilian Engelhardt wrote:
> On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote:
> >
> > Do any of the Debian maintainers have an AMD machine setup for debugging?
> > I'm not very well setup for debugging this particular issue. If you've
> > got an AMD machine with a pair of available SATA ports (including SATA
> > power!), I could send a pair of SATA devices known to readily reproduce
> > the issue.
>
> I'm not aware of anybody in our team having hardware where they can reproduce
> this issue, else I'm sure they would have already provided feedback here.
> There are also not many reports here of people running into this problem. Thus
> I assume it needs a special (and probably rare) hardware combination to
> trigger this.
> One thing I can add is that I have been running software raid1 with Xen on two
> SATA SSDs on an Intel CPU since many years without seeing any data corruption.
I'm skeptical of it being rare, but certainly uncommon. You've got some
similarity to the reproductions, but there are differences.
First question, what brand/model are the SSDs? Samsung SSDs are known to
be effected (severely effected for some models), while Crucial/Micron
SSDs are uneffected (some models might be mildly effected).
Second question, where are the SATA ports? They on-motherboard? Add-on
card? The reproductions were with on-motherboard ports.
What generation is your processor? Are you sure it has an IOMMU and Xen
is driving the IOMMU? I had suspected Intel systems would be effected,
but you may have disproven this.
> As Debian packages versions of xen, linux, etc. have changed a bit since the
> last time this issue was reported as reproduced in this bug, it would be good
> to get confirmation the problem is still there in Debian unstable or testing.
This is possible.
--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>: Bug#988477; Package src:xen.
(Fri, 04 Jul 2025 00:35:01 GMT) (full text, mbox, link).
Acknowledgement sent
to Elliott Mitchell <ehem+debian@m5p.com>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>.
(Fri, 04 Jul 2025 00:35:01 GMT) (full text, mbox, link).
Subject: Re: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg
shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
Date: Thu, 3 Jul 2025 17:25:27 -0700
On Wed, May 28, 2025 at 05:21:00PM -0700, Elliott Mitchell wrote:
> On Sun, May 18, 2025 at 02:10:25PM +0200, Maximilian Engelhardt wrote:
> > On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote:
> > >
> > > Do any of the Debian maintainers have an AMD machine setup for debugging?
> > > I'm not very well setup for debugging this particular issue. If you've
> > > got an AMD machine with a pair of available SATA ports (including SATA
> > > power!), I could send a pair of SATA devices known to readily reproduce
> > > the issue.
> >
> > I'm not aware of anybody in our team having hardware where they can reproduce
> > this issue, else I'm sure they would have already provided feedback here.
> > There are also not many reports here of people running into this problem. Thus
> > I assume it needs a special (and probably rare) hardware combination to
> > trigger this.
> > One thing I can add is that I have been running software raid1 with Xen on two
> > SATA SSDs on an Intel CPU since many years without seeing any data corruption.
>
> I'm skeptical of it being rare, but certainly uncommon. You've got some
> similarity to the reproductions, but there are differences.
>
> First question, what brand/model are the SSDs? Samsung SSDs are known to
> be effected (severely effected for some models), while Crucial/Micron
> SSDs are uneffected (some models might be mildly effected).
>
> Second question, where are the SATA ports? They on-motherboard? Add-on
> card? The reproductions were with on-motherboard ports.
>
> What generation is your processor? Are you sure it has an IOMMU and Xen
> is driving the IOMMU? I had suspected Intel systems would be effected,
> but you may have disproven this.
Uh. I did hope you could help narrowing things down some. Right now
we've got two confirmed reproductions, while you're the only person who
isn't seeing this reproduce.
The biggest difference is you've got a system with an Intel processor.
Yet we already know not all SSDs are effected, so could be your pair are
ones which won't reproduce the issue. On top of that, similar to the
spurious interrupt issue, could be it is less severe on Intel processors
and that has kept you safe.
Presently the shortage of reports seems mostly attributable to few people
using RAID1 with SSDs.
--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Debbugs is free software and licensed under the terms of the GNU General
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.