Debian Bug report logs - #880554
max grant frames problem (domu freeze with linux-image-4.9.0-4-amd64)

version graph

Package: src:xen; Maintainer for src:xen is Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>;

Reported by: Christian Schwamborn <christian.schwamborn@nswit.de>

Date: Thu, 2 Nov 2017 08:03:02 UTC

Severity: important

Found in version xen/4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9

Done: Hans van Kranenburg <hans@knorrie.org>

Bug is archived. No further changes may be made.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Debian Kernel Team <debian-kernel@lists.debian.org>:
Bug#880554; Package linux-image-4.9.0-4-amd64. (Thu, 02 Nov 2017 08:03:05 GMT) (full text, mbox, link).


Acknowledgement sent to Christian Schwamborn <christian.schwamborn@nswit.de>:
New Bug report received and forwarded. Copy sent to Debian Kernel Team <debian-kernel@lists.debian.org>. (Thu, 02 Nov 2017 08:03:05 GMT) (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Christian Schwamborn <christian.schwamborn@nswit.de>
To: submit@bugs.debian.org
Subject: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Thu, 2 Nov 2017 08:53:41 +0100
Package: linux-image-4.9.0-4-amd64
Version: 4.9.51-1
Severity: critical

As I can tell right now, the domu system simply freezes. The logs simply 
end at some point until the new reboot stuff comes up. Sometimes it's 
still possible to log on to the system, but nothing really works. It is 
like all IO to the virtual block devices is suspended indefinitely. 
Until this happens, the systems seems to work without issues. As the new 
kernel isn't out that long, I can't tell how often this happens. first 
time was the day before yesterday and yesterday afternoon it happened 
twice within two hours.

Something like 'ls' on a directory listed before still gets a result, 
but everything 'new', i.e. 'vim somefile' will cause the shell to stall.
Sadly there is no visible error, services just fails to answer one by 
one (maybe when the try to read/write something new to the disk, then 
they simply wait for IO to happen).

For testing I installed the older kernel (last linux-image-4.9.0-3-amd64 
from security - 4.9.30-2+deb9u5) and realized immediately that the 
system boot time is a fraction with the old kernel in opposite to the 
new one. For the time being, I'm staying with that nn the production system.

To see if anything will be dumped on the console, I started one within a 
screen on a test machine. Now I have to generate some activity and IO 
and see if something happens there.

I haven't had the time to test the impact on the dom0 kernel jet, as far 
as I observed, the dom0 seems to be unaffected by the kernel update.



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Kernel Team <debian-kernel@lists.debian.org>:
Bug#880554; Package linux-image-4.9.0-4-amd64. (Thu, 02 Nov 2017 21:39:02 GMT) (full text, mbox, link).


Acknowledgement sent to Christian Schwamborn <christian.schwamborn@nswit.de>:
Extra info received and forwarded to list. Copy sent to Debian Kernel Team <debian-kernel@lists.debian.org>. (Thu, 02 Nov 2017 21:39:02 GMT) (full text, mbox, link).


Message #10 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Christian Schwamborn <christian.schwamborn@nswit.de>
To: 880554@bugs.debian.org
Subject: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Thu, 2 Nov 2017 22:31:54 +0100
Update:

Sadly the my productive system froze in the early afternoon today again 
with the older kernel as well (4.9.30-2+deb9u5). so that wasn't a temp 
workaround. Paradoxically nothing showed up on the xl console (within a 
screen) at dom0. No errors, nothing, the vm just stopped responding. As 
I was monitoring the system, there where still two open shell 
connections. Some basic stuff still worked, but as soon as tried to open 
a file, the shell got unresponsive. I tried a shutdown on the other 
shell, but that didn't got very far.

Searching the net for that issue I found this post at the xen project 
mailing list: 
https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html 
which sounds similar. He got some traces out of it, but no answer on the 
mailing list.

Some information about my setup:

hardware:
xeon E5-2620 v4
board supermicro X10SRi-F
32gb ecc ram
two 10tb server disk
two I350 network adapter (onboard)

dom0:
debian stretch (up to date), kernel 4.9.51-1, xen-hypervisor 
4.8.1-1+deb9u3,
the two network as adapter as a bond in a bridge
the discs: gpt, 4 part (1M, 256M esp, 256M md mirror with boot, rest as 
md mirror for lvm)

domu:
memory: 8192, 2 vcpus
uses a network interface on the bridge
several (thin)lvm volumes as phys devices
debian stretch (up to date)
issue with both kernel versions: 4.9.30-2+deb9u5 and 4.9.51-1

Some other domu's (wheezy, jessie and a windows 7) seem to run fine

Next I'll try some newer kernels for the domu, starting with the stretch 
backport kernels.



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Kernel Team <debian-kernel@lists.debian.org>:
Bug#880554; Package linux-image-4.9.0-4-amd64. (Mon, 13 Nov 2017 12:57:04 GMT) (full text, mbox, link).


Acknowledgement sent to Christian Schwamborn <christian.schwamborn@nswit.de>:
Extra info received and forwarded to list. Copy sent to Debian Kernel Team <debian-kernel@lists.debian.org>. (Mon, 13 Nov 2017 12:57:04 GMT) (full text, mbox, link).


Message #15 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Christian Schwamborn <christian.schwamborn@nswit.de>
To: 880554@bugs.debian.org
Subject: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Mon, 13 Nov 2017 13:52:39 +0100
Update:

First of all: Forget my observation about the 'system boot time'. I 
mixed up something, the dom0 boot time was increased, but this happened 
probably due to the not (well/propper) handled lvm thin activation 
during system boot.

One last thing I pulled from domu with the original kernel (4.9.51-1) 
was this top output:

top - 20:41:03 up  6:18,  2 users,  load average: 17.03, 6.98, 2.62
Tasks: 231 total,   1 running, 230 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,  0.0 id,100.0 wa,  0.0 hi,  0.0 si, 
0.0 st
%Cpu1  :  0.0 us,  0.3 sy,  0.0 ni,  0.0 id, 99.7 wa,  0.0 hi,  0.0 si, 
0.0 st
KiB Mem :  8212616 total,  1907568 free,  1485276 used,  4819772 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.  6558984 avail Mem

at this point, the system is more or less unusable, everything depending 
on IO is dead.

Currently my production system domu is running for over a week with the 
last backports kernel (linux-image-4.13.0-0.bpo.1-amd64) dom0 is still 
on the current stretch kernel (4.9.51-1) and it seems stable for now.
My guess would be some issue with the xen blkfront driver.
About end of last year I experiences something similar with jessie. 
After some kernel updates those issues got better. They are not 
completely gone, some jessie domu's need a reboot from time to time due 
to raising wa, but the system is still responsive then, it's just 
getting slower and slower by the minute.



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Kernel Team <debian-kernel@lists.debian.org>:
Bug#880554; Package linux-image-4.9.0-4-amd64. (Tue, 14 Nov 2017 10:27:08 GMT) (full text, mbox, link).


Acknowledgement sent to Martin von Wittich <martin.von.wittich@iserv.eu>:
Extra info received and forwarded to list. Copy sent to Debian Kernel Team <debian-kernel@lists.debian.org>. (Tue, 14 Nov 2017 10:27:08 GMT) (full text, mbox, link).


Message #20 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Martin von Wittich <martin.von.wittich@iserv.eu>
To: 880554@bugs.debian.org
Subject: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Tue, 14 Nov 2017 11:18:12 +0100
We're having the same problem here. For some reason, only 2 domUs are 
affected (the dom0 has a total of 22 domUs, 14 of those are running 
Debian stretch, and 13 of those are running Linux 4.9.51-1).

The `xl console` output of the first domU (according to our monitoring 
it hangs since yesterday 14:06):

> [ 3746.780086] INFO: task ntpd:670 blocked for more than 120 seconds.
> [ 3746.780094]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3746.780097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3746.780223] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds.
> [ 3746.780228]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3746.780233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3746.780304] INFO: task rsync:8188 blocked for more than 120 seconds.
> [ 3746.780308]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3746.780311] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3867.612083] INFO: task jbd2/xvda1-8:203 blocked for more than 120 seconds.
> [ 3867.612091]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3867.612091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3867.612148] INFO: task ntpd:670 blocked for more than 120 seconds.
> [ 3867.612150]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3867.612152] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3867.612238] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds.
> [ 3867.612242]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3867.612245] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3867.612287] INFO: task rsync:8188 blocked for more than 120 seconds.
> [ 3867.612291]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3867.612294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3988.444071] INFO: task jbd2/xvda1-8:203 blocked for more than 120 seconds.
> [ 3988.444080]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3988.444084] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3988.444154] INFO: task ntpd:670 blocked for more than 120 seconds.
> [ 3988.444159]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3988.444162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3988.444266] INFO: task kworker/2:0:1533 blocked for more than 120 seconds.
> [ 3988.444271]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [ 3988.444274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The other domU had a similar error message before a coworker downgraded 
the kernel to 3.16 get it working again:

> INFO: task jbd2/xvda1-8:191 blocked for more than 120 seconds.
> [  605.148107]       Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
> [  605.148111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The first domU is a backup machine, it mainly uses rsync --link-dest to 
pull backups from other machines, and is therefore rather IO intensive. 
The other domU is a firewall/router and shouldn't be IO intensive at all.

-- 
Mit freundlichen Grüßen
Martin v. Wittich

IServ GmbH
Bültenweg 73
38106 Braunschweig

Telefon:   0531-2243666-0
Fax:       0531-2243666-9
E-Mail:    info@iserv.eu
Internet:  iserv.eu

USt-IdNr. DE265149425 | Amtsgericht Braunschweig | HRB 201822
Geschäftsführer: Benjamin Heindl, Jörg Ludwig



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Kernel Team <debian-kernel@lists.debian.org>:
Bug#880554; Package linux-image-4.9.0-4-amd64. (Fri, 17 Nov 2017 07:09:03 GMT) (full text, mbox, link).


Acknowledgement sent to Valentin Vidic <Valentin.Vidic@CARNet.hr>:
Extra info received and forwarded to list. Copy sent to Debian Kernel Team <debian-kernel@lists.debian.org>. (Fri, 17 Nov 2017 07:09:03 GMT) (full text, mbox, link).


Message #25 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Valentin Vidic <Valentin.Vidic@CARNet.hr>
To: 880554@bugs.debian.org
Subject: Re: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Fri, 17 Nov 2017 07:39:20 +0100
Hi,

The problem seems to be caused by the new multi-queue xen blk driver
and I was advised by the Xen devs to increase the gnttab_max_frames=256
parameter for the hypervisor.  This has solved the blocking issue
for me and it has been running without problems for a few months now.

I/O to LUNs hang / stall under high load when using xen-blkfront
https://www.novell.com/support/kb/doc.php?id=7018590

-- 
Valentin



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Kernel Team <debian-kernel@lists.debian.org>:
Bug#880554; Package linux-image-4.9.0-4-amd64. (Sat, 06 Jan 2018 14:12:06 GMT) (full text, mbox, link).


Acknowledgement sent to Yves-Alexis Perez <corsac@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Kernel Team <debian-kernel@lists.debian.org>. (Sat, 06 Jan 2018 14:12:06 GMT) (full text, mbox, link).


Message #30 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Yves-Alexis Perez <corsac@debian.org>
To: Valentin Vidic <Valentin.Vidic@CARNet.hr>, 880554@bugs.debian.org, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Sat, 06 Jan 2018 15:08:26 +0100
[Message part 1 (text/plain, inline)]
On Fri, 2017-11-17 at 07:39 +0100, Valentin Vidic wrote:
> Hi,
> 
> The problem seems to be caused by the new multi-queue xen blk driver
> and I was advised by the Xen devs to increase the gnttab_max_frames=256
> parameter for the hypervisor.  This has solved the blocking issue
> for me and it has been running without problems for a few months now.

I'm not really fluent in Xen, but does this relate to the kernel in dom0 or
one of the domU then? 
> 
> I/O to LUNs hang / stall under high load when using xen-blkfront
> https://www.novell.com/support/kb/doc.php?id=7018590

According to that link, the fix seems to be configuration rather than code.
Does this mean this bug against the kernel should be closed?

Regards,
-- 
Yves-Alexis
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Kernel Team <debian-kernel@lists.debian.org>:
Bug#880554; Package linux-image-4.9.0-4-amd64. (Sat, 06 Jan 2018 15:09:03 GMT) (full text, mbox, link).


Acknowledgement sent to Valentin Vidic <Valentin.Vidic@CARNet.hr>:
Extra info received and forwarded to list. Copy sent to Debian Kernel Team <debian-kernel@lists.debian.org>. (Sat, 06 Jan 2018 15:09:03 GMT) (full text, mbox, link).


Message #35 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Valentin Vidic <Valentin.Vidic@CARNet.hr>
To: Yves-Alexis Perez <corsac@debian.org>
Cc: 880554@bugs.debian.org, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Sat, 6 Jan 2018 15:23:56 +0100
On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote:
> According to that link, the fix seems to be configuration rather than code.
> Does this mean this bug against the kernel should be closed?

Yes, the problem seems to be in the Xen hypervisor and not the Linux
kernel itself.  The default value for the gnttab_max_frames parameter
needs to be increased to avoid domU disk IO hangs, for example:

  GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256"

So either close the bug or reassign it to xen-hypervisor package so
they can increase the default value for this parameter in the
hypervisor code.

-- 
Valentin



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Kernel Team <debian-kernel@lists.debian.org>:
Bug#880554; Package linux-image-4.9.0-4-amd64. (Sat, 06 Jan 2018 15:15:03 GMT) (full text, mbox, link).


Acknowledgement sent to Yves-Alexis Perez <corsac@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Kernel Team <debian-kernel@lists.debian.org>. (Sat, 06 Jan 2018 15:15:03 GMT) (full text, mbox, link).


Message #40 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Yves-Alexis Perez <corsac@debian.org>
To: Valentin Vidic <Valentin.Vidic@CARNet.hr>, pkg-xen-devel@lists.alioth.debian.org
Cc: 880554@bugs.debian.org, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Sat, 06 Jan 2018 16:11:41 +0100
[Message part 1 (text/plain, inline)]
control: reassign -1 xen-hypervisor-4.8-amd64

On Sat, 2018-01-06 at 15:23 +0100, Valentin Vidic wrote:
> On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote:
> > According to that link, the fix seems to be configuration rather than
> > code.
> > Does this mean this bug against the kernel should be closed?
> 
> Yes, the problem seems to be in the Xen hypervisor and not the Linux
> kernel itself.  The default value for the gnttab_max_frames parameter
> needs to be increased to avoid domU disk IO hangs, for example:
> 
>   GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256"
> 
> So either close the bug or reassign it to xen-hypervisor package so
> they can increase the default value for this parameter in the
> hypervisor code.
> 
Ok, I'll reassign and let the Xen maintainers handle that (maybe in a stable
update).

@Xen maintainers: see the complete bug log for more information, but basically
it seems that a domu freezes happens with the “new” multi-queue xen blk
driver, and the fix is to increase a configuration value. Valentin suggests
adding that to the default.

Regards,
-- 
Yves-Alexis
[signature.asc (application/pgp-signature, inline)]

Bug reassigned from package 'linux-image-4.9.0-4-amd64' to 'xen-hypervisor-4.8-amd64'. Request was from Yves-Alexis Perez <corsac@debian.org> to 880554-submit@bugs.debian.org. (Sat, 06 Jan 2018 15:15:03 GMT) (full text, mbox, link).


No longer marked as found in versions linux/4.9.51-1. Request was from Yves-Alexis Perez <corsac@debian.org> to 880554-submit@bugs.debian.org. (Sat, 06 Jan 2018 15:15:04 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Sat, 06 Jan 2018 22:27:03 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sat, 06 Jan 2018 22:27:03 GMT) (full text, mbox, link).


Message #49 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Yves-Alexis Perez <corsac@debian.org>, 880554@bugs.debian.org, Valentin Vidic <Valentin.Vidic@CARNet.hr>, pkg-xen-devel@lists.alioth.debian.org
Cc: Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Sat, 6 Jan 2018 23:17:00 +0100
[Message part 1 (text/plain, inline)]
Hi Christian and everyone else,

Ack on reassign to Xen.

On 01/06/2018 04:11 PM, Yves-Alexis Perez wrote:
> control: reassign -1 xen-hypervisor-4.8-amd64
> 
> On Sat, 2018-01-06 at 15:23 +0100, Valentin Vidic wrote:
>> On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote:
>>> According to that link, the fix seems to be configuration rather than
>>> code.
>>> Does this mean this bug against the kernel should be closed?
>>
>> Yes, the problem seems to be in the Xen hypervisor and not the Linux
>> kernel itself.  The default value for the gnttab_max_frames parameter
>> needs to be increased to avoid domU disk IO hangs, for example:
>>
>>   GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256"
>>
>> So either close the bug or reassign it to xen-hypervisor package so
>> they can increase the default value for this parameter in the
>> hypervisor code.
>>
> Ok, I'll reassign and let the Xen maintainers handle that (maybe in a stable
> update).
> 
> @Xen maintainers: see the complete bug log for more information, but basically
> it seems that a domu freezes happens with the “new” multi-queue xen blk
> driver, and the fix is to increase a configuration value. Valentin suggests
> adding that to the default.

The dom0 gnttab_max_frames boot setting is about how many pages are
allocated to fill with 'grants'. The grant concept is related to sharing
information between the dom0 and domU.

It allows memory pages to be shared back and forth, so that e.g. a domU
can fill a page with outgoing network packets or disk writes. Then the
dom0 can take over ownership of the memory page and read the contents
and do its trick with it. In this way, zero-copy IO is implemented.

When running xen domUs, the total amount of network interfaces and block
devices that are attached to all of the domUs that are running (and,
apparently, how heavy they are used) cause the usage of these grant guys
to increase. At some point you run out of grants because all of the
pages are filled.

I agree that the upstream default, 32 is quite low. This is indeed a
configuration issue. I myself ran into this years ago with a growing
number of domUs and network interfaces in use. We have been using
gnttab_max_nr_frames=128 for a long time already instead.

I was tempted to reassign src:xen, but in the meantime, this option has
already been removed again, so this bug does not apply to unstable
(well, as soon as we get something new in there) any more (as far as I
can see quickly now).

https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30

Including a better default for gnttab_max_nr_frames in the grub config
in the debian xen package in stable sounds reasonable from a best
practices point of view.

But, I would be interested in learning more about the relation with
block mq although. Does using newer linux kernels (like from
stretch-backports) for the domU always put a bigger strain on this? Or,
is it just related to the overall number of network devices and block
devices you are adding to your domUs in your specific own situation, and
did you just trip over the default limit?

In any case, the grub option thing is a conffile, so any user upgrading
has to accept/merge the change, so we won't cause a stable user to just
run out of memory because of a few extra kilobytes of memory usage
without notice.

Hans van Kranenburg

P.S. Debian Xen team is in the process of being "rebooted" while the
current shitstorm about meltdown/spectre is going on, so don't hold your
breath. :)

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Sun, 07 Jan 2018 09:09:03 GMT) (full text, mbox, link).


Acknowledgement sent to Valentin Vidic <Valentin.Vidic@CARNet.hr>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 07 Jan 2018 09:09:03 GMT) (full text, mbox, link).


Message #54 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Valentin Vidic <Valentin.Vidic@CARNet.hr>
To: Hans van Kranenburg <hans@knorrie.org>
Cc: Yves-Alexis Perez <corsac@debian.org>, 880554@bugs.debian.org, pkg-xen-devel@lists.alioth.debian.org, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Sun, 7 Jan 2018 10:05:07 +0100
On Sat, Jan 06, 2018 at 11:17:00PM +0100, Hans van Kranenburg wrote:
> I agree that the upstream default, 32 is quite low. This is indeed a
> configuration issue. I myself ran into this years ago with a growing
> number of domUs and network interfaces in use. We have been using
> gnttab_max_nr_frames=128 for a long time already instead.
> 
> I was tempted to reassign src:xen, but in the meantime, this option has
> already been removed again, so this bug does not apply to unstable
> (well, as soon as we get something new in there) any more (as far as I
> can see quickly now).
> 
> https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30

It does not seem to be removed but increased the default from 32 to 64?

> Including a better default for gnttab_max_nr_frames in the grub config
> in the debian xen package in stable sounds reasonable from a best
> practices point of view.
> 
> But, I would be interested in learning more about the relation with
> block mq although. Does using newer linux kernels (like from
> stretch-backports) for the domU always put a bigger strain on this? Or,
> is it just related to the overall number of network devices and block
> devices you are adding to your domUs in your specific own situation, and
> did you just trip over the default limit?

After upgrading the domU and dom0 from jessie to stretch on a big postgresql
database server (50 VCPUs, 200GB RAM) it starting freezing very soon
after boot as posted there here:

  https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html

It did not have these problems while running jessie versions of the
hypervisor and the kernels.  The problem seems to be related to the
number of CPUs used, as smaller domUs with a few VCPUs did not hang
like this.  Could it be that large number of VCPUs -> more queues in
Xen mq driver -> faster exhaustion of allocated pages?

-- 
Valentin



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Sun, 07 Jan 2018 18:39:02 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 07 Jan 2018 18:39:02 GMT) (full text, mbox, link).


Message #59 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Valentin Vidic <Valentin.Vidic@CARNet.hr>
Cc: Yves-Alexis Perez <corsac@debian.org>, 880554@bugs.debian.org, pkg-xen-devel@lists.alioth.debian.org, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Sun, 7 Jan 2018 19:36:40 +0100
On 01/07/2018 10:05 AM, Valentin Vidic wrote:
> On Sat, Jan 06, 2018 at 11:17:00PM +0100, Hans van Kranenburg wrote:
>> I agree that the upstream default, 32 is quite low. This is indeed a
>> configuration issue. I myself ran into this years ago with a growing
>> number of domUs and network interfaces in use. We have been using
>> gnttab_max_nr_frames=128 for a long time already instead.
>>
>> I was tempted to reassign src:xen, but in the meantime, this option has
>> already been removed again, so this bug does not apply to unstable
>> (well, as soon as we get something new in there) any more (as far as I
>> can see quickly now).
>>
>> https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30
> 
> It does not seem to be removed but increased the default from 32 to 64?

Ehm, yes you are correct. I was misreading and mixing up things. Let's
try again...

The referenced commit is talking about removal of the obsolete
gnttab_max_nr_frames from the documentation, so not related.

>> Including a better default for gnttab_max_nr_frames in the grub config
>> in the debian xen package in stable sounds reasonable from a best
>> practices point of view.

So, that's gnttab_max_frames, not gnttab_max_nr_frames... I was reading
out loud from my old Jessie dom0 grub config.

>> But, I would be interested in learning more about the relation with
>> block mq although. Does using newer linux kernels (like from
>> stretch-backports) for the domU always put a bigger strain on this? Or,
>> is it just related to the overall number of network devices and block
>> devices you are adding to your domUs in your specific own situation, and
>> did you just trip over the default limit?
> 
> After upgrading the domU and dom0 from jessie to stretch on a big postgresql
> database server (50 VCPUs, 200GB RAM) it starting freezing very soon
> after boot as posted there here:
> 
>   https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html
> 
> It did not have these problems while running jessie versions of the
> hypervisor and the kernels.  The problem seems to be related to the
> number of CPUs used, as smaller domUs with a few VCPUs did not hang
> like this.  Could it be that large number of VCPUs -> more queues in
> Xen mq driver -> faster exhaustion of allocated pages?

That exactly seems to be the case yes. Maybe this is also one of the
reasons that the default max was increased in Xen.

"xen/blkback: make pool of persistent grants and free pages per-queue"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4bf0065b7251afb723a29b2fd58f7c38f8ce297

Recently a tool was added to "dump guest grant table info". You could
see if it compiles on the 4.8 source and see if it works? Would be
interesting to get some idea about how high or low these numbers are in
different scenarios. I mean, I'm using 128, you 256, and we even don't
know if the actual value is maybe just above 32? :]

https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=df36d82e3fc91bee2ff1681fd438c815fa324b6a

If this is something users are going to run into while not doing more
unusual things like having dozens of vcpus or network interfaces, then
changing the default could prevent hours of frustration and debugging
for them.

The least invasive option is to add the option to the documentation of
GRUB_CMDLINE_XEN_DEFAULT in /etc/default/grub.d/xen.cfg like "If you
have more than xyz disks or network interfaces in a domU, use this, blah
blah."

Actually setting the option there is not a good idea, because people can
still have GRUB_CMDLINE_XEN_DEFAULT set in e.g. /etc/default/grub, so
that would override and damage things.

Other option is to add a patch to drag the defaults in the upstream code
from 32 to 64, including documentation etc.

Sorry for the earlier confusion,
Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Mon, 08 Jan 2018 12:42:06 GMT) (full text, mbox, link).


Acknowledgement sent to Valentin Vidic <Valentin.Vidic@CARNet.hr>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Mon, 08 Jan 2018 12:42:06 GMT) (full text, mbox, link).


Message #64 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Valentin Vidic <Valentin.Vidic@CARNet.hr>
To: Hans van Kranenburg <hans@knorrie.org>
Cc: Yves-Alexis Perez <corsac@debian.org>, 880554@bugs.debian.org, pkg-xen-devel@lists.alioth.debian.org, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Mon, 8 Jan 2018 13:38:20 +0100
On Sun, Jan 07, 2018 at 07:36:40PM +0100, Hans van Kranenburg wrote:
> Recently a tool was added to "dump guest grant table info". You could
> see if it compiles on the 4.8 source and see if it works? Would be
> interesting to get some idea about how high or low these numbers are in
> different scenarios. I mean, I'm using 128, you 256, and we even don't
> know if the actual value is maybe just above 32? :]
> 
> https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=df36d82e3fc91bee2ff1681fd438c815fa324b6a

The diag tool does not build inside xen-4.8:

xen-diag.c: In function ‘gnttab_query_size_func’:
xen-diag.c:50:10: error: implicit declaration of function ‘xc_gnttab_query_size’ [-Werror=implicit-function-declaration]
     rc = xc_gnttab_query_size(xch, &query);
          ^~~~~~~~~~~~~~~~~~~~

but I think the same info is available in the thread on xen-devel:

  https://www.mail-archive.com/xen-devel@lists.xen.org/msg116910.html

When the domU hangs crash reports nr_grant_frames=32. After increasing
the gnttab_max_frames=256 the domU reports using nr_grant_frames=59.

So the new default of gnttab_max_frames=64 might be a bit close to 59,
but I suppose 128 would be just as safe as 256 I currently use (if
you prefer 128).

> If this is something users are going to run into while not doing more
> unusual things like having dozens of vcpus or network interfaces, then
> changing the default could prevent hours of frustration and debugging
> for them.

Yes, the failure case is quite nasty, as the domU just hangs without
even suggesting grant frames might be the problem. Not sure if domU
can detect this situation at all?

Anyway, if the value cannot be increased, the situation should at least
be mentioned in the NEWS.Debian of the xen package.

-- 
Valentin



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Fri, 12 Jan 2018 00:39:03 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Fri, 12 Jan 2018 00:39:04 GMT) (full text, mbox, link).


Message #69 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Valentin Vidic <Valentin.Vidic@CARNet.hr>
Cc: 880554@bugs.debian.org, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Fri, 12 Jan 2018 01:34:10 +0100
Hi,

On 08/01/2018 13:38, Valentin Vidic wrote:
> On Sun, Jan 07, 2018 at 07:36:40PM +0100, Hans van Kranenburg wrote:
>> Recently a tool was added to "dump guest grant table info". You could
>> see if it compiles on the 4.8 source and see if it works? Would be
>> interesting to get some idea about how high or low these numbers are in
>> different scenarios. I mean, I'm using 128, you 256, and we even don't
>> know if the actual value is maybe just above 32? :]
>>
>> https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=df36d82e3fc91bee2ff1681fd438c815fa324b6a
> 
> The diag tool does not build inside xen-4.8:
> 
> xen-diag.c: In function ‘gnttab_query_size_func’:
> xen-diag.c:50:10: error: implicit declaration of function ‘xc_gnttab_query_size’ [-Werror=implicit-function-declaration]
>      rc = xc_gnttab_query_size(xch, &query);
>           ^~~~~~~~~~~~~~~~~~~~

Too bad. :|

> but I think the same info is available in the thread on xen-devel:
> 
>   https://www.mail-archive.com/xen-devel@lists.xen.org/msg116910.html

Ah, great, didn't see that one yet.

> When the domU hangs crash reports nr_grant_frames=32. After increasing
> the gnttab_max_frames=256 the domU reports using nr_grant_frames=59.
> 
> So the new default of gnttab_max_frames=64 might be a bit close to 59,
> but I suppose 128 would be just as safe as 256 I currently use (if
> you prefer 128).

Is the 59 your lots-o-vcpu-monster?

I just finished with the initial preparation of a Xen 4.10 package for
unstable and have it running in my test environment.

So, yay, I have xen-diag now.

-# /usr/lib/xen-4.10/bin/xen-diag
xen-diag: xen diagnostic utility
Usage: xen-diag command [args]
Commands:
  help                       display this help
  gnttab_query_size <domid>  dump the current and max grant frames for
<domid>

-# /usr/lib/xen-4.10/bin/xen-diag gnttab_query_size 0
domid=0: nr_frames=1, max_nr_frames=64

That's a 10vcpu PVHv2 domU with two disks attached, running 4.14 guest
kernel, which has only been booted up and is idling now.

So, at least, nice to have some extra tooling available to help.

>> If this is something users are going to run into while not doing more
>> unusual things like having dozens of vcpus or network interfaces, then
>> changing the default could prevent hours of frustration and debugging
>> for them.
> 
> Yes, the failure case is quite nasty, as the domU just hangs without
> even suggesting grant frames might be the problem. Not sure if domU
> can detect this situation at all?

I can't comment on that, since I don't know. Anyone who does, please
chime in.

> Anyway, if the value cannot be increased, the situation should at least
> be mentioned in the NEWS.Debian of the xen package.

Since this has been reported multiple times already, and upstream has
bumped it to 64, my verdict would be:

* Bump default to 64 already like upstream did in a later version.
* Properly document this issue in NEWS.Debian and also mention the
option with documentation in the template grub config file, so there's a
bigger chance users who run unusual big numbers of disks/nics/cpus/etc
will find it.

...so we also better accomodate users who are using newer kernels in the
domU with blk-mq, and prevent them from wasting too much time and
getting frustrated for no reason.

I wouldn't be comfortable with bumping it above the current latest
greatest upstream default, since it would mean we would need to keep a
patch in later versions.

I'll prepare a patch to bump the default to 64 in 4.8, taking changes
from the upstream patch. I probably have to ask upstream (Juergen Gross)
why the commit that was referenced earlier bumps the default without
mentioning it in the commit message.

Since I just joined the Debian Xen team, I'll run anything I can come up
with through the team to get it approved. We'll target the next Stretch
stable update to get it in.

Thanks,
Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Fri, 12 Jan 2018 11:45:03 GMT) (full text, mbox, link).


Acknowledgement sent to Valentin Vidic <Valentin.Vidic@CARNet.hr>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Fri, 12 Jan 2018 11:45:03 GMT) (full text, mbox, link).


Message #74 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Valentin Vidic <Valentin.Vidic@CARNet.hr>
To: Hans van Kranenburg <hans@knorrie.org>
Cc: 880554@bugs.debian.org, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Fri, 12 Jan 2018 12:43:03 +0100
On Fri, Jan 12, 2018 at 01:34:10AM +0100, Hans van Kranenburg wrote:
> Is the 59 your lots-o-vcpu-monster?

Yes, that is the one with a larger vcpu count.

> I just finished with the initial preparation of a Xen 4.10 package for
> unstable and have it running in my test environment.

Unrelated to this issue, but can you tell me if there is a way to
mitigate Meltdown with the Xen 4.8 dom0/domU(PV) running stretch?

> Since this has been reported multiple times already, and upstream has
> bumped it to 64, my verdict would be:
> 
> * Bump default to 64 already like upstream did in a later version.
> * Properly document this issue in NEWS.Debian and also mention the
> option with documentation in the template grub config file, so there's a
> bigger chance users who run unusual big numbers of disks/nics/cpus/etc
> will find it.
> 
> ...so we also better accomodate users who are using newer kernels in the
> domU with blk-mq, and prevent them from wasting too much time and
> getting frustrated for no reason.
> 
> I wouldn't be comfortable with bumping it above the current latest
> greatest upstream default, since it would mean we would need to keep a
> patch in later versions.
> 
> I'll prepare a patch to bump the default to 64 in 4.8, taking changes
> from the upstream patch. I probably have to ask upstream (Juergen Gross)
> why the commit that was referenced earlier bumps the default without
> mentioning it in the commit message.

Thanks, 64 should be a good start.  If there are still problems
reported with that it can be reconsidered.

-- 
Valentin



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Fri, 12 Jan 2018 13:30:12 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Fri, 12 Jan 2018 13:30:12 GMT) (full text, mbox, link).


Message #79 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Valentin Vidic <Valentin.Vidic@CARNet.hr>
Cc: 880554@bugs.debian.org, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Fri, 12 Jan 2018 14:29:01 +0100
On 01/12/2018 12:43 PM, Valentin Vidic wrote:
> On Fri, Jan 12, 2018 at 01:34:10AM +0100, Hans van Kranenburg wrote:
>> Is the 59 your lots-o-vcpu-monster?
> 
> Yes, that is the one with a larger vcpu count.

Check.

>> I just finished with the initial preparation of a Xen 4.10 package for
>> unstable and have it running in my test environment.
> 
> Unrelated to this issue, but can you tell me if there is a way to
> mitigate Meltdown with the Xen 4.8 dom0/domU(PV) running stretch?

There are no updates for the hypervisor itself yet that we can
distribute in Debian.

This is your starting point for information:

https://xenbits.xen.org/xsa/advisory-254.html
https://blog.xenproject.org/2018/01/04/xen-project-spectremeltdown-faq/

So, 64-bit PV guests can attack the hypervisor and other guests. If you
have untrusted PV guests the short term choices are to 1) convert them
to HVM or 2) shield your hypervisor from them by following the
instructions for the 'PV-in-PVH/HVM shim approach' (where currently for
Xen 4.8 only PV-in-HVM is relevant).

There's still a pending security update for Stretch to address the
previous XSA (up to 251), and it seems best to piggyback on that put
some guidance and information for users in there as well.

If you use IRC, you can also join #debian-xen on OFTC if you want, to
discuss things. There's a bunch of people there sharing information and
strategies about what to do with their debian systems.

>> Since this has been reported multiple times already, and upstream has
>> bumped it to 64, my verdict would be:
>>
>> * Bump default to 64 already like upstream did in a later version.
>> * Properly document this issue in NEWS.Debian and also mention the
>> option with documentation in the template grub config file, so there's a
>> bigger chance users who run unusual big numbers of disks/nics/cpus/etc
>> will find it.
>>
>> ...so we also better accomodate users who are using newer kernels in the
>> domU with blk-mq, and prevent them from wasting too much time and
>> getting frustrated for no reason.
>>
>> I wouldn't be comfortable with bumping it above the current latest
>> greatest upstream default, since it would mean we would need to keep a
>> patch in later versions.
>>
>> I'll prepare a patch to bump the default to 64 in 4.8, taking changes
>> from the upstream patch. I probably have to ask upstream (Juergen Gross)
>> why the commit that was referenced earlier bumps the default without
>> mentioning it in the commit message.
> 
> Thanks, 64 should be a good start.  If there are still problems
> reported with that it can be reconsidered.

Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Mon, 15 Jan 2018 10:21:08 GMT) (full text, mbox, link).


Acknowledgement sent to Christian Schwamborn <christian.schwamborn@nswit.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Mon, 15 Jan 2018 10:21:09 GMT) (full text, mbox, link).


Message #84 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Christian Schwamborn <christian.schwamborn@nswit.de>
To: Hans van Kranenburg <hans@knorrie.org>, Valentin Vidic <Valentin.Vidic@CARNet.hr>
Cc: Yves-Alexis Perez <corsac@debian.org>, 880554@bugs.debian.org, pkg-xen-devel@lists.alioth.debian.org
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Mon, 15 Jan 2018 11:12:03 +0100
Hi Hans and Valentin,

first of all: Thanks for your help and explanations, that is very 
helpfull. I was on vacation last week and couldn't answer right away.

On 07.01.2018 19:36, Hans van Kranenburg wrote:
> If this is something users are going to run into while not doing more
> unusual things like having dozens of vcpus or network interfaces, then
> changing the default could prevent hours of frustration and debugging
> for them.

As a reference:

Dom0 is stretch.

0 root@zero:~# xl list
Name           ID   Mem VCPUs	State	Time(s)
Domain-0        0  1961     2     r-----  407972.8
xaver-jessie   10  2048     2     -b----  177520.8
ustrich-jessie 12  2048     2     -b----    8555.9
ourea-stretch  14  8192     2     -b----  167352.7
arriba         17  4096     2     -b----    5108.3

All DomU's have one network interface on a bridge.
xaver-jessie has 5 block devices (phys, lvm)
ustrich-jessie has 4 block devices (phys, lvm)
ourea-stretch has 16 block devices (phys, lvm)
arriba has just one (phys, lvm) and is a hvm windows system

As you can see, nothing crazy with lots of vcpus or network interfaces.

The crashing (freezing) DomU was ourea-stretch, which is the one with 
the most load (smb, some web services, cal/card dav, psql, ldap, 
postfix, cyrus ...). As mentioned, the freezes stopped after using the 
backports kernel, nothing else changed. I was desperate at that time to 
get this new installed system to work and frankly stopped all planed 
updates to stretch on other systems at that point until I know what is 
going on.

Is there a easy way to get/monitor the used 'grants' frames? As I 
understand it, the xen-diag tool you mentioned doesn't compile in xen 4.8?

Christian





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Mon, 15 Jan 2018 11:03:05 GMT) (full text, mbox, link).


Acknowledgement sent to Valentin Vidic <Valentin.Vidic@CARNet.hr>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Mon, 15 Jan 2018 11:03:05 GMT) (full text, mbox, link).


Message #89 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Valentin Vidic <Valentin.Vidic@CARNet.hr>
To: Christian Schwamborn <christian.schwamborn@nswit.de>
Cc: Hans van Kranenburg <hans@knorrie.org>, Yves-Alexis Perez <corsac@debian.org>, 880554@bugs.debian.org, pkg-xen-devel@lists.alioth.debian.org
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Mon, 15 Jan 2018 11:59:19 +0100
[Message part 1 (text/plain, inline)]
On Mon, Jan 15, 2018 at 11:12:03AM +0100, Christian Schwamborn wrote:
> Is there a easy way to get/monitor the used 'grants' frames? As I understand
> it, the xen-diag tool you mentioned doesn't compile in xen 4.8?

I just gave it another try and after modifying xen-diag.c
a bit to work with 4.8 here is what I get:

  # ./xen-diag gnttab_query_size 0
  domid=0: nr_frames=4, max_nr_frames=256
  # ./xen-diag gnttab_query_size 1
  domid=1: nr_frames=11, max_nr_frames=256
  
  # ./xen-diag  gnttab_query_size 0
  domid=0: nr_frames=4, max_nr_frames=256
  # ./xen-diag  gnttab_query_size 1
  domid=1: nr_frames=11, max_nr_frames=256
  # ./xen-diag  gnttab_query_size 5
  domid=5: nr_frames=11, max_nr_frames=256

so currently at 11, not high at all.

Attaching a patch for stretch xen package if you want to check
your hosts.

-- 
Valentin
[xen-diag.patch (text/x-diff, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Mon, 15 Jan 2018 11:09:03 GMT) (full text, mbox, link).


Acknowledgement sent to Valentin Vidic <Valentin.Vidic@CARNet.hr>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Mon, 15 Jan 2018 11:09:03 GMT) (full text, mbox, link).


Message #94 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Valentin Vidic <Valentin.Vidic@CARNet.hr>
To: Christian Schwamborn <christian.schwamborn@nswit.de>
Cc: Hans van Kranenburg <hans@knorrie.org>, Yves-Alexis Perez <corsac@debian.org>, 880554@bugs.debian.org, pkg-xen-devel@lists.alioth.debian.org
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Mon, 15 Jan 2018 12:07:39 +0100
On Mon, Jan 15, 2018 at 11:12:03AM +0100, Christian Schwamborn wrote:
> Is there a easy way to get/monitor the used 'grants' frames? As I understand
> it, the xen-diag tool you mentioned doesn't compile in xen 4.8?

Here is a status from another host:

domid=0: nr_frames=4, max_nr_frames=256
domid=487: nr_frames=6, max_nr_frames=256
domid=488: nr_frames=5, max_nr_frames=256
domid=489: nr_frames=4, max_nr_frames=256
domid=490: nr_frames=6, max_nr_frames=256
domid=491: nr_frames=7, max_nr_frames=256
domid=492: nr_frames=4, max_nr_frames=256
domid=493: nr_frames=4, max_nr_frames=256
domid=494: nr_frames=29, max_nr_frames=256
domid=495: nr_frames=4, max_nr_frames=256
domid=496: nr_frames=4, max_nr_frames=256
domid=497: nr_frames=5, max_nr_frames=256
domid=498: nr_frames=4, max_nr_frames=256
domid=499: nr_frames=4, max_nr_frames=256
domid=500: nr_frames=4, max_nr_frames=256
domid=501: nr_frames=4, max_nr_frames=256
domid=503: nr_frames=5, max_nr_frames=256
domid=572: nr_frames=13, max_nr_frames=256
domid=575: nr_frames=7, max_nr_frames=256

Most of the hosts have older kernels and nr_frames < 10.

And than 494 has a stretch kernel and only 4 vcpus but is quite close to
the current default of 32.  Maybe it just depends on the amount of disk IO?

-- 
Valentin



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Fri, 23 Feb 2018 15:27:03 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Fri, 23 Feb 2018 15:27:03 GMT) (full text, mbox, link).


Message #99 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Valentin Vidic <Valentin.Vidic@CARNet.hr>, Christian Schwamborn <christian.schwamborn@nswit.de>
Cc: 880554@bugs.debian.org, Ian Jackson <ijackson@chiark.greenend.org.uk>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Fri, 23 Feb 2018 16:18:01 +0100
Hi Valentin, Christian,
Finally getting back to you about the max grant frames issue.

We discussed this with upstream Xen developers, and a different fix was
proposed. I would really appreciate if you could test it and confirm it
also solves the issue. Testing does not involve recompiling the
hypervisor with patches etc.

The deadline for changes for the 9.4 Stretch point release is end next
week, so we aim to get it in then.

The cause of the problem is, like earlier discused, the "blkback
multipage ring" changes a.k.a. "multi-queue xen blk driver" which eats
grant frame resources way too fast.

As shown in the reports, this issue already exists while using the
normal stretch kernel (not only newer backports) in combination with Xen
4.8.

The upstream change we found earlier that doubles the max number to 64
is part of a bigger change that touches more of the inner workings,
making Xen better able to handle the domU kernel behavior. This whole
change is not going to be backported to Xen 4.8.


Can you please test the following, instead of setting the
gnttab_max_frames value:

Create the file
    /etc/modprobe.d/xen-blkback-fewer-gnttab-frames
with contents...

# apropos of #880554
# workaround is not required for Xen 4.9 and later
options xen_blkback max_ring_page_order=0
options xen_blkback max_queues=1

...and reboot.

This will cause the domU kernels to behave more in a way that Xen 4.8
can cope with.

Regards,
Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Mon, 26 Feb 2018 08:39:05 GMT) (full text, mbox, link).


Acknowledgement sent to Christian Schwamborn <christian.schwamborn@nswit.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Mon, 26 Feb 2018 08:39:05 GMT) (full text, mbox, link).


Message #104 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Christian Schwamborn <christian.schwamborn@nswit.de>
To: Hans van Kranenburg <hans@knorrie.org>, Valentin Vidic <Valentin.Vidic@CARNet.hr>
Cc: 880554@bugs.debian.org, Ian Jackson <ijackson@chiark.greenend.org.uk>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Mon, 26 Feb 2018 09:29:30 +0100
Hi Hans,

I can try, but the only system I can really test this is a productive 
system, as this 'reliable' shows this issue (and I don't want to crash 
it on purpose on a regular basis). Since I set gnttab_max_frame to a 
higher value it runs smooth. If you're confident this will work I can 
try this in the eventing, when all users logged off.

Best regards,
Chritian


On 23.02.2018 16:18, Hans van Kranenburg wrote:
> Hi Valentin, Christian,
> Finally getting back to you about the max grant frames issue.
> 
> We discussed this with upstream Xen developers, and a different fix was
> proposed. I would really appreciate if you could test it and confirm it
> also solves the issue. Testing does not involve recompiling the
> hypervisor with patches etc.
> 
> The deadline for changes for the 9.4 Stretch point release is end next
> week, so we aim to get it in then.
> 
> The cause of the problem is, like earlier discused, the "blkback
> multipage ring" changes a.k.a. "multi-queue xen blk driver" which eats
> grant frame resources way too fast.
> 
> As shown in the reports, this issue already exists while using the
> normal stretch kernel (not only newer backports) in combination with Xen
> 4.8.
> 
> The upstream change we found earlier that doubles the max number to 64
> is part of a bigger change that touches more of the inner workings,
> making Xen better able to handle the domU kernel behavior. This whole
> change is not going to be backported to Xen 4.8.
> 
> 
> Can you please test the following, instead of setting the
> gnttab_max_frames value:
> 
> Create the file
>      /etc/modprobe.d/xen-blkback-fewer-gnttab-frames
> with contents...
> 
> # apropos of #880554
> # workaround is not required for Xen 4.9 and later
> options xen_blkback max_ring_page_order=0
> options xen_blkback max_queues=1
> 
> ...and reboot.
> 
> This will cause the domU kernels to behave more in a way that Xen 4.8
> can cope with.
> 
> Regards,
> Hans
> 



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Mon, 26 Feb 2018 14:57:03 GMT) (full text, mbox, link).


Acknowledgement sent to Ian Jackson <ijackson@chiark.greenend.org.uk>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Mon, 26 Feb 2018 14:57:03 GMT) (full text, mbox, link).


Message #109 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Ian Jackson <ijackson@chiark.greenend.org.uk>
To: Christian Schwamborn <christian.schwamborn@nswit.de>
Cc: Hans van Kranenburg <hans@knorrie.org>, Valentin Vidic <Valentin.Vidic@CARNet.hr>, 880554@bugs.debian.org
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Mon, 26 Feb 2018 14:52:58 +0000
Christian Schwamborn writes ("Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64"):
> I can try, but the only system I can really test this is a productive 
> system, as this 'reliable' shows this issue (and I don't want to crash 
> it on purpose on a regular basis). Since I set gnttab_max_frame to a 
> higher value it runs smooth. If you're confident this will work I can 
> try this in the eventing, when all users logged off.

Thanks.  I understand your reluctance.  I don't want to mislead you.
I think the odds of it working are probably ~75%.

Unless you want to tolerate that risk, it might be better for us to
try to come up with a better way to test it.

Ian.



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Mon, 26 Feb 2018 18:39:06 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Mon, 26 Feb 2018 18:39:06 GMT) (full text, mbox, link).


Message #114 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Ian Jackson <ijackson@chiark.greenend.org.uk>, Christian Schwamborn <christian.schwamborn@nswit.de>
Cc: Valentin Vidic <Valentin.Vidic@CARNet.hr>, 880554@bugs.debian.org
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Mon, 26 Feb 2018 19:35:26 +0100
On 02/26/2018 03:52 PM, Ian Jackson wrote:
> Christian Schwamborn writes ("Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64"):
>> I can try, but the only system I can really test this is a productive 
>> system, as this 'reliable' shows this issue (and I don't want to crash 
>> it on purpose on a regular basis). Since I set gnttab_max_frame to a 
>> higher value it runs smooth. If you're confident this will work I can 
>> try this in the eventing, when all users logged off.
> 
> Thanks.  I understand your reluctance.  I don't want to mislead you.
> I think the odds of it working are probably ~75%.
> 
> Unless you want to tolerate that risk, it might be better for us to
> try to come up with a better way to test it.

I can try this.

I can run a dom0 with Xen 4.8 and 4.9 domU, I already have the xen-diag
for it (so confirmed the patch in this bug report builds ok, we should
include it for stretch, it's really useful).

I think it's mainly trying to get a domU running with various
combinations of domU kernel, number of disks and vcpus, and then look at
the output of xen-diag.

Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Mon, 26 Feb 2018 23:45:06 GMT) (full text, mbox, link).


Acknowledgement sent to 880554@bugs.debian.org, Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Mon, 26 Feb 2018 23:45:06 GMT) (full text, mbox, link).


Message #119 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Ian Jackson <ijackson@chiark.greenend.org.uk>, Christian Schwamborn <christian.schwamborn@nswit.de>
Cc: 880554@bugs.debian.org, Valentin Vidic <Valentin.Vidic@CARNet.hr>
Subject: Re: [Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Tue, 27 Feb 2018 00:40:41 +0100
On 02/26/2018 07:35 PM, Hans van Kranenburg wrote:
> On 02/26/2018 03:52 PM, Ian Jackson wrote:
>> Christian Schwamborn writes ("Re: Bug#880554: xen domu freezes with
kernel linux-image-4.9.0-4-amd64"):
>>> I can try, but the only system I can really test this is a productive
>>> system, as this 'reliable' shows this issue (and I don't want to crash
>>> it on purpose on a regular basis). Since I set gnttab_max_frame to a
>>> higher value it runs smooth. If you're confident this will work I can
>>> try this in the eventing, when all users logged off.
>>
>> Thanks.  I understand your reluctance.  I don't want to mislead you.
>> I think the odds of it working are probably ~75%.
>>
>> Unless you want to tolerate that risk, it might be better for us to
>> try to come up with a better way to test it.
>
> I can try this.
>
> I can run a dom0 with Xen 4.8 and 4.9 domU, I already have the xen-diag
> for it (so confirmed the patch in this bug report builds ok, we should
> include it for stretch, it's really useful).
>
> I think it's mainly trying to get a domU running with various
> combinations of domU kernel, number of disks and vcpus, and then look at
> the output of xen-diag.
Ok, I spent some time trying things.

Xen: 4.8.3+comet2+shim4.10.0+comet3-1+deb9u4.1
dom0 kernel 4.9.65-3+deb9u2
domU (PV) kernel 4.9.82-1+deb9u2

Observation so far: nr_frames increases as soon as a combination of
disk+vcpu has actually been doing disk activity, and then never decreases.

I ended up with a 64-vcpu domU with additional 10 1GiB disks (xvdc,
xvdd, etc).

I created ext4 fs on the disks and mounted them.

I used fio to throw some IO at the disk, trying to hit as many
combinations of vcpu and disk.

[things]
rw=randwrite
rwmixread=75
size=8M
directory=/mnt/xvdBLAH
ioengine=libaio
direct=1
iodepth=16
numjobs=64

with BLAH replaced by c, d, e, f etc...

-# rm */things*; for i in c d e f g h i j k l; do fio fio-xvd$i; done

-# while true; do /usr/lib/xen-4.8/bin/xen-diag gnttab_query_size 2;
sleep 10; done
domid=2: nr_frames=6, max_nr_frames=128
domid=2: nr_frames=7, max_nr_frames=128
domid=2: nr_frames=7, max_nr_frames=128
domid=2: nr_frames=10, max_nr_frames=128
domid=2: nr_frames=10, max_nr_frames=128
domid=2: nr_frames=11, max_nr_frames=128
domid=2: nr_frames=13, max_nr_frames=128
domid=2: nr_frames=14, max_nr_frames=128
domid=2: nr_frames=15, max_nr_frames=128
domid=2: nr_frames=16, max_nr_frames=128
domid=2: nr_frames=18, max_nr_frames=128
domid=2: nr_frames=18, max_nr_frames=128
domid=2: nr_frames=19, max_nr_frames=128
domid=2: nr_frames=21, max_nr_frames=128
domid=2: nr_frames=21, max_nr_frames=128
domid=2: nr_frames=23, max_nr_frames=128
domid=2: nr_frames=24, max_nr_frames=128
domid=2: nr_frames=24, max_nr_frames=128
domid=2: nr_frames=24, max_nr_frames=128
domid=2: nr_frames=24, max_nr_frames=128

So I can push it up to about 24 when doing this.

-# grep . /sys/module/xen_blkback/parameters/*
/sys/module/xen_blkback/parameters/log_stats:0
/sys/module/xen_blkback/parameters/max_buffer_pages:1024
/sys/module/xen_blkback/parameters/max_persistent_grants:1056
/sys/module/xen_blkback/parameters/max_queues:4
/sys/module/xen_blkback/parameters/max_ring_page_order:4

Now, I rebooted my test domo and put the modprobe file in place.
(Note: the filename has to end in .conf)!!

-# grep . /sys/module/xen_blkback/parameters/*
/sys/module/xen_blkback/parameters/log_stats:0
/sys/module/xen_blkback/parameters/max_buffer_pages:1024
/sys/module/xen_blkback/parameters/max_persistent_grants:1056
/sys/module/xen_blkback/parameters/max_queues:1
/sys/module/xen_blkback/parameters/max_ring_page_order:0

After doing the same tests, the result ends up being exactly 24 again.
So, the modprobe settings don't seem to do anything.

-# tree /sys/block/xvda/mq
/sys/block/xvda/mq
└── 0
    ├── active
    ├── cpu0
    │   ├── completed
    │   ├── dispatched
    │   ├── merged
    │   └── rq_list
    ├── cpu1
    │   ├── completed
    │   ├── dispatched
    │   ├── merged
    │   └── rq_list
   [...]
    ├── cpu63
    │   ├── completed
    │   ├── dispatched
    │   ├── merged
    │   └── rq_list
   [...]
    ├── cpu_list
    ├── dispatched
    ├── io_poll
    ├── pending
    ├── queued
    ├── run
    └── tags

65 directories, 264 files

Mwooop mwooop mwoop mwooooo (failure trombone).

It obviously didn't involve network traffic yet. And, all is stretch
kernels etc, which are reported to already be problematic.

But, the main thing I wanted to test is if the change would result in a
much lower total amount of grants, which is not the case.

So, anyone a better idea, or should we just add some clear documentation
for the max frames setting in the grub config example?

Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Tue, 27 Feb 2018 16:09:05 GMT) (full text, mbox, link).


Acknowledgement sent to 880554@bugs.debian.org, Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Tue, 27 Feb 2018 16:09:05 GMT) (full text, mbox, link).


Message #124 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Ian Jackson <ijackson@chiark.greenend.org.uk>, Christian Schwamborn <christian.schwamborn@nswit.de>
Cc: 880554@bugs.debian.org, Valentin Vidic <Valentin.Vidic@CARNet.hr>
Subject: Re: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Tue, 27 Feb 2018 17:05:06 +0100
On 02/27/2018 12:40 AM, Hans van Kranenburg wrote:
> [...]
> 
> But, the main thing I wanted to test is if the change would result in a
> much lower total amount of grants, which is not the case.

So,
* I couldn't reproduce a number > 32
* The proposed fix doesn't help.

There's two scenarios which can be happening:
1. Bug reporters are running a really exceptional sizing and workload.
2. "It's on fire and we don't know how big the fire is" (quote Ian)

ad 1. Christian, Valentin, can you give more specific info that can help
someone else to set up a test environment to trigger > 32 values.

ad 2. e.g. how many users run into this and do not report it, don't
understand, switch to KVM and tell their friends that Xen only is
unstable and crashes?

OTOH:

Since...
* this problem has been fixed in newer Xen already in a different way
* there's a sufficient workaround now (setting max frames)

...I doubt if it's useful (priority wise) to keep spending a lot of time
on this, since the work is really time consuming.

Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Tue, 27 Feb 2018 17:33:03 GMT) (full text, mbox, link).


Acknowledgement sent to 880554@bugs.debian.org, Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Tue, 27 Feb 2018 17:33:03 GMT) (full text, mbox, link).


Message #129 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Ian Jackson <ijackson@chiark.greenend.org.uk>, Christian Schwamborn <christian.schwamborn@nswit.de>
Cc: 880554@bugs.debian.org, Valentin Vidic <Valentin.Vidic@CARNet.hr>
Subject: Re: [Pkg-xen-devel] Bug#880554: Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Tue, 27 Feb 2018 18:28:40 +0100
On 02/27/2018 05:05 PM, Hans van Kranenburg wrote:
> [...]
> 
> ...I doubt if it's useful (priority wise) to keep spending a lot of time
> on this, since the work is really time consuming.

It is, but it's also an interesting problem.

Idle just started domU starts at nr_frames=6 or 7 in all cases.

Same test as before 64 vcpu, 10 disks, trying to hit as many vcpu/disk
combinations:

1. With new modprobe limits applied:

3.16.51-3+deb8u1 -> nr_frames=25
4.9.30-2+deb9u5  -> nr_frames=24
4.9.51-1         -> nr_frames=25
4.14.13-1~bpo9+1 -> nr_frames=23

2. Rebooting dom0, removing limits:

3.16.51-3+deb8u1 -> nr_frames=25
4.9.30-2+deb9u5  -> nr_frames=25
4.9.51-1         -> nr_frames=24
4.14.13-1~bpo9+1 -> nr_frames=46  <--

Well, there it is.

However, I can not, I repeat, not, see a difference between
4.9.30-2+deb9u5 and 4.9.51-1, the versions used to report with in the
very first message on this bug.

1. If you're running into the problem with a 4.9 stretch domU kernel,
you're likely hitting the limits in the same way that I already also hit
them like 10 years ago, just having quite some of either vcpu, vbd or vif.

2. If you're upgrading a domU to use the stretch-backports kernel,
you're suddenly much more likely to bump into the limit.

So:

For 1. the solution is to change the boot parameter by the user, or to
reconsider patching DEFAULT_MAX_NR_GRANT_FRAMES 32 to something else
(xen/include/xen/grant_table.h) but that would require another rounds of
testing to see if it does what we might think it does. I vote no.

To accommodate 2. the better is to ship the modprobe config for 4.8,
since running stretch-backports is a valid 'normal' use case. I vote yes.

Ian, up to you to make a final decision.

kthxbye,
Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Tue, 27 Feb 2018 19:27:03 GMT) (full text, mbox, link).


Acknowledgement sent to Valentin Vidic <Valentin.Vidic@CARNet.hr>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Tue, 27 Feb 2018 19:27:03 GMT) (full text, mbox, link).


Message #134 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Valentin Vidic <Valentin.Vidic@CARNet.hr>
To: 880554@bugs.debian.org, Hans van Kranenburg <hans@knorrie.org>
Cc: Ian Jackson <ijackson@chiark.greenend.org.uk>, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Tue, 27 Feb 2018 20:22:50 +0100
On Tue, Feb 27, 2018 at 05:05:06PM +0100, Hans van Kranenburg wrote:
> ad 1. Christian, Valentin, can you give more specific info that can help
> someone else to set up a test environment to trigger > 32 values.

I can't touch the original VM that had this issue and tried to
reproduce on another host with recent stretch kernels but without
success.  The maximum number I can get now is nr_frames=11.

Another info that I forgot to mention before is that my VMs were
using DRBD disks. Since DRBD acts like a slow disk it could cause
IO requests to pile up and hit the limit faster.

Since I can't reproduce it easily anymore I suspect something was
fixed in the meanwhile.  My original report was for 4.9.30-2+deb9u2
and since then there seems to be a number of fixes that could be
related to this:

linux (4.9.65-3) stretch; urgency=medium
  * xen/time: do not decrease steal time after live migration on xen
linux (4.9.65-1) stretch; urgency=medium
    - swiotlb-xen: implement xen_swiotlb_dma_mmap callback
    - xen-netback: Use GFP_ATOMIC to allocate hash
    - xen/gntdev: avoid out of bounds access in case of partial
      gntdev_mmap()
    - xen/manage: correct return value check on xenbus_scanf()
    - xen: don't print error message in case of missing Xenstore entry
    - xen/netback: set default upper limit of tx/rx queues to 8
linux (4.9.47-1) stretch; urgency=medium
    - nvme: use blk_mq_start_hw_queues() in nvme_kill_queues()
    - nvme: avoid to use blk_mq_abort_requeue_list()
    - efi: Don't issue error message when booted under Xen
    - xen/privcmd: Support correctly 64KB page granularity when mapping
      memory
    - xen/blkback: fix disconnect while I/Os in flight
    - xen/blkback: don't use xen_blkif_get() in xen-blkback kthread
    - xen/blkback: don't free be structure too early
    - xen-netback: fix memory leaks on XenBus disconnect
    - xen-netback: protect resource cleaning on XenBus disconnect
    - swiotlb-xen: update dev_addr after swapping pages
    - xen-netfront: Fix Rx stall during network stress and OOM
    - [x86] mm: Fix flush_tlb_page() on Xen
    - xen-netfront: Rework the fix for Rx stall during OOM and network
      stress
    - xen/scsiback: Fix a TMR related use-after-free
    - [x86] xen: allow userspace access during hypercalls
    - [armhf] Xen: Zero reserved fields of xatp before making hypervisor
      call
    - xen-netback: correctly schedule rate-limited queues
    - nbd: blk_mq_init_queue returns an error code on failure, not NULL
    - xen: fix bio vec merging (CVE-2017-12134) (Closes: #866511)
    - blk-mq-pci: add a fallback when pci_irq_get_affinity returns NULL
    - xen-blkfront: use a right index when checking requests
linux (4.9.30-2+deb9u4) stretch-security; urgency=high
  * xen: fix bio vec merging (CVE-2017-12134) (Closes: #866511)
linux (4.9.30-2+deb9u3) stretch-security; urgency=high
  * xen-blkback: don't leak stack data via response ring
  * (CVE-2017-10911)
  * mqueue: fix a use-after-free in sys_mq_notify() (CVE-2017-11176)

In fact the original big VM with this problem runs happily with:

  domid=1: nr_frames=11, max_nr_frames=256

so it is quite possible raising the limit is not needed anymore
with the latest stretch kernels.

If no-one else can reproduce this anymore I suggest you close the
issue but include the xen-diag tool in the updated package.  That
way if someone reports the problem again it should be easy to detect.

-- 
Valentin



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Wed, 28 Feb 2018 06:51:05 GMT) (full text, mbox, link).


Acknowledgement sent to Christian Schwamborn <christian.schwamborn@nswit.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Wed, 28 Feb 2018 06:51:05 GMT) (full text, mbox, link).


Message #139 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Christian Schwamborn <christian.schwamborn@nswit.de>
To: 880554@bugs.debian.org, Hans van Kranenburg <hans@knorrie.org>, Ian Jackson <ijackson@chiark.greenend.org.uk>
Cc: Valentin Vidic <Valentin.Vidic@CARNet.hr>
Subject: Re: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Wed, 28 Feb 2018 07:46:34 +0100
I much appreciate the effort you all did and like the idea to ship the 
xen-diag tool and maybe a hint somewhere about the issues that occurred 
and the possible solution by raising max_nr_frames.

On 27.02.2018 17:05, Hans van Kranenburg wrote:
> ad 1. Christian, Valentin, can you give more specific info that can help
> someone else to set up a test environment to trigger > 32 values.

As this isn't my own system, but a productive system of one of my 
customers, I'm really reluctant to use it for invasive testing.

Just for recap: The issues hit me with kernel 4.9.51-1

hardware:
xeon E5-2620 v4
board supermicro X10SRi-F
32gb ecc ram
two 10tb server disk
two I350 network adapter (onboard)

dom0:
debian stretch (up to date), kernel 4.9.51-1, xen-hypervisor 4.8.1-1+deb9u3,
the two network as adapter as a bond in a bridge
the discs: gpt, 4 part (1M, 256M esp, 256M md mirror for boot, rest as 
md mirror for lvm)

domu:
memory: 8192, 2 vcpus
uses a network interface on the bridge
16 lvm volumes as phys devices
debian stretch
issue with both kernel versions: 4.9.30-2+deb9u5 and 4.9.51-1
system runs mostly some smb, some web services, cal/card dav, psql, 
ldap, postfix, cyrus ...

In my early tests before the issue was discussed here I tried 
linux-image-4.13.0-0.bpo.1-amd64 an the system went stable for a week.

Oh and It's worth to mention that I tried thin lvm in the beginning, but 
I dropped that due to (write)performance and boot issues (thinpool was 
always inactive after boot and took about 5-10 minutes to activate after 
there where about 4TB of data within).

currently the system is running stable with max_nr_frames=256 (I wanted 
to be on the save side) and kernel 4.9.65-3+deb9u2.

Maybe I can try to get some values with xen-diag Valentin provided to 
see the current state of the system, but I'm really busy at the moment 
job wise and private, I hope next week gets better (had some bad luck 
with our water installation - much mopping).



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Wed, 28 Feb 2018 06:51:07 GMT) (full text, mbox, link).


Acknowledgement sent to Christian Schwamborn <christian.schwamborn@nswit.de>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Wed, 28 Feb 2018 06:51:07 GMT) (full text, mbox, link).


Message #144 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Christian Schwamborn <christian.schwamborn@nswit.de>
To: 880554@bugs.debian.org, Hans van Kranenburg <hans@knorrie.org>, Ian Jackson <ijackson@chiark.greenend.org.uk>
Cc: Valentin Vidic <Valentin.Vidic@CARNet.hr>
Subject: Re: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Wed, 28 Feb 2018 07:47:06 +0100
I much appreciate the effort you all did and like the idea to ship the 
xen-diag tool and maybe a hint somewhere about the issues that occurred 
and the possible solution by raising max_nr_frames.

On 27.02.2018 17:05, Hans van Kranenburg wrote:
> ad 1. Christian, Valentin, can you give more specific info that can help
> someone else to set up a test environment to trigger > 32 values.

As this isn't my own system, but a productive system of one of my 
customers, I'm really reluctant to use it for invasive testing.

Just for recap: The issues hit me with kernel 4.9.51-1

hardware:
xeon E5-2620 v4
board supermicro X10SRi-F
32gb ecc ram
two 10tb server disk
two I350 network adapter (onboard)

dom0:
debian stretch (up to date), kernel 4.9.51-1, xen-hypervisor 4.8.1-1+deb9u3,
the two network as adapter as a bond in a bridge
the discs: gpt, 4 part (1M, 256M esp, 256M md mirror for boot, rest as 
md mirror for lvm)

domu:
memory: 8192, 2 vcpus
uses a network interface on the bridge
16 lvm volumes as phys devices
debian stretch
issue with both kernel versions: 4.9.30-2+deb9u5 and 4.9.51-1
system runs mostly some smb, some web services, cal/card dav, psql, 
ldap, postfix, cyrus ...

In my early tests before the issue was discussed here I tried 
linux-image-4.13.0-0.bpo.1-amd64 an the system went stable for a week.

Oh and It's worth to mention that I tried thin lvm in the beginning, but 
I dropped that due to (write)performance and boot issues (thinpool was 
always inactive after boot and took about 5-10 minutes to activate after 
there where about 4TB of data within).

currently the system is running stable with max_nr_frames=256 (I wanted 
to be on the save side) and kernel 4.9.65-3+deb9u2.

Maybe I can try to get some values with xen-diag Valentin provided to 
see the current state of the system, but I'm really busy at the moment 
job wise and private, I hope next week gets better (had some bad luck 
with our water installation - much mopping).



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Wed, 28 Feb 2018 07:57:02 GMT) (full text, mbox, link).


Acknowledgement sent to Valentin Vidic <Valentin.Vidic@CARNet.hr>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Wed, 28 Feb 2018 07:57:02 GMT) (full text, mbox, link).


Message #149 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Valentin Vidic <Valentin.Vidic@CARNet.hr>
To: 880554@bugs.debian.org, Hans van Kranenburg <hans@knorrie.org>
Cc: Ian Jackson <ijackson@chiark.greenend.org.uk>, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Wed, 28 Feb 2018 08:54:51 +0100
On Tue, Feb 27, 2018 at 08:22:50PM +0100, Valentin Vidic wrote:
> Since I can't reproduce it easily anymore I suspect something was
> fixed in the meanwhile.  My original report was for 4.9.30-2+deb9u2
> and since then there seems to be a number of fixes that could be
> related to this:

Just rebooted both dom0 and domU with 4.9.30-2+deb9u2 and the the
postgresql domU is having problems right away after boot:

  domid=1: nr_frames=32, max_nr_frames=32

  [  242.652100] INFO: task kworker/u90:0:6 blocked for more than 120 seconds.

Upgrading the kernels and I can't get it above 11 anymore:

  domid=1: nr_frames=11, max_nr_frames=32

So some of those many kernel fixes did the trick and things just
work fine with the newer kernels without raising gnttab_max_frames.

-- 
Valentin



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Thu, 06 Sep 2018 17:33:05 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Thu, 06 Sep 2018 17:33:05 GMT) (full text, mbox, link).


Message #154 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Valentin Vidic <Valentin.Vidic@CARNet.hr>, 880554@bugs.debian.org
Cc: Ian Jackson <ijackson@chiark.greenend.org.uk>, Christian Schwamborn <christian.schwamborn@nswit.de>
Subject: Re: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Thu, 6 Sep 2018 19:23:49 +0200
On 02/28/2018 08:54 AM, Valentin Vidic wrote:
> On Tue, Feb 27, 2018 at 08:22:50PM +0100, Valentin Vidic wrote:
>> Since I can't reproduce it easily anymore I suspect something was
>> fixed in the meanwhile.  My original report was for 4.9.30-2+deb9u2
>> and since then there seems to be a number of fixes that could be
>> related to this:
> 
> Just rebooted both dom0 and domU with 4.9.30-2+deb9u2 and the the
> postgresql domU is having problems right away after boot:
> 
>   domid=1: nr_frames=32, max_nr_frames=32
> 
>   [  242.652100] INFO: task kworker/u90:0:6 blocked for more than 120 seconds.
> 
> Upgrading the kernels and I can't get it above 11 anymore:
> 
>   domid=1: nr_frames=11, max_nr_frames=32
> 
> So some of those many kernel fixes did the trick and things just
> work fine with the newer kernels without raising gnttab_max_frames.

During my testing I also couldn't quickly cause the nr_frames exhaustion
to happen with block devices, but I still can with a decent amount of
network interfaces inside the domU.

Anyway, I think the future proof solution here is to have clear
documentation about how to configure related settings, instead of trying
to find values that suit all users and that are not ridiculously high.

In Xen 4.10/4.11 the settings changed by the way. The default for in the
dom0 is 64 now, and the default for domUs can be set in xl.conf (which
is still 32), I have it at max_grant_frames=64 currently. It can also be
set per domU, but I like setting it system-wide more.

There's still a xen kernel option for this, which causes the dom0 value
to be set, and which determines the upper limit for the xl.conf option,
iirc.

Oh, and the setting for a domU can also be changed while it's running.
Mind blown.

So yeah, it's a bit complicated, like 4 or 6 knobs to turn which you all
need to get in the right direction, instead of only the old option.

I only don't know where to put the info pointing the user at the right
places to config this. NEWS.Debian? Somewhere else? There is reference
documentation about this in the man pages, but I don't think there's a
tutorial/howto kind of documentation.

Hans




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Sun, 09 Sep 2018 20:57:05 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>, 880554@bugs.debian.org:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Sun, 09 Sep 2018 20:57:05 GMT) (full text, mbox, link).


Message #159 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Valentin Vidic <Valentin.Vidic@CARNet.hr>, 880554@bugs.debian.org
Cc: Christian Schwamborn <christian.schwamborn@nswit.de>, Ian Jackson <ijackson@chiark.greenend.org.uk>
Subject: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Date: Sun, 9 Sep 2018 21:55:39 +0100
On 06/09/2018 18:23, Hans van Kranenburg wrote:
> 
> Anyway, I think the future proof solution here is to have clear
> documentation about how to configure related settings, instead of trying
> to find values that suit all users and that are not ridiculously high.

I just assisted a user in #xen on freenode with this exact issue again.

The user had already experienced three maintenance windows in which it
was tried to upgrade a domU with quite some big sized disks and cpus
from Jessie to Stretch, every time failing again with random symptoms.
Disk doesn't work, network does not ping, and had spent quite some hours
searching for solutions already.

This reminded me of something else... which is better error logging when
the issue happens. This is an upstream thing to fix I guess, if
possible. As soon as there's a useful error message in logging or on the
console of the domU, then the user has something specific to search for
on ze interwebz.

Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package xen-hypervisor-4.8-amd64. (Tue, 23 Oct 2018 17:39:03 GMT) (full text, mbox, link).


Acknowledgement sent to Ian Jackson <ijackson@chiark.greenend.org.uk>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Tue, 23 Oct 2018 17:39:03 GMT) (full text, mbox, link).


Message #164 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Ian Jackson <ijackson@chiark.greenend.org.uk>
To: 880554@bugs.debian.org
Subject: Re: #880554: max grant frames problem
Date: Tue, 23 Oct 2018 18:34:13 +0100
Control: retitle -1 max grant frames problem (domu freeze with linux-image-4.9.0-4-amd64)
Control: severity -1 important
Control: reassign -1 src:xen 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9

Just gardening here.

(i) Bug title should mention grant frames.

(ii) This does not affect all use cases and is not, IMO, RC.  Although
we should certainly see if we can improve it.

(iii) Britney is confused because the bug was reassigned to
xen-hypervisor-4.8-amd64 and thinks that updating to the 4.11-based
packages from sid would help reduce RC bugs since they lack that .deb.
This is wrong, of course.  For this reason in general bugs should be
reported against src:xen rather than against binary packages with
Xen versions in their package name.

Ian.

-- 
Ian Jackson <ijackson@chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.



Changed Bug title to 'max grant frames problem (domu freeze with linux-image-4.9.0-4-amd64)' from 'xen domu freezes with kernel linux-image-4.9.0-4-amd64'. Request was from Ian Jackson <ijackson@chiark.greenend.org.uk> to 880554-submit@bugs.debian.org. (Tue, 23 Oct 2018 17:39:03 GMT) (full text, mbox, link).


Severity set to 'important' from 'critical' Request was from Ian Jackson <ijackson@chiark.greenend.org.uk> to 880554-submit@bugs.debian.org. (Tue, 23 Oct 2018 17:39:04 GMT) (full text, mbox, link).


Bug reassigned from package 'xen-hypervisor-4.8-amd64' to 'src:xen'. Request was from Ian Jackson <ijackson@chiark.greenend.org.uk> to 880554-submit@bugs.debian.org. (Tue, 23 Oct 2018 17:39:04 GMT) (full text, mbox, link).


Marked as found in versions xen/4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9. Request was from Ian Jackson <ijackson@chiark.greenend.org.uk> to 880554-submit@bugs.debian.org. (Tue, 23 Oct 2018 17:39:05 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package src:xen. (Fri, 22 Feb 2019 21:24:03 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Fri, 22 Feb 2019 21:24:03 GMT) (full text, mbox, link).


Message #177 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: Christian Schwamborn <christian.schwamborn@nswit.de>, 880554@bugs.debian.org
Subject: Re: [Pkg-xen-devel] Bug#880554: #880554: max grant frames problem
Date: Fri, 22 Feb 2019 22:20:37 +0100
Hi,

Our Buster TODO [1] has a TODO item for me about adding a paragraph in
the "Known Issues" section of the Debian README / NEWS / or wherever it
will go about the grant frames issue.

It would be very nice to have some WARN_ONCE code in the linux kernel
exactly at the place where this issue happens, but nobody has been
adding that yet, so for now, if this happens, domUs will just still hang
without meaningful stuff in dmesg.

I also have a suspicion that this issue will be showing up less when
using Xen 4.11+ and Linux 4.19+ in dom0 and domU (so, yay, upgrade all
the things!). Newer Linux kernel versions have patches that also
actually release frame stuff when it's no longer actually in use, which
might end the 'only-going-up' behaviour of the number we see.

Anyway, I'm planning to add some "known issue" documentation about this
to the Xen 4.11 packaging, together with a clear short description of
what to change where in what configuration (because in Xen 4.11+ it's
different than in 4.4 or 4.8 again), and then mark this bts as closed
while that upload happens.

Digging around in kernel code to find out where this happens and adding
this WARN_ONCE is still a nice kernel coding exercise, but I don't want
this bts to track if that ends up on top of my TODO list or not.

Hans

[1] https://salsa.debian.org/xen-team/debian-xen/issues/24



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package src:xen. (Thu, 07 Mar 2019 09:09:03 GMT) (full text, mbox, link).


Acknowledgement sent to Jan Korbel <debian@teptin.net>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Thu, 07 Mar 2019 09:09:03 GMT) (full text, mbox, link).


Message #182 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Jan Korbel <debian@teptin.net>
To: 880554@bugs.debian.org
Subject: Re: max grant frames problem again
Date: Thu, 7 Mar 2019 09:57:54 +0100
Hello.

Same problem here. Everything is up-to-date stable, domU freezing after
mount and R/W on new 1,5TB device (btrfs).

dom0: 64b
domU: 32b OS + 64b kernel

No problem with gnttab_max_frames=256 yet.

Please include some fix in packages. Thanks.

J.



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package src:xen. (Wed, 08 May 2019 10:03:06 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Wed, 08 May 2019 10:03:06 GMT) (full text, mbox, link).


Message #187 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: 880554@bugs.debian.org
Subject: Re: Bug#880554: max grant frames problem again
Date: Wed, 8 May 2019 11:58:47 +0200
jftr: Yesterday I discovered this:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=29d11cfd8698038b87458ba4d1329b9da81150a5

"xen/grant-table: log the lack of grants"

This is great, this is something we have discussed before, that the domU
kernel should tell something instead of just hanging. So, apparently
someone implemented that already.

Knorrie



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package src:xen. (Wed, 17 Jul 2019 23:33:04 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Wed, 17 Jul 2019 23:33:04 GMT) (full text, mbox, link).


Message #192 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: 880554@bugs.debian.org, Valentin Vidic <Valentin.Vidic@CARNet.hr>, Christian Schwamborn <christian.schwamborn@nswit.de>, Martin von Wittich <martin.von.wittich@iserv.eu>, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>
Cc: Ian Jackson <ijackson@chiark.greenend.org.uk>
Subject: Re: [Pkg-xen-devel] Bug#880554: #880554: max grant frames problem
Date: Thu, 18 Jul 2019 01:30:44 +0200
Hi,

On 10/23/18 7:34 PM, Ian Jackson wrote:
> Control: retitle -1 max grant frames problem (domu freeze with linux-image-4.9.0-4-amd64)
> Control: severity -1 important
> Control: reassign -1 src:xen 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9

my last comment in this bts bug was about:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=29d11cfd8698038b87458ba4d1329b9da81150a5

..which is in since linux 4.13-rc2, and buster has 4.19+

Is there anyone who would wants to try reproduce the max grant frames
problem on buster with Xen 4.11 and Linux 4.19 dom0/domU?

The 'xen/grant-table: max_grant_frames reached' should show up on the
serial console. I'd like to see a test report of it actually happening.

No further adjustments/fixes will go into the Stretch Xen packages at
this stage.

Having better documentation about how to set hypervisor and guest
options to deal with all of this is still a TODO. I would really like to
get some people together to start cleaning out the whole Xen related
wiki section for Debian, and actually provide some helpful content,
including FAQ stuff like max grants, PVH, PVH+grub etc...

Whoever would want to participate in that, just reply a Yay!

Doing documentation work might seem boring, but it's write once, read
many all the way.

Hans



Information forwarded to debian-bugs-dist@lists.debian.org, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>:
Bug#880554; Package src:xen. (Thu, 28 Nov 2019 15:33:03 GMT) (full text, mbox, link).


Acknowledgement sent to Hans van Kranenburg <hans@knorrie.org>:
Extra info received and forwarded to list. Copy sent to Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>. (Thu, 28 Nov 2019 15:33:03 GMT) (full text, mbox, link).


Message #197 received at 880554@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: 880554@bugs.debian.org, Valentin Vidic <Valentin.Vidic@CARNet.hr>, Christian Schwamborn <christian.schwamborn@nswit.de>, Martin von Wittich <martin.von.wittich@iserv.eu>, Debian Xen Team <pkg-xen-devel@lists.alioth.debian.org>
Cc: Ian Jackson <ijackson@chiark.greenend.org.uk>
Subject: Re: [Pkg-xen-devel] Bug#880554: #880554: max grant frames problem
Date: Thu, 28 Nov 2019 16:21:33 +0100
On 7/18/19 1:30 AM, Hans van Kranenburg wrote:
> Hi,
> 
> On 10/23/18 7:34 PM, Ian Jackson wrote:
>> Control: retitle -1 max grant frames problem (domu freeze with linux-image-4.9.0-4-amd64)
>> Control: severity -1 important
>> Control: reassign -1 src:xen 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9
> 
> my last comment in this bts bug was about:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=29d11cfd8698038b87458ba4d1329b9da81150a5
> 
> ..which is in since linux 4.13-rc2, and buster has 4.19+
> 
> Is there anyone who would wants to try reproduce the max grant frames
> problem on buster with Xen 4.11 and Linux 4.19 dom0/domU?
> 
> The 'xen/grant-table: max_grant_frames reached' should show up on the
> serial console. I'd like to see a test report of it actually happening.

I actually just did this, by putting max_grant_frames = 4 in a domU
config file and starting it (Linux 4.19 domU on Xen 4.11):

Welcome to Debian GNU/Linux 10 (buster)!

[    5.499058] systemd[1]: Set hostname to <debug-btrfs-buster>.
[    5.552968] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
[    5.554012] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
[    5.555858] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
[    5.556950] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
[    5.557082] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
[    5.557295] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
[    5.557636] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
[    5.558960] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
[    5.559800] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
[    6.014291] gnttab_expand: 159 callbacks suppressed
[    6.014296] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=3
[    6.014351] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=8
[    6.033683] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=3
[    6.055013] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=3
[    6.055729] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=26
[    6.060256] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=3
[    6.077000] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=3
[    6.109760] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=3
[    6.138126] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=3
[    6.148626] xen:grant_table: xen/grant-table: max_grant_frames
reached cur=4 extra=1 limit=4 gnttab_free_count=0 req_entries=3

Yay. Better info for the users!

Also, there's a patch in review that can improve the situation:

https://lists.xenproject.org/archives/html/xen-devel/2019-11/msg01607.html

The biggest annoyance in our Xen 4.11 now is that the default value for
the hypervisor command line of gnttab_max_frames is raised to 64 from 32
a while ago, but the toolstack overwrites this again with a default of
32. The patch attempts to fix that.

Hans



Reply sent to Hans van Kranenburg <hans@knorrie.org>:
You have taken responsibility. (Sun, 22 Nov 2020 20:39:02 GMT) (full text, mbox, link).


Notification sent to Christian Schwamborn <christian.schwamborn@nswit.de>:
Bug acknowledged by developer. (Sun, 22 Nov 2020 20:39:03 GMT) (full text, mbox, link).


Message #202 received at 880554-done@bugs.debian.org (full text, mbox, reply):

From: Hans van Kranenburg <hans@knorrie.org>
To: 880554-done@bugs.debian.org, Valentin Vidic <Valentin.Vidic@CARNet.hr>, Christian Schwamborn <christian.schwamborn@nswit.de>, Martin von Wittich <martin.von.wittich@iserv.eu>
Cc: Ian Jackson <ijackson@chiark.greenend.org.uk>
Subject: Re: #880554: max grant frames problem
Date: Sun, 22 Nov 2020 21:29:03 +0100
Hi all,

On 11/28/19 4:21 PM, Hans van Kranenburg wrote:
> On 7/18/19 1:30 AM, Hans van Kranenburg wrote:
>> Hi,
>>
>> On 10/23/18 7:34 PM, Ian Jackson wrote:
>>> Control: retitle -1 max grant frames problem (domu freeze with linux-image-4.9.0-4-amd64)
>>> Control: severity -1 important
>>> Control: reassign -1 src:xen 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9
>>
>> my last comment in this bts bug was about:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=29d11cfd8698038b87458ba4d1329b9da81150a5
>>
>> ..which is in since linux 4.13-rc2, and buster has 4.19+
>>
>> Is there anyone who would wants to try reproduce the max grant frames
>> problem on buster with Xen 4.11 and Linux 4.19 dom0/domU?
>>
>> The 'xen/grant-table: max_grant_frames reached' should show up on the
>> serial console. I'd like to see a test report of it actually happening.
> 
> I actually just did this, by putting max_grant_frames = 4 in a domU
> config file and starting it (Linux 4.19 domU on Xen 4.11):
> 
> Welcome to Debian GNU/Linux 10 (buster)!
> 
> [    5.499058] systemd[1]: Set hostname to <debug-btrfs-buster>.
> [    5.552968] xen:grant_table: xen/grant-table: max_grant_frames
> reached cur=4 extra=1 limit=4 gnttab_free_count=3 req_entries=1
> [...]
> 
> Yay. Better info for the users!

So, this was already confirmed.

> Also, there's a patch in review that can improve the situation:
> 
> https://lists.xenproject.org/archives/html/xen-devel/2019-11/msg01607.html
> 
> The biggest annoyance in our Xen 4.11 now is that the default value for
> the hypervisor command line of gnttab_max_frames is raised to 64 from 32
> a while ago, but the toolstack overwrites this again with a default of
> 32. The patch attempts to fix that.

That change was included in Xen 4.13. We're about to put Xen 4.14 in
Debian unstable now, which includes the improvement. In Xen 4.11 in
Debian stable, the situation is a bit more annoying, but that's not
going to change any more now. Whoever needs specific settings that are
non-default should have figured out how to set them at this point.

For reference (and for who does not want to look it up), here's the
commit message of the final patch that went in, so, about the new Xen
4.14 behavior:

---- 8< ----

commit f2ae59bc4b9b5c3f12de86aa42cdf413d2c3ffbf
Author: George Dunlap <george.dunlap@citrix.com>
Date:   Fri Nov 29 17:24:45 2019 +0000

Rationalize max_grant_frames and max_maptrack_frames handling

Xen used to have single, system-wide limits for the number of grant
frames and maptrack frames a guest was allowed to create. Increasing
or decreasing this single limit on the Xen command-line would change
the limit for all guests on the system.

Later, per-domain limits for these values was created. The system-wide
limits became strict limits: domains could not be created with higher
limits, but could be created with lower limits. However, that change
also introduced a range of different "default" values into various
places in the toolstack:

- The python libxc bindings hard-coded these values to 32 and 1024,
  respectively
- The libxl default values are 32 and 1024 respectively.
- xl will use the libxl default for maptrack, but does its own default
  calculation for grant frames: either 32 or 64, based on the max
  possible mfn.

These defaults interact poorly with the hypervisor command-line limit:

- The hypervisor command-line limit cannot be used to raise the limit
  for all guests anymore, as the default in the toolstack will
  effectively override this.
- If you use the hypervisor command-line limit to *reduce* the limit,
  then the "default" values generated by the toolstack are too high,
  and all guest creations will fail.

In other words, the toolstack defaults require any change to be
effected by having the admin explicitly specify a new value in every
guest.

In order to address this, have grant_table_init treat negative values
for max_grant_frames and max_maptrack_frames as instructions to use the
system-wide default, and have all the above toolstacks default to passing
-1 unless a different value is explicitly configured.

This restores the old behavior in that changing the hypervisor command-line
option can change the behavior for all guests, while retaining the ability
to set per-guest values.  It also removes the bug that reducing the
system-wide max will cause all domains without explicit limits to fail.

NOTE: - The Ocaml bindings require the caller to always specify a value,
  and the code to start a xenstored stubdomain hard-codes these to 4
  and 128 respectively; this behavour will not be modified.

---- >8 ----

So, I'm closing this debian bug now, since there are no actionable items
left to do.

Hans



Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Mon, 21 Dec 2020 07:27:58 GMT) (full text, mbox, link).


Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Wed Jul 24 08:00:21 2024; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.