Debian Bug report logs -
#900399
memtest86+: very probably kills system controller on Lenovo Thinkpad T500 laptop
Reply or subscribe to this bug.
Toggle useless messages
Report forwarded
to debian-bugs-dist@lists.debian.org, kogan@bit-integro.ru, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+.
(Wed, 30 May 2018 08:36:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Sergey Kogan <kogan@bit-integro.ru>:
New Bug report received and forwarded. Copy sent to kogan@bit-integro.ru, Yann Dirson <dirson@debian.org>.
(Wed, 30 May 2018 08:36:04 GMT) (full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
Package: memtest86+
Version: 5.01-3
Severity: critical
Justification: breaks the whole system
Hi! There is a situation I belive should be reported ASAP. We have two
Lenovo T500 laptops completely dead after an overnight testing with memtest86+.
Notebooks do not power on, and even do not show up the 'external power'
ledled when plugging in AC-adapter.
The whole story is:
29-May-2018 Two Lenovo T500 laptops were upgraded with 4Gb memory sticks. After
the upgrade the laptops were powered on with no problems and were used till the
evening.
In the evening 29-May-2018 both laptops were rebooted into Memtest86+ and
set up for an overnight RAM test with default memtest settings.
In the morning 30-May-2018 both laptops where found with a few passes completed
and zero memory errors found. But laptops were not responding to
keyboard commands. The laptops were turned off with a long press of the power
button and then refused to start.
All the usual tricks were performed including:
- Removing and replacing RAM sticks
- Removing battery and AC power for a period of time
- Ten times and a long press power button advice found on the internet
- CMOS battery removal
Still, two lenovo laptops show no signs of life. We are going to send one
laptop for service today (maybe they would diagnose the issue better),
but it seems very likely that memtest86+ somehow killed the firmware
of a motherboard system controller.
Until the problem is identified, I recommend to issue a warning and (or) prevent
installation/running of the memtest86+ on Lenovo T500 laptops.
Will post updates when more information will be available.
-- System Information:
Debian Release: 9.1
APT prefers stable
APT policy: (500, 'stable')
Architecture: i386 (i686)
Kernel: Linux 4.9.0-3-686-pae (SMP w/2 CPU cores)
Locale: LANG=en_US, LC_CTYPE=ru_RU.UTF8 (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF8), LANGUAGE=en_US:en (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF8)
Shell: /bin/sh linked to /bin/bash
Init: systemd (via /run/systemd/system)
Versions of packages memtest86+ depends on:
ii debconf [debconf-2.0] 1.5.61
memtest86+ recommends no packages.
Versions of packages memtest86+ suggests:
ii grub-pc 2.02~beta3-5
pn hwtools <none>
pn kernel-patch-badram <none>
pn memtest86 <none>
pn memtester <none>
pn mtools <none>
-- debconf-show failed
Information forwarded
to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+.
(Wed, 06 Jun 2018 09:54:08 GMT) (full text, mbox, link).
Acknowledgement sent
to Сергей Коган <kogan@bit-integro.ru>:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>.
(Wed, 06 Jun 2018 09:54:08 GMT) (full text, mbox, link).
Message #10 received at 900399@bugs.debian.org (full text, mbox, reply):
Hi!
Good news and a bad news. Both T500 laptops were examined. One was
(almost) repaired. One is dead.
One-line summary: Yes, memtest86+ killed them. No, it is not related to
the embedded controller. It's a short-circuit in a power hub IC.
Details:
- It's important to note that power management logic in lenovo thinkpad
laptops is quite sophisticated. The embedded controller provides a
high-level signal, while special IC's issue signals to various gates to
power up or power down specific parts of the system.
- One of those low-level IC's is a RIKNAN (U61 on lenovo schematics).
The important part of the IC is the VCC3SW micro-power LDO (dc/dc
converter). It provides a limited 3.3v power supply for the power button
detection circuit, thermal protection logic and a power hub IC.
- The power hub PMH_7 (U28) is more intelligent then RINKAN, and has a
SPI connection to the EC. It controls a lot of clocks and power signals
on a main board. Note that PMH is used across different lenovo products,
so some of it's outputs are left unused. It is a common practice to tie
unused IC outputs to ground or VCC instead of leaving them unconnected.
- Coreboot developers discovered a method of accessing the internal
registers of the PMH. The protocol is simple: write a register address
to some memory-mapped EC address, then write desired value to the other
EC address.
outb(reg, EC_LENOVO_PMH7_ADDR);
val = inb(EC_LENOVO_PMH7_DATA);
outb(reg, EC_LENOVO_PMH7_ADDR);
outb(val | (1 << bit), EC_LENOVO_PMH7_DATA);
- Now we are leaving the hard facts ground and start speculating.
- It seems be the case than either BIOS do not list memory-mapped EC
registers as a reserved memory area, or memtest86+ fails to process this
reservation correctly.
- The pattern of the memory writes by memtest is (unfortunately) 100%
compatible with PMH internal register access protocol.
- It is very possible that by writing some moving ones and zeros or a
random bytes, the memtest has pulled an unused (tied to ground or VCC)
PMH pin high or low - thereby creating a short-circuit on VCC3SW line.
- This short-circuit would tend to overheat the RINKAN LDO as it's
output transistor is in active mode, and is easily overloaded with a PMH
output transistor (which is in conduction mode with a resistance of
milli-ohms). It seems that RINKAN has no over-current or thermal
protection built in.
- VCC3SW malfunction is not critical while the main board 3.3V/9А and
5V/8A buses are powered by TPS51221 (U41) IC. Most components draw power
from main buses and not from VCC3SW. But when the laptop is powered off,
there is no VCC3SW bus to initiate the power-on process. The laptop is
bricked.
Findings:
Both laptops were disassembled and main boards examined using a
multi-meter and an oscilloscope. The main boards were of a different
revisions (and different types: one with discrete graphics, one without)
but both has the VCC3SW power bus malfunctioned. The first laptop
provided around 1.2v over the VCC3SW and a measured resistance from
VCC3SW to GND was around 400 Ohm. After cutting the VCC3SW pin on RINKAN
IC and providing an external power to the VCC3SW line - the laptop
powered up and attempted to boot. We ended up wiring up an external
micro-power LDO (LP2930-3.3) to provide the power permanently. This
laptop still has some minor problems (like refusing to power-up unless
the battery is removed and AC-IN is plugged-in), but is still usable.
The second T500 RINKAN was not providing any power to the VCC3SW bus,
and measured resistance was only ~50 Ohms. We had to cut both VCC3SW
(output) and VREGIN20 (input) RINKAN pins to remove an over-current
condition. After that we observed the power on main 3.3V and 5V buses,
but RINKAN/PMH7 do not issue 'POWER GOOD' signals and prevent the system
to become usable. No repair is possible.
It looks like T6x, T400/500, T410/510, T420/520 laptop families could be
affected by this problem. Starting from the T430/530 series, a
communication protocol with the EC was changed - breaking tp_smapi
driver and fixing the described problem as a side effect.
I have a "revived" T500 on hands and I would be happy to provide any
information to confirm or correct my findings.
I still think that it's appropriate to warn lenovo users of a
possibility to brick their laptops with just a mere memory test.
---
Sincerely yours,
Sergey Kogan
Information forwarded
to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+.
(Thu, 07 Jun 2018 13:30:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Сергей Коган <kogan@bit-integro.ru>:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>.
(Thu, 07 Jun 2018 13:30:04 GMT) (full text, mbox, link).
Message #15 received at 900399@bugs.debian.org (full text, mbox, reply):
Hi!
Let's lower the severity of this bug and flag it as unverified.
Given the datasheet for the TB62501 and actual board layout of the T500
- the described scenario (short from the VCC3SW to GND caused by a stray
write to the PMH register) is highly improbable:
- The LDO inside the RINKAN has an over-current protection set as low as
55mA and should prevent any damage even if the VCC3SW is shorted. After
the single over-current/under-voltage event, RINKAN LDO is locked in the
OFF state and requires a complete power-off to restart.
- Unused pins of the PMH are in fact floating
- Some RINKAN batches do show tendency to malfunction with no apparent
reasons. The main board temperature could be a contributing factor.
So, we have to seriously consider the possibility that two laptops died
at the same time just by a coincidence.
We do plan to run a memtest on the restored laptop using a current
measuring/limiting circuit on the VCC3SW bus. If no excessive current
consumption would be detected - the memtest has nothing to do with the
issue. If an excessive current during the test would be observed, it
would get us a direction to resume the investigation.
---
Sincerely yours,
Sergey Kogan
Information forwarded
to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+.
(Tue, 03 Jul 2018 12:36:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Tomas Janousek <tomi@nomi.cz>:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>.
(Tue, 03 Jul 2018 12:36:03 GMT) (full text, mbox, link).
Message #20 received at 900399@bugs.debian.org (full text, mbox, reply):
Hi,
On Wed, Jun 06, 2018 at 03:35:36PM +0600, Сергей Коган wrote:
> [...]
> It looks like T6x, T400/500, T410/510, T420/520 laptop families could be
> affected by this problem. Starting from the T430/530 series, a communication
> protocol with the EC was changed - breaking tp_smapi driver and fixing the
> described problem as a side effect.
> [...]
This may be completely unrelated, but it seems somewhat relevant:
When pressing and holding a key during memtest86+ on an otherwise perfectly
working T420, there are errors due to a different value being read than was
written. Initially I thought my memory/motherboard is faulty and the keyboard
pressure is triggering this, but the patterns are totally deterministic: the
same key always does the same "damage" to the bits.
Perhaps there is indeed something mapped into the memory... :-)
--
Tomáš Janoušek, a.k.a. Pivník, a.k.a. Liskni_si, http://work.lisk.in/
Information forwarded
to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+.
(Sat, 14 Jul 2018 01:27:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Dmitry Smirnov <onlyjob@debian.org>:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>.
(Sat, 14 Jul 2018 01:27:03 GMT) (full text, mbox, link).
Message #25 received at 900399@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
IMHO inflated severity if this bug is unjustified.
Generally speaking, memtest86+ is exposing a hardware problem which is
exactly what it designed to do and seems to be doing well - therefore this
bug seems to be targeted against memtest86+'s primary function.
Let me use a hypothetical example: suppose I'm stress testing a notebook
continuously for many hours. But notebook is not designed with same thermal
properties as a server so during testing notebook is overheated beyond its
thermal specifications for too long so it eventually breaks. Fair enough,
arguably memtest86+ exposed flaw in thermal design which is exactly what's
expected. It is unfortunate if hardware ended up damaged but it is not a bug
in memtest86+.
Isn't it common sense that any burn-out test is not without risks of damage
to hardware?
Maybe this bug is to be forwarded to notebook vendor?
What action you expect from Debian maintainer?
Incorporating a warning appears to be a task for upstream developers.
For what it's worth, I've used memtest86+ to extensively test two different
models of T520 and T410 Thinkpads without breaking them...
--
All the best,
Dmitry Smirnov.
---
Lies are the social equivalent of toxic waste: Everyone is potentially
harmed by their spread.
-- Sam Harris
[signature.asc (application/pgp-signature, inline)]
Message sent on
to Sergey Kogan <kogan@bit-integro.ru>:
Bug#900399.
(Sat, 14 Jul 2018 01:27:05 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+.
(Sun, 12 Aug 2018 00:24:03 GMT) (full text, mbox, link).
Acknowledgement sent
to ydirson@free.fr:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>.
(Sun, 12 Aug 2018 00:24:03 GMT) (full text, mbox, link).
Message #33 received at 900399@bugs.debian.org (full text, mbox, reply):
severity 900399 normal
thanks
I suggest you get some advice from the forum[1], and as Dmitry mentionned, bring the issue to Lenovo.
[1] http://forum.canardpc.com/forums/73-Memtest86-Official-forum?s=1407c99a4da914ef85e60c32c658ba16
----- Mail original -----
> De: "Сергей Коган" <kogan@bit-integro.ru>
> À: 900399@bugs.debian.org
> Envoyé: Jeudi 7 Juin 2018 15:26:43
> Objet: Bug#900399: More good news
>
> Hi!
>
> Let's lower the severity of this bug and flag it as unverified.
>
> Given the datasheet for the TB62501 and actual board layout of the
> T500
> - the described scenario (short from the VCC3SW to GND caused by a
> stray
> write to the PMH register) is highly improbable:
>
> - The LDO inside the RINKAN has an over-current protection set as low
> as
> 55mA and should prevent any damage even if the VCC3SW is shorted.
> After
> the single over-current/under-voltage event, RINKAN LDO is locked in
> the
> OFF state and requires a complete power-off to restart.
>
> - Unused pins of the PMH are in fact floating
>
> - Some RINKAN batches do show tendency to malfunction with no
> apparent
> reasons. The main board temperature could be a contributing factor.
>
> So, we have to seriously consider the possibility that two laptops
> died
> at the same time just by a coincidence.
>
> We do plan to run a memtest on the restored laptop using a current
> measuring/limiting circuit on the VCC3SW bus. If no excessive current
> consumption would be detected - the memtest has nothing to do with
> the
> issue. If an excessive current during the test would be observed, it
> would get us a direction to resume the investigation.
>
> ---
> Sincerely yours,
> Sergey Kogan
>
Severity set to 'normal' from 'critical'
Request was from ydirson@free.fr
to control@bugs.debian.org.
(Sun, 12 Aug 2018 00:24:04 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Debian QA Group <packages@qa.debian.org>:
Bug#900399; Package memtest86+.
(Tue, 11 Jan 2022 17:06:02 GMT) (full text, mbox, link).
Acknowledgement sent
to fantonifabio@tiscali.it:
Extra info received and forwarded to list. Copy sent to Debian QA Group <packages@qa.debian.org>.
(Tue, 11 Jan 2022 17:06:02 GMT) (full text, mbox, link).
Message #40 received at 900399@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Hi, I used many times memtest86+ but never broken hardware and I suppose
also in this case is not the cause.
What I have seen in several cases instead in which it restarted/turned
off/blocked that the cause was an overheating problem that should be
solved BEFORE these tests, in my cases was always solved by changing the
thermal paste (needed on near all servers/pc/notebook after many years).
Described my experience hoping to help someone has similar problems and
thinking that memtest is the cause read this bug
[OpenPGP_signature (application/pgp-signature, attachment)]
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Fri May 26 18:29:25 2023;
Machine Name:
buxtehude
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.