Debian Bug report logs - #900399
memtest86+: very probably kills system controller on Lenovo Thinkpad T500 laptop

version graph

Package: memtest86+; Maintainer for memtest86+ is Fabio Fantoni <fantonifabio@tiscali.it>; Source for memtest86+ is src:memtest86+ (PTS, buildd, popcon).

Reported by: Sergey Kogan <kogan@bit-integro.ru>

Date: Wed, 30 May 2018 08:36:02 UTC

Severity: normal

Found in version memtest86+/5.01-3

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, kogan@bit-integro.ru, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+. (Wed, 30 May 2018 08:36:04 GMT) (full text, mbox, link).


Acknowledgement sent to Sergey Kogan <kogan@bit-integro.ru>:
New Bug report received and forwarded. Copy sent to kogan@bit-integro.ru, Yann Dirson <dirson@debian.org>. (Wed, 30 May 2018 08:36:04 GMT) (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Sergey Kogan <kogan@bit-integro.ru>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: memtest86+: very probably kills system controller on Lenovo Thinkpad T500 laptop
Date: Wed, 30 May 2018 14:58:33 +0600
Package: memtest86+
Version: 5.01-3
Severity: critical
Justification: breaks the whole system

Hi! There is a situation I belive should be reported ASAP. We have two
Lenovo T500 laptops completely dead after an overnight testing with memtest86+.
Notebooks do not power on, and even do not show up the 'external power' 
ledled when plugging in AC-adapter.

The whole story is:

29-May-2018 Two Lenovo T500 laptops were upgraded with 4Gb memory sticks. After
the upgrade the laptops were powered on with no problems and were used till the
evening.

In the evening 29-May-2018 both laptops were rebooted into Memtest86+ and
set up for an overnight RAM test with default memtest settings.

In the morning 30-May-2018 both laptops where found with a few passes completed
and zero memory errors found. But laptops were not responding to 
keyboard commands. The laptops were turned off with a long press of the power
button and then refused to start.

All the usual tricks were performed including:
- Removing and replacing RAM sticks
- Removing battery and AC power for a period of time
- Ten times and a long press power button advice found on the internet
- CMOS battery removal

Still, two lenovo laptops show no signs of life. We are going to send one
laptop for service today (maybe they would diagnose the issue better), 
but it seems very likely that memtest86+ somehow killed the firmware 
of a motherboard system controller. 

Until the problem is identified, I recommend to issue a warning and (or) prevent
installation/running of the memtest86+ on Lenovo T500 laptops.

Will post updates when more information will be available.


-- System Information:
Debian Release: 9.1
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 4.9.0-3-686-pae (SMP w/2 CPU cores)
Locale: LANG=en_US, LC_CTYPE=ru_RU.UTF8 (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF8), LANGUAGE=en_US:en (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF8)
Shell: /bin/sh linked to /bin/bash
Init: systemd (via /run/systemd/system)

Versions of packages memtest86+ depends on:
ii  debconf [debconf-2.0]  1.5.61

memtest86+ recommends no packages.

Versions of packages memtest86+ suggests:
ii  grub-pc              2.02~beta3-5
pn  hwtools              <none>
pn  kernel-patch-badram  <none>
pn  memtest86            <none>
pn  memtester            <none>
pn  mtools               <none>

-- debconf-show failed



Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+. (Wed, 06 Jun 2018 09:54:08 GMT) (full text, mbox, link).


Acknowledgement sent to Сергей Коган <kogan@bit-integro.ru>:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>. (Wed, 06 Jun 2018 09:54:08 GMT) (full text, mbox, link).


Message #10 received at 900399@bugs.debian.org (full text, mbox, reply):

From: Сергей Коган <kogan@bit-integro.ru>
To: 900399@bugs.debian.org
Subject: It's confirmed: memtest86+ can kill lenovo mainboard
Date: Wed, 6 Jun 2018 15:35:36 +0600
Hi!

Good news and a bad news. Both T500 laptops were examined. One was 
(almost) repaired. One is dead.

One-line summary:  Yes, memtest86+ killed them. No, it is not related to 
the embedded controller. It's a short-circuit in a power hub IC.

Details:

- It's important to note that power management logic in lenovo thinkpad 
laptops is quite sophisticated. The embedded controller provides a 
high-level signal, while special IC's issue signals to various gates to 
power up or power down specific parts of the system.

- One of those low-level IC's is a RIKNAN (U61 on lenovo schematics). 
The important part of the IC is the VCC3SW micro-power LDO (dc/dc 
converter). It provides a limited 3.3v power supply for the power button 
detection circuit, thermal protection logic and a power hub IC.

- The power hub PMH_7 (U28) is more intelligent then RINKAN, and has a 
SPI connection to the EC. It controls a lot of clocks and power signals 
on a main board. Note that PMH is used across different lenovo products, 
so some of it's outputs are left unused. It is a common practice to tie 
unused IC outputs to ground or VCC instead of leaving them unconnected.

- Coreboot developers discovered a method of accessing the internal 
registers of the PMH. The protocol is simple: write a register address 
to some memory-mapped EC address, then write desired value to the other 
EC address.

    outb(reg, EC_LENOVO_PMH7_ADDR);
    val = inb(EC_LENOVO_PMH7_DATA);
    outb(reg, EC_LENOVO_PMH7_ADDR);
    outb(val | (1 << bit), EC_LENOVO_PMH7_DATA);

- Now we are leaving the hard facts ground and start speculating.

- It seems be the case than either BIOS do not list memory-mapped EC 
registers as a reserved memory area, or memtest86+ fails to process this 
reservation correctly.

- The pattern of the memory writes by memtest is (unfortunately) 100% 
compatible with PMH internal register access protocol.

- It is very possible that by writing some moving ones and zeros or a 
random bytes, the memtest has pulled an unused (tied to ground or VCC) 
PMH pin high or low - thereby creating a short-circuit on VCC3SW line.

- This short-circuit would tend to overheat the RINKAN LDO as it's 
output transistor is in active mode, and is easily overloaded with a PMH 
output transistor (which is in conduction mode with a resistance of 
milli-ohms). It seems that RINKAN has no over-current or thermal 
protection built in.

- VCC3SW malfunction is not critical while the main board 3.3V/9А and 
5V/8A buses are powered by TPS51221 (U41) IC. Most components draw power 
from main buses and not from VCC3SW. But when the laptop is powered off, 
there is no VCC3SW bus to initiate the power-on process. The laptop is 
bricked.

Findings:

Both laptops were disassembled and main boards examined using a 
multi-meter and an oscilloscope. The main boards were of a different 
revisions (and different types: one with discrete graphics, one without) 
but both has the VCC3SW power bus malfunctioned. The first laptop 
provided around 1.2v over the VCC3SW and a measured resistance from 
VCC3SW to GND was around 400 Ohm. After cutting the VCC3SW pin on RINKAN 
IC and providing an external power to the VCC3SW line - the laptop 
powered up and attempted to boot. We ended up wiring up an external 
micro-power LDO (LP2930-3.3) to provide the power permanently. This 
laptop still has some minor problems (like refusing to power-up unless 
the battery is removed and AC-IN is plugged-in), but is still usable.

The second T500 RINKAN was not providing any power to the VCC3SW bus, 
and measured resistance was only ~50 Ohms. We had to cut both VCC3SW 
(output) and VREGIN20 (input) RINKAN pins to remove an over-current 
condition. After that we observed the power on main 3.3V and 5V buses, 
but RINKAN/PMH7 do not issue 'POWER GOOD' signals and prevent the system 
to become usable. No repair is possible.

It looks like T6x, T400/500, T410/510, T420/520 laptop families could be 
affected by this problem. Starting from the T430/530 series, a 
communication protocol with the EC was changed - breaking tp_smapi 
driver and fixing the described problem as a side effect.

I have a "revived" T500 on hands and I would be happy to provide any 
information to confirm or correct my findings.

I still think that it's appropriate to warn lenovo users of a 
possibility to brick their laptops with just a mere memory test.

---
Sincerely yours,
Sergey Kogan



Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+. (Thu, 07 Jun 2018 13:30:03 GMT) (full text, mbox, link).


Acknowledgement sent to Сергей Коган <kogan@bit-integro.ru>:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>. (Thu, 07 Jun 2018 13:30:04 GMT) (full text, mbox, link).


Message #15 received at 900399@bugs.debian.org (full text, mbox, reply):

From: Сергей Коган <kogan@bit-integro.ru>
To: 900399@bugs.debian.org
Subject: More good news
Date: Thu, 7 Jun 2018 19:26:43 +0600
Hi!

Let's lower the severity of this bug and flag it as unverified.

Given the datasheet for the TB62501 and actual board layout of the T500 
- the described scenario (short from the VCC3SW to GND caused by a stray 
write to the PMH register) is highly improbable:

- The LDO inside the RINKAN has an over-current protection set as low as 
55mA and should prevent any damage even if the VCC3SW is shorted. After 
the single over-current/under-voltage event, RINKAN LDO is locked in the 
OFF state and requires a complete power-off to restart.

- Unused pins of the PMH are in fact floating

- Some RINKAN batches do show tendency to malfunction with no apparent 
reasons. The main board temperature could be a contributing factor.

So, we have to seriously consider the possibility that two laptops died 
at the same time just by a coincidence.

We do plan to run a memtest on the restored laptop using a current 
measuring/limiting circuit on the VCC3SW bus. If no excessive current 
consumption would be detected - the memtest has nothing to do with the 
issue. If an excessive current during the test would be observed, it 
would get us a direction to resume the investigation.

---
Sincerely yours,
Sergey Kogan



Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+. (Tue, 03 Jul 2018 12:36:03 GMT) (full text, mbox, link).


Acknowledgement sent to Tomas Janousek <tomi@nomi.cz>:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>. (Tue, 03 Jul 2018 12:36:03 GMT) (full text, mbox, link).


Message #20 received at 900399@bugs.debian.org (full text, mbox, reply):

From: Tomas Janousek <tomi@nomi.cz>
To: Сергей Коган <kogan@bit-integro.ru>, 900399@bugs.debian.org
Subject: Re: Bug#900399: It's confirmed: memtest86+ can kill lenovo mainboard
Date: Tue, 3 Jul 2018 14:24:28 +0200
Hi,

On Wed, Jun 06, 2018 at 03:35:36PM +0600, Сергей Коган wrote:
> [...]
> It looks like T6x, T400/500, T410/510, T420/520 laptop families could be
> affected by this problem. Starting from the T430/530 series, a communication
> protocol with the EC was changed - breaking tp_smapi driver and fixing the
> described problem as a side effect.
> [...]

This may be completely unrelated, but it seems somewhat relevant:

When pressing and holding a key during memtest86+ on an otherwise perfectly
working T420, there are errors due to a different value being read than was
written. Initially I thought my memory/motherboard is faulty and the keyboard
pressure is triggering this, but the patterns are totally deterministic: the
same key always does the same "damage" to the bits.

Perhaps there is indeed something mapped into the memory... :-)

-- 
Tomáš Janoušek, a.k.a. Pivník, a.k.a. Liskni_si, http://work.lisk.in/



Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+. (Sat, 14 Jul 2018 01:27:03 GMT) (full text, mbox, link).


Acknowledgement sent to Dmitry Smirnov <onlyjob@debian.org>:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>. (Sat, 14 Jul 2018 01:27:03 GMT) (full text, mbox, link).


Message #25 received at 900399@bugs.debian.org (full text, mbox, reply):

From: Dmitry Smirnov <onlyjob@debian.org>
To: 900399@bugs.debian.org
Cc: 900399-submitter@bugs.debian.org
Subject: Re: #900399 memtest86+: very probably kills system controller on Lenovo Thinkpad T500 laptop
Date: Sat, 14 Jul 2018 11:22:10 +1000
[Message part 1 (text/plain, inline)]
IMHO inflated severity if this bug is unjustified.

Generally speaking, memtest86+ is exposing a hardware problem which is 
exactly what it designed to do and seems to be doing well - therefore this 
bug seems to be targeted against memtest86+'s primary function.

Let me use a hypothetical example: suppose I'm stress testing a notebook 
continuously for many hours. But notebook is not designed with same thermal 
properties as a server so during testing notebook is overheated beyond its 
thermal specifications for too long so it eventually breaks. Fair enough, 
arguably memtest86+ exposed flaw in thermal design which is exactly what's 
expected. It is unfortunate if hardware ended up damaged but it is not a bug 
in memtest86+.

Isn't it common sense that any burn-out test is not without risks of damage 
to hardware?

Maybe this bug is to be forwarded to notebook vendor?

What action you expect from Debian maintainer?
Incorporating a warning appears to be a task for upstream developers.

For what it's worth, I've used memtest86+ to extensively test two different 
models of T520 and T410 Thinkpads without breaking them...

-- 
All the best,
 Dmitry Smirnov.

---

Lies are the social equivalent of toxic waste: Everyone is potentially
harmed by their spread.
        -- Sam Harris
[signature.asc (application/pgp-signature, inline)]

Message sent on to Sergey Kogan <kogan@bit-integro.ru>:
Bug#900399. (Sat, 14 Jul 2018 01:27:05 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org>:
Bug#900399; Package memtest86+. (Sun, 12 Aug 2018 00:24:03 GMT) (full text, mbox, link).


Acknowledgement sent to ydirson@free.fr:
Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org>. (Sun, 12 Aug 2018 00:24:03 GMT) (full text, mbox, link).


Message #33 received at 900399@bugs.debian.org (full text, mbox, reply):

From: ydirson@free.fr
To: Сергей Коган <kogan@bit-integro.ru>, 900399@bugs.debian.org
Cc: control@bugs.debian.org
Subject: Re: Bug#900399: More good news
Date: Sun, 12 Aug 2018 02:21:36 +0200 (CEST)
severity 900399 normal
thanks

I suggest you get some advice from the forum[1], and as Dmitry mentionned, bring the issue to Lenovo.

[1] http://forum.canardpc.com/forums/73-Memtest86-Official-forum?s=1407c99a4da914ef85e60c32c658ba16

----- Mail original -----
> De: "Сергей Коган" <kogan@bit-integro.ru>
> À: 900399@bugs.debian.org
> Envoyé: Jeudi 7 Juin 2018 15:26:43
> Objet: Bug#900399: More good news
> 
> Hi!
> 
> Let's lower the severity of this bug and flag it as unverified.
> 
> Given the datasheet for the TB62501 and actual board layout of the
> T500
> - the described scenario (short from the VCC3SW to GND caused by a
> stray
> write to the PMH register) is highly improbable:
> 
> - The LDO inside the RINKAN has an over-current protection set as low
> as
> 55mA and should prevent any damage even if the VCC3SW is shorted.
> After
> the single over-current/under-voltage event, RINKAN LDO is locked in
> the
> OFF state and requires a complete power-off to restart.
> 
> - Unused pins of the PMH are in fact floating
> 
> - Some RINKAN batches do show tendency to malfunction with no
> apparent
> reasons. The main board temperature could be a contributing factor.
> 
> So, we have to seriously consider the possibility that two laptops
> died
> at the same time just by a coincidence.
> 
> We do plan to run a memtest on the restored laptop using a current
> measuring/limiting circuit on the VCC3SW bus. If no excessive current
> consumption would be detected - the memtest has nothing to do with
> the
> issue. If an excessive current during the test would be observed, it
> would get us a direction to resume the investigation.
> 
> ---
> Sincerely yours,
> Sergey Kogan
> 



Severity set to 'normal' from 'critical' Request was from ydirson@free.fr to control@bugs.debian.org. (Sun, 12 Aug 2018 00:24:04 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian QA Group <packages@qa.debian.org>:
Bug#900399; Package memtest86+. (Tue, 11 Jan 2022 17:06:02 GMT) (full text, mbox, link).


Acknowledgement sent to fantonifabio@tiscali.it:
Extra info received and forwarded to list. Copy sent to Debian QA Group <packages@qa.debian.org>. (Tue, 11 Jan 2022 17:06:02 GMT) (full text, mbox, link).


Message #40 received at 900399@bugs.debian.org (full text, mbox, reply):

From: Fabio Fantoni <fantonifabio@tiscali.it>
To: 900399@bugs.debian.org
Subject: Re: memtest86+: very probably kills system controller on Lenovo Thinkpad T500 laptop
Date: Tue, 11 Jan 2022 18:02:05 +0100
[Message part 1 (text/plain, inline)]
Hi, I used many times memtest86+ but never broken hardware and I suppose 
also in this case is not the cause.

What I have seen in several cases instead in which it restarted/turned 
off/blocked that the cause was an overheating problem that should be 
solved BEFORE these tests, in my cases was always solved by changing the 
thermal paste (needed on near all servers/pc/notebook after many years).

Described my experience hoping to help someone has similar problems and 
thinking that memtest is the cause read this bug

[OpenPGP_signature (application/pgp-signature, attachment)]

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Fri May 26 18:29:25 2023; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.