Debian Bug report logs -
#1010368
python3.10: python variables called _m lead to unreproducible pyc installations
Toggle useless messages
Report forwarded
to debian-bugs-dist@lists.debian.org, reproducible-bugs@lists.alioth.debian.org, Matthias Klose <doko@debian.org>:
Bug#1010368; Package src:python3.10.
(Fri, 29 Apr 2022 16:24:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Johannes Schauer Marin Rodrigues <josch@debian.org>:
New Bug report received and forwarded. Copy sent to reproducible-bugs@lists.alioth.debian.org, Matthias Klose <doko@debian.org>.
(Fri, 29 Apr 2022 16:24:03 GMT) (full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Source: python3.10
Version: 3.10.4-3
Severity: wishlist
Tags: patch
User: reproducible-builds@lists.alioth.debian.org
Usertags: randomness
X-Debbugs-Cc: reproducible-bugs@lists.alioth.debian.org
Hi,
if a package contains python code with a variable named _m, then after
installing that package the pyc file resulting from that code is
unreproducible because of some randomness. Minimal reproducer:
export SOURCE_DATE_EPOCH="$(date +%s)"
for i in `seq 1 10`; do
mmdebstrap --quiet --variant=apt --include=python3.10 \
--customize-hook='echo _m > "$1"/tmp/decoder.py' \
--customize-hook='chroot "$1" python3.10 -m py_compile /tmp/decoder.py' \
--customize-hook='cat "$1"/tmp/__pycache__/decoder.cpython-310.pyc | md5sum' \
unstable /dev/null 2>&1
done | sort | uniq -c
The above will print something like:
6 4662176a6024d5eec15033097cd7e588 -
4 aeb00bedc784e7cca3eb42cf50e92f8d -
If you run the loop more often, one can see that 2/3 of the times, the
pyc file will have one hash and the other 1/3 of the times the other. So
there are two distinct possible contents that the pyc file generated
from the same python script just containing "_m" can have. Below you can
find a difference between the hexdump these two possible pyc versions.
I have no idea why this happens. But why does it matter? Since #1004558
got fixed, a Priority:standard chroot is now mostly bit-by-bit
identical. Only "mostly" because there is one remaining difference:
/usr/lib/python3.10/json/__pycache__/decoder.cpython-310.pyc
But why does that pyc file differ (randomly) while all the others remain
stable? Even if it sounds ridiculous, I tracked it down to the use of
the variable _m in /usr/lib/python3.10/json/decoder.py.
Also, the problem only shows when compiling all pyc files in a fresh
chroot. Given the same chroot with all pyc files already generated, the
pyc file generated from the minimal test case (just a python script
containing the variable name "_m" as above) will remain stable. So the
following will *not* reproduce the problem:
echo _m > test.py
for i in `seq 1 100`; do
rm -rf __pycache__
python3.10 -m py_compile test.py
md5sum __pycache__/test.cpython-310.pyc
done
It needs to be done in a fresh chroot. Since the pyc contents also rely
on the modification time of the python scripts involved, maybe the
reason for this is behaviour is some unreproducible mtimes after
unpacking the packages? This is why I'm filing it here. This might as
well be some sort of packaging problem.
For the minimal test case (a python script just containing the variable
name "_m"), the pyc file is very tiny and the diffoscope output will
display the whole file via the diff context:
@@ -1,8 +1,8 @@
00000000: 6f0d 0d0a 0300 0000 5371 fe33 17b6 dd59 o.......Sq.3...Y
00000010: e300 0000 0000 0000 0000 0000 0000 0000 ................
00000020: 0001 0000 0040 0000 0073 0800 0000 6500 .....@...s....e.
-00000030: 0100 6400 5300 2901 4e29 01da 025f 6da9 ..d.S.).N)..._m.
-00000040: 0072 0200 0000 7202 0000 00fa 0f2f 746d .r....r....../tm
+00000030: 0100 6400 5300 2901 4e29 015a 025f 6da9 ..d.S.).N).Z._m.
+00000040: 0072 0100 0000 7201 0000 00fa 0f2f 746d .r....r....../tm
00000050: 702f 6465 636f 6465 722e 7079 da08 3c6d p/decoder.py..<m
00000060: 6f64 756c 653e 0100 0000 7302 0000 0008 odule>....s.....
00000070: 00 .
I'm not familiar with the pyc format so I cannot tell what the bits that
differ mean but maybe somebody who can, can figure this out given the
hexdump difference from above.
But it's crazy that a simple choice of variable name triggers randomness
in the pyc files, right? So to further test this theory, I patched the
python3.10 source package like this:
--- a/Lib/json/decoder.py
+++ b/Lib/json/decoder.py
@@ -67,7 +67,7 @@ def _decode_uXXXX(s, pos):
raise JSONDecodeError(msg, s, pos)
def py_scanstring(s, end, strict=True,
- _b=BACKSLASH, _m=STRINGCHUNK.match):
+ _b=BACKSLASH, m=STRINGCHUNK.match):
"""Scan the string s for a JSON string. End is the index of the
character in s after the quote that started the JSON string.
Unescapes all valid JSON string escape sequences and raises ValueError
@@ -80,7 +80,7 @@ def py_scanstring(s, end, strict=True,
_append = chunks.append
begin = end - 1
while 1:
- chunk = _m(s, end)
+ chunk = m(s, end)
if chunk is None:
raise JSONDecodeError("Unterminated string starting at", s, begin)
end = chunk.end()
This solves the problem of random unreproducibility. All pyc files in a
priority:standard chroot are now reproducible even when running the
producer from the top of this mail 100 times. This is why I'm tagging
this bug with "patch". I know this is just a workaround but maybe it can
be applied until the underlying problem is identified?
With above patch, a priority:standard chroot is now finally always
bit-by-bit reproducible. I know that I also claimed that this were the
case for the patch I submitted in #1004558 but since the pyc contents
change randomly, it is very possible that I just did two tests which
happened to produce identical output and called it a day and thus never
encountered the randomly occurring difference of
decoder.cpython-310.pyc. Due to the random nature of the pyc file
contents, it's completely possible to run the reproducer 10 times and
always get the same result and only the 11th run shows the difference.
But what is so special about variables named _m? Following a hunch I
searched the python codebase and found another variable called _m in
Lib/types.py. Choosing _m here seemed arbitrary so I tried what happens
if the function name would be changed from _m to something else:
--- a/Lib/types.py
+++ b/Lib/types.py
@@ -37,8 +37,8 @@ _ag = _ag()
AsyncGeneratorType = type(_ag)
class _C:
- def _m(self): pass
-MethodType = type(_C()._m)
+ def _abc(self): pass
+MethodType = type(_C()._abc)
BuiltinFunctionType = type(len)
BuiltinMethodType = type([].append) # Same as BuiltinFunctionType
And this *also* fixes the reproducibility issue! So now there exists a
second workaround patch and it seems that somehow private variable names
from Lib/types.py have an influence on pyc files generated containing
the same variable names in a completely different context?
So yes, this is a bug that probably needs to be properly fixed elsewhere
but until then, please consider applying either of above temporary
workarounds so that a priority:standard chroot can become reproducible
again for our next stable release.
Thanks!
cheers, josch
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org, Matthias Klose <doko@debian.org>:
Bug#1010368; Package src:python3.10.
(Fri, 29 Apr 2022 18:24:03 GMT) (full text, mbox, link).
Acknowledgement sent
to "Chris Lamb" <lamby@debian.org>:
Extra info received and forwarded to list. Copy sent to Matthias Klose <doko@debian.org>.
(Fri, 29 Apr 2022 18:24:03 GMT) (full text, mbox, link).
Message #10 received at 1010368@bugs.debian.org (full text, mbox, reply):
Hey Johannes,
> I'm not familiar with the pyc format so I cannot tell what the bits that
> differ mean but maybe somebody who can, can figure this out given the
> hexdump difference from above.
As I understand it, a .pyc file consists of .pyc-specific header but
the bulk of the file is "just" a marshalled PyCode object. The hexdump
you referenced has the change within this marshalled part. When I
disassemble this part using the dis module, there is no "semantic"
difference between two different .pyc files from your loop:
1 0 LOAD_NAME 0 (_m)
2 POP_TOP
4 LOAD_CONST 0 (None)
6 RETURN_VALUE
<conjectures>
This suggests that the difference is some internal implementation
detail of the marshalled PyCode object which does not affect its
execution semantics. I could imagine that some kind of string
internalisation algorithm is resulting in nondeterministic hashmap
entry numbers... or something. Still, it might not even be an
implementation detail: it could merely be uninitialised memory that is
happily skipped over by the parser.
</conjectures>
As it happens, I don't think you are the first to discover the
peculiarity of "_m" — take a look at this enigmatic comment:
https://github.com/python/cpython/issues/78903#issuecomment-1093799639
Regards,
--
,''`.
: :' : Chris Lamb
`. `'` lamby@debian.org 🍥 chris-lamb.co.uk
`-
Information forwarded
to debian-bugs-dist@lists.debian.org, Matthias Klose <doko@debian.org>:
Bug#1010368; Package src:python3.10.
(Sat, 30 Apr 2022 04:27:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Keith Amling <me@amling2.org>:
Extra info received and forwarded to list. Copy sent to Matthias Klose <doko@debian.org>.
(Sat, 30 Apr 2022 04:27:03 GMT) (full text, mbox, link).
Message #15 received at 1010368@bugs.debian.org (full text, mbox, reply):
From skimming some of cpython's "marshal" code [1] my best guess is that
first difference is between it thinking the `_m` string might have
another reference to it (and thus adding 0x80, or FLAG_REF to it) or
not. This seems driven by whether or not python's object for the string
has other references (it calls Py_REFCNT(v) to decide, see line 302).
I assume the difference is whether or not python has bothered to collect
some other reference to the string or not. Type "Z" is an interned
string type, TYPE_SHORT_ASCII_INTERNED, which therefore makes sense that
it might be shared with who knows what else. I'm assuming this stops
reproducing when you change it to a unique name since no one else will
share the reference and you'll just deterministically get no FLAG_REF.
Just my best guesses.
Keith
[1] https://github.com/python/cpython/blob/main/Python/marshal.c
Information forwarded
to debian-bugs-dist@lists.debian.org, Matthias Klose <doko@debian.org>:
Bug#1010368; Package src:python3.10.
(Mon, 02 May 2022 11:33:04 GMT) (full text, mbox, link).
Message #18 received at 1010368@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Control: forwarded -1 https://github.com/python/cpython/issues/92132
Hi,
On Fri, 29 Apr 2022 21:00:52 -0700 Keith Amling <me@amling2.org> wrote:
> From skimming some of cpython's "marshal" code [1] my best guess is that
> first difference is between it thinking the `_m` string might have
> another reference to it (and thus adding 0x80, or FLAG_REF to it) or
> not. This seems driven by whether or not python's object for the string
> has other references (it calls Py_REFCNT(v) to decide, see line 302).
>
> I assume the difference is whether or not python has bothered to collect
> some other reference to the string or not. Type "Z" is an interned
> string type, TYPE_SHORT_ASCII_INTERNED, which therefore makes sense that
> it might be shared with who knows what else. I'm assuming this stops
> reproducing when you change it to a unique name since no one else will share
> the reference and you'll just deterministically get no FLAG_REF.
thank you! It was indeed about that line and there exists a pull request
upstream that fixes this issue:
https://github.com/python/cpython/pull/8226
Specifically, the following patch to python3.10 in Debian seems to solve this.
I also attached a full debdiff for your convenience. Thanks!
cheers, josch
From 6c8ea7c1dacd42f3ba00440231ec0e6b1a38300d Mon Sep 17 00:00:00 2001
From: Inada Naoki <songofacandy@gmail.com>
Date: Sat, 14 Jul 2018 00:46:11 +0900
Subject: [PATCH] Use FLAG_REF always for interned strings
---
Python/marshal.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/Python/marshal.c
+++ b/Python/marshal.c
@@ -298,9 +298,14 @@ w_ref(PyObject *v, char *flag, WFILE *p)
if (p->version < 3 || p->hashtable == NULL)
return 0; /* not writing object references */
- /* if it has only one reference, it definitely isn't shared */
- if (Py_REFCNT(v) == 1)
+ /* If it has only one reference, it definitely isn't shared.
+ * But we use TYPE_REF always for interned string, to PYC file stable
+ * as possible.
+ */
+ if (Py_REFCNT(v) == 1 &&
+ !(PyUnicode_CheckExact(v) && PyUnicode_CHECK_INTERNED(v))) {
return 0;
+ }
entry = _Py_hashtable_get_entry(p->hashtable, v);
if (entry != NULL) {
[python.debdiff (text/x-diff, attachment)]
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org, Matthias Klose <doko@debian.org>:
Bug#1010368; Package src:python3.10.
(Wed, 04 May 2022 06:00:02 GMT) (full text, mbox, link).
Message #23 received at 1010368@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Hi,
Quoting Johannes Schauer Marin Rodrigues (2022-05-02 13:31:15)
> thank you! It was indeed about that line and there exists a pull request
> upstream that fixes this issue:
>
> https://github.com/python/cpython/pull/8226
>
> Specifically, the following patch to python3.10 in Debian seems to solve this.
> I also attached a full debdiff for your convenience. Thanks!
that the pull request has now been merged into main and I guess it's thus safe
to backport it to 3.10:
https://github.com/python/cpython/commit/6dcfd6c5e3cb46543e82dc3f7234546adf4bb04a
Thanks!
cheers, josch
[signature.asc (application/pgp-signature, inline)]
Added tag(s) fixed-upstream.
Request was from debian-bts-link@lists.debian.org
to control@bugs.debian.org.
(Thu, 05 May 2022 17:39:35 GMT) (full text, mbox, link).
Reply sent
to Matthias Klose <doko@debian.org>:
You have taken responsibility.
(Fri, 13 May 2022 12:39:05 GMT) (full text, mbox, link).
Notification sent
to Johannes Schauer Marin Rodrigues <josch@debian.org>:
Bug acknowledged by developer.
(Fri, 13 May 2022 12:39:05 GMT) (full text, mbox, link).
Message #30 received at 1010368-close@bugs.debian.org (full text, mbox, reply):
Source: python3.10
Source-Version: 3.10.4-4
Done: Matthias Klose <doko@debian.org>
We believe that the bug you reported is fixed in the latest version of
python3.10, which is due to be installed in the Debian FTP archive.
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to 1010368@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Matthias Klose <doko@debian.org> (supplier of updated python3.10 package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@ftp-master.debian.org)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Format: 1.8
Date: Fri, 13 May 2022 14:08:11 +0200
Source: python3.10
Architecture: source
Version: 3.10.4-4
Distribution: unstable
Urgency: medium
Maintainer: Matthias Klose <doko@debian.org>
Changed-By: Matthias Klose <doko@debian.org>
Closes: 1010368
Changes:
python3.10 (3.10.4-4) unstable; urgency=medium
.
* Source-only upload.
* Backport gh-78214: marshal: Stabilize FLAG_REF usage. Closes: #1010368.
Checksums-Sha1:
7e8325c2bc1479aad02c278b977e695ec61f9089 3609 python3.10_3.10.4-4.dsc
b9ba4b7f5ebbc527ec37ed0f036cc1037977cbb6 217696 python3.10_3.10.4-4.debian.tar.xz
f5c6758b02d9282c8a3faaa468d2f499fac60f64 8048 python3.10_3.10.4-4_source.buildinfo
Checksums-Sha256:
42d2b36f6261fc9c85fc6c930b6c13b50cdeae5ae4ce88b8dda03f261c3cee2d 3609 python3.10_3.10.4-4.dsc
9e1306ae69558959d447f01d107a32a8b809a9cce423fc0d83b3c3ef2a19d3f6 217696 python3.10_3.10.4-4.debian.tar.xz
2527dff8f8bbfea062d1ad892c945a77fa584c8dd26e1747aafc77086d24824f 8048 python3.10_3.10.4-4_source.buildinfo
Files:
3c914503d7b58d44414a78614c2a6b7c 3609 python optional python3.10_3.10.4-4.dsc
89b16bf08393af51282c5a0bb4d26556 217696 python optional python3.10_3.10.4-4.debian.tar.xz
3ec1c681467ecb1c75d3ae0f4a1675d2 8048 python optional python3.10_3.10.4-4_source.buildinfo
-----BEGIN PGP SIGNATURE-----
iQJEBAEBCAAuFiEE1WVxuIqLuvFAv2PWvX6qYHePpvUFAmJ+S7cQHGRva29AZGVi
aWFuLm9yZwAKCRC9fqpgd4+m9WO3D/9O3iTYRl7zRquKO5I2WeJ9OnHa9D2Z91yb
luz2JPO8j/3K4XnvYVqrS6qQa9zb+4crmxz84LC5qg/fQFSX2rYymYEHHWig73Uh
LwVQimIvrN/Lskr9RAnpw/chv6C874c1cyWqUM3mdGsqut2eAfCjTfoHYd4SRW8e
3ClxPMtwzK8zYeLnpRyBVZO/rLKttmdK13DJe45UiiOr8lX4y2Tz/8z4pr8iOrGo
smk5Djh9c8SP+HzXcPl9EKXszLw/w+iKjbm1zdITm3P27u8pokn3OeqcOyeQEWH/
/KfJKhtgSDVRJVB7CbeihEnEKYSQ/YcydLRqQmreZf8utmFIOrs+5FCxqht6QuwR
rw1dLC+aVAooK4CeaaVK0PqWJ/KmPVtyP+oKU94U9Uew/DAvRvDOnfviIdJeeTwQ
yKpAP8x4y4LFXxYrF3n3rTj/wbi34235KfTm8LB/0Mr4oz357lhRkQZSEZ08kGHV
tKs40MWeQDzQ4t48KQfKPtCBPynNsFhF0J1Yifh/hl+u+tE+DY0B3wu1Y7stjq2e
esV2eSjVjY6KicT/atP9MZbkTiQFeRBwbszIl3RoS4c8L3dTsMeTOFNMmqgCoT6Y
OR0Qj8rJQg/pHV2wNwVMADfnJCsOB+nzlqKTqLxt4ReGndjJTp5Xm5Ks3+0hKXBU
SgxZxs2miA==
=YrU3
-----END PGP SIGNATURE-----
Bug archived.
Request was from Debbugs Internal Request <owner@bugs.debian.org>
to internal_control@bugs.debian.org.
(Sat, 11 Jun 2022 07:25:44 GMT) (full text, mbox, link).
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Wed May 17 10:41:32 2023;
Machine Name:
buxtehude
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.