Debian Bug report logs -
#909122
diffoscope: MemoryError when comparing big ISO images
Toggle useless messages
Report forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#909122; Package diffoscope.
(Tue, 18 Sep 2018 18:21:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>:
New Bug report received and forwarded. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Tue, 18 Sep 2018 18:21:04 GMT) (full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Package: diffoscope
Version: 101
Severity: normal
Dear Maintainer,
When comparing two 4.5GB ISO images, diffoscope tries to load them into
memory, which fails with MemoryError in json comparator:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/diffoscope/main.py", line 470, in main
sys.exit(run_diffoscope(parsed_args))
File "/usr/lib/python3/dist-packages/diffoscope/main.py", line 442, in run_diffoscope
difference = compare_root_paths(path1, path2)
File "/usr/lib/python3/dist-packages/diffoscope/comparators/utils/compare.py", line 65, in compare_root_paths
file1 = specialize(FilesystemFile(path1, container=container1))
File "/usr/lib/python3/dist-packages/diffoscope/comparators/utils/specialize.py", line 49, in specialize
if try_recognize(file, cls, cls.recognizes):
File "/usr/lib/python3/dist-packages/diffoscope/comparators/utils/specialize.py", line 36, in try_recognize
if not recognizes(file):
File "/usr/lib/python3/dist-packages/diffoscope/comparators/json.py", line 52, in recognizes
f.read().decode('utf-8', errors='ignore'),
MemoryError
Obviously ISO file is not JSON.
The whole thing could be avoided if earlier check (if initial 10 chars
contains '[' or '{') would be executed not only on "text" files.
Any reasons for that "is_text" there? Alternatively, if is_text=False,
maybe the function should return False early?
I can provide a patch for either option, but I'd like to know which one
of them you prefer.
The JSONFile.recognizes function, for context:
@classmethod
def recognizes(cls, file):
with open(file.path, 'rb') as f:
# Try fuzzy matching for JSON files
is_text = any(
file.magic_file_type.startswith(x)
for x in ('ASCII text', 'UTF-8 Unicode text'),
)
if is_text and not file.name.endswith('.json'):
buf = f.read(10)
if not any(x in buf for x in b'{['):
return False
f.seek(0)
try:
file.parsed = json.loads(
f.read().decode('utf-8', errors='ignore'),
object_pairs_hook=collections.OrderedDict,
)
except ValueError:
return False
return True
-- System Information:
Debian Release: buster/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 4.14.67-1.pvops.qubes.x86_64 (SMP w/8 CPU cores)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968), LANGUAGE=C (charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /usr/bin/dash
Init: unable to detect
Versions of packages diffoscope depends on:
ii libpython3.6-stdlib 3.6.6-1
ii python3 3.6.5-3
ii python3-distro 1.3.0-1
ii python3-distutils 3.6.6-1
ii python3-libarchive-c 2.1-3.1
ii python3-magic 2:0.4.15-2
ii python3-pkg-resources 40.2.0-1
Versions of packages diffoscope recommends:
ii abootimg 0.6-1+b2
ii acl 2.2.52-3+b1
pn apktool <none>
ii binutils-multiarch 2.31.1-5
ii bzip2 1.0.6-9
ii caca-utils 0.99.beta19-2+b3
ii colord 1.3.3-2
ii db-util 5.3.1
ii default-jdk-headless 2:1.10-68
ii device-tree-compiler 1.4.7-3
ii docx2txt 1.4-1
ii e2fsprogs 1.44.4-2
ii enjarify 1:1.0.3-4
ii fontforge-extras 0.3-4
ii fp-utils 3.0.4+dfsg-20
ii fp-utils-3.0.4 [fp-utils] 3.0.4+dfsg-20
ii genisoimage 9:1.1.11-3+b2
ii gettext 0.19.8.1-7
ii ghc 8.2.2-4
ii ghostscript 9.25~dfsg-2
ii giflib-tools 5.1.4-3
ii gnumeric 1.12.41-1
ii gnupg 2.2.10-1
ii imagemagick 8:6.9.10.8+dfsg-1
ii imagemagick-6.q16 [imagemagick] 8:6.9.10.8+dfsg-1
ii jsbeautifier 1.6.4-7
ii libarchive-tools 3.2.2-5
ii llvm 1:6.0-43
ii lz4 1.8.2-1
ii mono-utils 4.6.2.7+dfsg-1
ii odt2txt 0.5-1+b2
pn oggvideotools <none>
ii openssh-client 1:7.8p1-1
ii pgpdump 0.33-1
ii poppler-utils 0.63.0-2
ii procyon-decompiler 0.5.32-4
ii python3-argcomplete 1.8.1-1
ii python3-binwalk 2.1.2~git20180830+dfsg1-1
ii python3-debian 0.1.33
ii python3-defusedxml 0.5.0-1
ii python3-guestfs 1:1.38.4-1
ii python3-jsondiff 1.1.1-2
ii python3-progressbar 2.3-4
ii python3-pyxattr 0.6.0-2+b2
ii python3-tlsh 3.4.4+20151206-1+b4
ii r-base-core 3.5.1-1+b1
ii rpm2cpio 4.14.1+dfsg1-4
ii sng 1.1.0-1+b1
ii sqlite3 3.24.0-1
ii squashfs-tools 1:4.3-6
ii tcpdump 4.9.2-3
ii unzip 6.0-21
ii vim-common 2:8.1.0320-1
ii xmlbeans 2.6.0+dfsg-4
ii xxd 2:8.1.0320-1
ii xz-utils 5.2.2-1.3
Versions of packages diffoscope suggests:
ii libjs-jquery 3.2.1-1
-- no debconf information
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#909122; Package diffoscope.
(Tue, 18 Sep 2018 18:39:05 GMT) (full text, mbox, link).
Acknowledgement sent
to Chris Lamb <lamby@debian.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Tue, 18 Sep 2018 18:39:05 GMT) (full text, mbox, link).
Message #10 received at submit@bugs.debian.org (full text, mbox, reply):
Hi Marek,
> The whole thing could be avoided if earlier check (if initial 10 chars
> contains '[' or '{') would be executed not only on "text" files.
Indeed. The origins of this appear to be:
https://salsa.debian.org/reproducible-builds/diffoscope/commit/2a758d3d0205e934ed6dffebb5d6462b00fe590d
> I can provide a patch for either option, but I'd like to know which one
> of them you prefer.
I'm not quite sure but I would probably go with dropping the whole
`is_text` thing but keeping everything the same. Are you happy and
comfortable creating an MR for this? Thanks in advance…
> Obviously ISO file is not JSON.
♥
Best wishes,
--
,''`.
: :' : Chris Lamb
`. `'` lamby@debian.org / chris-lamb.co.uk
`-
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#909122; Package diffoscope.
(Tue, 18 Sep 2018 18:39:12 GMT) (full text, mbox, link).
Acknowledgement sent
to Chris Lamb <lamby@debian.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Tue, 18 Sep 2018 18:39:12 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#909122; Package diffoscope.
(Tue, 18 Sep 2018 18:42:02 GMT) (full text, mbox, link).
Acknowledgement sent
to Daniel Shahaf <danielsh@apache.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Tue, 18 Sep 2018 18:42:02 GMT) (full text, mbox, link).
Message #20 received at 909122@bugs.debian.org (full text, mbox, reply):
Marek Marczykowski-Górecki wrote on Tue, Sep 18, 2018 at 20:17:03 +0200:
> File "/usr/lib/python3/dist-packages/diffoscope/comparators/json.py", line 52, in recognizes
> f.read().decode('utf-8', errors='ignore'),
> MemoryError
>
> The JSONFile.recognizes function, for context:
>
> @classmethod
> def recognizes(cls, file):
> with open(file.path, 'rb') as f:
> # Try fuzzy matching for JSON files
> is_text = any(
> file.magic_file_type.startswith(x)
> for x in ('ASCII text', 'UTF-8 Unicode text'),
> )
> if is_text and not file.name.endswith('.json'):
> buf = f.read(10)
> if not any(x in buf for x in b'{['):
> return False
> f.seek(0)
>
> try:
> file.parsed = json.loads(
> f.read().decode('utf-8', errors='ignore'),
> object_pairs_hook=collections.OrderedDict,
> )
Slurping the file to a string object is an antipattern. Instead of
using f.read() to create a 4.5GB string, it would be better to use
json.load(f) to read the file incrementally. That should raise an
exception rather quickly.
> except ValueError:
> return False
>
> return True
> Obviously ISO file is not JSON.
> The whole thing could be avoided if earlier check (if initial 10 chars
> contains '[' or '{') would be executed not only on "text" files.
> Any reasons for that "is_text" there? Alternatively, if is_text=False,
> maybe the function should return False early?
>
> I can provide a patch for either option, but I'd like to know which one
> of them you prefer.
No opinion on these.
Thanks for including the function in the report.
Cheers,
Daniel
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#909122; Package diffoscope.
(Tue, 18 Sep 2018 19:03:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Tue, 18 Sep 2018 19:03:03 GMT) (full text, mbox, link).
Message #25 received at 909122@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
On Tue, Sep 18, 2018 at 06:39:28PM +0000, Daniel Shahaf wrote:
> Slurping the file to a string object is an antipattern. Instead of
> using f.read() to create a 4.5GB string, it would be better to use
> json.load(f) to read the file incrementally. That should raise an
> exception rather quickly.
That may be even better! Expect MR in a moment. Should I include some
magic text in commit message to link it with this bug?
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#909122; Package diffoscope.
(Tue, 18 Sep 2018 19:12:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Tue, 18 Sep 2018 19:12:04 GMT) (full text, mbox, link).
Message #30 received at 909122@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
On Tue, Sep 18, 2018 at 09:00:11PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Sep 18, 2018 at 06:39:28PM +0000, Daniel Shahaf wrote:
> > Slurping the file to a string object is an antipattern. Instead of
> > using f.read() to create a 4.5GB string, it would be better to use
> > json.load(f) to read the file incrementally. That should raise an
> > exception rather quickly.
>
> That may be even better! Expect MR in a moment. Should I include some
> magic text in commit message to link it with this bug?
Nope, json.load:
def load(fp, *, cls=None, object_hook=None, parse_float=None,
...
return loads(fp.read(),
...
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#909122; Package diffoscope.
(Tue, 18 Sep 2018 19:15:07 GMT) (full text, mbox, link).
Acknowledgement sent
to Daniel Shahaf <danielsh@apache.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Tue, 18 Sep 2018 19:15:07 GMT) (full text, mbox, link).
Message #35 received at 909122@bugs.debian.org (full text, mbox, reply):
Marek Marczykowski-Górecki wrote on Tue, 18 Sep 2018 21:09 +0200:
> On Tue, Sep 18, 2018 at 09:00:11PM +0200, Marek Marczykowski-Górecki wrote:
> > On Tue, Sep 18, 2018 at 06:39:28PM +0000, Daniel Shahaf wrote:
> > > Slurping the file to a string object is an antipattern. Instead of
> > > using f.read() to create a 4.5GB string, it would be better to use
> > > json.load(f) to read the file incrementally. That should raise an
> > > exception rather quickly.
> >
> > That may be even better! Expect MR in a moment. Should I include some
> > magic text in commit message to link it with this bug?
>
> Nope, json.load:
>
> def load(fp, *, cls=None, object_hook=None, parse_float=None,
> ...
> return loads(fp.read(),
> ...
I stand corrected...
... and surprised.
Cheers,
Daniel
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#909122; Package diffoscope.
(Tue, 18 Sep 2018 19:15:09 GMT) (full text, mbox, link).
Acknowledgement sent
to "Chris Lamb " <lamby@debian.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Tue, 18 Sep 2018 19:15:09 GMT) (full text, mbox, link).
Message #40 received at 909122@bugs.debian.org (full text, mbox, reply):
Hi Marek,
> magic text in commit message to link it with this bug?
Sure:
"Blah blah blah. (Closes: #909122)"
Thanks!
Best wishes,
--
,''`.
: :' : Chris Lamb
`. `'` lamby@debian.org / chris-lamb.co.uk
`-
Reply sent
to Mattia Rizzolo <mattia@debian.org>:
You have taken responsibility.
(Sun, 23 Sep 2018 09:21:12 GMT) (full text, mbox, link).
Notification sent
to Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>:
Bug acknowledged by developer.
(Sun, 23 Sep 2018 09:21:12 GMT) (full text, mbox, link).
Message #45 received at 909122-close@bugs.debian.org (full text, mbox, reply):
Source: diffoscope
Source-Version: 102
We believe that the bug you reported is fixed in the latest version of
diffoscope, which is due to be installed in the Debian FTP archive.
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to 909122@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Mattia Rizzolo <mattia@debian.org> (supplier of updated diffoscope package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@ftp-master.debian.org)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Format: 1.8
Date: Sun, 23 Sep 2018 10:43:40 +0200
Source: diffoscope
Binary: diffoscope
Architecture: source
Version: 102
Distribution: unstable
Urgency: medium
Maintainer: Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>
Changed-By: Mattia Rizzolo <mattia@debian.org>
Description:
diffoscope - in-depth comparison of files, archives, and directories
Closes: 908900 909122
Changes:
diffoscope (102) unstable; urgency=medium
.
[ Chris Lamb ]
* Fix tests under colord >= 1.4.3. Closes: #908900
.
[ Xavier Briand ]
* Add an "Add a comparator" section in CONTRIBUTING. MR: !9
.
[ Mattia Rizzolo ]
* debian: Use the new debhelper-compat(=11) build dep and drop d/compat.
.
[ Marek Marczykowski-Górecki ]
* comparators/json: Try fuzzy matching for non-text files too.
This avoids loading very large file just to discover they aren't JSON.
Closes: #909122
Checksums-Sha1:
bb80029110f656ed86241bdcb4266bb77749a9c1 4072 diffoscope_102.dsc
92246250370e173e5b97de98f336214b1e7eee5e 9252320 diffoscope_102.tar.xz
0d51edabd976683027c534cd5bb604a827acec21 21640 diffoscope_102_amd64.buildinfo
Checksums-Sha256:
882c29062247ec93d39e5c5180a5539d62bec9c8f5259fa215e225aa0b1ddda2 4072 diffoscope_102.dsc
ce3f3ef52fc1fea17b31a890c9d9d3b49951e92501f515922f0f756ef64c59cb 9252320 diffoscope_102.tar.xz
02ec4740f9992630affb1903e6b791a79c0146c57277e055897377d25c0d246e 21640 diffoscope_102_amd64.buildinfo
Files:
5410095debfd02eddf4b77e4621c98ad 4072 devel optional diffoscope_102.dsc
75b3c90e33dae8da49cc33c784a7c7aa 9252320 devel optional diffoscope_102.tar.xz
ba4a04c653164ca0d13e1d36c2e9d00d 21640 devel optional diffoscope_102_amd64.buildinfo
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEi3hoeGwz5cZMTQpICBa54Yx2K60FAlunVkIACgkQCBa54Yx2
K63JbhAAgzCLLcAp9PomUHUsqnMlmw5JROdsSLb1iCAT6YklAxdWijl4DefGyNFH
zH8gvWgmSyO2kCIqnN6sAiDGJIwkb4CcaKhn6KaiNkucwBRa6kVe7/+2VPlBnGeF
vA76pKogLuvwgRUI1OicKC+5zaWegnlQcUDOKKXbM3srqu1+QvHEEO4Q3F5gXUOi
Vf+SKIYYwvUtY8hy/9ibAA+rHXyefVhJn88Xn335cele9X8soP/t0k1FFJ1NdgJd
Lpi4//w8P+M4eL6ItPVumQWENtekEXrohC5W+aUS6AQJF0/Gb119OGHbXIlYqJvj
sVUxiY4qTF4ETQOBeuBUJAT33Fa3vRH2L2XvVqJLJDlrDTCjBZy3QHkaEapmSDe6
BskX+YYNinK4t1CxRSU0I25JK7PuA67X4jK9NZt8rtMhUK1iNV3+LKxf5Dwf7qC6
o43vAoKfFHFAqwHdBFqHLyAmZmkQ0aBuMPAJ2LyBM3iq1K+Ymv4mDHHAJqfnRk7P
SdkyBg6UZgkOd54UX0vTxT2GSYdAJth6BS/h7ueuNR0L1FrH8qVpJYOTWDoK1H+m
CBX+MPxQmv2PjQzWcMV8YygiCIHEWiz2fvi5j6xaKDaeimgofZkms0IpFSGESMdO
hPZyQ/BlAf34dWcfvdKGOmWK0743ctisSTKRdR2Ba2UD4eqSELE=
=2mIQ
-----END PGP SIGNATURE-----
Bug archived.
Request was from Debbugs Internal Request <owner@bugs.debian.org>
to internal_control@bugs.debian.org.
(Tue, 23 Oct 2018 07:26:29 GMT) (full text, mbox, link).
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Wed May 17 12:54:42 2023;
Machine Name:
buxtehude
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.