Debian Bug report logs -
#898022
diffoscope: Traceback when comparing paths with invalid unicode characters
Reported by: Chris Lamb <lamby@debian.org>
Date: Sun, 6 May 2018 00:42:01 UTC
Severity: important
Tags: patch
Found in version diffoscope/93
Fixed in version diffoscope/95
Done: Mattia Rizzolo <mattia@debian.org>
Bug is archived. No further changes may be made.
Toggle useless messages
Report forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#898022; Package diffoscope.
(Sun, 06 May 2018 00:42:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Chris Lamb <lamby@debian.org>:
New Bug report received and forwarded. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Sun, 06 May 2018 00:42:04 GMT) (full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
Package: diffoscope
Version: 93
Severity: important
Hi,
This is via <https://github.com/lamby/trydiffoscope/issues/35>, but I
think the bug is in diffoscope itself.
So, given the following test:
import os
import pytest
import subprocess
def test_invalid_filename(capsys, tmpdir):
base = str(tmpdir.mkdir('src')).encode('utf-8')
a = os.path.join(base, b'\xf0\x28\x8c\x28')
b = os.path.join(base, b'\xf0\x28\x8c\x29')
with open(a, 'w'), open(b, 'w'):
pass
subprocess.check_call(('bin/diffoscope', a, b))
I get:
____________________________ test_invalid_filename _____________________________
capsys = <_pytest.capture.CaptureFixture object at 0x7f25bd267710>
tmpdir = local('/tmp/pytest-of-lamby/pytest-43/test_invalid_filename0')
def test_invalid_filename(capsys, tmpdir):
base = str(tmpdir.mkdir('src')).encode('utf-8')
a = os.path.join(base, b'\xf0\x28\x8c\x28')
b = os.path.join(base, b'\xf0\x28\x8c\x29')
with open(a, 'w'), open(b, 'w'):
pass
> subprocess.check_call(('bin/diffoscope', a, b))
tests/test_filenames.py:34:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
popenargs = (('bin/diffoscope', b'/tmp/pytest-of-lamby/pytest-43/test_invalid_filename0/src/\xf0(\x8c(', b'/tmp/pytest-of-lamby/pytest-43/test_invalid_filename0/src/\xf0(\x8c)'),)
kwargs = {}, retcode = 2
cmd = ('bin/diffoscope', b'/tmp/pytest-of-lamby/pytest-43/test_invalid_filename0/src/\xf0(\x8c(', b'/tmp/pytest-of-lamby/pytest-43/test_invalid_filename0/src/\xf0(\x8c)')
def check_call(*popenargs, **kwargs):
"""Run command with arguments. Wait for command to complete. If
the exit code was zero then return, otherwise raise
CalledProcessError. The CalledProcessError object will have the
return code in the returncode attribute.
The arguments are the same as for the call function. Example:
check_call(["ls", "-l"])
"""
retcode = call(*popenargs, **kwargs)
if retcode:
cmd = kwargs.get("args")
if cmd is None:
cmd = popenargs[0]
> raise CalledProcessError(retcode, cmd)
E subprocess.CalledProcessError: Command '('bin/diffoscope', b'/tmp/pytest-of-lamby/pytest-43/test_invalid_filename0/src/\xf0(\x8c(', b'/tmp/pytest-of-lamby/pytest-43/test_invalid_filename0/src/\xf0(\x8c)')' returned non-zero exit status 2.
/usr/lib/python3.6/subprocess.py:291: CalledProcessError
------------------------------ Captured log setup ------------------------------
locale.py 33 DEBUG Normalising locale, timezone, etc.
__init__.py 128 DEBUG Loaded 66 comparator classes
__init__.py 128 DEBUG Loaded 66 comparator classes
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
File "/home/lamby/git/debian/reproducible/diffoscope/diffoscope/main.py", line 448, in main
sys.exit(run_diffoscope(parsed_args))
File "/home/lamby/git/debian/reproducible/diffoscope/diffoscope/main.py", line 420, in run_diffoscope
difference = compare_root_paths(path1, path2)
File "/home/lamby/git/debian/reproducible/diffoscope/diffoscope/comparators/utils/compare.py", line 65, in compare_root_paths
file1 = specialize(FilesystemFile(path1, container=container1))
File "/home/lamby/git/debian/reproducible/diffoscope/diffoscope/comparators/utils/specialize.py", line 49, in specialize
if try_recognize(file, cls, cls.recognizes):
File "/home/lamby/git/debian/reproducible/diffoscope/diffoscope/comparators/utils/specialize.py", line 36, in try_recognize
if not recognizes(file):
File "/home/lamby/git/debian/reproducible/diffoscope/diffoscope/comparators/debian.py", line 169, in recognizes
if not super().recognizes(file):
File "/home/lamby/git/debian/reproducible/diffoscope/diffoscope/comparators/utils/file.py", line 141, in recognizes
lambda m, t: t.search(m), file.magic_file_type),
File "/home/lamby/git/debian/reproducible/diffoscope/diffoscope/comparators/utils/file.py", line 227, in magic_file_type
self._magic_file_type = File.guess_file_type(self.path)
File "/home/lamby/git/debian/reproducible/diffoscope/diffoscope/comparators/utils/file.py", line 71, in guess_file_type
return self._mimedb.file(path)
File "/usr/lib/python3/dist-packages/magic/compat.py", line 148, in file
return Magic.__tostr(_file(self._magic_t, Magic.__tobytes(filename)))
File "/usr/lib/python3/dist-packages/magic/compat.py", line 138, in __tobytes
return bytes(b, 'utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 58: surrogates not allowed
=========================== 1 failed in 0.75 seconds ===========================
However, I can't seem to minimally reproduce with file by itself:
import magic
filename = b'\xf0\x28\x8c\x28'
with open(filename, 'w'):
pass
m = magic.open(magic.NONE)
m.load()
m.file(filename)
Regards,
--
,''`.
: :' : Chris Lamb
`. `'` lamby@debian.org / chris-lamb.co.uk
`-
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#898022; Package diffoscope.
(Thu, 10 May 2018 15:39:03 GMT) (full text, mbox, link).
Message #8 received at 898022@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Control: tag -1 patch
On Sun, May 06, 2018 at 01:38:58AM +0100, Chris Lamb wrote:
> This is via <https://github.com/lamby/trydiffoscope/issues/35>, but I
> think the bug is in diffoscope itself.
It is, although one could say it's a bug in argparse.
> However, I can't seem to minimally reproduce with file by itself:
>
> import magic
> filename = b'\xf0\x28\x8c\x28'
> with open(filename, 'w'):
> pass
> m = magic.open(magic.NONE)
> m.load()
> m.file(filename)
That's because argparse decodes the arguments, you can get the same
traceback by using this instead of the last command above:
|>>> m.file(filename.decode('utf-8', errors='surrogateescape'))
|Traceback (most recent call last):
| File "<stdin>", line 1, in <module>
| File "/usr/lib/python3/dist-packages/magic/compat.py", line 148, in file
| return Magic.__tostr(_file(self._magic_t, Magic.__tobytes(filename)))
| File "/usr/lib/python3/dist-packages/magic/compat.py", line 138, in __tobytes
| return bytes(b, 'utf-8')
|UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 0: surrogates not allowed
What do you think if we try to use:
|>>> m.file(f.encode('utf-8', errors='surrogateescape'))
In that place?
I.e., the following patch would fix this bug for me.
See also:
https://www.python.org/dev/peps/pep-0383/
https://bugs.python.org/issue21416
|diff --git a/diffoscope/comparators/utils/file.py b/diffoscope/comparators/utils/file.py
|index 4fd49ac..0638ef4 100644
|--- a/diffoscope/comparators/utils/file.py
|+++ b/diffoscope/comparators/utils/file.py
|@@ -68,7 +68,7 @@ class File(object, metaclass=abc.ABCMeta):
| if not hasattr(self, '_mimedb'):
| self._mimedb = magic.open(magic.NONE)
| self._mimedb.load()
|- return self._mimedb.file(path)
|+ return self._mimedb.file(path.encode('utf-8', errors='surrogateescape'))
|
| @classmethod
| def guess_encoding(self, path):
Do you think this would be fine?
--
regards,
Mattia Rizzolo
GPG Key: 66AE 2B4A FCCF 3F52 DA18 4D18 4B04 3FCD B944 4540 .''`.
more about me: https://mapreri.org : :' :
Launchpad user: https://launchpad.net/~mapreri `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia `-
[signature.asc (application/pgp-signature, inline)]
Added tag(s) patch.
Request was from Mattia Rizzolo <mattia@debian.org>
to 898022-submit@bugs.debian.org.
(Thu, 10 May 2018 15:39:03 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#898022; Package diffoscope.
(Thu, 10 May 2018 15:45:05 GMT) (full text, mbox, link).
Acknowledgement sent
to Chris Lamb <lamby@debian.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Thu, 10 May 2018 15:45:05 GMT) (full text, mbox, link).
Message #15 received at 898022@bugs.debian.org (full text, mbox, reply):
Hi Mattia,
> That's because argparse decodes the arguments
Ahh! Nice spot.
> Do you think this would be fine?
Whilst this works, would it not be better if we could use bytes for
filenames throughout? I mean, AIUI there is no assumption that
filesystems need to have any form of valid encoding whatsoever, let
alone UTF-8.
However, somewhat happy to see this in diffoscope as it certainly
improves the current state of affairs. If you do commit it, please
include my testcase (or something based on it) that I added in:
https://bugs.debian.org/898022#5
Best wishes,
--
,''`.
: :' : Chris Lamb
`. `'` lamby@debian.org / chris-lamb.co.uk
`-
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#898022; Package diffoscope.
(Thu, 10 May 2018 16:03:06 GMT) (full text, mbox, link).
Message #18 received at 898022@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
On Thu, May 10, 2018 at 04:43:37PM +0100, Chris Lamb wrote:
> > Do you think this would be fine?
>
> Whilst this works, would it not be better if we could use bytes for
> filenames throughout? I mean, AIUI there is no assumption that
> filesystems need to have any form of valid encoding whatsoever, let
> alone UTF-8.
That was my initial idea as well, but apparently the Python developers
are of different opinion. Check out the PEP I linked in my previous
email: https://www.python.org/dev/peps/pep-0383/
Together with the argparse bug I also linked:
https://bugs.python.org/issue21416 - apparently it's "hard" (more like
impossible?) to get bytes from the CLI...
I believe that, like that bug is showing, we should just specify
type=os.fsencode # https://docs.python.org/3/library/os.html#os.fsencode
in the parser.add_argument() calls using a filename (to make sure
argparse doesn't change output), and then re-encode them before passing
them to functions that can't handle surrogate encoded stuff like this
magic module.
> However, somewhat happy to see this in diffoscope as it certainly
> improves the current state of affairs. If you do commit it, please
> include my testcase (or something based on it) that I added in:
>
> https://bugs.debian.org/898022#5
Of course.
--
regards,
Mattia Rizzolo
GPG Key: 66AE 2B4A FCCF 3F52 DA18 4D18 4B04 3FCD B944 4540 .''`.
more about me: https://mapreri.org : :' :
Launchpad user: https://launchpad.net/~mapreri `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia `-
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#898022; Package diffoscope.
(Thu, 10 May 2018 16:09:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Chris Lamb <lamby@debian.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Thu, 10 May 2018 16:09:03 GMT) (full text, mbox, link).
Message #23 received at 898022@bugs.debian.org (full text, mbox, reply):
Hi Mattia,
> That was my initial idea as well, but apparently the Python developers
> are of different opinion. Check out the PEP I linked in my previous
> email: https://www.python.org/dev/peps/pep-0383/
Ah, thanks I merely skimmed this one and not the the issue. Hey, the Python developers could be wrong — they didn't like the SOURCE_DATE_EPOCH patches!
> > please include my testcase (or something based on it) that I added in:
> >
> > https://bugs.debian.org/898022#5
>
> Of course.
(Cool, just checking you saw it as, of course, there was no initial
"patch" tag)
Nice work :)
Regards,
--
,''`.
: :' : Chris Lamb
`. `'` lamby@debian.org / chris-lamb.co.uk
`-
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#898022; Package diffoscope.
(Thu, 10 May 2018 19:15:03 GMT) (full text, mbox, link).
Message #26 received at 898022@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Control: clone -1 -2
Control: tag -2 - patch
Control: severity -2 wishlist
Control: retitle -2 diffoscope: should handle the filenames as bytes
On Thu, May 10, 2018 at 06:01:50PM +0200, Mattia Rizzolo wrote:
> I believe that, like that bug is showing, we should just specify
> type=os.fsencode # https://docs.python.org/3/library/os.html#os.fsencode
> in the parser.add_argument() calls using a filename (to make sure
> argparse doesn't change output)
This indeed makes the filename be a .bytes through all the code, and
therefore requires a bunch of changes pretty much everywhere in the
comparators (str.endsiwth → bytes.endswith and with them all the
comparing strings needs to be become bytes as well).
I'm cloning this bug to keep track of this thing, as I'm not going to do
that now.
Initial patch to start (from there just try run it and it will crash)
|--- a/diffoscope/main.py
|+++ b/diffoscope/main.py
|@@ -74,11 +74,13 @@ def create_parser():
| parser = argparse.ArgumentParser(
| description='Calculate differences between two files or directories',
| add_help=False, formatter_class=HelpFormatter)
|- parser.add_argument('path1', nargs='?', help='First file or directory to '
|- 'compare. If omitted, tries to read a diffoscope diff from stdin.')
|- parser.add_argument('path2', nargs='?', help='Second file or directory to '
|- 'compare. If omitted, no comparison is done but instead we read a '
|- 'diffoscope diff from path1 and will output this in the formats '
|+ parser.add_argument('path1', nargs='?', type=os.fsencode,
|+ help='First file or directory to compare. If omitted, '
|+ 'tries to read a diffoscope diff from stdin.')
|+ parser.add_argument('path2', nargs='?', type=os.fsencode,
|+ help='Second file or directory to compare. If omitted, '
|+ 'no comparison is done but instead we read a diffoscope '
|+ 'diff from path1 and will output this in the formats '
| 'specified by the rest of the command line.')
| parser.add_argument('--debug', action='store_true',
| default=False, help='Display debug messages')
|
Also, I believe doing that change also requires changing the handling of
the filenames in the Container objects, thanks to diffoscope's
recursivity. So really, a quite big change.
In the meantime to fix #898022 I'll apply the patch I posted earlier.
--
regards,
Mattia Rizzolo
GPG Key: 66AE 2B4A FCCF 3F52 DA18 4D18 4B04 3FCD B944 4540 .''`.
more about me: https://mapreri.org : :' :
Launchpad user: https://launchpad.net/~mapreri `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia `-
[signature.asc (application/pgp-signature, inline)]
Bug 898022 cloned as bug 898361
Request was from Mattia Rizzolo <mattia@debian.org>
to 898022-submit@bugs.debian.org.
(Thu, 10 May 2018 19:15:03 GMT) (full text, mbox, link).
Reply sent
to Mattia Rizzolo <mattia@debian.org>:
You have taken responsibility.
(Sun, 20 May 2018 16:39:10 GMT) (full text, mbox, link).
Notification sent
to Chris Lamb <lamby@debian.org>:
Bug acknowledged by developer.
(Sun, 20 May 2018 16:39:10 GMT) (full text, mbox, link).
Message #33 received at 898022-close@bugs.debian.org (full text, mbox, reply):
Source: diffoscope
Source-Version: 95
We believe that the bug you reported is fixed in the latest version of
diffoscope, which is due to be installed in the Debian FTP archive.
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to 898022@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Mattia Rizzolo <mattia@debian.org> (supplier of updated diffoscope package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@ftp-master.debian.org)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Format: 1.8
Date: Sun, 20 May 2018 17:46:57 +0200
Source: diffoscope
Binary: diffoscope
Architecture: source
Version: 95
Distribution: unstable
Urgency: medium
Maintainer: Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>
Changed-By: Mattia Rizzolo <mattia@debian.org>
Description:
diffoscope - in-depth comparison of files, archives, and directories
Closes: 872826 898022 898683
Changes:
diffoscope (95) unstable; urgency=medium
.
[ Mattia Rizzolo ]
* tests:
+ test_binary: Don't capture unused output from subprocess.
+ test_git: Fix test failure on FreeBSD. Closes: #872826
Thanks to Ximin Luo <infinity0@debian.org> for the initial patch
* Fix handling of filesnames with non-unicode chars. Closes: #898022
* diff: Use bytes as much as possible, to prevent possible encoding issues.
* d/control: Make the dependency on python3-distutils an alternative on the
previous versions of libpython3.*-stdlib that used to ship it.
Closes: #898683
* d/watch: Update URL to the new archive.
.
[ Chris Lamb ]
* The Git repository has been migrated to salsa, update all the references.
* Drop extra whitespace in supported file format output.
Checksums-Sha1:
782c2ae2fd11e557de59a3b369077879c40d6785 3448 diffoscope_95.dsc
6b3e0c05cfb94459de1169ae612bdcdf36bad6a4 9244552 diffoscope_95.tar.xz
cec82d83e5852e0434b4510836b184b5fd285bd5 21700 diffoscope_95_amd64.buildinfo
Checksums-Sha256:
1def6a6a0dc17e0291886eee6249597c685027b05ab48af6cc412c6ab192d95f 3448 diffoscope_95.dsc
7991d7680d56315e99efa284cb278915b74f3646e2ae10fc10389c318cf7711c 9244552 diffoscope_95.tar.xz
9323e37a2cbc9cc14b0f0dd4deaa79c29cddfe9612d70ae5016a34a16319af78 21700 diffoscope_95_amd64.buildinfo
Files:
d2e620a8779cbed99cb1c6444163e100 3448 devel optional diffoscope_95.dsc
d8068a84700c902dd7cc2c5fb6f9e8bc 9244552 devel optional diffoscope_95.tar.xz
1f42fe134ee41e3857354614e860ddf4 21700 devel optional diffoscope_95_amd64.buildinfo
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEi3hoeGwz5cZMTQpICBa54Yx2K60FAlsBmnMACgkQCBa54Yx2
K62qeQ/+OFubz/aQeKo0zkDAWaOSXAHbTCJdz6LE13KB6qHrMQl9G4+s5wpG2DDb
t6DMiHKvQcg4rVSuSvFWIhyIt17Y2wgShLTT4VgN0KHldNjB9s7T9xQHQScNG++o
9HM2OzprRmgR2P+1E+blnJNHtEEt/TxsCmBYgOrdqq7ACrAFmoRPQWI8nZK0qO3B
o2VRl4v/GQNagXiTHbHn/63pQXCrK9ZQ0WV+U0BDSaYYa98NpIk5PebvyPi+Itsc
QbCkFxFUq07lJQlFqlL+LJUFeg0EnLzkMSRxCfp13rvusFvgAe3HBxQiv4hlTJpF
x9gL9tdNB1UiDIbPn7z5vxHS+FDrBDm7D2hXSYjdOmSbszk79yrDxjLllRAijGoZ
DH07LZmF+G9U1S74IzLt4e7fmAQb5BzX6/Gjvkv9UVbhbNrdn2CjEz7baAL8wAh7
kgCabsBmjkqnSK5iTj+0MCbpc8CZVNdF5hZRx4XlE5pv/yIKws9kF0gscit7m4Ln
FHZdDONf8qTzrq8vYqbVS+RP5KAlE2K6mwP8+qKY9g5vZaZOPVQ9wlT88Wi7YYrG
fLeVHCtfjCxO8ecCE6WLu9/0cmvVSECAYjghhYmDpiMpzyAh5/VyRXop9dm4mCI3
abBQ2dXIvImnLNKx8axAofajeGLzzZP7cgj0TJzaBufKg21SKY0=
=DK1/
-----END PGP SIGNATURE-----
Bug archived.
Request was from Debbugs Internal Request <owner@bugs.debian.org>
to internal_control@bugs.debian.org.
(Wed, 20 Jun 2018 07:26:10 GMT) (full text, mbox, link).
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Wed May 17 11:52:47 2023;
Machine Name:
buxtehude
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.