Debian Bug report logs -
#848049
diffoscope: Add detection of order-only differences in plain text formats
Reported by: Emanuel Bronshtein <e3amn2l@gmx.com>
Date: Tue, 13 Dec 2016 16:21:02 UTC
Severity: wishlist
Fixed in version diffoscope/66
Done: Chris Lamb <lamby@debian.org>
Bug is archived. No further changes may be made.
Toggle useless messages
Report forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#848049; Package diffoscope.
(Tue, 13 Dec 2016 16:21:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Emanuel Bronshtein <e3amn2l@gmx.com>:
New Bug report received and forwarded. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Tue, 13 Dec 2016 16:21:04 GMT) (full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
Package: diffoscope
Severity: wishlist
Please add detection of order only variations in plain text files (TXT/HTML files)
in similar way like the recent addition of order only difference in JSON output in:
https://anonscm.debian.org/cgit/reproducible/diffoscope.git/commit/?id=8faf040767b892e1cf19d4ec2965a29301b9ae40
Thus allows easier/faster classification of issues, for example the below issues:
random_order_in_md5sums
clilibs_line_order
varying_ordering_in_data_tar_gz_or_control_tar_gz
random_order_in_tarball
qt_translate_noop_nondeterminstic_ordering
random_order_in_ibus_table_createdb_output
db_ordering_noop_detect
random_ordering_in_pom
random_order_in_sisu_javax_inject_named
sphinx_htmlhelp_readdir_sensitive
nondeterminstic_ordering_in_gsettings_glib_enums_xml
valac_permutes_get_type_calls
share the same 'order only difference', it can be implemented by merging all the changed lines in file A (first build) vs same file B (second build), if there is the same amount of lines it's means that there is no other difference except ordering.
Changed Bug title to 'diffoscope: Add detection of order-only differences in plain text formats' from 'diffoscope: Add detection for order only differnce by lines'.
Request was from Chris Lamb <lamby@debian.org>
to control@bugs.debian.org.
(Thu, 22 Dec 2016 09:33:13 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#848049; Package diffoscope.
(Sat, 24 Dec 2016 18:18:02 GMT) (full text, mbox, link).
Acknowledgement sent
to Маша Глухова <siamezzze@gmail.com>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Sat, 24 Dec 2016 18:18:02 GMT) (full text, mbox, link).
Message #12 received at 848049@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
I believe the attached patch would provide the requested functionality.
[Message part 2 (text/html, inline)]
[0001-Add-detection-of-order-only-difference-in-plain-text.patch (text/x-diff, attachment)]
Reply sent
to Chris Lamb <lamby@debian.org>:
You have taken responsibility.
(Sat, 24 Dec 2016 19:51:04 GMT) (full text, mbox, link).
Notification sent
to Emanuel Bronshtein <e3amn2l@gmx.com>:
Bug acknowledged by developer.
(Sat, 24 Dec 2016 19:51:04 GMT) (full text, mbox, link).
Message #17 received at 848049-close@bugs.debian.org (full text, mbox, reply):
Source: diffoscope
Source-Version: 66
We believe that the bug you reported is fixed in the latest version of
diffoscope, which is due to be installed in the Debian FTP archive.
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to 848049@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Chris Lamb <lamby@debian.org> (supplier of updated diffoscope package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@ftp-master.debian.org)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Format: 1.8
Date: Sat, 24 Dec 2016 19:18:30 +0000
Source: diffoscope
Binary: diffoscope
Architecture: source
Version: 66
Distribution: unstable
Urgency: medium
Maintainer: Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>
Changed-By: Chris Lamb <lamby@debian.org>
Description:
diffoscope - in-depth comparison of files, archives, and directories
Closes: 848049 849142
Changes:
diffoscope (66) unstable; urgency=medium
.
[ Chris Lamb ]
* Update dex_expected_diffs and test requirement to ensure test compatibility
with enjarify >= 1.0.3. (Closes: #849142)
* Print the detected version in @skip_unless_tool_is_at_least test utility.
.
[ Maria Glukhova ]
* Add detection of order-only difference in plain text format. (Closes: #848049)
.
[ anthraxx ]
* Add OpenSSH Arch package to Recommends.
Checksums-Sha1:
90f5202c59082bfa9f446c9cb61f785b82537c98 2923 diffoscope_66.dsc
a83d0ae72f61eeb2ac8166b313d7c0d38103f90c 315872 diffoscope_66.tar.xz
Checksums-Sha256:
7a5e88ce749f0b3169e2fdb46b3e6fd5c13df26d6e54762fb73deff281f3ee84 2923 diffoscope_66.dsc
fe41876d0c1889663b963090cc2f30d58e1afe8bb3a16e61118d6ad81deed3f4 315872 diffoscope_66.tar.xz
Files:
da0c63fe0280c90dd49ff0efb54903ea 2923 devel optional diffoscope_66.dsc
fbfb52a82f331a5bd8ebd6ebcbdb45f5 315872 devel optional diffoscope_66.tar.xz
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCAAdFiEEwv5L0nHBObhsUz5GHpU+J9QxHlgFAlhezDEACgkQHpU+J9Qx
Hlg4XxAAwhkGTmPkmIEQ7Gph01RCKCjtkOdjagwDIPE3nBzp0NBjSksbyi8QkY4n
yfjk8bLjX01YRGegqaT0CXLtbsCjsTBabaKgh4X1mKn/51nghuOFChIeQRG/rsA0
/eYLAJAAQ/ag2siEiPoISa5YCWRNLYV8K0JUtJnEKCmqPvVxk5Q8zKFaJqUBO8QV
Om1rV1RBUGb863UUMpOYZCXkaN1gvtv+/u3l4yhr3SyV0cEquh5JiDMLcl27yAlB
7gDw4Nb5UuwkNTlUr+5Fkc5GsTeW3A9lsI61VEQPTqc7zwGRB5LjjMVr2+muGfjR
x9ZNtrDZozjwB/6ufnueVoPIE9sHv4IgXvyU5biSk19oN6ZXYHmw7ekZwXKyI7Rm
zGzfJIkIC7Fu4LL5bsUYD6atjQWpyxwvcFCVFv3wmAhN/sQsUs/V/ljUehfUTnEw
rOloYZdALE8E8TBTMpPXIAJNrnM73wMIOpI1wOrFgYYCTsaSVf/NTLzKZAm7FDXy
JWTM5UXeI8wDT30hvCQCFNxVgEMR3KeR3ipN/YqUWBdlilFU4WAEVlsdReV2PC9t
DpYF69VX0J7+UId+O8frmW9jDwyBpqR3TmnCmd9OL99Oozg595nL+rwX2ZLMCoFU
+m6nBaU9/Q/l8bEKcjnXoSncsbHtuJ0la4RqTr20opEiRLRSAQM=
=M1F+
-----END PGP SIGNATURE-----
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#848049; Package diffoscope.
(Sat, 24 Dec 2016 20:33:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Daniel Shahaf <danielsh@apache.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Sat, 24 Dec 2016 20:33:03 GMT) (full text, mbox, link).
Message #22 received at 848049@bugs.debian.org (full text, mbox, reply):
Маша Глухова wrote on Sat, Dec 24, 2016 at 18:14:16 +0000:
> +def order_only_difference(unified_diff):
> + diff_lines = unified_diff.splitlines()
> + added_lines = [line[1:] for line in diff_lines if line.startswith('+')]
> + removed_lines = [line[1:] for line in diff_lines if line.startswith('-')]
> + # Faster check: does number of lines match?
> + if len(added_lines) != len(removed_lines):
> + return False
> + # Counter stores line and number of its occurrences.
> + return sorted(added_lines) == sorted(removed_lines)
What happens if one of the files has a trailing newline and one does
not? Strictly speaking, that's not an "ordering only difference", but
this function doesn't seem to handle this case.
Example:
% diff -u <(echo foo) <(printf foo)
--- /proc/self/fd/11 2016-12-24 20:24:22.064115616 +0000
+++ /proc/self/fd/12 2016-12-24 20:24:22.064115616 +0000
@@ -1 +1 @@
-foo
+foo
\ No newline at end of file
Cheers,
Daniel
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#848049; Package diffoscope.
(Sun, 25 Dec 2016 11:45:06 GMT) (full text, mbox, link).
Acknowledgement sent
to Jérémy Bobbio <lunar@debian.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Sun, 25 Dec 2016 11:45:06 GMT) (full text, mbox, link).
Message #27 received at 848049@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Маша Глухова:
> I believe the attached patch would provide the requested functionality.
Nice work! :)
> From: Maria Glukhova <siamezzze@gmail.com>
> Date: Sat, 24 Dec 2016 12:29:57 +0200
> Subject: [PATCH] Add detection of order-only difference in plain text format.
>
> Detect if the text files' contents differ only in line ordering, and give an appropriate comment.
> […]
> +def order_only_difference(unified_diff):
> + diff_lines = unified_diff.splitlines()
> + added_lines = [line[1:] for line in diff_lines if line.startswith('+')]
> + removed_lines = [line[1:] for line in diff_lines if line.startswith('-')]
> + # Faster check: does number of lines match?
> + if len(added_lines) != len(removed_lines):
> + return False
> + # Counter stores line and number of its occurrences.
> + return sorted(added_lines) == sorted(removed_lines)
I guess it's a fine approach to the problem, but I wonder if it would
not be better to use a slightly less accurate strategy that would be
nicer to memory and CPU.
What I have in mind is to incrementally compute a hash value that would
give the same result even if the lines are in different order.
Drawing from discussions on StackOverflow [1], I think doing a sum of
Python's hash() would work. My test was:
def unordered_hash(lines):
h = 0
for line in lines:
h += hash(line)
return h
h1 = unordered_hash(open('tests/data/text_order1').readlines())
h2 = unordered_hash(open('tests/data/text_order2').readlines())
print(h1, h2, h1 == h2)
That way, it could probably be implemented directly in the difference
module and work for other file types than just text files.
[1]: https://stackoverflow.com/questions/30734848/order-independant-hash-algorithm
--
Lunar .''`.
lunar@debian.org : :Ⓐ : # apt-get install anarchism
`. `'`
`-
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#848049; Package diffoscope.
(Sun, 25 Dec 2016 14:21:06 GMT) (full text, mbox, link).
Acknowledgement sent
to Маша Глухова <siamezzze@gmail.com>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Sun, 25 Dec 2016 14:21:06 GMT) (full text, mbox, link).
Message #32 received at 848049@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Jeremy,
Thank you for sharing that!
The reason why I did not use some algorihm like that is that it requires to
read files for the second time. Right now, all the actual work with the
content of the files (except for the quick check for has_same_content) is
delegated to diff, and on big files, it occupies most of the time. Assuming
that for big files, reading them from drive would be the bottleneck, I
tried to avoid reading them again, instead working with the result of the
diff.
Still, I would be happily mistaken. I will implement your version and
compare the performance.
Thank you again :)
Maria Glukhova
вс, 25 дек. 2016, 13:37 Jérémy Bobbio <lunar@debian.org>:
> Маша Глухова:
> > I believe the attached patch would provide the requested functionality.
>
> Nice work! :)
>
> > From: Maria Glukhova <siamezzze@gmail.com>
> > Date: Sat, 24 Dec 2016 12:29:57 +0200
> > Subject: [PATCH] Add detection of order-only difference in plain text
> format.
> >
> > Detect if the text files' contents differ only in line ordering, and
> give an appropriate comment.
> > […]
> > +def order_only_difference(unified_diff):
> > + diff_lines = unified_diff.splitlines()
> > + added_lines = [line[1:] for line in diff_lines if
> line.startswith('+')]
> > + removed_lines = [line[1:] for line in diff_lines if
> line.startswith('-')]
> > + # Faster check: does number of lines match?
> > + if len(added_lines) != len(removed_lines):
> > + return False
> > + # Counter stores line and number of its occurrences.
> > + return sorted(added_lines) == sorted(removed_lines)
>
> I guess it's a fine approach to the problem, but I wonder if it would
> not be better to use a slightly less accurate strategy that would be
> nicer to memory and CPU.
>
> What I have in mind is to incrementally compute a hash value that would
> give the same result even if the lines are in different order.
>
> Drawing from discussions on StackOverflow [1], I think doing a sum of
> Python's hash() would work. My test was:
>
> def unordered_hash(lines):
> h = 0
> for line in lines:
> h += hash(line)
> return h
>
> h1 = unordered_hash(open('tests/data/text_order1').readlines())
> h2 = unordered_hash(open('tests/data/text_order2').readlines())
> print(h1, h2, h1 == h2)
>
> That way, it could probably be implemented directly in the difference
> module and work for other file types than just text files.
>
> [1]:
> https://stackoverflow.com/questions/30734848/order-independant-hash-algorithm
>
> --
> Lunar .''`.
> lunar@debian.org : :Ⓐ : # apt-get install anarchism
> `. `'`
> `-
>
[Message part 2 (text/html, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#848049; Package diffoscope.
(Sun, 25 Dec 2016 14:33:07 GMT) (full text, mbox, link).
Acknowledgement sent
to Jérémy Bobbio <lunar@debian.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Sun, 25 Dec 2016 14:33:07 GMT) (full text, mbox, link).
Message #37 received at 848049@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Hi!
Маша Глухова:
> The reason why I did not use some algorihm like that is that it requires to
> read files for the second time. Right now, all the actual work with the
> content of the files (except for the quick check for has_same_content) is
> delegated to diff, and on big files, it occupies most of the time. Assuming
> that for big files, reading them from drive would be the bottleneck, I
> tried to avoid reading them again, instead working with the result of the
> diff.
> Still, I would be happily mistaken. I will implement your version and
> compare the performance.
You would not have to read the file twice as long as you do the hash
in the difference module, when each line is actually fed to diff.
A similar trick is already used to cope with files that are too long,
see diffoscope.difference.make_feeder_from_raw_reader()
I don't know if my suggestions is a good one. It might not be a good
idea at all. Feel free to discuss it with your mentor before spending
too much time on it.
> Thank you again :)
PS: Please call me Lunar. :)
--
Lunar .''`.
lunar@debian.org : :Ⓐ : # apt-get install anarchism
`. `'`
`-
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#848049; Package diffoscope.
(Sun, 25 Dec 2016 15:21:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Chris Lamb <lamby@debian.org>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Sun, 25 Dec 2016 15:21:03 GMT) (full text, mbox, link).
Message #42 received at 848049@bugs.debian.org (full text, mbox, reply):
Jérémy Bobbio wrote:
[…]
> h += hash(line)
[…]
Watch out, using hash() often leads to unreproducible output. :)
Regards,
--
,''`.
: :' : Chris Lamb
`. `'` lamby@debian.org / chris-lamb.co.uk
`-
Information forwarded
to debian-bugs-dist@lists.debian.org, Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>:
Bug#848049; Package diffoscope.
(Thu, 12 Jan 2017 10:15:05 GMT) (full text, mbox, link).
Acknowledgement sent
to Maria Glukhova <siamezzze@gmail.com>:
Extra info received and forwarded to list. Copy sent to Reproducible builds folks <reproducible-builds@lists.alioth.debian.org>.
(Thu, 12 Jan 2017 10:15:06 GMT) (full text, mbox, link).
Message #47 received at 848049@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
On Sun, 25 Dec 2016 15:28:52 +0100 Jérémy Bobbio <lunar@debian.org> wrote:
Hi Lunar!
> You would not have to read the file twice as long as you do the hash
> in the difference module, when each line is actually fed to diff.
> A similar trick is already used to cope with files that are too long,
> see diffoscope.difference.make_feeder_from_raw_reader()
>
I implemented what I believe was your idea in the attached patch. Thank you
for pointing me to it!
Still, I don't think that feature worth invading into diff.py/diffoscope.py
modules. It doesn't speed up comparison significantly, because call to diff
still takes most of the time on big files with difference only in line
order. Besides, I can't think of many examples of where that feature would
be needed, save from text files.
In any case, thank you again for taking time to provide me with that idea!
Maria
[Message part 2 (text/html, inline)]
[0001-Generic-order-line-difference-for-all-kind-of-inputs.patch (text/x-diff, attachment)]
Bug archived.
Request was from Debbugs Internal Request <owner@bugs.debian.org>
to internal_control@bugs.debian.org.
(Fri, 10 Feb 2017 07:28:47 GMT) (full text, mbox, link).
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Wed May 17 13:57:30 2023;
Machine Name:
buxtehude
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.