Debian Bug report logs - #779207
unzip fails to unpack filenames containing 'ä' 'ö' 'ü' -> results in "(invalid encoding)"

version graph

Package: unzip; Maintainer for unzip is Santiago Vila <sanvila@debian.org>; Source for unzip is src:unzip (PTS, buildd, popcon).

Reported by: derMaria <rmr@mailfish.de>

Date: Wed, 25 Feb 2015 13:09:07 UTC

Severity: wishlist

Tags: l10n, patch

Found in versions unzip/6.0-16, unzip/6.0-21

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, rmr@mailfish.de, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Wed, 25 Feb 2015 13:09:11 GMT) (full text, mbox, link).


Acknowledgement sent to derMaria <rmr@mailfish.de>:
New Bug report received and forwarded. Copy sent to rmr@mailfish.de, Santiago Vila <sanvila@debian.org>. (Wed, 25 Feb 2015 13:09:11 GMT) (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: derMaria <rmr@mailfish.de>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: unzip fails to unpack filenames containing 'ä' 'ö' 'ü' -> results in "(invalid encoding)"
Date: Wed, 25 Feb 2015 14:07:28 +0100
Package: unzip
Version: 6.0-16
Severity: normal
Tags: l10n

Dear Maintainer,

whenever I try to unpack a ZIP File in which a filename is containing 'ä',
'ö' or 'ü' it is replaced by '�' and the term " (ungültige Kodierung)"
(invalid encoding) is added as part of the extracted filename.

It is a whole lot of work to remove the term " (ungültige Kodierung)" from all
of the files as these characters in German are quite often used and I don't
pack the files myself so I can't influence the original names of the files.

See also:

http://forum.ubuntuusers.de/topic/falsche-buchstaben-nach-entpacken/
https://blueprints.launchpad.net/unzip/+spec/unzip-detect-filename-encoding

Thanks - Rafael



-- System Information:
Debian Release: jessie/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: i386 (i686)

Kernel: Linux 3.16-2-486
Locale: LANG=de_DE.utf8, LC_CTYPE=de_DE.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages unzip depends on:
ii  libbz2-1.0  1.0.6-7+b1
ii  libc6       2.19-13

unzip recommends no packages.

Versions of packages unzip suggests:
ii  zip  3.0-8

-- no debconf information



Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Sun, 17 Jan 2016 18:51:04 GMT) (full text, mbox, link).


Acknowledgement sent to Andrey Skvortsov <andrej.skvortzov@gmail.com>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Sun, 17 Jan 2016 18:51:04 GMT) (full text, mbox, link).


Message #10 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Andrey Skvortsov <andrej.skvortzov@gmail.com>
To: 779207@bugs.debian.org
Subject: comment
Date: Sun, 17 Jan 2016 21:47:56 +0300
[Message part 1 (text/plain, inline)]
I'm pretty often have this problem with Cyrillic characters in
filenames. So this is not related only to German, but for all locales
with non-latin characters.
This problem is already solved by Altlinux (Russian Linux distro).
Ubuntu is using their patch. 

unzip (6.0-20ubuntu1) xenial; urgency=medium

  * Resynchronise with Debian. Remaining changes:
    - Add patch from archlinux which adds the -O option, allowing a charset
      to be specified for the proper unzipping of non-Latin and non-Unicode
      filenames.

Archlinux uses it too, according to the bug report:
https://bugs.archlinux.org/task/8383

Patch from Ubuntu is attached.

-- 
Best regards,
Andrey Skvortsov

Secure e-mail with gnupg: See http://www.gnupg.org/
PGP Key ID: 0x57A3AEAD


[unzip60-alt-iconv-utf8.patch (text/x-diff, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Sun, 21 May 2017 14:03:05 GMT) (full text, mbox, link).


Acknowledgement sent to Osamu Aoki <osamu@debian.org>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Sun, 21 May 2017 14:03:05 GMT) (full text, mbox, link).


Message #15 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Osamu Aoki <osamu@debian.org>
To: 779207@bugs.debian.org
Cc: derMaria <rmr@mailfish.de>, Andrey Skvortsov <andrej.skvortzov@gmail.com>, Marc Lehmann <debian-reportbug@plan9.de>, Steve Langasek <steve.langasek@ubuntu.com>
Subject: non-UTF8 encoded ZIP archive
Date: Sun, 21 May 2017 23:00:07 +0900
[Message part 1 (text/plain, inline)]
control: tags -1 patch
control: severity -1 important

Hi,

zip as shipped currently with Debian squeeze lacks encoding support.
This is a widely known problem with some workarounds.
  https://superuser.com/questions/872596/decompress-zip-with-given-encoding
  https://unix.stackexchange.com/questions/251969/how-can-i-correctly-decompress-a-zip-archive-of-files-with-hebrew-names

Seemingly the same problem is reported as https://bugs.debian.org/696914
too.

Apparently, Ubuntu, Arch, Redhat and FreeBSD ships (or shipped) patched
version of unzip to cope with this widely known encoding issue (it seems
this is a more than 10 year hanging issue.  Upstream change seemd to
broke old patch sometime in history.  But I see Ubuntu has an updated
patch.).  Knowing slow upstream, maybe it is good idea to apply a patch
to fix this shortcomings on Debian too.

Arch bug and patch in 2009:
  https://bugs.archlinux.org/task/15256

Ubuntu discussion on this bug is here:
  https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961

In this:
  Mathew Hodson (mathew-hodson) wrote on 2016-05-16:    #198
I've closed the remaining tasks. This particular bug was fixed in
Precise and later. For remaining issues in p7zip and file-roller, see
Bug #1382106 and Bug #495880

Current Ubuntu fixed this bug and its diff is here:
  https://ubuntudiff.debian.net/q/package/unzip

unzip (6.0-21ubuntu1) artful; urgency=low

  * Merge from Debian unstable.  Remaining changes:
    - Add patch from archlinux which adds the -O option, allowing a charset
      to be specified for the proper unzipping of non-Latin and non-Unicode
      filenames.

Looks quite reasonable.

The same patch has been in use from unzip version 6.0-19ubuntu1 packaged
by Sebastien Bacher <seb128@ubuntu.com>  Fri, 23 Oct 2015 15:58:43 +0200

So this patch should have been well tested by know!

As long as we apply the same patch as Ubuntu, security concern is
minimal, too.  (I understand that, with so many recent CVE fixes, you may
be very conservative deviating from the upstream.)

If you don't feel like updating under freeze, please seriously consider
uploading right after the release and backporting.

Regards,

Osamu

[signature.asc (application/pgp-signature, inline)]

Added tag(s) patch. Request was from Osamu Aoki <osamu@debian.org> to 779207-submit@bugs.debian.org. (Sun, 21 May 2017 14:03:05 GMT) (full text, mbox, link).


Severity set to 'important' from 'normal' Request was from Osamu Aoki <osamu@debian.org> to 779207-submit@bugs.debian.org. (Sun, 21 May 2017 14:03:06 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Sun, 21 May 2017 14:42:03 GMT) (full text, mbox, link).


Acknowledgement sent to Santiago Vila <sanvila@unex.es>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Sun, 21 May 2017 14:42:03 GMT) (full text, mbox, link).


Message #24 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Santiago Vila <sanvila@unex.es>
To: Osamu Aoki <osamu@debian.org>, 779207@bugs.debian.org
Cc: derMaria <rmr@mailfish.de>, Andrey Skvortsov <andrej.skvortzov@gmail.com>, Marc Lehmann <debian-reportbug@plan9.de>, Steve Langasek <steve.langasek@ubuntu.com>
Subject: Re: Bug#779207: non-UTF8 encoded ZIP archive
Date: Sun, 21 May 2017 16:38:51 +0200
severity 779207 wishlist
thanks

This is still a feature request. Granted, a feature request that many
people request, but still a feature request.

The proposed patch, even if it's "well tested", may or may not be
compatible with whatever thing upstream finally implements.

If we only had some assurance that the upstream patch will be like
this, then yes, it would be fine to apply the patch (I would be happy
to backport the changes from upstream git or whatever source control
version they have), but we don't really know, so no, I still do not
feel like deviating from upstream, not under freeze, and not after
the freeze.

Anyway, I'll ask the authors about this once again.

Thanks.



Severity set to 'wishlist' from 'important' Request was from Santiago Vila <sanvila@unex.es> to control@bugs.debian.org. (Sun, 21 May 2017 14:42:05 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Fri, 24 Nov 2017 08:39:04 GMT) (full text, mbox, link).


Acknowledgement sent to Antonio Ospite <ao2@ao2.it>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Fri, 24 Nov 2017 08:39:04 GMT) (full text, mbox, link).


Message #31 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Antonio Ospite <ao2@ao2.it>
To: Debian Bug Tracking System <779207@bugs.debian.org>
Subject: Re: unzip fails to unpack filenames containing 'ä' 'ö' 'ü' -> results in "(invalid encoding)"
Date: Fri, 24 Nov 2017 09:30:26 +0100
Package: unzip
Version: 6.0-21
Followup-For: Bug #779207

Dear Maintainer,

it looks like some upstream beta version from 2010 fixes this by adding
the -I and -O option, and the changelog says it's based on the
unzip60-alt-iconv-utf8.patch proposed in this thread.

This is the beta version:
ftp://ftp.info-zip.org/pub/infozip/beta/unzip610b.zip

You'll need to compile with -DUSE_ICONV_MAPPING to enable this, and
depend on the iconv library.

Mentioned also in #197427 and maybe other related bugs are also #696914
and #483290.

Ciao,
   Antonio

-- System Information:
Debian Release: buster/sid
  APT prefers unstable
  APT policy: (900, 'unstable'), (500, 'unstable-debug')
Architecture: amd64 (x86_64)

Kernel: Linux 4.13.0-1-amd64 (SMP w/2 CPU cores)
Locale: LANG=it_IT.utf8, LC_CTYPE=it_IT.utf8 (charmap=UTF-8), LANGUAGE=it_IT.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages unzip depends on:
ii  libbz2-1.0  1.0.6-8.1
ii  libc6       2.25-2

unzip recommends no packages.

Versions of packages unzip suggests:
ii  zip  3.0-11+b1

-- no debconf information
-- 
Antonio Ospite
https://ao2.it
https://twitter.com/ao2it

A: Because it messes up the order in which people normally read text.
   See http://en.wikipedia.org/wiki/Posting_style
Q: Why is top-posting such a bad thing?



Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Fri, 24 Nov 2017 10:03:07 GMT) (full text, mbox, link).


Acknowledgement sent to Antonio Ospite <ao2@ao2.it>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Fri, 24 Nov 2017 10:03:07 GMT) (full text, mbox, link).


Message #36 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Antonio Ospite <ao2@ao2.it>
To: Debian Bug Tracking System <779207@bugs.debian.org>
Subject: Re: unzip fails to unpack filenames containing 'ä' 'ö' 'ü' -> results in "(invalid encoding)"
Date: Fri, 24 Nov 2017 10:24:24 +0100
On Fri, 24 Nov 2017 09:30:26 +0100
Antonio Ospite <ao2@ao2.it> wrote:

[...]
> You'll need to compile with -DUSE_ICONV_MAPPING to enable this, and
> depend on the iconv library.
> 

Well, the changelog mentions the "iconv library", but on linux the
functionality is in glibc, so no extra dependencies should be needed.

Ciao,
   Antonio

-- 
Antonio Ospite
https://ao2.it
https://twitter.com/ao2it

A: Because it messes up the order in which people normally read text.
   See http://en.wikipedia.org/wiki/Posting_style
Q: Why is top-posting such a bad thing?



Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Mon, 07 Jan 2019 16:12:03 GMT) (full text, mbox, link).


Acknowledgement sent to Shengjing Zhu <zhsj@debian.org>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Mon, 07 Jan 2019 16:12:03 GMT) (full text, mbox, link).


Message #41 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Shengjing Zhu <zhsj@debian.org>
To: Santiago Vila <sanvila@unex.es>
Cc: 779207@bugs.debian.org
Subject: Re: Bug#779207: non-UTF8 encoded ZIP archive
Date: Tue, 8 Jan 2019 00:09:54 +0800
[Message part 1 (text/plain, inline)]
Hi Santiago,

On Sun, May 21, 2017 at 04:38:51PM +0200, Santiago Vila wrote:
> severity 779207 wishlist
> thanks
> 
> This is still a feature request. Granted, a feature request that many
> people request, but still a feature request.
> 
> The proposed patch, even if it's "well tested", may or may not be
> compatible with whatever thing upstream finally implements.
> 
> If we only had some assurance that the upstream patch will be like
> this, then yes, it would be fine to apply the patch (I would be happy
> to backport the changes from upstream git or whatever source control
> version they have), but we don't really know, so no, I still do not
> feel like deviating from upstream, not under freeze, and not after
> the freeze.

How is this bug going?

I think we can assume upstream patch is like this.

The upstream has a 610beta release on 2010.
https://sourceforge.net/projects/infozip/files/unreleased%20Betas/UnZip%20betas/

It adds the support of -O/-I option for non-UTF8 encoding.

I checked the code in 610beta, and the patch applied in archlinux AUR[1][2],
most code are same, except:

1. Upstream code is surrounded by USE_ICONV_MAPPING.
2. 610b has a big refactor of command line parser. So the code of command parser
   is a big difference.

The left code is just same.

Cloud you backport this lovely feature in Debian?

[1] https://aur.archlinux.org/packages/unzip-iconv
[2] http://www.conostix.com/pub/adv/06-unzip60-alt-iconv-utf8_CVE-2015-1315.patch
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Tue, 05 Feb 2019 10:24:02 GMT) (full text, mbox, link).


Acknowledgement sent to Andreas Schlager <andreas.schlager@sbg.at>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Tue, 05 Feb 2019 10:24:02 GMT) (full text, mbox, link).


Message #46 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Andreas Schlager <andreas.schlager@sbg.at>
To: 779207@bugs.debian.org
Subject: Re: Bug#779207: non-UTF8 encoded ZIP archive
Date: Tue, 5 Feb 2019 11:01:14 +0100
[Message part 1 (text/plain, inline)]
Hi Santiago,

I think, the community in all non-english spoken countries would highly
appreciate a solution for this bug, which affects many almost every day.

The effort hopefully is not too big, but the benefit would be really
tremendously.

Many thanks in advance!

Best Regards,

-Andreas.

---------------------------------------------------------
"Sed quis custodiet ipsos custodes?"
(Wer, außer den Wächtern selbst, wacht über die Wächter?)


[signature.asc (application/pgp-signature, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Wed, 22 May 2024 16:09:03 GMT) (full text, mbox, link).


Acknowledgement sent to Ivan Sorokin <unxed@mail.ru>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Wed, 22 May 2024 16:09:03 GMT) (full text, mbox, link).


Message #51 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Ivan Sorokin <unxed@mail.ru>
To: 779207@bugs.debian.org
Subject: unzip fails to unpack filenames containing 'ä' 'ö' 'ü' -> results in "(invalid encoding)"
Date: Wed, 22 May 2024 18:40:21 +0300
[Message part 1 (text/plain, inline)]
The built-in .zip archiver in older versions of Windows used DOS (OEM) or Windows (ANSI) code page corresponding to current regional settings for new archives. Lots of such archives still exist.
 
The correct behavior is to determine the relevant OEM or ANSI code page based on the system locale and use it. You can look at this PR for reference implementation:
https://github.com/p7zip-project/p7zip/pull/232

Sample archive showing this bug attached. It uses CP866 (often called DOS or OEM) for Cyrillic (Russian) letters.

 
 
 
[Message part 2 (text/html, inline)]
[Desktop.zip (application/zip, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Sun, 26 May 2024 15:51:03 GMT) (full text, mbox, link).


Acknowledgement sent to Ivan Sorokin <unxed@mail.ru>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Sun, 26 May 2024 15:51:03 GMT) (full text, mbox, link).


Message #56 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Ivan Sorokin <unxed@mail.ru>
To: 779207@bugs.debian.org
Subject: patch submission for improving code pages support in unzip
Date: Sun, 26 May 2024 18:48:23 +0300
[Message part 1 (text/plain, inline)]
Dear colleagues,
 
I am writing to bring to your attention an issue with the current upstream version of unzip that has not been updated for many years. In the modern environment, where the vast majority of systems use UTF-8, unzip exhibits several problems that need addressing:
 
1) unzip is unable to correctly extract files containing the bit 11 in the General Purpose flag. This bit indicates that the file names are encoded in UTF-8. However, unzip attempts to re-encode them as if they are in OEM codepage, leading to incorrect file names.
2) By default, unzip does not display UTF-8 encoding correctly on Unix systems.
3) It is necessary to determine the OEM codepage correctly based on the system locale, rather than using a single codepage for all archives.
4) The assumption that archives for which the legacy codepage cannot be determined are encoded in ISO 8859-1 is incorrect. In reality, most archivers used the user's system codepage, which could be any codepage. It is reasonable not to alter the encoding in this case, ensuring that the archive can be opened at least on the same system where it was created. Additionally, options -O and -I have been added to specify the encoding manually.
 
I have prepared a patch (based on a similar patch from Ubuntu, with significant enhancements) that addresses these issues. A significant difference from the Ubuntu patch is that my code is capable of selecting the OEM codepage based on the system locale, instead of assuming the Russian/Cyrillic CP866 codepage for all archives when the system is set to UTF-8.
 
I hope you will find this patch useful.
 
Best regards,
Ivan Sorokin
 
 
[Message part 2 (text/html, inline)]
[29-fix-code-pages.patch (text/x-diff, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, Santiago Vila <sanvila@debian.org>:
Bug#779207; Package unzip. (Tue, 28 May 2024 07:42:04 GMT) (full text, mbox, link).


Acknowledgement sent to Ivan Sorokin <unxed@mail.ru>:
Extra info received and forwarded to list. Copy sent to Santiago Vila <sanvila@debian.org>. (Tue, 28 May 2024 07:42:04 GMT) (full text, mbox, link).


Message #61 received at 779207@bugs.debian.org (full text, mbox, reply):

From: Ivan Sorokin <unxed@mail.ru>
To: 779207@bugs.debian.org
Subject: Re: patch submission for improving code pages support in unzip
Date: Tue, 28 May 2024 10:37:54 +0300
[Message part 1 (text/plain, inline)]
I slightly modified the patch for unzip:
1) Found a sample ANSI archive, which requires a separate code branch, so added it (sample archive attached)
2) Fixed the -I and -O options, which were broken; they now mean the same thing for all types of archives, both are left for compatibility, you can specify either one. That is, they work similarly to -mcp in 7zip now
 

  
>Воскресенье, 26 мая 2024, 17:48 +02:00 от Ivan Sorokin <unxed@mail.ru>:
> 
>Dear colleagues,
> 
>I am writing to bring to your attention an issue with the current upstream version of unzip that has not been updated for many years. In the modern environment, where the vast majority of systems use UTF-8, unzip exhibits several problems that need addressing:
> 
>1) unzip is unable to correctly extract files containing the bit 11 in the General Purpose flag. This bit indicates that the file names are encoded in UTF-8. However, unzip attempts to re-encode them as if they are in OEM codepage, leading to incorrect file names.
>2) By default, unzip does not display UTF-8 encoding correctly on Unix systems.
>3) It is necessary to determine the OEM codepage correctly based on the system locale, rather than using a single codepage for all archives.
>4) The assumption that archives for which the legacy codepage cannot be determined are encoded in ISO 8859-1 is incorrect. In reality, most archivers used the user's system codepage, which could be any codepage. It is reasonable not to alter the encoding in this case, ensuring that the archive can be opened at least on the same system where it was created. Additionally, options -O and -I have been added to specify the encoding manually.
> 
>I have prepared a patch (based on a similar patch from Ubuntu, with significant enhancements) that addresses these issues. A significant difference from the Ubuntu patch is that my code is capable of selecting the OEM codepage based on the system locale, instead of assuming the Russian/Cyrillic CP866 codepage for all archives when the system is set to UTF-8.
> 
>I hope you will find this patch useful.
> 
>Best regards,
>Ivan Sorokin
> 
>  
 
 
 
 
[Message part 2 (text/html, inline)]
[29-fix-code-pages.patch (text/x-diff, attachment)]
[Дикий помещик.htm.zip (application/zip, attachment)]

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Thu Aug 8 02:45:16 2024; Machine Name: bembo

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.