Debian Bug report logs - #418058
iconv: half-smart on ascii compatible code conversion (shift-jis)

version graph

Package: libc6; Maintainer for libc6 is GNU Libc Maintainers <debian-glibc@lists.debian.org>; Source for libc6 is src:glibc (PTS, buildd, popcon).

Reported by: Osamu Aoki <osamu@debian.org>

Date: Fri, 6 Apr 2007 16:06:01 UTC

Severity: minor

Found in version glibc/2.3.6.ds1-13

Done: Pierre Habouzit <madcoder@debian.org>

Bug is archived. No further changes may be made.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>:
Bug#418058; Package libc6. (full text, mbox, link).


Acknowledgement sent to Osamu Aoki <osamu@debian.org>:
New Bug report received and forwarded. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>. (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Osamu Aoki <osamu@debian.org>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: iconv: half-smart on ascii compatible code conversion (latin1, shift-jis, ...)
Date: Sat, 7 Apr 2007 01:04:32 +0900
[Message part 1 (text/plain, inline)]
Package: libc6
Version: 2.3.6.ds1-13
Severity: important

Problem: ~ ' \ conversion.

In short, iconv should not to smart guessing for 7 bit section of each
traditional encodings which was ASCII compatible.  They should be same
in that 7 bit section.

Here we go....

For all popular C/perl/shell/... programs written originally in latin-1,
latin-2, ..., shift-jis, euc-jp, ...  encodings will break if iconv is
used to convert them in UTF-8.  iconv does half-smart job to please some
cosmetic factors but forgot about how these encodings were originally
developed and used in real life so it is harmful to the data.  (Of
course those funny 8 bit texts are in the comments and the quoted text)  

In this sense, I could file grave bug for breaking data but considering
timing, I stay with important.  (After etch, I may raise this bug
severity.)

All these encodings (latin-1, latin-2, ..., shift-jis, euc-jp, ...  )
were developed so non-ASCII characters can be expressed without breaking
existing tools/codes developped for ASCII.  That is why they are ASCII
compatible.  All 0x00-0x7f (7bit) represented characters shared the same
position (We do use alternative font for the ASCII 0x5c = back_lash =
'\' in Japan which looks like Japanese Yen-mark, but these \ in ASCII
and yen in shift-jis serves the same purpose in the program world.  C
standard even mention about dual nature of \.)  So by changing encoding
of the file, we expect all 0x00-0x7f (7bit) to remain the same.

But I iconv does many funny things.

The code 0x27 (single-quote) is changed to something else (long UTF-8
sequence for single-quote) when converted from any of latin-1, latin-2,
shift-jis, euc-jp,... to UTF-8 changes.  This is not expected.

For shift-jis, it is even worse.  iconv tries to map character 0x5c to
UTF-8 YEN mark.  That mapping should be done for the yen mark code in
16bit (full width character section) and not for this 7 bit one.  This
is very bad for any program.  Another issue is 0x7e '~'.  This is
translated to upper bar.  Although some Japanese old PC (pre-IBM
compatible, NEC 98 machines, I think) had upper bar shaped font for ~,
converting this ~ in data to UTF-8 upper bar breaks URLs data stored on
shift-jis machines.

The choice of conversion table should not be based on superficial shape
caparison but should take into full account of actual usage and
implication.

iconv being basic tool, it should not do these conversion on 7 bit code
for these.  If anyone want syntactical pretty print conversion of UTF-8
text, it should rely on some other tool.  Then they can use open and
closing quote if they wish.  But we can keep C programs right.   Many
old C programs in each locale used to use these ASCII compatible
encodings and all we want to do is convert quoted text and comments to
UTF-8.

-- System Information:
Debian Release: lenny/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing')
Architecture: amd64 (x86_64)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.18-mactel64
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)

Versions of packages libc6 depends on:
ii  tzdata                        2007d-1    Time Zone and Daylight Saving Time

libc6 recommends no packages.

-- debconf-show failed

Conversion results are attached as diffs.
[eucj-utf8.diff (text/x-diff, attachment)]
[shiftj-utf8.diff (text/x-diff, attachment)]
[latin1-utf8.diff (text/x-diff, attachment)]
[latin2-utf8.diff (text/x-diff, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>:
Bug#418058; Package libc6. (full text, mbox, link).


Acknowledgement sent to Aurelien Jarno <aurelien@aurel32.net>:
Extra info received and forwarded to list. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>. (full text, mbox, link).


Message #10 received at 418058@bugs.debian.org (full text, mbox, reply):

From: Aurelien Jarno <aurelien@aurel32.net>
To: Osamu Aoki <osamu@debian.org>, 418058@bugs.debian.org
Cc: control@bugs.debian.org
Subject: Re: Bug#418058: iconv: half-smart on ascii compatible code conversion (latin1, shift-jis, ...)
Date: Thu, 12 Apr 2007 12:11:02 +0200
tags 418058 + unreproducible
thanks

Osamu Aoki a écrit :
> Package: libc6
> Version: 2.3.6.ds1-13
> Severity: important
> 
> Problem: ~ ' \ conversion.
> 
> In short, iconv should not to smart guessing for 7 bit section of each
> traditional encodings which was ASCII compatible.  They should be same
> in that 7 bit section.
> 
> Here we go....
> 
> For all popular C/perl/shell/... programs written originally in latin-1,
> latin-2, ..., shift-jis, euc-jp, ...  encodings will break if iconv is
> used to convert them in UTF-8.  iconv does half-smart job to please some
> cosmetic factors but forgot about how these encodings were originally
> developed and used in real life so it is harmful to the data.  (Of
> course those funny 8 bit texts are in the comments and the quoted text)  
> 
> In this sense, I could file grave bug for breaking data but considering
> timing, I stay with important.  (After etch, I may raise this bug
> severity.)
> 
> All these encodings (latin-1, latin-2, ..., shift-jis, euc-jp, ...  )
> were developed so non-ASCII characters can be expressed without breaking
> existing tools/codes developped for ASCII.  That is why they are ASCII
> compatible.  All 0x00-0x7f (7bit) represented characters shared the same
> position (We do use alternative font for the ASCII 0x5c = back_lash =
> '\' in Japan which looks like Japanese Yen-mark, but these \ in ASCII
> and yen in shift-jis serves the same purpose in the program world.  C
> standard even mention about dual nature of \.)  So by changing encoding
> of the file, we expect all 0x00-0x7f (7bit) to remain the same.
> 
> But I iconv does many funny things.
> 
> The code 0x27 (single-quote) is changed to something else (long UTF-8
> sequence for single-quote) when converted from any of latin-1, latin-2,
> shift-jis, euc-jp,... to UTF-8 changes.  This is not expected.
> 
> For shift-jis, it is even worse.  iconv tries to map character 0x5c to
> UTF-8 YEN mark.  That mapping should be done for the yen mark code in
> 16bit (full width character section) and not for this 7 bit one.  This
> is very bad for any program.  Another issue is 0x7e '~'.  This is
> translated to upper bar.  Although some Japanese old PC (pre-IBM
> compatible, NEC 98 machines, I think) had upper bar shaped font for ~,
> converting this ~ in data to UTF-8 upper bar breaks URLs data stored on
> shift-jis machines.
> 
> The choice of conversion table should not be based on superficial shape
> caparison but should take into full account of actual usage and
> implication.
> 
> iconv being basic tool, it should not do these conversion on 7 bit code
> for these.  If anyone want syntactical pretty print conversion of UTF-8
> text, it should rely on some other tool.  Then they can use open and
> closing quote if they wish.  But we can keep C programs right.   Many
> old C programs in each locale used to use these ASCII compatible
> encodings and all we want to do is convert quoted text and comments to
> UTF-8.

All the diff you provide are actually wrong. In all those file, the
input character for ' is not 27 but E2 80 99, which is an UTF-8
sequence. iconv behaves correctly here.

Please provide us a correct input file (check it with hexdump) that
exhibits the problem. I suggests to gzip it to avoid encoding
translation by your MUA.

-- 
  .''`.  Aurelien Jarno	            | GPG: 1024D/F1BCDB73
 : :' :  Debian developer           | Electrical Engineer
 `. `'   aurel32@debian.org         | aurelien@aurel32.net
   `-    people.debian.org/~aurel32 | www.aurel32.net



Tags added: unreproducible Request was from Aurelien Jarno <aurelien@aurel32.net> to control@bugs.debian.org. (Thu, 12 Apr 2007 10:21:03 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>:
Bug#418058; Package libc6. (full text, mbox, link).


Acknowledgement sent to Osamu Aoki <osamu@debian.org>:
Extra info received and forwarded to list. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>. (full text, mbox, link).


Message #17 received at 418058@bugs.debian.org (full text, mbox, reply):

From: Osamu Aoki <osamu@debian.org>
To: Aurelien Jarno <aurelien@aurel32.net>
Cc: 418058@bugs.debian.org
Subject: Re: Bug#418058: iconv: half-smart on ascii compatible code conversion (latin1, shift-jis, ...)
Date: Fri, 13 Apr 2007 00:28:29 +0900
On Thu, Apr 12, 2007 at 12:11:02PM +0200, Aurelien Jarno wrote:
> tags 418058 + unreproducible
> thanks
> All the diff you provide are actually wrong. In all those file, the
> input character for ' is not 27 but E2 80 99, which is an UTF-8
> sequence. iconv behaves correctly here.

Hmmm... I created these ASCII table from the output of "man ascii" but
did not check its content enough.

> Please provide us a correct input file (check it with hexdump) that
> exhibits the problem. I suggests to gzip it to avoid encoding
> translation by your MUA.

Let me sort out situation.  Quite likely, problem is just Japanese Yen
mark only.

Thanks for replying to this.



Severity set to `minor' from `important' Request was from Pierre Habouzit <madcoder@debian.org> to control@bugs.debian.org. (Mon, 23 Apr 2007 20:45:55 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>:
Bug#418058; Package libc6. (full text, mbox, link).


Acknowledgement sent to Osamu Aoki <osamu@debian.org>:
Extra info received and forwarded to list. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>. (full text, mbox, link).


Message #24 received at 418058@bugs.debian.org (full text, mbox, reply):

From: Osamu Aoki <osamu@debian.org>
To: Aurelien Jarno <aurelien@aurel32.net>, 418058@bugs.debian.org
Cc: control@bugs.debian.org
Subject: Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)
Date: Sat, 16 Jun 2007 10:37:47 +0900
[Message part 1 (text/plain, inline)]
tags 418058 - unreproducible
retitle 418058 iconv: half-smart on ascii compatible code conversion (shift-jis)
thanks

Let me update with better data since original report was contaminated
with other bugs such as groff. (Thanks Aurelien Jarno to checking them.)

Bug: The \ and ~ (ascii 92 126)are not handled right by iconv under
SHIFT-JIS (SJIS).

The conversion error of iconv itself over printable 7 bit character was
tested with attached script with its result in diff.txt.

The conversion error also occurs on GB for # and ~ (ascii 35 126).
Please ask Chinese speaking people for GB situation. But I am almost
certain this is quite likely bug too.

NB: As for Japanese, I remember EUC-JP used to have similar problem.

Rationale:

iconv should not to smart guessing for 7 bit section of each traditional
encodings which was ASCII compatible.  They should be same in that 7 bit
section.

For all popular C/perl/shell/... programs written originally in
shift-jis  should not break if iconv is used to convert them in UTF-8.

Details:
For shift-jis, iconv tries to map character 0x5c to UTF-8 YEN mark.
That mapping to UTF-8 YEN mark should be done frim the yen mark code in
16bit (full width character section) and not for this 7 bit one hich is
0x5c.  This is very bad for any program.  Another issue is 0x7e '~'.
This is translated to upper bar.  Although some Japanese old PC (pre-IBM
compatible, NEC 98 ans Sharp MZ machines which used to run
IBM-incompatible MS-DOS, I think) had upper bar shaped font for ~ and
keyboard, converting this ~ in data to UTF-8 upper bar breaks URLs data
stored on shift-jis machines.

These cosmetic differenceis were just font difference.  The code point
should not be moved for these.

Osamu

[ascii.tar.gz (application/octet-stream, attachment)]

Tags removed: unreproducible Request was from Osamu Aoki <osamu@debian.org> to control@bugs.debian.org. (Sat, 16 Jun 2007 01:39:02 GMT) (full text, mbox, link).


Changed Bug title to `iconv: half-smart on ascii compatible code conversion (shift-jis)' from `iconv: half-smart on ascii compatible code conversion (latin1, shift-jis, ...)'. Request was from Osamu Aoki <osamu@debian.org> to control@bugs.debian.org. (Sat, 16 Jun 2007 01:39:04 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>:
Bug#418058; Package libc6. (full text, mbox, link).


Acknowledgement sent to Kenshi Muto <kmuto@debian.org>:
Extra info received and forwarded to list. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>. (full text, mbox, link).


Message #33 received at 418058@bugs.debian.org (full text, mbox, reply):

From: Kenshi Muto <kmuto@debian.org>
To: 418058@bugs.debian.org
Subject: Re: iconv: half-smart on ascii compatible code conversion (shift-jis)
Date: Fri, 06 Jul 2007 14:17:39 +0900
> iconv should not to smart guessing for 7 bit section of each traditional
> encodings which was ASCII compatible.  They should be same in that 7 bit
> section.
> 
> For all popular C/perl/shell/... programs written originally in
> shift-jis  should not break if iconv is used to convert them in UTF-8.

Use 'cp932' (or 'windows-31j') instead of shift-jis.
IMO libc6 implements the "Shift-JIS to Unicode" mapping defined by
Unicode Inc. in the right way.

Thanks,
-- 
Kenshi Muto
kmuto@debian.org



Information forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>:
Bug#418058; Package libc6. (full text, mbox, link).


Acknowledgement sent to Aurelien Jarno <aurelien@aurel32.net>:
Extra info received and forwarded to list. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>. (full text, mbox, link).


Message #38 received at 418058@bugs.debian.org (full text, mbox, reply):

From: Aurelien Jarno <aurelien@aurel32.net>
To: Kenshi Muto <kmuto@debian.org>, 418058@bugs.debian.org
Subject: Re: Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)
Date: Sun, 22 Jul 2007 01:34:20 +0200
Kenshi Muto a écrit :
>> iconv should not to smart guessing for 7 bit section of each traditional
>> encodings which was ASCII compatible.  They should be same in that 7 bit
>> section.
>>
>> For all popular C/perl/shell/... programs written originally in
>> shift-jis  should not break if iconv is used to convert them in UTF-8.
> 
> Use 'cp932' (or 'windows-31j') instead of shift-jis.
> IMO libc6 implements the "Shift-JIS to Unicode" mapping defined by
> Unicode Inc. in the right way.

Does it means that this bug could be closed?

-- 
  .''`.  Aurelien Jarno	            | GPG: 1024D/F1BCDB73
 : :' :  Debian developer           | Electrical Engineer
 `. `'   aurel32@debian.org         | aurelien@aurel32.net
   `-    people.debian.org/~aurel32 | www.aurel32.net



Information forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>:
Bug#418058; Package libc6. (full text, mbox, link).


Acknowledgement sent to Kenshi Muto <kmuto@debian.org>:
Extra info received and forwarded to list. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>. (full text, mbox, link).


Message #43 received at 418058@bugs.debian.org (full text, mbox, reply):

From: Kenshi Muto <kmuto@debian.org>
To: Aurelien Jarno <aurelien@aurel32.net>
Cc: 418058@bugs.debian.org
Subject: Re: Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)
Date: Sun, 22 Jul 2007 09:09:40 +0900
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

At Sun, 22 Jul 2007 01:34:20 +0200,
Aurelien Jarno wrote:
> Kenshi Muto a écrit :
> > Use 'cp932' (or 'windows-31j') instead of shift-jis.
> > IMO libc6 implements the "Shift-JIS to Unicode" mapping defined by
> > Unicode Inc. in the right way.
> 
> Does it means that this bug could be closed?

Yes, I think so. This is from a design rather than a bug.

Thanks,
- -- 
Kenshi Muto
kmuto@debian.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8+ <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkaioD0ACgkQQKW+7XLQPLFBuQCdHxVZebBHKEH6A7EscVnxpRby
u+wAoMif9Gc87ddkGhX9PIMrmdqIyEw5
=manQ
-----END PGP SIGNATURE-----



Reply sent to Pierre Habouzit <madcoder@debian.org>:
You have taken responsibility. (full text, mbox, link).


Notification sent to Osamu Aoki <osamu@debian.org>:
Bug acknowledged by developer. (full text, mbox, link).


Message #48 received at 418058-done@bugs.debian.org (full text, mbox, reply):

From: Pierre Habouzit <madcoder@debian.org>
To: Kenshi Muto <kmuto@debian.org>, 418058-done@bugs.debian.org
Cc: Aurelien Jarno <aurelien@aurel32.net>
Subject: Re: Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)
Date: Sun, 22 Jul 2007 11:04:17 +0200
[Message part 1 (text/plain, inline)]
On Sun, Jul 22, 2007 at 09:09:40AM +0900, Kenshi Muto wrote:
> Hi,
> 
> At Sun, 22 Jul 2007 01:34:20 +0200,
> Aurelien Jarno wrote:
> > Kenshi Muto a écrit :
> > > Use 'cp932' (or 'windows-31j') instead of shift-jis.
> > > IMO libc6 implements the "Shift-JIS to Unicode" mapping defined by
> > > Unicode Inc. in the right way.
> > 
> > Does it means that this bug could be closed?
> 
> Yes, I think so. This is from a design rather than a bug.
> 
> Thanks,
> -- 
> Kenshi Muto
> kmuto@debian.org
> 

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org
[Message part 2 (application/pgp-signature, inline)]

Information stored:
Bug#418058; Package libc6. (full text, mbox, link).


Acknowledgement sent to Osamu Aoki <osamu@debian.org>:
Extra info received and filed, but not forwarded. (full text, mbox, link).


Message #53 received at 418058-quiet@bugs.debian.org (full text, mbox, reply):

From: Osamu Aoki <osamu@debian.org>
To: 418058-quiet@bugs.debian.org, madcoder@debian.org, kmuto@debian.org
Subject: Thanks: Bug#418058
Date: Sun, 22 Jul 2007 21:42:57 +0900
Hi,

You guys beat me on closing this.

> > Yes, I think so. This is from a design rather than a bug.
> > Kenshi Muto

Right.

Thanks,

Osamu




Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Mon, 20 Aug 2007 07:33:34 GMT) (full text, mbox, link).


Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Mon Jun 5 03:19:02 2023; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.