Debian Bug report logs - #139861
tr: no UTF-8 support

version graph

Package: coreutils; Maintainer for coreutils is Michael Stone <mstone@debian.org>; Source for coreutils is src:coreutils (PTS, buildd, popcon).

Reported by: Torsten Hilbrich <debian-user-german@myrkr.in-berlin.de>

Date: Mon, 25 Mar 2002 17:18:01 UTC

Severity: normal

Tags: confirmed, upstream

Merged with 388689, 431231, 613155, 649729, 721324

Found in versions coreutils/5.97-5.3, coreutils/8.13-3, coreutils/8.21-1, coreutils/8.32-4, coreutils/5.97-5, coreutils/8.5-1, coreutils/5.96-3, coreutils/9.1-1, coreutils/6.10~20071127-1

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Ben Collins <bcollins@debian.org>, glibc@packages.qa.debian.org:
Bug#139861; Package locales. (full text, mbox, link).


Acknowledgement sent to Torsten Hilbrich <debian-user-german@myrkr.in-berlin.de>:
New Bug report received and forwarded. Copy sent to Ben Collins <bcollins@debian.org>, glibc@packages.qa.debian.org. (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Torsten Hilbrich <debian-user-german@myrkr.in-berlin.de>
To: submit@bugs.debian.org
Subject: LC_CTYPE with UTF-8 doesn't work correctly
Date: Mon, 25 Mar 2002 18:11:00 +0100
[Message part 1 (text/plain, inline)]
Package: locales
Version: 2.2.5-3

Yesterday I noticed, that the UTF-8 encoding doesn't seem to be
correctly supported by the current locales package.  I have problems
using the lower and upper case conversion.

Here are two different ways to exploit this behaviour.  In both cases
I used an 
"xterm -u8 -fn -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1" 
with LC_ALL=de_DE.UTF-8 to test the programs.  In this email I display
the umlaut characters in latin1, I will append the typescript with the
real utf-8 encoding of the characters.

- Programs like tr (textutils 2.0-12):

$ tr [:lower:] [:upper:]
oauöäü                           # the input
OAUöäü                           # the output

The ASCII alphabetic characters are correctly transformed, the utf-8
encoding umlauts are not.

- The bash (2.05a-9):
$ for i in a A ä Ä; do case $i in [[:lower:]]) echo "$i is l"; esac; done
a is lc                          # the output

The ä umlaut should also be output.

My /etc/locale.gen has the following contents:

# /etc/locale.gen
de_DE ISO-8859-1
de_DE.UTF-8 UTF-8
de_DE@euro ISO-8859-15

Here is the locale setting while doing the tests:

~$ locale
LANG=de_DE
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=de_DE.UTF-8

Here is the typescript file, as binary attachment to allow correct
transmission of utf-8.

[typescript (application/octet-stream, attachment)]
[Message part 3 (text/plain, inline)]
        Torsten

Information forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>, glibc@packages.qa.debian.org:
Bug#139861; Package locales. (full text, mbox, link).


Acknowledgement sent to GOTO Masanori <gotom@debian.or.jp>:
Extra info received and forwarded to list. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>, glibc@packages.qa.debian.org. (full text, mbox, link).


Message #10 received at 139861@bugs.debian.org (full text, mbox, reply):

From: GOTO Masanori <gotom@debian.or.jp>
To: Torsten Hilbrich <debian-user-german@myrkr.in-berlin.de>, 139861@bugs.debian.org
Subject: Re: LC_CTYPE with UTF-8 doesn't work correctly
Date: Wed, 26 Feb 2003 20:29:55 +0900
> Yesterday I noticed, that the UTF-8 encoding doesn't seem to be
> correctly supported by the current locales package.  I have problems
> using the lower and upper case conversion.
> 
> Here are two different ways to exploit this behaviour.  In both cases
> I used an=20
> "xterm -u8 -fn -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1"=
> =20
> with LC_ALL=3Dde_DE.UTF-8 to test the programs.  In this email I display
> the umlaut characters in latin1, I will append the typescript with the
> real utf-8 encoding of the characters.
> 
> - Programs like tr (textutils 2.0-12):
> 
> $ tr [:lower:] [:upper:]
> oau=F6=E4=FC                           # the input
> OAU=F6=E4=FC                           # the output
> 
> The ASCII alphabetic characters are correctly transformed, the utf-8
> encoding umlauts are not.
> 
> - The bash (2.05a-9):
> $ for i in a A =E4 =C4; do case $i in [[:lower:]]) echo "$i is l"; esac; do=
> ne
> a is lc                          # the output
> 
> The =E4 umlaut should also be output.

It should be fixed in sid glibc 2.3.1, please check.

Regards,
-- gotom



Information forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>, glibc@packages.qa.debian.org:
Bug#139861; Package locales. (full text, mbox, link).


Acknowledgement sent to Torsten Hilbrich <debbug@myrkr.in-berlin.de>:
Extra info received and forwarded to list. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>, glibc@packages.qa.debian.org. (full text, mbox, link).


Message #15 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Torsten Hilbrich <debbug@myrkr.in-berlin.de>
To: 139861@bugs.debian.org, GOTO Masanori <gotom@debian.or.jp>
Subject: Re: LC_CTYPE with UTF-8 doesn't work correctly
Date: Thu, 27 Feb 2003 20:01:25 +0100
GOTO Masanori <gotom@debian.or.jp> writes:

[Problems with [:lower:] and [:upper:] in UTF-8 locale]

> It should be fixed in sid glibc 2.3.1, please check.

I have now installed the latest versions of theses programs:

ii  bash           2.05b-3        The GNU Bourne Again SHell
ii  coreutils      4.5.7-1        The GNU core utilities
ii  grep           2.5.1-2        GNU grep, egrep and fgrep
ii  locales        2.3.1-14       GNU C Library: National Language (locale) da
ii  libc6          2.3.1-14       GNU C Library: Shared libraries and Timezone

The following statements work as expected:

$ grep [[:lower:]]                   
$ grep [[:upper:]]                   
$ case ... in [[:lower:]]) ... esac  # bash
$ case ... in [[:upper:]]) ... esac  # bash

The following don't work with non-ASCII characters when LC_CTYPE is
set to de_DE.UTF8

$ tr [:lower:] [:upper:]

Using "tr [:alpha:] '-'" I found out that non-ASCII letters (valid
letters in the de_DE locale) are not even recognized.  In the
de_DE.ISO-8859-1 locale both statements work correctly.

I don't know if this is related to this single program or can be
caused by problems in libc6 oder locales data.  Please tell me if you
think that I should report to coreutils instead.

So half the bug report is resolved,

        Torsten



Information forwarded to debian-bugs-dist@lists.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>, glibc@packages.qa.debian.org:
Bug#139861; Package locales. (full text, mbox, link).


Acknowledgement sent to GOTO Masanori <gotom@debian.or.jp>:
Extra info received and forwarded to list. Copy sent to GNU Libc Maintainers <debian-glibc@lists.debian.org>, glibc@packages.qa.debian.org. (full text, mbox, link).


Message #20 received at 139861@bugs.debian.org (full text, mbox, reply):

From: GOTO Masanori <gotom@debian.or.jp>
To: Torsten Hilbrich <debbug@myrkr.in-berlin.de>
Cc: 139861@bugs.debian.org, GOTO Masanori <gotom@debian.or.jp>
Subject: Re: LC_CTYPE with UTF-8 doesn't work correctly
Date: Fri, 28 Feb 2003 10:15:02 +0900
At Thu, 27 Feb 2003 20:01:25 +0100,
Torsten Hilbrich wrote:
> 
> GOTO Masanori <gotom@debian.or.jp> writes:
> 
> [Problems with [:lower:] and [:upper:] in UTF-8 locale]
> 
> > It should be fixed in sid glibc 2.3.1, please check.
> 
> I have now installed the latest versions of theses programs:
> 
> ii  bash           2.05b-3        The GNU Bourne Again SHell
> ii  coreutils      4.5.7-1        The GNU core utilities
> ii  grep           2.5.1-2        GNU grep, egrep and fgrep
> ii  locales        2.3.1-14       GNU C Library: National Language (locale) da
> ii  libc6          2.3.1-14       GNU C Library: Shared libraries and Timezone
> 
> The following statements work as expected:
> 
> $ grep [[:lower:]]                   
> $ grep [[:upper:]]                   
> $ case ... in [[:lower:]]) ... esac  # bash
> $ case ... in [[:upper:]]) ... esac  # bash
> 
> The following don't work with non-ASCII characters when LC_CTYPE is
> set to de_DE.UTF8
> 
> $ tr [:lower:] [:upper:]
> 
> Using "tr [:alpha:] '-'" I found out that non-ASCII letters (valid
> letters in the de_DE locale) are not even recognized.  In the
> de_DE.ISO-8859-1 locale both statements work correctly.
> 
> I don't know if this is related to this single program or can be
> caused by problems in libc6 oder locales data.  Please tell me if you
> think that I should report to coreutils instead.
> 
> So half the bug report is resolved,

Coreutils uses old regex engine, so tr is not ready for UTF-8.
I think it's TODO item for coreutils/textutils.
I reassign this bug to coreutils.

-- gotom



Bug reassigned from package `locales' to `coreutils'. Request was from GOTO Masanori <gotom@debian.or.jp> to control@bugs.debian.org. (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (full text, mbox, link).


Acknowledgement sent to Margarita Manterola <debian@marga.com.ar>:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (full text, mbox, link).


Message #27 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Margarita Manterola <debian@marga.com.ar>
To: 139861@bugs.debian.org
Subject: This bug is still present.
Date: Thu, 27 Jan 2005 16:40:40 -0300
[Message part 1 (text/plain, inline)]
I've just verified, and with version  5.2.1-2 of coreutils, I can still
reproduce the bug:

using LANG=es_AR.UTF8

$ echo áéí | tr áéí ÁÉÍ
ÁÉÍ

$ echo áéí | tr [:lower:] [:upper:]
áéí

$ echo aeiáéí | tr [:lower:] [:upper:]
AEIáéí

$ echo áéí | grep [[:lower:]]
áéí

Please try and fix it.

-- 
 Bessos,    (o_
    Marga.  (\)_
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org:
Bug#139861; Package coreutils. (full text, mbox, link).


Acknowledgement sent to Michael Stone <mstone@debian.org>:
Extra info received and forwarded to list. (full text, mbox, link).


Message #32 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Michael Stone <mstone@debian.org>
To: Margarita Manterola <debian@marga.com.ar>, 139861@bugs.debian.org
Subject: Re: Bug#139861: This bug is still present.
Date: Thu, 27 Jan 2005 15:08:21 -0500
On Thu, Jan 27, 2005 at 04:40:40PM -0300, you wrote:
>I've just verified, and with version  5.2.1-2 of coreutils, I can still
>reproduce the bug:

Yes, coreutils does not claim to handle utf-8.

Mike Stone




Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (full text, mbox, link).


Acknowledgement sent to Margarita Manterola <debian@marga.com.ar>:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (full text, mbox, link).


Message #37 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Margarita Manterola <debian@marga.com.ar>
To: Michael Stone <mstone@debian.org>
Cc: 139861@bugs.debian.org
Subject: Re: Bug#139861: This bug is still present.
Date: Thu, 27 Jan 2005 17:35:13 -0300
[Message part 1 (text/plain, inline)]
Hola Michael Stone!

> On Thu, Jan 27, 2005 at 04:40:40PM -0300, you wrote:
> >I've just verified, and with version  5.2.1-2 of coreutils, I can still
> >reproduce the bug:
> Yes, coreutils does not claim to handle utf-8.

Will it ever handle it?  I guess you must know that UTF8 seems to be "the
encoding of the future" :), and thus it's a good idea to handle it.

The bug had no activity since 2003, and I'm re-checking old bugs to see if
they are still present, so that they don't go forgotten.

-- 
 Besitos,   {o_
     Marga. (')_
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (full text, mbox, link).


Acknowledgement sent to Jim Meyering <jim@meyering.net>:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (full text, mbox, link).


Message #42 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Jim Meyering <jim@meyering.net>
To: Margarita Manterola <debian@marga.com.ar>
Cc: 139861@bugs.debian.org, Michael Stone <mstone@debian.org>
Subject: Re: Bug#139861: This bug is still present.
Date: Thu, 27 Jan 2005 22:02:51 +0100
Margarita Manterola <debian@marga.com.ar> wrote:
> Hola Michael Stone!
>
>> On Thu, Jan 27, 2005 at 04:40:40PM -0300, you wrote:
>> >I've just verified, and with version  5.2.1-2 of coreutils, I can still
>> >reproduce the bug:
>> Yes, coreutils does not claim to handle utf-8.
>
> Will it ever handle it?  I guess you must know that UTF8 seems to be "the
> encoding of the future" :), and thus it's a good idea to handle it.
>
> The bug had no activity since 2003, and I'm re-checking old bugs to see if
> they are still present, so that they don't go forgotten.

FWIW, converting some of the textutils to deal with utf-8 is
most definitely on the upstream to-do list, but I don't know
when it'll be done.



Tags added: confirmed, upstream Request was from Thomas Hood <jdthood@yahoo.co.uk> to control@bugs.debian.org. (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (full text, mbox, link).


Acknowledgement sent to Max Kutny <mkut@umc.ua>:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (full text, mbox, link).


Message #49 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Max Kutny <mkut@umc.ua>
To: Debian Bug Tracking System <139861@bugs.debian.org>
Subject: coreutils are utf-8 aware on redhat
Date: Sat, 03 Jun 2006 10:16:41 +0300
Package: coreutils
Version: 5.96-3
Followup-For: Bug #139861

I'm not sure if it's fixed upstream but RedHat coreutils package (5.2.1-48)
is definitely utf-8 aware (tested on fold).

So there might be a sort of patch that could be applied to debian package.

Thanks.

-- System Information:
Debian Release: testing/unstable
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.16-1-686
Locale: LANG=uk_UA.UTF-8, LC_CTYPE=uk_UA.UTF-8 (charmap=UTF-8)

Versions of packages coreutils depends on:
ii  libacl1                       2.2.37-1   Access control list shared library
ii  libc6                         2.3.6-9    GNU C Library: Shared libraries
ii  libselinux1                   1.30-1     SELinux shared libraries

coreutils recommends no packages.

-- no debconf information



Information forwarded to debian-bugs-dist@lists.debian.org:
Bug#139861; Package coreutils. (full text, mbox, link).


Acknowledgement sent to Michael Stone <mstone@debian.org>:
Extra info received and forwarded to list. (full text, mbox, link).


Message #54 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Michael Stone <mstone@debian.org>
To: Max Kutny <mkut@umc.ua>, 139861@bugs.debian.org
Subject: Re: Bug#139861: coreutils are utf-8 aware on redhat
Date: Sat, 03 Jun 2006 08:05:15 -0400
On Sat, Jun 03, 2006 at 10:16:41AM +0300, you wrote:
>I'm not sure if it's fixed upstream but RedHat coreutils package (5.2.1-48)
>is definitely utf-8 aware (tested on fold).

Last time I looked, redhat hacked kinda utf-8 support onto the package. 
It wasn't complete, it was just enough to pass the tests. I'd rather not 
do it than do it wrong.

Mike Stone



Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (full text, mbox, link).


Acknowledgement sent to Lucas Nussbaum <lucas@lucas-nussbaum.net>:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (full text, mbox, link).


Message #59 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Lucas Nussbaum <lucas@lucas-nussbaum.net>
To: 431231@bugs.debian.org, 139861@bugs.debian.org, 388689@bugs.debian.org
Cc: control@bugs.debian.org
Subject: Re: Bug#431231: tr fails with UTF-8
Date: Tue, 22 Jan 2008 20:56:51 +0100
forcemerge 139861 388689 431231
tags 139861 + upstream confirmed wontfix
found 139861 6.10~20071127-1
thanks

Hi,

I'm merging these bugs (all about tr not supporting UTF-8), that still
affects the current coreutils in experimental. "wontfix" indicates that
this is not going to be fixed by a debian-specific patch, but that the
problem should be fixed upstream first.
-- 
| Lucas Nussbaum
| lucas@lucas-nussbaum.net   http://www.lucas-nussbaum.net/ |
| jabber: lucas@nussbaum.fr             GPG: 1024D/023B3F4F |




Forcibly Merged 139861 388689 431231. Request was from Lucas Nussbaum <lucas@lucas-nussbaum.net> to control@bugs.debian.org. (Tue, 22 Jan 2008 19:57:05 GMT) (full text, mbox, link).


Tags added: upstream, confirmed, wontfix Request was from Lucas Nussbaum <lucas@lucas-nussbaum.net> to control@bugs.debian.org. (Tue, 22 Jan 2008 19:57:07 GMT) (full text, mbox, link).


Bug marked as found in version 6.10~20071127-1. Request was from Lucas Nussbaum <lucas@lucas-nussbaum.net> to control@bugs.debian.org. (Tue, 22 Jan 2008 19:57:08 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (Fri, 11 Feb 2011 10:06:03 GMT) (full text, mbox, link).


Acknowledgement sent to Jonathan Nieder <jrnieder@gmail.com>:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (Fri, 11 Feb 2011 10:06:03 GMT) (full text, mbox, link).


Message #70 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Jonathan Nieder <jrnieder@gmail.com>
To: Lucas Nussbaum <lucas@lucas-nussbaum.net>
Cc: 139861@bugs.debian.org
Subject: Re: tr fails with UTF-8
Date: Fri, 11 Feb 2011 04:04:13 -0600
tags 139861 - wontfix
quit

Lucas Nussbaum wrote:

> I'm merging these bugs (all about tr not supporting UTF-8), that still
> affects the current coreutils in experimental. "wontfix" indicates that
> this is not going to be fixed by a debian-specific patch, but that the
> problem should be fixed upstream first.

That isn't what wontfix usually means, is it?

Thanks; I do agree that it seems best for anyone working on this to
just communicate directly with upstream.




Removed tag(s) wontfix. Request was from Jonathan Nieder <jrnieder@gmail.com> to control@bugs.debian.org. (Fri, 11 Feb 2011 10:06:06 GMT) (full text, mbox, link).


Merged 139861 388689 431231 613155. Request was from Benoît Knecht <benoit.knecht@fsfe.org> to control@bugs.debian.org. (Sun, 10 Jul 2011 09:57:24 GMT) (full text, mbox, link).


Changed Bug title to 'tr: no UTF-8 support' from 'LC_CTYPE with UTF-8 doesn't work correctly' Request was from Benoît Knecht <benoit.knecht@fsfe.org> to control@bugs.debian.org. (Sun, 10 Jul 2011 09:57:26 GMT) (full text, mbox, link).


Forcibly Merged 139861 388689 431231 613155 649729. Request was from Bob Proulx <bob@proulx.com> to control@bugs.debian.org. (Fri, 03 Feb 2012 07:03:57 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (Fri, 13 Apr 2012 11:04:08 GMT) (full text, mbox, link).


Acknowledgement sent to Paul Sladen <debian@paul.sladen.org>:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (Fri, 13 Apr 2012 11:05:21 GMT) (full text, mbox, link).


Message #83 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Paul Sladen <debian@paul.sladen.org>
To: Debbugs - Bug 139861 <139861@bugs.debian.org>
Subject: sort -u incorrect behaviour
Date: Fri, 13 Apr 2012 11:37:04 +0100 (BST)
More examples:

$ echo -e "日\n本\nで\nは" | sort -u | wc -l
4
$ echo -e "日\n本\nで\nは" | sort | wc -l
3

Something is quite wrong (eg. this is definitely *incorrect behaviour*,
rather than merely a difference of opinion over implementation).





Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (Mon, 29 Sep 2014 17:33:05 GMT) (full text, mbox, link).


Acknowledgement sent to dimas000@ya.ru:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (Mon, 29 Sep 2014 17:33:05 GMT) (full text, mbox, link).


Message #88 received at 139861@bugs.debian.org (full text, mbox, reply):

From: dimas <dimas000@ya.ru>
To: 139861@bugs.debian.org
Subject: Re: tr: no UTF-8 support
Date: Mon, 29 Sep 2014 21:30:27 +0400
here is yet another example of tr working wrong with cyrillic chars:
>21:17:36 272 ~$ echo "ололо" | xxd
0000000: d0be d0bb d0be d0bb d0be 0a              ...........
>21:18:06 272 ~$ echo "ОЛОЛО" | xxd
0000000: d09e d09b d09e d09b d09e 0a              ...........
>21:18:21 272 ~$ echo "ололо" | tr 'а-я' 'А-Я' | xxd
0000000: b09e b09b b09e b09b b09e 0a              ...........
first is cyrillic text in lowercase, second in uppercase, and the last is what
tr produced for given range substitution. as we can see, the range was
mistakenly moved from 0xd0XX (where cyrillic chars reside in unicode) to 0xb0XX
(don't know, what's that), while each second byte is correct



Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (Mon, 29 Sep 2014 19:03:10 GMT) (full text, mbox, link).


Message #91 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Bob Proulx <bob@proulx.com>
To: dimas000@ya.ru, 139861@bugs.debian.org
Subject: Re: Bug#139861: tr: no UTF-8 support
Date: Mon, 29 Sep 2014 12:58:35 -0600
[Message part 1 (text/plain, inline)]
dimas wrote:
> here is yet another example of tr working wrong with cyrillic chars:
> >21:17:36 272 ~$ echo "ололо" | xxd

Yes.  Please read through the bug log that you are adding information
to.  It is well known that the coreutils does not support UTF-8
characters.  Please start at the top and read through to the bottom.

  https://bugs.debian.org/139861

> 0000000: d0be d0bb d0be d0bb d0be 0a              ...........
> >21:18:06 272 ~$ echo "ОЛОЛО" | xxd
> 0000000: d09e d09b d09e d09b d09e 0a              ...........
> >21:18:21 272 ~$ echo "ололо" | tr 'а-я' 'А-Я' | xxd
> 0000000: b09e b09b b09e b09b b09e 0a              ...........

When you specify 'а-я' 'А-Я' you *think* you are specifying a range
from 'а' to 'я' but since the utilities are not multibyte aware what
you are actually specifying is:

  printf аяАЯ | od -c
  0000000 320 260 321 217 320 220 320 257

Here is the mapping.

  а \320\260
  я \321\217
  А \320\220
  Я \320\257

Therefore what you are *actually* telling tr is:

  tr '\320\260-\321\217' '\320\220-\320\257'

> first is cyrillic text in lowercase, second in uppercase, and the last is what
> tr produced for given range substitution. as we can see, the range was
> mistakenly moved from 0xd0XX (where cyrillic chars reside in unicode) to 0xb0XX
> (don't know, what's that), while each second byte is correct

It is a known deficiency in coreutils that the utilities are not
multibyte aware.  The following can be found in the upstream source
package TODO file.

  Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
    multibyte aware.  The problem is that I want to avoid duplicating
    significant blocks of logic, yet I also want to incur only minimal
    (preferably `no') cost when operating in single-byte mode.

Some vendors have hacked in patches to make the utilities multibyte
aware but none of those patches have been considered clean enough to
incorporate into the upstream source yet.  Debian's maintainer has
stated that he does not want to diverge from upstream this radically
especially since there have been bugs reported with the multibyte
hacks.  The patches are very messy and incomplete.  The best course of
action would be to get this resolved upstream with the functionally
properly integrated.  Until then this remains a known deficiency.

Bob
[signature.asc (application/pgp-signature, inline)]

Marked as found in versions coreutils/8.21-1. Request was from Bob Proulx <bob@proulx.com> to control@bugs.debian.org. (Sun, 30 Nov 2014 21:54:15 GMT) (full text, mbox, link).


Merged 139861 388689 431231 613155 649729 721324 Request was from Bob Proulx <bob@proulx.com> to control@bugs.debian.org. (Sun, 30 Nov 2014 21:54:20 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (Mon, 19 Feb 2018 03:48:03 GMT) (full text, mbox, link).


Acknowledgement sent to vadyba@klientai.eu:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (Mon, 19 Feb 2018 03:48:03 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (Tue, 14 Jun 2022 15:30:02 GMT) (full text, mbox, link).


Acknowledgement sent to sergio <sergio+it@outerface.net>:
Extra info received and forwarded to list. Copy sent to Michael Stone <mstone@debian.org>. (Tue, 14 Jun 2022 15:30:02 GMT) (full text, mbox, link).


Message #105 received at 139861@bugs.debian.org (full text, mbox, reply):

From: sergio <sergio+it@outerface.net>
To: 139861@bugs.debian.org
Subject: Is this the same bug?
Date: Tue, 14 Jun 2022 17:50:20 +0300
Is this the same bug, or a separate one should be opened?

% echo '¡Hola!' | tr -d '¿'
�Hola!

-- 
sergio.



Information forwarded to debian-bugs-dist@lists.debian.org, tg@mirbsd.de, Michael Stone <mstone@debian.org>:
Bug#139861; Package coreutils. (Mon, 13 Mar 2023 14:42:07 GMT) (full text, mbox, link).


Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to tg@mirbsd.de, Michael Stone <mstone@debian.org>. (Mon, 13 Mar 2023 14:42:07 GMT) (full text, mbox, link).


Message #110 received at 139861@bugs.debian.org (full text, mbox, reply):

From: Thorsten Glaser <tg@mirbsd.de>
To: Debian Bug Tracking System <139861@bugs.debian.org>
Subject: Re: tr: no UTF-8 support
Date: Mon, 13 Mar 2023 15:39:57 +0100
Package: coreutils
Version: 8.32-4+b1
Followup-For: Bug #139861
X-Debbugs-Cc: tg@mirbsd.de
Control: found 139861 9.1-1

Oh wow is this an old bug.

I thought, at first, it’s just character classes…

$ echo mäÄH | tr '[:upper:]' '[:lower:]'
mäÄh

… but apparently, yes, multibyte support is broken:

$ echo 'mäæn' | tr ä Ȁ
mȀȦn



-- System Information:
Debian Release: 11.6
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-20-amd64 (SMP w/2 CPU threads)
Locale: LANG=C, LC_CTYPE=C (charmap=UTF-8) (ignored: LC_ALL set to C.UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/lksh
Init: sysvinit (via /sbin/init)

Versions of packages coreutils depends on:
ii  libacl1      2.2.53-10
ii  libattr1     1:2.4.48-6
ii  libc6        2.31-13+deb11u5
ii  libgmp10     2:6.2.1+dfsg-1+deb11u1
ii  libselinux1  3.1-3

coreutils recommends no packages.

coreutils suggests no packages.

-- no debconf information

Marked as found in versions coreutils/9.1-1. Request was from Thorsten Glaser <tg@mirbsd.de> to 139861-submit@bugs.debian.org. (Mon, 13 Mar 2023 14:42:07 GMT) (full text, mbox, link).


Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Tue Jul 2 18:24:38 2024; Machine Name: bembo

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.