Debian Bug report logs - #404980
mawk: UTF-8 multibyte characters are not handled properly

Package: mawk; Maintainer for mawk is Boyuan Yang <byang@debian.org>; Source for mawk is src:mawk (PTS, buildd, popcon).

Reported by: Teemu Likonen <tlikonen@iki.fi>

Date: Mon, 13 Jun 2005 14:48:09 UTC

Severity: normal

Done: Jonathan Nieder <jrnieder@gmail.com>

Bug is archived. No further changes may be made.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, James Troup <james@nocrew.org>:
Bug#313411; Package gawk. (full text, mbox, link).


Acknowledgement sent to Teemu Likonen <tlikonen@ulapland.fi>:
New Bug report received and forwarded. Copy sent to James Troup <james@nocrew.org>. (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Teemu Likonen <tlikonen@ulapland.fi>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: gawk: UTF-8 multibyte characters are not handled properly
Date: Mon, 13 Jun 2005 17:42:48 +0300
Package: gawk
Version: 1:3.1.4-2
Severity: important


gawk does not handle UTF-8 multibyte characters properly. Here's an
example:


$ cat example.txt

A Only_a_singlebyte_character_here_(UTF-8:_41)
Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
€ A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)


$ cat example.txt | awk '{ printf "%-5s%s\n",$1, $2 }'

A    Only_a_singlebyte_character_here_(UTF-8:_41)
Ö   A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
€  A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)


As we can see the format specifier %-5s does not calculate field widths
correctly when string contains multibyte characters. Unfortunately this
makes gawk's field widths mostly unusable with UTF-8 locale.


-- System Information:
Debian Release: 3.1
  APT prefers testing
  APT policy: (850, 'testing'), (800, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.6.8-2-k7
Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8)

Versions of packages gawk depends on:
ii  libc6                       2.3.2.ds1-22 GNU C Library: Shared libraries an

-- no debconf information



Changed Bug submitter from Teemu Likonen <tlikonen@ulapland.fi> to Teemu Likonen <tlikonen@iki.fi>. Request was from Teemu Likonen <tlikonen@iki.fi> to control@bugs.debian.org. (full text, mbox, link).


Changed Bug submitter from Teemu Likonen <tlikonen@iki.fi> to Teemu Likonen <tlikonen@iki.fi>. Request was from Teemu Likonen <tlikonen@iki.fi> to control@bugs.debian.org. (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, James Troup <james@nocrew.org>:
Bug#313411; Package gawk. (full text, mbox, link).


Acknowledgement sent to Teemu Likonen <tlikonen@iki.fi>:
Extra info received and forwarded to list. Copy sent to James Troup <james@nocrew.org>. (full text, mbox, link).


Message #14 received at 313411@bugs.debian.org (full text, mbox, reply):

From: Teemu Likonen <tlikonen@iki.fi>
To: control@bugs.debian.org, 313411@bugs.debian.org
Subject: This also affects mawk
Date: Fri, 29 Dec 2006 23:55:30 +0200
clone 313411 -1
retitle -1 mawk: UTF-8 multibyte characters are not handled properly
thanks

/usr/bin/mawk seems to be default awk interpreter in Etch. The same
UTF-8 bug is in mawk too.



Bug 313411 cloned as bug 404980. Request was from Teemu Likonen <tlikonen@iki.fi> to control@bugs.debian.org. (full text, mbox, link).


Changed Bug title. Request was from Teemu Likonen <tlikonen@iki.fi> to control@bugs.debian.org. (full text, mbox, link).


Bug reassigned from package `gawk' to `mawk'. Request was from Teemu Likonen <tlikonen@iki.fi> to control@bugs.debian.org. (full text, mbox, link).


Severity set to `normal' from `important' Request was from Teemu Likonen <tlikonen@iki.fi> to control@bugs.debian.org. (Tue, 08 Jan 2008 21:30:05 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Steve Langasek <vorlon@debian.org>:
Bug#404980; Package mawk. (full text, mbox, link).


Acknowledgement sent to Francesco Poli <frx@firenze.linux.it>:
Extra info received and forwarded to list. Copy sent to Steve Langasek <vorlon@debian.org>. (full text, mbox, link).


Message #27 received at 404980@bugs.debian.org (full text, mbox, reply):

From: Francesco Poli <frx@firenze.linux.it>
To: 313411@bugs.debian.org, 404980@bugs.debian.org
Subject: What's the status of this bug?
Date: Thu, 18 Sep 2008 23:49:27 +0200
[Message part 1 (text/plain, inline)]
Hi!

This bug is a show-stopper whenever one wants to set a field width with
the %s format specifier.
It has been reported quite some time ago and I cannot see any activity.
Can this bug at least be forwarded upstream, please?


BTW, I experienced this bug while trying to center lines of text inside
an 80-column container:

$ echo 'hello world' | awk '{ w = int((80 + length())/2); printf "%" w "s\n", $0; }'
                                  hello world
$ echo 'hèllo wörld' | awk '{ w = int((80 + length())/2); printf "%" w "s\n", $0; }'
                                hèllo wörld

Do you happen to know of a command-line tool that can read text lines
from stdin and write them centered to stdout?


Thanks for any help.

-- 
 http://frx.netsons.org/doc/index.html#nanodocs
 The nano-document series is here!
..................................................... Francesco Poli .
 GnuPG key fpr == C979 F34B 27CE 5CD8 DC12  31B5 78F4 279B DD6D FCF4
[Message part 2 (application/pgp-signature, inline)]

Bug 404980 cloned as bug 572138. Request was from Jonathan Nieder <jrnieder@gmail.com> to control@bugs.debian.org. (Mon, 01 Mar 2010 20:12:08 GMT) (full text, mbox, link).


Reply sent to Jonathan Nieder <jrnieder@gmail.com>:
You have taken responsibility. (Mon, 01 Mar 2010 20:12:15 GMT) (full text, mbox, link).


Notification sent to Teemu Likonen <tlikonen@iki.fi>:
Bug acknowledged by developer. (Mon, 01 Mar 2010 20:12:15 GMT) (full text, mbox, link).


Message #34 received at 404980-done@bugs.debian.org (full text, mbox, reply):

From: Jonathan Nieder <jrnieder@gmail.com>
To: Teemu Likonen <tlikonen@ulapland.fi>, control@bugs.debian.org, 404980-done@bugs.debian.org
Cc: Francesco Poli <frx@firenze.linux.it>, Thomas E Dickey <dickey@invisible-island.net>
Subject: Re: mawk: UTF-8 multibyte characters are not handled properly
Date: Mon, 1 Mar 2010 14:09:38 -0600
clone 404980 -1
retitle -1 mawk: Please add a function wrapping wcswidth()
severity -1 wishlist
tags -1 + upstream
thanks

Hi Teemu,

Teemu Likonen wrote:

> $ cat example.txt
> 
> A Only_a_singlebyte_character_here_(UTF-8:_41)
> Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
> € A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)
> 
> 
> $ cat example.txt | awk '{ printf "%-5s%s\n",$1, $2 }'
> 
> A    Only_a_singlebyte_character_here_(UTF-8:_41)
> Ö   A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
> €  A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)
> 
> 
> As we can see the format specifier %-5s does not calculate field widths
> correctly when string contains multibyte characters.

This behavior is shared with C printf, and sadly it is is required.
POSIX is clear about this: the numeric argument to a %s format is a
number of bytes.  See the target of the “File Format Notation” link in
http://www.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_10
So closing.

On the other hand, the functionality you are asking for would be very
nice to have in some form.

> Unfortunately this
> makes gawk's field widths mostly unusable with UTF-8 locale.

In C, it is understandable why it was chosen to use number of bytes,
to avoid nonobvious buffer overflow bugs with sprintf().  That problem
does not apply to awk, so maybe it would be possible to convince the
Open Group people to change the behavior (or add a new function)?

See http://unix.org/2008edition/ for the latest standards,
http://austingroupbugs.net/main_page.php to contact the standards
bodies.

Jonathan




Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Tue, 30 Mar 2010 07:40:58 GMT) (full text, mbox, link).


Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Sun Aug 11 20:40:58 2024; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.