Debian Bug report logs -
#404980
mawk: UTF-8 multibyte characters are not handled properly
Reported by: Teemu Likonen <tlikonen@iki.fi>
Date: Mon, 13 Jun 2005 14:48:09 UTC
Severity: normal
Done: Jonathan Nieder <jrnieder@gmail.com>
Bug is archived. No further changes may be made.
Toggle useless messages
Report forwarded to debian-bugs-dist@lists.debian.org, James Troup <james@nocrew.org>:
Bug#313411; Package gawk.
(full text, mbox, link).
Acknowledgement sent to Teemu Likonen <tlikonen@ulapland.fi>:
New Bug report received and forwarded. Copy sent to James Troup <james@nocrew.org>.
(full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
Package: gawk
Version: 1:3.1.4-2
Severity: important
gawk does not handle UTF-8 multibyte characters properly. Here's an
example:
$ cat example.txt
A Only_a_singlebyte_character_here_(UTF-8:_41)
Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
€ A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)
$ cat example.txt | awk '{ printf "%-5s%s\n",$1, $2 }'
A Only_a_singlebyte_character_here_(UTF-8:_41)
Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
€ A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)
As we can see the format specifier %-5s does not calculate field widths
correctly when string contains multibyte characters. Unfortunately this
makes gawk's field widths mostly unusable with UTF-8 locale.
-- System Information:
Debian Release: 3.1
APT prefers testing
APT policy: (850, 'testing'), (800, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.6.8-2-k7
Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8)
Versions of packages gawk depends on:
ii libc6 2.3.2.ds1-22 GNU C Library: Shared libraries an
-- no debconf information
Changed Bug submitter from Teemu Likonen <tlikonen@ulapland.fi> to Teemu Likonen <tlikonen@iki.fi>.
Request was from Teemu Likonen <tlikonen@iki.fi>
to control@bugs.debian.org.
(full text, mbox, link).
Changed Bug submitter from Teemu Likonen <tlikonen@iki.fi> to Teemu Likonen <tlikonen@iki.fi>.
Request was from Teemu Likonen <tlikonen@iki.fi>
to control@bugs.debian.org.
(full text, mbox, link).
Information forwarded to debian-bugs-dist@lists.debian.org, James Troup <james@nocrew.org>:
Bug#313411; Package gawk.
(full text, mbox, link).
Acknowledgement sent to Teemu Likonen <tlikonen@iki.fi>:
Extra info received and forwarded to list. Copy sent to James Troup <james@nocrew.org>.
(full text, mbox, link).
Message #14 received at 313411@bugs.debian.org (full text, mbox, reply):
clone 313411 -1
retitle -1 mawk: UTF-8 multibyte characters are not handled properly
thanks
/usr/bin/mawk seems to be default awk interpreter in Etch. The same
UTF-8 bug is in mawk too.
Changed Bug title.
Request was from Teemu Likonen <tlikonen@iki.fi>
to control@bugs.debian.org.
(full text, mbox, link).
Bug reassigned from package `gawk' to `mawk'.
Request was from Teemu Likonen <tlikonen@iki.fi>
to control@bugs.debian.org.
(full text, mbox, link).
Severity set to `normal' from `important'
Request was from Teemu Likonen <tlikonen@iki.fi>
to control@bugs.debian.org.
(Tue, 08 Jan 2008 21:30:05 GMT) (full text, mbox, link).
Information forwarded to debian-bugs-dist@lists.debian.org, Steve Langasek <vorlon@debian.org>:
Bug#404980; Package mawk.
(full text, mbox, link).
Acknowledgement sent to Francesco Poli <frx@firenze.linux.it>:
Extra info received and forwarded to list. Copy sent to Steve Langasek <vorlon@debian.org>.
(full text, mbox, link).
Message #27 received at 404980@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Hi!
This bug is a show-stopper whenever one wants to set a field width with
the %s format specifier.
It has been reported quite some time ago and I cannot see any activity.
Can this bug at least be forwarded upstream, please?
BTW, I experienced this bug while trying to center lines of text inside
an 80-column container:
$ echo 'hello world' | awk '{ w = int((80 + length())/2); printf "%" w "s\n", $0; }'
hello world
$ echo 'hèllo wörld' | awk '{ w = int((80 + length())/2); printf "%" w "s\n", $0; }'
hèllo wörld
Do you happen to know of a command-line tool that can read text lines
from stdin and write them centered to stdout?
Thanks for any help.
--
http://frx.netsons.org/doc/index.html#nanodocs
The nano-document series is here!
..................................................... Francesco Poli .
GnuPG key fpr == C979 F34B 27CE 5CD8 DC12 31B5 78F4 279B DD6D FCF4
[Message part 2 (application/pgp-signature, inline)]
Bug 404980 cloned as bug 572138.
Request was from Jonathan Nieder <jrnieder@gmail.com>
to control@bugs.debian.org.
(Mon, 01 Mar 2010 20:12:08 GMT) (full text, mbox, link).
Reply sent
to Jonathan Nieder <jrnieder@gmail.com>:
You have taken responsibility.
(Mon, 01 Mar 2010 20:12:15 GMT) (full text, mbox, link).
Notification sent
to Teemu Likonen <tlikonen@iki.fi>:
Bug acknowledged by developer.
(Mon, 01 Mar 2010 20:12:15 GMT) (full text, mbox, link).
Message #34 received at 404980-done@bugs.debian.org (full text, mbox, reply):
clone 404980 -1
retitle -1 mawk: Please add a function wrapping wcswidth()
severity -1 wishlist
tags -1 + upstream
thanks
Hi Teemu,
Teemu Likonen wrote:
> $ cat example.txt
>
> A Only_a_singlebyte_character_here_(UTF-8:_41)
> Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
> € A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)
>
>
> $ cat example.txt | awk '{ printf "%-5s%s\n",$1, $2 }'
>
> A Only_a_singlebyte_character_here_(UTF-8:_41)
> Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
> € A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)
>
>
> As we can see the format specifier %-5s does not calculate field widths
> correctly when string contains multibyte characters.
This behavior is shared with C printf, and sadly it is is required.
POSIX is clear about this: the numeric argument to a %s format is a
number of bytes. See the target of the “File Format Notation” link in
http://www.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_10
So closing.
On the other hand, the functionality you are asking for would be very
nice to have in some form.
> Unfortunately this
> makes gawk's field widths mostly unusable with UTF-8 locale.
In C, it is understandable why it was chosen to use number of bytes,
to avoid nonobvious buffer overflow bugs with sprintf(). That problem
does not apply to awk, so maybe it would be possible to convince the
Open Group people to change the behavior (or add a new function)?
See http://unix.org/2008edition/ for the latest standards,
http://austingroupbugs.net/main_page.php to contact the standards
bodies.
Jonathan
Bug archived.
Request was from Debbugs Internal Request <owner@bugs.debian.org>
to internal_control@bugs.debian.org.
(Tue, 30 Mar 2010 07:40:58 GMT) (full text, mbox, link).
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Sun Aug 11 20:40:58 2024;
Machine Name:
buxtehude
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.