Debian Bug report logs - #522776
debian-policy: mandate existence of a standardised UTF-8 locale

version graph

Package: debian-policy; Maintainer for debian-policy is Debian Policy List <debian-policy@lists.debian.org>; Source for debian-policy is src:debian-policy.

Reported by: Thorsten Glaser <tg@mirbsd.de>

Date: Mon, 6 Apr 2009 12:09:02 UTC

Severity: wishlist

Found in version debian-policy/3.8.1.0

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, tg@mirbsd.de, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Apr 2009 12:09:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
New Bug report received and forwarded. Copy sent to tg@mirbsd.de, Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Apr 2009 12:09:04 GMT) Full text and rfc822 format available.

Message #5 received at submit@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 06 Apr 2009 14:06:55 +0200
Package: debian-policy
Version: 3.8.1.0
Severity: wishlist

For the mksh regression tests, I need a UTF-8 locale working; most
systems either provide “en_US.UTF-8” or “en_US.utf8” with the former
being recommended.

Build-depending on locales-all has worked for me so far, except it
won’t do in Kubuntu where said package does not exist (workaround
is to run 「locale-gen en_US.UTF-8」 in a pbuilder hook, but that’s
almost certainly not allowed in debian/rules *and* requires root),
and fails on hurd-i386 recently (locales-all fails to install).

The promise of the etch release to bring UTF-8 support was not met
because a standard installation of etch does not supply any locale
which can be used for LC_CTYPE with UTF-8 support; only installing
locales-all, or installing locales and debconfing one will do so.
I do not know about lenny, though, I have to admit.

The most light-weight solution would be to
• introduce a “C.UTF-8” locale, as some other OSes did, which is
  equivalent to the “C” (POSIX) locale in all respects *except*
  for LC_CTYPE, where it uses UTF-8 instead of a 7/8-bit charac-
  ter set or encoding
• deliver the “C.UTF-8” locale with the base system
• allow Debian packages to depend on its existence, both at
  build and run time

A more controversial solution would be to do the second and third
point of the above with the “en_US.UTF-8” locale, but that would
be favouring US americanism. (On the other hand, it’s *the* one
most widely spread UTF-8 capable locale available, and as such,
the mksh regression tests use it upstream already.)

Thanks in advance.


-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.26-1-xen-amd64 (SMP w/1 CPU core)
Locale: LANG=C, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/mksh

debian-policy depends on no packages.

debian-policy recommends no packages.

Versions of packages debian-policy suggests:
pn  doc-base                      <none>     (no description available)

-- no debconf information




Blocking bugs of 522777 added: 274699 and 522776 Request was from Thorsten Glaser <tg@mirbsd.de> to control@bugs.debian.org. (Mon, 06 Apr 2009 12:27:06 GMT) Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Apr 2009 14:03:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Apr 2009 14:03:07 GMT) Full text and rfc822 format available.

Message #12 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Thorsten Glaser <tg@mirbsd.de>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 06 Apr 2009 16:02:18 +0200
Thorsten Glaser wrote:
> For the mksh regression tests, I need a UTF-8 locale working; most
> systems either provide “en_US.UTF-8” or “en_US.utf8” with the former
> being recommended.
> 
> Build-depending on locales-all has worked for me so far, except it
> won’t do in Kubuntu where said package does not exist (workaround
> is to run 「locale-gen en_US.UTF-8」 in a pbuilder hook, but that’s
> almost certainly not allowed in debian/rules *and* requires root),
> and fails on hurd-i386 recently (locales-all fails to install).
> 
> The promise of the etch release to bring UTF-8 support was not met
> because a standard installation of etch does not supply any locale
> which can be used for LC_CTYPE with UTF-8 support; only installing
> locales-all, or installing locales and debconfing one will do so.
> I do not know about lenny, though, I have to admit.
> 
> The most light-weight solution would be to
> • introduce a “C.UTF-8” locale, as some other OSes did, which is
>   equivalent to the “C” (POSIX) locale in all respects *except*
>   for LC_CTYPE, where it uses UTF-8 instead of a 7/8-bit charac-
>   ter set or encoding
> • deliver the “C.UTF-8” locale with the base system
> • allow Debian packages to depend on its existence, both at
>   build and run time
> 
> A more controversial solution would be to do the second and third
> point of the above with the “en_US.UTF-8” locale, but that would
> be favouring US americanism. (On the other hand, it’s *the* one
> most widely spread UTF-8 capable locale available, and as such,
> the mksh regression tests use it upstream already.)

I don't understand the problem.
In POSIX the choice of locale and charset is done by user
(in the list of system supported locales/charset).
The default is the locale "C" (alias "POSIX").

If you need a specific locale (as seems from "mksh", not
sure if it is a bug in that program), you need to set it.
Why does mksh need UTF-8? What is wrong with other charsets
or with simple ASCII7?

Debian target is that all program should support (and
possibly display) UTF8 inputs and outputs. Mandate
UTF-8 as default (instead of C/POSIX) would probably
be worse (and non POSIX conformant).

About "C.UTF-8". I really think it is an error. If a user
need a locale, it should set it with the right language
(maybe "en_US.UTF-8").
"C" doesn't mean "default" or "English", but it specify a specific
output, usually for automatic processing. (Check POSIX standard,
and output requirement on "C" locale). en_US could be more user
friendly, but "C" means "old sysadmin gergo".

So, if I interpret right your problem, the right solution is:
- mksh should allow all locales and charsets
and one of:
- Debian should mandate (ev. recommend en_US.UTF-8)
  [ I think it is right on standard installation, but IMHO
  it could be to strong for a minimal essential base (chroot)]
- or a "en_US.UTF-8" package dependency should be required.

ciao
	cate




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Apr 2009 14:21:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Apr 2009 14:21:02 GMT) Full text and rfc822 format available.

Message #17 received at 522776@bugs.debian.org (full text, mbox):

From: Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>
To: 522776@bugs.debian.org
Cc: Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 6 Apr 2009 16:18:59 +0200
On Mon, Apr 06, 2009 at 02:06:55PM +0200, Thorsten Glaser wrote:
> Package: debian-policy
> Version: 3.8.1.0
> Severity: wishlist
> 
> For the mksh regression tests, I need a UTF-8 locale working; most
> systems either provide “en_US.UTF-8” or “en_US.utf8” with the former
> being recommended.

Hello Thorsten,
I have some sympathy with your proposal because dgettext does not work
in the "C" locale but there are too much open question.

> The most light-weight solution would be to
> • introduce a “C.UTF-8” locale, as some other OSes did, which is
>   equivalent to the “C” (POSIX) locale in all respects *except*
>   for LC_CTYPE, where it uses UTF-8 instead of a 7/8-bit charac-
>   ter set or encoding

What about LC_COLLATE (which is a major problem with sort(1)) ?

> • deliver the “C.UTF-8” locale with the base system
> • allow Debian packages to depend on its existence, both at
>   build and run time
> 
> A more controversial solution would be to do the second and third
> point of the above with the “en_US.UTF-8” locale, but that would
> be favouring US americanism. (On the other hand, it’s *the* one
> most widely spread UTF-8 capable locale available, and as such,
> the mksh regression tests use it upstream already.)

What about packages that run before /usr is mounted ? 
What about embedded systems with tight space requirement ?

Cheers,
-- 
Bill. <ballombe@debian.org>

Imagine a large red swirl here. 




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Apr 2009 17:39:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Apr 2009 17:39:03 GMT) Full text and rfc822 format available.

Message #22 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>, 522776@bugs.debian.org
Cc: debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 6 Apr 2009 17:27:12 +0000 (UTC)
Bill Allombert dixit:

>What about LC_COLLATE (which is a major problem with sort(1)) ?

1:1, just like the C locale does.

>What about packages that run before /usr is mounted ? 

They do not have /usr/*/locale/ anyway. This is a glibc problem.

>What about embedded systems with tight space requirement ?

They have different rules anyway… they need to see themselves
if the C.UTF-8 locale (estimated ~200K) is worth it.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Apr 2009 17:39:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Apr 2009 17:39:04 GMT) Full text and rfc822 format available.

Message #27 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: "Giacomo A. Catenazzi" <cate@debian.org>, 522776@bugs.debian.org
Cc: debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 6 Apr 2009 17:33:35 +0000 (UTC)
Giacomo A. Catenazzi dixit:

> If you need a specific locale (as seems from "mksh", not
> sure if it is a bug in that program), you need to set it.

You can only set a locale on a glibc-based system if it’s
installed beforehand, which root needs to do.

> Why does mksh need UTF-8?

The regression tests check if the Unicode mode of mksh is
properly enabled in a UTF-8 locale, and properly disabled
outside of them.

> Mandate
> UTF-8 as default (instead of C/POSIX) would probably
> be worse (and non POSIX conformant).

This is not what I proposed. I proposed that an additional
C.UTF-8 locale shall be available on all Debian systems, to
complement the default 7/8-bit C locale.

> but "C" means "old sysadmin gergo".

Yes, but some programmes basically need that plus UTF-8.
For example, the traditional sorting order, gcc output
warnings, date format, etc.

Note that mksh *is* fine with any locale, UTF-8 or not,
it just makes a distinguishing on the nl_langinfo(CODESET).
However, the *regression test suite* for mksh, run at build
time, needs one UTF-8 locale, and it needs to know which one.
On most systems, this is “en_US.UTF-8”. But Debian, despite
its release goals of UTF-8 support, does not guarantee its
existence. This is what I’d like to have changed.

> So, if I interpret right your problem, the right solution is:
> - mksh should allow all locales and charsets

This part I think you don’t interpret correctly.

> and one of:
> - Debian should mandate (ev. recommend en_US.UTF-8)
>  [ I think it is right on standard installation, but IMHO
>  it could be to strong for a minimal essential base (chroot)]
> - or a "en_US.UTF-8" package dependency should be required.

Right, one of them. Or at least, have the locales pregenerated,
maybe so that I can depend on a "locale_en_US_UTF_8" package.

bye,
//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Apr 2009 18:12:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Steve Langasek <vorlon@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Apr 2009 18:12:06 GMT) Full text and rfc822 format available.

Message #32 received at 522776@bugs.debian.org (full text, mbox):

From: Steve Langasek <vorlon@debian.org>
To: Thorsten Glaser <tg@mirbsd.de>
Cc: 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 6 Apr 2009 11:09:17 -0700
On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:
> > If you need a specific locale (as seems from "mksh", not
> > sure if it is a bug in that program), you need to set it.

> You can only set a locale on a glibc-based system if it’s
> installed beforehand, which root needs to do.

You can build-depend on the locales package and generate the locales you
want locally, using LOCPATH to reference them.  There's no need for Debian
to guarantee the presence of a particular locale ahead of time -
particularly one that isn't actually useful to end users, as C.UTF-8 would
be.

-- 
Steve Langasek                   Give me a lever long enough and a Free OS
Debian Developer                   to set it on, and I can move the world.
Ubuntu Developer                                    http://www.debian.org/
slangasek@ubuntu.com                                     vorlon@debian.org




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Apr 2009 21:57:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Apr 2009 21:57:07 GMT) Full text and rfc822 format available.

Message #37 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Steve Langasek <vorlon@debian.org>, 522776@bugs.debian.org
Cc: Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 6 Apr 2009 22:52:26 +0100
On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote:
> On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:
> > > If you need a specific locale (as seems from "mksh", not
> > > sure if it is a bug in that program), you need to set it.
> 
> > You can only set a locale on a glibc-based system if it’s
> > installed beforehand, which root needs to do.
> 
> You can build-depend on the locales package and generate the locales you
> want locally, using LOCPATH to reference them.  There's no need for Debian
> to guarantee the presence of a particular locale ahead of time -
> particularly one that isn't actually useful to end users, as C.UTF-8 would
> be.

I think that it would be very useful, I'll detail why below.

The GCC toolchain has, for some time now, been using UTF-8 as the
internal representation for narrow strings (-fexec-charset).  It has
also been using UTF-8 as the default input encoding for C source code
(-finput-charset).  This means that unless you take any special
measures, your program will be outputting UTF-8 strings for all file
and terminal I/O.  Of course, this is backward compatible with ASCII,
and is also transcoded automatically when in a non-UTF-8 locale.  I've
attached a trivial example.  Just to be clear: this handling is
completely built into GCC and libc, and is completely transparent.

Now, this will work fine in all locales *except for C/POSIX*.
Obviously the charsets of some locales can't represent all the
characters used in this example, but the C library will actually
transcode (iconv) to the locale codeset as best it can.  Except for
C/POSIX.

Now, why is this needed?

If I write a program, I might want to use non-ASCII UTF-8 characters
in the sources.  We have been doing this for years without realising
since GCC switched to UTF-8 as the default internal encoding, but
simply for portability when using the C locale we are restricted to
using ASCII only in the sources, and then a translation library such
as libintl/gettext to get translated strings with the extended
characters in them.  This is workable, but it imposes a big burden on
translators because I might want to use symbols and other characters
which are not part of a /language/ translation, but need adding by
each and every translator through explicit translator comments in the
sources.  This is tedious and error-prone.  If the sources were UTF-8
encoded, this would work perfectly since I could just use the
necessary UTF-8 characters directly in the source rather than abusing
the translation machinery to avoid non-ASCII codes.  A UTF-8 C locale
thus cuts out a big pile of cruft and complexity in sources which only
exists to cater for people who want to run your code in a C locale!
And the translators can completely ignore the now no longer needed
job of translating special characters as well doing as the actual
translation work, so the symbol usage is identical in all
translations, and their job is much easier.

I've tested all this, and it all works *perfectly*.  Except that if
you do this, your program will not run in the C locale (and *only*
the C locale) due to having completely borked output.  A C.UTF-8 would
be a solution to this problem, and allow full use of the *existing*
UTF-8 string handling which all sources are built with, yet only a
tiny fraction dare to use.  Note that gettext is *completely disabled*
if used in a C locale, and this does additional mangling in addition
to the plain libc damage, resulting in *no output at all*!  (I would
need to double check that; this was the case when I last looked,
and the reason I had to abandon use of UTF-8 string literals.)


There are other uses for a UTF-8 C locale as well.  I've needed at
several times a UTF-8 locale at build time for various tasks,
mainly related to translation work.  While you mentioned it's
possible to do this by generation of locales at build time, in
practice I've found this rather error prone and unreliable.  Having
the C locale (which is the locale all our buildds use by default)
UTF-8 by default would make these jobs much easier.  Some of the
projects I work on such as gutenprint have needed to reimplement some
of the gettext internals to work around this in a portable manner.


Regarding the standards conformance of using a UTF-8 C locale:
I've spent some time reading the standards (SUSv3), and see no reason
why C can't use UTF-8 as its default codeset and still remain strictly
conforming.

The standards specifies a minimum requirement of a portable character
set and control character set.  This is satisfied by the 7-bit ASCII
encoding which we currently use as the C0 and G1 control and graphics
sets.  However, UTF-8 is a strict 8-bit superset of this standard, and
it is eminently reasonable to use UTF-8 *and still remain conforming*
with the minimum functionality required by the standard.  It's
explicity spelled out in SUSv2, though the wording was dropped in
SUSv3 (definitely not forbidden, though).

POSIX/C locale:
http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap07.html#tag_07_02

Portable charset:
http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tag_06

"Implementations may also add other characters."
This is from the charset documentation in SUSv2
http://opengroup.org/onlinepubs/007908775/xbd/charset.html


UTF-8 is the default character set on Debian GNU/Linux.  It's what
we all use, it's what all the tools use, and the C locale is the
last ASCII holdout.  It would make the lives of many maintainers
and users more bearable if it was also UTF-8, as well as getting
rid of the current buggy behaviour if you use UTF-8-encoded sources.
It's currently *the only blocker* preventing us using UTF-8 encoded
sources.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Apr 2009 21:57:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Apr 2009 21:57:08 GMT) Full text and rfc822 format available.

Message #42 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Steve Langasek <vorlon@debian.org>, 522776@bugs.debian.org
Cc: Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 6 Apr 2009 22:53:43 +0100
[Message part 1 (text/plain, inline)]
On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote:
> On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:
> > > If you need a specific locale (as seems from "mksh", not
> > > sure if it is a bug in that program), you need to set it.
> 
> > You can only set a locale on a glibc-based system if it’s
> > installed beforehand, which root needs to do.
> 
> You can build-depend on the locales package and generate the locales you
> want locally, using LOCPATH to reference them.  There's no need for Debian
> to guarantee the presence of a particular locale ahead of time -
> particularly one that isn't actually useful to end users, as C.UTF-8 would
> be.

Example attached of direct UTF-8 encoding in sources.  Just run
in a few locales such as UTF-8, ISO-8859-1 and C and check the
differences in output.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[unicode.c (text/x-csrc, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Apr 2009 21:57:09 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Apr 2009 21:57:09 GMT) Full text and rfc822 format available.

Message #47 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>, 522776@bugs.debian.org
Cc: Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 6 Apr 2009 22:56:25 +0100
On Mon, Apr 06, 2009 at 04:18:59PM +0200, Bill Allombert wrote:
> On Mon, Apr 06, 2009 at 02:06:55PM +0200, Thorsten Glaser wrote:
> > Package: debian-policy
> > Version: 3.8.1.0
> > Severity: wishlist
> > 
> > For the mksh regression tests, I need a UTF-8 locale working; most
> > systems either provide “en_US.UTF-8” or “en_US.utf8” with the former
> > being recommended.
> 
> Hello Thorsten,
> I have some sympathy with your proposal because dgettext does not work
> in the "C" locale but there are too much open question.

Is there any hope of fixing this?  I consider this hardcoded
gettext behaviour in a C locale a severe misfeature, which has caused
me (as a programmer) no end of problems.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 08:42:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 08:42:03 GMT) Full text and rfc822 format available.

Message #52 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Roger Leigh <rleigh@codelibre.net>, 522776@bugs.debian.org
Cc: Steve Langasek <vorlon@debian.org>, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 07 Apr 2009 10:36:20 +0200
Roger Leigh wrote:
> On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote:
>> On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:
>>>> If you need a specific locale (as seems from "mksh", not
>>>> sure if it is a bug in that program), you need to set it.
>>> You can only set a locale on a glibc-based system if it’s
>>> installed beforehand, which root needs to do.
>> You can build-depend on the locales package and generate the locales you
>> want locally, using LOCPATH to reference them.  There's no need for Debian
>> to guarantee the presence of a particular locale ahead of time -
>> particularly one that isn't actually useful to end users, as C.UTF-8 would
>> be.
> 
> I think that it would be very useful, I'll detail why below.
> 
> The GCC toolchain has, for some time now, been using UTF-8 as the
> internal representation for narrow strings (-fexec-charset).  It has
> also been using UTF-8 as the default input encoding for C source code
> (-finput-charset).  This means that unless you take any special
> measures, your program will be outputting UTF-8 strings for all file
> and terminal I/O.  Of course, this is backward compatible with ASCII,
> and is also transcoded automatically when in a non-UTF-8 locale.  I've
> attached a trivial example.  Just to be clear: this handling is
> completely built into GCC and libc, and is completely transparent.

Hmm. Warning, you confuse some terms.
- input charset is the source charset (used to parse C code)
- exec charset is the charset of the target machine (which run the program).
- C99 must support unicode identifier (written with \uxxxx or in other
  non portable implementation defined way)
- standard libraries can use locales (but only if you initialized the locale),
  but not all the functions, not all uses.
- wide charaters are yet an other things (as you note in your example,
  the wide string is not in UTF-8, but I think UTF-32)

Same input and exec charset really means: don't translate strings
(e.g. in
   if(c = 'a') printf("bcde\n");
 'a' and "bcde\n" will have the same values as in the input file, else
 it will put in binary the representation of exec charset)

I expect that your program will run fine (i.e. really no changes: the
same binary output), if you use tell GCC that you use any other ASCII-7
derived 8-bit encoding (both for input and exec charset).

printf/wprintf uses locale only for numeric representation.

Usually the interpretation of bytes is done by terminal, not by compiler.


> Now, this will work fine in all locales *except for C/POSIX*.
> Obviously the charsets of some locales can't represent all the
> characters used in this example, but the C library will actually
> transcode (iconv) to the locale codeset as best it can.  Except for
> C/POSIX.
> 
> Now, why is this needed?
> 
> If I write a program, I might want to use non-ASCII UTF-8 characters
> in the sources.  We have been doing this for years without realising
> since GCC switched to UTF-8 as the default internal encoding, but
> simply for portability when using the C locale we are restricted to
> using ASCII only in the sources,

Really minimal C charset is smaller than ASCII (a portable program
must not have "$" and no "@", plus C supports also smaller charset,
with trigraps [preprocessor] and/or new bigraphs [compiler])

> and then a translation library such
> as libintl/gettext to get translated strings with the extended
> characters in them.  This is workable, but it imposes a big burden on
> translators because I might want to use symbols and other characters
> which are not part of a /language/ translation, but need adding by
> each and every translator through explicit translator comments in the
> sources.  This is tedious and error-prone.  If the sources were UTF-8
> encoded, this would work perfectly since I could just use the
> necessary UTF-8 characters directly in the source rather than abusing
> the translation machinery to avoid non-ASCII codes.  A UTF-8 C locale
> thus cuts out a big pile of cruft and complexity in sources which only
> exists to cater for people who want to run your code in a C locale!
> And the translators can completely ignore the now no longer needed
> job of translating special characters as well doing as the actual
> translation work, so the symbol usage is identical in all
> translations, and their job is much easier.

yes, in a perfect world we need only one charset (and maybe only
one language and one locale). From all the proposals to reach this
target, unicode and UTF-8 seems the best solution.
But... for now take care about locales and don't assume UTF-8,
or you will cause trouble with a lot of non-UTF-8 users.
Converting locale (from non-UTF-8 to UTF-8) is simple for
English and few European languages, but it is a tedious work
for many user: it need a "flag day", in which I should convert
all my files to UTF-8 or annotate every file with the right
encoding (most of editors and tools understands such annotations).

So for now we support UTF-8, we try to set UTF-8 default to
new users, and UTF-8 is the encoding for debian files in packages.
But it will take a lot of years (or maybe never) before
we can assume UTF-8 if user don't loudly tell the system to
use other encodings.


 > I've tested all this, and it all works *perfectly*.  Except that if
> you do this, your program will not run in the C locale (and *only*
> the C locale) due to having completely borked output.

It is the terminal, not the C program.

>  A C.UTF-8 would
> be a solution to this problem, and allow full use of the *existing*
> UTF-8 string handling which all sources are built with, yet only a
> tiny fraction dare to use.  Note that gettext is *completely disabled*
> if used in a C locale, and this does additional mangling in addition
> to the plain libc damage, resulting in *no output at all*!  (I would
> need to double check that; this was the case when I last looked,
> and the reason I had to abandon use of UTF-8 string literals.)

Use "en_US.UTF-8".
"C.UTF-8" is a bad name. Locale "C" means "no locale, old behaviour,
for machine". Do we need to translate all strings also on C.UTF-8?
Which alphabetic characters?  Which numeric characters?  Which
alphabetic order? etc. etc.  You see: it is difficult to create
a new locale, and people must understand the meaning of such locale
(without reading all the locale definition). "en_US.UTF-8" is
clear.


> There are other uses for a UTF-8 C locale as well.  I've needed at
> several times a UTF-8 locale at build time for various tasks,
> mainly related to translation work.  While you mentioned it's
> possible to do this by generation of locales at build time, in
> practice I've found this rather error prone and unreliable.  Having
> the C locale (which is the locale all our buildds use by default)
> UTF-8 by default would make these jobs much easier.  Some of the
> projects I work on such as gutenprint have needed to reimplement some
> of the gettext internals to work around this in a portable manner.
> 
> 
> Regarding the standards conformance of using a UTF-8 C locale:
> I've spent some time reading the standards (SUSv3), and see no reason
> why C can't use UTF-8 as its default codeset and still remain strictly
> conforming.

UTF-8 as a lot of characters (alphabetic, numeric, white).
C locale requires that whitespace are only SPACE and TAB.
I did look for all requirement, but I found that some requirement
are incompatible from what one should expect.

So a C local in UTF-8 would cause more trouble to users (no warning,
but the whitespace are missinterpreted (note: some windows editors
are know to insert a lot of non standard whitespace, instead of spaces).


> The standards specifies a minimum requirement of a portable character
> set and control character set.  This is satisfied by the 7-bit ASCII
> encoding which we currently use as the C0 and G1 control and graphics
> sets.  However, UTF-8 is a strict 8-bit superset of this standard, and
> it is eminently reasonable to use UTF-8 *and still remain conforming*
> with the minimum functionality required by the standard.  It's
> explicity spelled out in SUSv2, though the wording was dropped in
> SUSv3 (definitely not forbidden, though).
> 
> POSIX/C locale:
> http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap07.html#tag_07_02
> 
> Portable charset:
> http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tag_06
> 
> "Implementations may also add other characters."
> This is from the charset documentation in SUSv2
> http://opengroup.org/onlinepubs/007908775/xbd/charset.html
> 
> 
> UTF-8 is the default character set on Debian GNU/Linux.  It's what
> we all use, it's what all the tools use, and the C locale is the
> last ASCII holdout.  It would make the lives of many maintainers
> and users more bearable if it was also UTF-8, as well as getting
> rid of the current buggy behaviour if you use UTF-8-encoded sources.
> It's currently *the only blocker* preventing us using UTF-8 encoded
> sources.

I think ASCII 7 would simplify the finding bugs.
An c>127 in a C locale is simply wrong, it will miss interpreted
by different terminal (local and remote, etc.).
Not always, there are terminal libraries and standard libraries that
do the right things, but with your proposal, I think in few months
programs will simply write UTF-8 to terminal, ignoring charset
choose by user.

Before was: all must use English because I understand English
now we want: all must use UTF-8 because I use UTF-8?

If English is the most spoken language (and easier to type), or
that UTF-8 is technically very good, doesn't mean that we
should oblige users to use English or UTF-8.

ciao
	cate




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 18:33:16 GMT) Full text and rfc822 format available.

Acknowledgement sent to Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 18:33:16 GMT) Full text and rfc822 format available.

Message #57 received at 522776@bugs.debian.org (full text, mbox):

From: Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>
To: Roger Leigh <rleigh@codelibre.net>
Cc: 522776@bugs.debian.org, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 20:32:25 +0200
On Mon, Apr 06, 2009 at 10:56:25PM +0100, Roger Leigh wrote:
> On Mon, Apr 06, 2009 at 04:18:59PM +0200, Bill Allombert wrote:
> > On Mon, Apr 06, 2009 at 02:06:55PM +0200, Thorsten Glaser wrote:
> > > Package: debian-policy
> > > Version: 3.8.1.0
> > > Severity: wishlist
> > > 
> > > For the mksh regression tests, I need a UTF-8 locale working; most
> > > systems either provide “en_US.UTF-8” or “en_US.utf8” with the former
> > > being recommended.
> > 
> > Hello Thorsten,
> > I have some sympathy with your proposal because dgettext does not work
> > in the "C" locale but there are too much open question.
> 
> Is there any hope of fixing this?  I consider this hardcoded
> gettext behaviour in a C locale a severe misfeature, which has caused
> me (as a programmer) no end of problems.

None: I discussed extensively this issue with Bruno Haible, and while he
was sympathetic to my cause, he says there were no chance that upstream
glibc would accept such a change.

On the other hand, technically it is a one-line patch to remove that
restriction. I even considered to ship menu with a patched gettext to
avoid that issue. Fortunately, since Sarge, debian-installer set LANG in
/etc/environment so programs almost never run under C locale anymore.

Cheers,
-- 
Bill. <ballombe@debian.org>

Imagine a large red swirl here. 




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 19:03:09 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 19:03:09 GMT) Full text and rfc822 format available.

Message #62 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>, 522776@bugs.debian.org
Cc: Roger Leigh <rleigh@codelibre.net>, debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 18:54:59 +0000 (UTC)
Bill Allombert dixit:

>Fortunately, since Sarge, debian-installer set LANG in
>/etc/environment so programs almost never run under C locale anymore.

Except the ton which sets LC_ALL=C to get sane (parsable,
dependable, historically compatible) output.

These would then unset all other LC_* and LANG and LANGUAGE,
and only set LC_CTYPE to C.UTF-8 to get "old" behaviour but
with UTF-8 (and mbrtowc and iswctype and and and) available.


For what it's worth: vorlon gave me the means to change the
mksh regression test (LOCPATH), so that this will no longer
block it on the HURD. However, I'm still in favour of a de-
fault UTF-8 locale (be it C.UTF-8 or en_US.UTF-8) installed
plus, maybe, one binary package per locale? Aurelien - if I
remember correctly - said something along these lines too.

bye,
//mirabilos
-- 
[...] if maybe ext3fs wasn't a better pick, or jfs, or maybe reiserfs, oh but
what about xfs, and if only i had waited until reiser4 was ready... in the be-
ginning, there was ffs, and in the middle, there was ffs, and at the end, there
was still ffs, and the sys admins knew it was good. :)  -- Ted Unangst über *fs




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 19:27:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Adeodato Simó <dato@net.com.org.es>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 19:27:03 GMT) Full text and rfc822 format available.

Message #67 received at 522776@bugs.debian.org (full text, mbox):

From: Adeodato Simó <dato@net.com.org.es>
To: Thorsten Glaser <tg@mirbsd.de>
Cc: 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 21:24:38 +0200
+ Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +0000):

> Except the ton which sets LC_ALL=C to get sane (parsable,
> dependable, historically compatible) output.

> These would then unset all other LC_* and LANG and LANGUAGE,
> and only set LC_CTYPE to C.UTF-8 to get "old" behaviour but
> with UTF-8 (and mbrtowc and iswctype and and and) available.

Isn’t setting LC_ALL=C.UTF-8 going to be about the same and less work?
I’m genuinely interested if that would behave any different to what you
said (unsetting all, setting LC_CTYPE).

Cheers,

-- 
- Are you sure we're good?
- Always.
        -- Rory and Lorelai





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 19:51:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 19:51:02 GMT) Full text and rfc822 format available.

Message #72 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org
Cc: debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 19:40:30 +0000 (UTC)
Adeodato Simó dixit:

>+ Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +0000):
>
>> Except the ton which sets LC_ALL=C to get sane (parsable,
>> dependable, historically compatible) output.
>
>> These would then unset all other LC_* and LANG and LANGUAGE,
>> and only set LC_CTYPE to C.UTF-8 to get "old" behaviour but
>> with UTF-8 (and mbrtowc and iswctype and and and) available.
>
>Isn’t setting LC_ALL=C.UTF-8 going to be about the same and less work?

Indeed.

>I’m genuinely interested if that would behave any different to what you
>said (unsetting all, setting LC_CTYPE).

For my proposed C.UTF-8 "locale" it would be exactly zero, nada,
difference. (For en_US.UTF-8 it is a lot of difference, for example
sorting order.)

Unfortunately, GNU libc needs a locale to even enable UTF-8 support.

bye,
//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 19:57:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 19:57:03 GMT) Full text and rfc822 format available.

Message #77 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Thorsten Glaser <tg@mirbsd.de>
Cc: Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>, 522776@bugs.debian.org, debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 20:53:27 +0100
[Message part 1 (text/plain, inline)]
On Tue, Apr 07, 2009 at 06:54:59PM +0000, Thorsten Glaser wrote:
> Bill Allombert dixit:
> 
> >Fortunately, since Sarge, debian-installer set LANG in
> >/etc/environment so programs almost never run under C locale anymore.
> 
> Except the ton which sets LC_ALL=C to get sane (parsable,
> dependable, historically compatible) output.

The gettext bug itself won't cause any change in typical behaviour
with gettext().

As an optimisation, it's OK to skip translating if running in a C
locale.  However, if we use dgettext/dcgettext etc., we are
explicitly asking for a given text domain and want translation
even in a C locale.

As Bill said, the change is trivial (I've also looked at libintl
and libc to look at fixing it).  One use case I need this for is
the generation of PPD files in gutenprint; we generate single files
containing multiple languages and so use dgettext, but this totally
breaks in the C locale due to the C locale special casing in gettext.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 20:27:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 20:27:04 GMT) Full text and rfc822 format available.

Message #82 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org
Cc: Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 21:25:42 +0100
[Message part 1 (text/plain, inline)]
On Tue, Apr 07, 2009 at 09:24:38PM +0200, Adeodato Simó wrote:
> + Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +0000):
> 
> > Except the ton which sets LC_ALL=C to get sane (parsable,
> > dependable, historically compatible) output.
> 
> > These would then unset all other LC_* and LANG and LANGUAGE,
> > and only set LC_CTYPE to C.UTF-8 to get "old" behaviour but
> > with UTF-8 (and mbrtowc and iswctype and and and) available.
> 
> Isn’t setting LC_ALL=C.UTF-8 going to be about the same and less work?
> I’m genuinely interested if that would behave any different to what you
> said (unsetting all, setting LC_CTYPE).

% sudo localedef -c -i POSIX -f UTF-8 C.UTF-8

% LANG=C.UTF8 locale charmap
UTF-8

% LANG=C locale charmap
ANSI_X3.4-1968

This appears to work correctly at first glance.

However, I would ideally like the C/POSIX locales to be UTF-8
by default as on other systems (with a C.ASCII variant if required).


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 20:33:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to Adeodato Simó <dato@net.com.org.es>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 20:33:08 GMT) Full text and rfc822 format available.

Message #87 received at 522776@bugs.debian.org (full text, mbox):

From: Adeodato Simó <dato@net.com.org.es>
To: 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 22:32:46 +0200
+ Steve Langasek (Mon, 06 Apr 2009 11:09:17 -0700):

> On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:
> > > If you need a specific locale (as seems from "mksh", not
> > > sure if it is a bug in that program), you need to set it.

> > You can only set a locale on a glibc-based system if it’s
> > installed beforehand, which root needs to do.

> You can build-depend on the locales package and generate the locales you
> want locally, using LOCPATH to reference them.  There's no need for Debian
> to guarantee the presence of a particular locale ahead of time -

It is my impression that more packages than mksh could use an UTF-8
locale at build time (I’m afraid I don’t have pointers, but I’m sure
I’ve come across at least a couple).

Wouldn’t it be just better to change Debian’s default to make an UTF-8
locale available by default, rather than to force all those packages to
play tricks with LOCPATH?

I would go as far as suggesting that some package like libc6 itself
ships the locale, both as a way of ensuring it’ll always be there, and
of not forcing the locales package on every system (not sure if this was
part of your concerns).

Unfortunately, and from my limited knowledge and recent poking of this,
it seems the supported locales for a running system are kept in a single
file (/usr/lib/locale/locale-archive), so I’m unsure how the above could
work out, if at all.

> particularly one that isn't actually useful to end users, as C.UTF-8 would
> be.

Is that point really important? It is useful for building some packages,
plus I’m sure we have pedant enough users that would prefer C.UTF-8 over
en_US.UTF-8. :-P

Finally, this stuff that Roger proposes about making “C” be UTF-8, and
create some C.ASCII for people needing that, sounds shocking at the same
time as appealing.

Cheers,

-- 
- Are you sure we're good?
- Always.
        -- Rory and Lorelai





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 20:51:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 20:51:05 GMT) Full text and rfc822 format available.

Message #92 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Roger Leigh <rleigh@codelibre.net>
Cc: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 20:41:56 +0000 (UTC)
Roger Leigh dixit:

>However, I would ideally like the C/POSIX locales to be UTF-8
>by default as on other systems (with a C.ASCII variant if required).

No, this has the potential to break, for example, tr(1).
I lived through that on MirBSD.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 21:15:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 21:15:03 GMT) Full text and rfc822 format available.

Message #97 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org
Cc: debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 21:00:50 +0000 (UTC)
Adeodato Simó dixit:

>I would go as far as suggesting that some package like libc6 itself

FWIW:

-rw-r--r-- 1 tg tg 238336 Apr  7 22:59 en_US.UTF-8/LC_CTYPE

It's not *that* much...

>Finally, this stuff that Roger proposes about making “C” be UTF-8, and
>create some C.ASCII for people needing that, sounds shocking at the same
>time as appealing.

It won't work, because in a UTF-8 locale, for example stdio
functions must reject "invalid" (not valid UTF-8) input, so
it would not be 8-bit clean/transparent any more.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 21:27:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andrew McMillan <andrew@morphoss.com>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 21:27:11 GMT) Full text and rfc822 format available.

Message #102 received at 522776@bugs.debian.org (full text, mbox):

From: Andrew McMillan <andrew@morphoss.com>
To: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 08 Apr 2009 09:17:40 +1200
On Tue, 2009-04-07 at 22:32 +0200, Adeodato Simó wrote:
> 
> It is my impression that more packages than mksh could use an UTF-8
> locale at build time (I’m afraid I don’t have pointers, but I’m sure
> I’ve come across at least a couple).
> 
> Wouldn’t it be just better to change Debian’s default to make an UTF-8
> locale available by default, rather than to force all those packages to
> play tricks with LOCPATH?

I too would really like to see a UTF-8 locale available by default, and
would prefer to see this be the C.UTF-8 locale, which doesn't screw with
the collation / character type settings like any other UTF-8 locale
would.

It seems to me that the consensus here is that having a UTF-8 locale
available is a good idea and I don't hear any very strong argument
against such a change.

Consequently I think we should move on from the discussion and start
working out a patch to resolve this in policy.

Regards,
					Andrew.

------------------------------------------------------------------------
andrew (AT) morphoss (DOT) com                            +64(272)DEBIAN
           Time to be aggressive.  Go after a tattooed Virgo.
------------------------------------------------------------------------






Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 21:50:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 21:50:46 GMT) Full text and rfc822 format available.

Message #107 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: "Giacomo A. Catenazzi" <cate@debian.org>, 522776@bugs.debian.org
Cc: Steve Langasek <vorlon@debian.org>, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 22:33:24 +0100
[Message part 1 (text/plain, inline)]
On Tue, Apr 07, 2009 at 10:36:20AM +0200, Giacomo A. Catenazzi wrote:

I can't help but feel that your reply completely missed the
purpose of what I want to do, and why.  I hope the following
response clears things up.

> Roger Leigh wrote:
>> On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote:
>>> On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:
>>>>> If you need a specific locale (as seems from "mksh", not
>>>>> sure if it is a bug in that program), you need to set it.
>>>> You can only set a locale on a glibc-based system if it’s
>>>> installed beforehand, which root needs to do.
>>> You can build-depend on the locales package and generate the locales you
>>> want locally, using LOCPATH to reference them.  There's no need for Debian
>>> to guarantee the presence of a particular locale ahead of time -
>>> particularly one that isn't actually useful to end users, as C.UTF-8 would
>>> be.
>>
>> I think that it would be very useful, I'll detail why below.
>>
>> The GCC toolchain has, for some time now, been using UTF-8 as the
>> internal representation for narrow strings (-fexec-charset).  It has
>> also been using UTF-8 as the default input encoding for C source code
>> (-finput-charset).  This means that unless you take any special
>> measures, your program will be outputting UTF-8 strings for all file
>> and terminal I/O.  Of course, this is backward compatible with ASCII,
>> and is also transcoded automatically when in a non-UTF-8 locale.  I've
>> attached a trivial example.  Just to be clear: this handling is
>> completely built into GCC and libc, and is completely transparent.
>
> Hmm. Warning, you confuse some terms.

I'm not really sure how relevant these minor points are to the general
point that I was trying to make.

> - input charset is the source charset (used to parse C code)
> - exec charset is the charset of the target machine (which run the program).

That's pretty much what I said.

> - C99 must support unicode identifier (written with \uxxxx or in other
>   non portable implementation defined way)

OK.  But that's really nothing to do with the fact that you can use
UTF-8 sources directly.  It's akin to having to support trigraphs,
but we don't use trigraphs because they are bloody annoying and nowadays
competelely unnecessary.  But mainly, it doesn't affect the exec charset
whether you use UTF-8 encoded sources or \uxxxx.

> - standard libraries can use locales (but only if you initialized the locale),
>   but not all the functions, not all uses.
> - wide charaters are yet an other things (as you note in your example,
>   the wide string is not in UTF-8, but I think UTF-32)
>
> Same input and exec charset really means: don't translate strings
> (e.g. in
>    if(c = 'a') printf("bcde\n");
>  'a' and "bcde\n" will have the same values as in the input file, else
>  it will put in binary the representation of exec charset)

Of course.  However, the test program I posted showed what that if the
locale has been appropriately initialised, there is an additional
translation between the exec charset and the output charset specified
by the locale (see the Latin characters correctly preserved and output
as ISO-8859-1 in an ISO-8859-1 locale).

> I expect that your program will run fine (i.e. really no changes: the
> same binary output), if you use tell GCC that you use any other ASCII-7
> derived 8-bit encoding (both for input and exec charset).

Of course.

> Usually the interpretation of bytes is done by terminal, not by compiler.

It's done at several points:
compiler: source->exec
runtime: locale-dependent exec->output (and optional use of gettext)
terminal: output->display

>> Now, this will work fine in all locales *except for C/POSIX*.
>> Obviously the charsets of some locales can't represent all the
>> characters used in this example, but the C library will actually
>> transcode (iconv) to the locale codeset as best it can.  Except for
>> C/POSIX.
>>
>> Now, why is this needed?
>>
>> If I write a program, I might want to use non-ASCII UTF-8 characters
>> in the sources.  We have been doing this for years without realising
>> since GCC switched to UTF-8 as the default internal encoding, but
>> simply for portability when using the C locale we are restricted to
>> using ASCII only in the sources,
>
> Really minimal C charset is smaller than ASCII (a portable program
> must not have "$" and no "@", plus C supports also smaller charset,
> with trigraps [preprocessor] and/or new bigraphs [compiler])

I'm not sure how relevant this is.  This is specified as the minimum
requirement by the *C standard*.  But, it's the *minimum* requirement.
GCC supports full use of UTF-8 (or whatever) encoded sources, and I
want to make better use of it, while still remaining in compliance
with the standard (which it is--I've read the ISO C standard relating
to source and execution character sets, and you're allowed to do better
than 7 bit ASCII!).

>> and then a translation library such
>> as libintl/gettext to get translated strings with the extended
>> characters in them.  This is workable, but it imposes a big burden on
>> translators because I might want to use symbols and other characters
>> which are not part of a /language/ translation, but need adding by
>> each and every translator through explicit translator comments in the
>> sources.  This is tedious and error-prone.  If the sources were UTF-8
>> encoded, this would work perfectly since I could just use the
>> necessary UTF-8 characters directly in the source rather than abusing
>> the translation machinery to avoid non-ASCII codes.  A UTF-8 C locale
>> thus cuts out a big pile of cruft and complexity in sources which only
>> exists to cater for people who want to run your code in a C locale!
>> And the translators can completely ignore the now no longer needed
>> job of translating special characters as well doing as the actual
>> translation work, so the symbol usage is identical in all
>> translations, and their job is much easier.
>
> yes, in a perfect world we need only one charset (and maybe only
> one language and one locale). From all the proposals to reach this
> target, unicode and UTF-8 seems the best solution.
> But... for now take care about locales and don't assume UTF-8,
> or you will cause trouble with a lot of non-UTF-8 users.
> Converting locale (from non-UTF-8 to UTF-8) is simple for
> English and few European languages, but it is a tedious work
> for many user: it need a "flag day", in which I should convert
> all my files to UTF-8 or annotate every file with the right
> encoding (most of editors and tools understands such annotations).

I have never *ever* suggested that we only use one charset.  I'm only
suggesting that the *C locale* must be UTF-8 in order to allow for
full UTF-8 support.  Normal user locales can use whatever charset
they like.

Non-UTF-8 users won't be disadvantaged because the UTF-8 exec charset
will be recoded to their locale-specific output codeset, either by
libc or gettext.

The C locale is special in that normal users won't use it, but
system programs and code needing locale independence do use it.
Any program wanting to work correctly in a C locale must only use
ASCII or it *breaks*.  This means we are /de facto/ restricted to
ASCII unless we take special effort to work around the fact (and
this was the point of my l10n/i18n comments above).

Most programs do need to work correctly in a C locale, and so can't
use UTF-8 either as a source or exec charset.  This is a severe
limitation.

> So for now we support UTF-8, we try to set UTF-8 default to
> new users, and UTF-8 is the encoding for debian files in packages.
> But it will take a lot of years (or maybe never) before
> we can assume UTF-8 if user don't loudly tell the system to
> use other encodings.

We're at that point now, but this really is not relevant to the
purpose of this discussion.

>  > I've tested all this, and it all works *perfectly*.  Except that if
>> you do this, your program will not run in the C locale (and *only*
>> the C locale) due to having completely borked output.
>
> It is the terminal, not the C program.

No, it is the program.  I have tested this with different terminal
input encodings and by examining the program output byte-by-byte
(as my test program shows).

>>  A C.UTF-8 would
>> be a solution to this problem, and allow full use of the *existing*
>> UTF-8 string handling which all sources are built with, yet only a
>> tiny fraction dare to use.  Note that gettext is *completely disabled*
>> if used in a C locale, and this does additional mangling in addition
>> to the plain libc damage, resulting in *no output at all*!  (I would
>> need to double check that; this was the case when I last looked,
>> and the reason I had to abandon use of UTF-8 string literals.)
>
> Use "en_US.UTF-8".

Why?  Did you actually understand the rationale I provided above.
I could use en_US.UTF-8, or any locale.  But the point is that the
code works in all locales *except* C.

> "C.UTF-8" is a bad name. Locale "C" means "no locale, old behaviour,
> for machine".

No.  "C" and "POSIX" mean the /default/ POSIX-specified locale.  And
there's nothing written in the standard that restricts that locale
to 7-bit ASCII as its codeset.  There are UNIX systems out there
right now using UTF-8 in their C locale.

> Do we need to translate all strings also on C.UTF-8?

Of course not.  We don't do any translation in the C locale.  The
only difference is the character encoding, which is backward
compatible with ASCII in any case.

> Which alphabetic characters?  Which numeric characters?  Which
> alphabetic order? etc. etc.  You see: it is difficult to create
> a new locale, and people must understand the meaning of such locale
> (without reading all the locale definition).

For a minimal locale it could just use strict numerical ordering.
It should probably copy what existing systems using UTF-8 C locales do.

>> Regarding the standards conformance of using a UTF-8 C locale:
>> I've spent some time reading the standards (SUSv3), and see no reason
>> why C can't use UTF-8 as its default codeset and still remain strictly
>> conforming.
>
> UTF-8 as a lot of characters (alphabetic, numeric, white).
> C locale requires that whitespace are only SPACE and TAB.

Where is this requirement?  Can you point me to the SUSv3 definition?

> I did look for all requirement, but I found that some requirement
> are incompatible from what one should expect.

Again, do you have references or examples?

> So a C local in UTF-8 would cause more trouble to users (no warning,
> but the whitespace are missinterpreted (note: some windows editors
> are know to insert a lot of non standard whitespace, instead of spaces).

Huh?

>> The standards specifies a minimum requirement of a portable character
>> set and control character set.  This is satisfied by the 7-bit ASCII
>> encoding which we currently use as the C0 and G1 control and graphics
>> sets.  However, UTF-8 is a strict 8-bit superset of this standard, and
>> it is eminently reasonable to use UTF-8 *and still remain conforming*
>> with the minimum functionality required by the standard.  It's
>> explicity spelled out in SUSv2, though the wording was dropped in
>> SUSv3 (definitely not forbidden, though).
>>
>> POSIX/C locale:
>> http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap07.html#tag_07_02
>>
>> Portable charset:
>> http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tag_06
>>
>> "Implementations may also add other characters."
>> This is from the charset documentation in SUSv2
>> http://opengroup.org/onlinepubs/007908775/xbd/charset.html
>>
>>
>> UTF-8 is the default character set on Debian GNU/Linux.  It's what
>> we all use, it's what all the tools use, and the C locale is the
>> last ASCII holdout.  It would make the lives of many maintainers
>> and users more bearable if it was also UTF-8, as well as getting
>> rid of the current buggy behaviour if you use UTF-8-encoded sources.
>> It's currently *the only blocker* preventing us using UTF-8 encoded
>> sources.

Just to clarify what I meant here.  We can currently support any
locale using charset recoding via iconv, or by abusing gettext
(which does recoding as a side effect of its main purpose of
translating text).  This works for all locales except C, where
it doesn't do any translation.

If the C locale used UTF-8, the UTF-8 strings in the sources would
display correctly in the absence of any recoding or translating
machinery (which is effectively what happens in the C locale).  This
is pretty much the crux of the point I'm trying to make.

Solely due to the C locale being a throwback to the 1960s, we are not
able to make use UTF-8 encoded sources or strings unless the C locale
changes.  It's just this one locale.

> I think ASCII 7 would simplify the finding bugs.

In what context?

> An c>127 in a C locale is simply wrong, it will miss interpreted
> by different terminal (local and remote, etc.).

Err, why?  This is a recursive argument.  If the C locale used UTF-8,
then c>127 would be perfectly OK.  And code which does things based
on the locale charset should check the locale charmap if it's
important.

> Not always, there are terminal libraries and standard libraries that
> do the right things, but with your proposal, I think in few months
> programs will simply write UTF-8 to terminal, ignoring charset
> choose by user.

Correctly written programs will always use the locale chosen by the
user.  I have not ever said I wanted to ignore the user's charset:
I don't.  They can select (or make) any locale of their choosing,
without it affecting anything to do with the C locale.

> Before was: all must use English because I understand English
> now we want: all must use UTF-8 because I use UTF-8?

Err, *no*.  Whatever gave you that idea?

> If English is the most spoken language (and easier to type), or
> that UTF-8 is technically very good, doesn't mean that we
> should oblige users to use English or UTF-8.

Err, I'm not doing *either*.  I'm talking about the C locale only,
which isn't a locale any *user* should be choosing unless they
want untranslated (English or whatever the programmer used) text.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 21:57:42 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 21:57:42 GMT) Full text and rfc822 format available.

Message #112 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Thorsten Glaser <tg@mirbsd.de>
Cc: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org, debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 22:42:56 +0100
[Message part 1 (text/plain, inline)]
On Tue, Apr 07, 2009 at 09:00:50PM +0000, Thorsten Glaser wrote:
> Adeodato Simó dixit:
> 
> >I would go as far as suggesting that some package like libc6 itself
> 
> FWIW:
> 
> -rw-r--r-- 1 tg tg 238336 Apr  7 22:59 en_US.UTF-8/LC_CTYPE
> 
> It's not *that* much...
> 
> >Finally, this stuff that Roger proposes about making “C” be UTF-8, and
> >create some C.ASCII for people needing that, sounds shocking at the same
> >time as appealing.
> 
> It won't work, because in a UTF-8 locale, for example stdio
> functions must reject "invalid" (not valid UTF-8) input, so
> it would not be 8-bit clean/transparent any more.

I wasn't aware that this level of checking was performed, though
it does make sense.  But, does it not reject non 7-bit input in the C
locale for completeness?

Should tools doing "raw" I/O not be using lower level interfaces
such as fread() and fwrite() rather than the "formatted" print
functions which are specified to behave in a locale-dependent
manner?  This strikes me as bugs in the form of assumptions in the
code which should be fixed, rather than a fundamental problem with
the locale itself using a non-7-bit-ASCII codeset.


Thanks,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 22:15:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 22:15:02 GMT) Full text and rfc822 format available.

Message #117 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Roger Leigh <rleigh@codelibre.net>
Cc: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org, debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 22:01:16 +0000 (UTC)
Roger Leigh dixit:

>But, does it not reject non 7-bit input in the C
>locale for completeness?

No, it doesn't - "we" (before my time though, I think) fought
hard for eight-bit transparence and eight-bit cleanliness.

>Should tools doing "raw" I/O not be using lower level interfaces
>such as fread() and fwrite()

These too are affected.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 22:45:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 22:45:07 GMT) Full text and rfc822 format available.

Message #122 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Thorsten Glaser <tg@mirbsd.de>
Cc: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org, debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 23:44:29 +0100
[Message part 1 (text/plain, inline)]
On Tue, Apr 07, 2009 at 10:01:16PM +0000, Thorsten Glaser wrote:
> Roger Leigh dixit:
> 
> >But, does it not reject non 7-bit input in the C
> >locale for completeness?
> 
> No, it doesn't - "we" (before my time though, I think) fought
> hard for eight-bit transparence and eight-bit cleanliness.
> 
> >Should tools doing "raw" I/O not be using lower level interfaces
> >such as fread() and fwrite()
> 
> These too are affected.

Are you sure?  The documentation does not suggest they are affected
by locale.  These functions are operating on binary objects, and
should not be affected by the locale.  From SUSv3:

fwrite - binary output
The fwrite() function shall write, from the array pointed to by ptr, up to
nitems elements whose size is specified by size, to the stream pointed to by
stream. For each object, size calls shall be made to the fputc() function,
taking the values (in order) from an array of unsigned char exactly overlaying
the object.

And for fputc

fputc - put a byte on a stream
The fputc() function shall write the byte specified by c (converted to an
unsigned char) to the output stream pointed to by stream, at the position
indicated by the associated file-position indicator for the stream (if
defined), and shall advance the indicator appropriately. If the file cannot
support positioning requests, or if the stream was opened with append mode, the
byte shall be appended to the output stream.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 07 Apr 2009 22:51:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 07 Apr 2009 22:51:05 GMT) Full text and rfc822 format available.

Message #127 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Roger Leigh <rleigh@codelibre.net>
Cc: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org, debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 7 Apr 2009 22:47:00 +0000 (UTC)
Roger Leigh dixit:

>Are you sure?

Not entirely, but I recall fgetc (or was it fgetwc?)
being affected.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 07:42:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 07:42:02 GMT) Full text and rfc822 format available.

Message #132 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Roger Leigh <rleigh@codelibre.net>
Cc: Thorsten Glaser <tg@mirbsd.de>, Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org, debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 08 Apr 2009 09:41:18 +0200
Roger Leigh wrote:
 > I wasn't aware that this level of checking was performed, though
> it does make sense.  But, does it not reject non 7-bit input in the C
> locale for completeness?
> 
> Should tools doing "raw" I/O not be using lower level interfaces
> such as fread() and fwrite() rather than the "formatted" print
> functions which are specified to behave in a locale-dependent
> manner? 

printf is not locale dependent, but on numeric display
(and eventually on some extensions).

ciao
	cate





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 08:18:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 08:18:03 GMT) Full text and rfc822 format available.

Message #137 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Andrew McMillan <andrew@morphoss.com>, 522776@bugs.debian.org
Cc: Adeodato Simó <dato@net.com.org.es>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 08 Apr 2009 10:15:27 +0200
Andrew McMillan wrote:
> On Tue, 2009-04-07 at 22:32 +0200, Adeodato Simó wrote:
>> It is my impression that more packages than mksh could use an UTF-8
>> locale at build time (I’m afraid I don’t have pointers, but I’m sure
>> I’ve come across at least a couple).
>>
>> Wouldn’t it be just better to change Debian’s default to make an UTF-8
>> locale available by default, rather than to force all those packages to
>> play tricks with LOCPATH?
> 
> I too would really like to see a UTF-8 locale available by default, and
> would prefer to see this be the C.UTF-8 locale, which doesn't screw with
> the collation / character type settings like any other UTF-8 locale
> would.
> 
> It seems to me that the consensus here is that having a UTF-8 locale
> available is a good idea and I don't hear any very strong argument
> against such a change.
> 
> Consequently I think we should move on from the discussion and start
> working out a patch to resolve this in policy.

So I've a question: what does UTF-8 mean in this context (C.UTF-8) ?

It is not a stupid question, and the answer is not the UTF-8 algorithm
to code/decode unicode.
I'm still thinking that you are confusing the various meanings.
And until I understand the problem, I cannot propose a solution.

- terminals should be sensible to charsets, on choosing how to display
  things
- programs should be sensible to locales (topic of this discussion):
  the locales provides some charsets dependent strings, and interpretation
  of the various characters, but (usually) they MUST NOT translate characters.

Anyway:

The locale C is already a UTF-8 compatible locale.
No? so what it misses?
- other alphabetic, numeric, currency, whitespace characters?  But not UTF-8
  local provides all characters: they define only the needed range for the
  language [see wikipedia, which should code UTF-8 as binary for this reason].
  The "C" "spoken" language require only ASCII-7 (or maybe only a subrange of it).
  So why we need further characters?
  Note: whitespace are restricted in "C" locale by POSIX, in only two values

  We could use charset UTF-8 for C locale, declaring unused/illegal all
  c > 127.  Whould this solve the problems with mksh? I don't think so,
  so what you need in this C.UTF-8?

I still think that "en_US.UTF-8" is the right default (note:
I'm not a US citizen, nor I speak English).

The installation will install the correct locale, so the en_US period is very
short (we'll dominate them ;-) ).

On debootstrap/pbuild/... things are different.  But if it this the problem,
let check a solution for building environment (and I still think that in this
env "en_US.UTF-8" could be nice.

But I'll prefer a simple basic ASCII-7 "C" for basic/plain build, and only
after packager thinks if it is a bug or a feature to have a specific build with
UTF-8, it should manually set it.
Why build need to depend to a locale?
UNIX way is to allow to compile things for remote (maybe other OS, other arch)
system.
For testing? So why not test various locales (UTF-8, but also other non
ascii based encodings)

ciao
	cate





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 08:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 08:24:02 GMT) Full text and rfc822 format available.

Message #142 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Thorsten Glaser <tg@mirbsd.de>
Cc: Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org, debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 8 Apr 2009 09:21:31 +0100
On Tue, Apr 07, 2009 at 10:47:00PM +0000, Thorsten Glaser wrote:
> Roger Leigh dixit:
> 
> >Are you sure?
> 
> Not entirely, but I recall fgetc (or was it fgetwc?)
> being affected.

Ah, fgetc/fputc are specified in the standard as "byte oriented"
rather than character-oriented, so are probably locale-independent
for binary I/O.  OTOH, the wide variants are for wide character I/O
and may require conversion between the narrow and wide forms which
might well need to involve the locale.  I thought I spotted this
reading the standard last night, but I can't find the text this
morning.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 08:24:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 08:24:03 GMT) Full text and rfc822 format available.

Message #147 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Roger Leigh <rleigh@codelibre.net>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 08 Apr 2009 10:22:15 +0200
Roger Leigh wrote:
> On Tue, Apr 07, 2009 at 09:24:38PM +0200, Adeodato Simó wrote:
>> + Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +0000):
>>
>>> Except the ton which sets LC_ALL=C to get sane (parsable,
>>> dependable, historically compatible) output.
>>> These would then unset all other LC_* and LANG and LANGUAGE,
>>> and only set LC_CTYPE to C.UTF-8 to get "old" behaviour but
>>> with UTF-8 (and mbrtowc and iswctype and and and) available.
>> Isn’t setting LC_ALL=C.UTF-8 going to be about the same and less work?
>> I’m genuinely interested if that would behave any different to what you
>> said (unsetting all, setting LC_CTYPE).
> 
> % sudo localedef -c -i POSIX -f UTF-8 C.UTF-8
> 
> % LANG=C.UTF8 locale charmap
> UTF-8
> 
> % LANG=C locale charmap
> ANSI_X3.4-1968
> 
> This appears to work correctly at first glance.
> 
> However, I would ideally like the C/POSIX locales to be UTF-8
> by default as on other systems (with a C.ASCII variant if required).

POSIX doesn't mandate "C" to be ASCII7.

BTW ASCII7 is a subset of UTF-8, so what would be different with
normal "C"?  I don't expect any differences on any program (which
are POSIX compatible). The output characters will still be only on
the c<128 range.

ciao
	cate




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 09:09:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 09:09:06 GMT) Full text and rfc822 format available.

Message #152 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: "Giacomo A. Catenazzi" <cate@debian.org>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 8 Apr 2009 10:07:37 +0100
On Wed, Apr 08, 2009 at 10:22:15AM +0200, Giacomo A. Catenazzi wrote:
> Roger Leigh wrote:
>> On Tue, Apr 07, 2009 at 09:24:38PM +0200, Adeodato Simó wrote:
>>> + Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +0000):
>>>
>>>> Except the ton which sets LC_ALL=C to get sane (parsable,
>>>> dependable, historically compatible) output.
>>>> These would then unset all other LC_* and LANG and LANGUAGE,
>>>> and only set LC_CTYPE to C.UTF-8 to get "old" behaviour but
>>>> with UTF-8 (and mbrtowc and iswctype and and and) available.
>>> Isn’t setting LC_ALL=C.UTF-8 going to be about the same and less work?
>>> I’m genuinely interested if that would behave any different to what you
>>> said (unsetting all, setting LC_CTYPE).
>>
>> % sudo localedef -c -i POSIX -f UTF-8 C.UTF-8
>>
>> % LANG=C.UTF8 locale charmap
>> UTF-8
>>
>> % LANG=C locale charmap
>> ANSI_X3.4-1968
>>
>> This appears to work correctly at first glance.
>>
>> However, I would ideally like the C/POSIX locales to be UTF-8
>> by default as on other systems (with a C.ASCII variant if required).
>
> POSIX doesn't mandate "C" to be ASCII7.
>
> BTW ASCII7 is a subset of UTF-8, so what would be different with
> normal "C"?  I don't expect any differences on any program (which
> are POSIX compatible). The output characters will still be only on
> the c<128 range.

Exactly.  For a conforming program only using c<128, there will
be zero differences running in a UTF-8 C locale and running and
an ASCII C locale, just like there are no differences today when
running in any UTF-8 locale (except maybe collation, but for the
UTF-8 C locale we would need to keep it fully backward compatible
with the existing behaviour).

However, what is different is that programs may /optionally/ choose
to use the UTF-8 superset of ASCII7 and have output and string
formatting and wide/narrow character conversion work correctly.
This is what is currently lacking.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 09:36:42 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 09:36:52 GMT) Full text and rfc822 format available.

Message #157 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: "Giacomo A. Catenazzi" <cate@debian.org>
Cc: Thorsten Glaser <tg@mirbsd.de>, Adeodato Simó <dato@net.com.org.es>, 522776@bugs.debian.org, debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 8 Apr 2009 10:25:55 +0100
On Wed, Apr 08, 2009 at 09:41:18AM +0200, Giacomo A. Catenazzi wrote:
> Roger Leigh wrote:
>  > I wasn't aware that this level of checking was performed, though
>> it does make sense.  But, does it not reject non 7-bit input in the C
>> locale for completeness?
>>
>> Should tools doing "raw" I/O not be using lower level interfaces
>> such as fread() and fwrite() rather than the "formatted" print
>> functions which are specified to behave in a locale-dependent
>> manner? 
>
> printf is not locale dependent, but on numeric display
> (and eventually on some extensions).

Each C FILE* stream has an associated locale.
Look at struct _IO_FILE_complete in libio.h.
The example program I posted demonstrates that this does actually
happen; the output streams use the current locale, and there is
a UTF-8 [narrow]/UCS-4 [wide] conversion to the locale codeset on
output.

When you output a string to a stream, there is a conversion step
from the exec charset (either narrow or wide) to the stream's
associated locale.  I haven't yet found documented exactly where
this happens (it's all in the libc internals), but I would
hazard a guess that all the "string" functions use this step,
where the lower-level byte-based I/O functions skip this step.

This machinery is also used by the C++ iostream locale imbue()
mechanism.

So while printf itself might not do the conversion, it's done
at some point, probably when printf copies the formatted string
to the stream buffer.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 09:57:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 09:57:02 GMT) Full text and rfc822 format available.

Message #162 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Roger Leigh <rleigh@codelibre.net>, 522776@bugs.debian.org
Cc: Steve Langasek <vorlon@debian.org>, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 08 Apr 2009 11:54:44 +0200
Roger Leigh wrote:
> On Tue, Apr 07, 2009 at 10:36:20AM +0200, Giacomo A. Catenazzi wrote:

>> Roger Leigh wrote:

> I can't help but feel that your reply completely missed the
> purpose of what I want to do, and why.  I hope the following
> response clears things up.

I know that I missed the original point, but IMHO you was and still
misunderstandings locale, charset and C language behaviour.

So I'm trying to explain you how these things works, and after
this, we can go to the real problem.
[Note: maybe I am on the wrong side. Often standards are not
so consistent on these behaviours, and thus maybe I interpreted them
wrongly]

> 

> 
>> - input charset is the source charset (used to parse C code)
>> - exec charset is the charset of the target machine (which run the program).
> 
> That's pretty much what I said.
> 
>> - C99 must support unicode identifier (written with \uxxxx or in other
>>   non portable implementation defined way)
> 
> OK.  But that's really nothing to do with the fact that you can use
> UTF-8 sources directly.  It's akin to having to support trigraphs,
> but we don't use trigraphs because they are bloody annoying and nowadays
> competelely unnecessary.  But mainly, it doesn't affect the exec charset
> whether you use UTF-8 encoded sources or \uxxxx.

ok.

>> - standard libraries can use locales (but only if you initialized the locale),
>>   but not all the functions, not all uses.
>> - wide charaters are yet an other things (as you note in your example,
>>   the wide string is not in UTF-8, but I think UTF-32)
>>
>> Same input and exec charset really means: don't translate strings
>> (e.g. in
>>    if(c = 'a') printf("bcde\n");
>>  'a' and "bcde\n" will have the same values as in the input file, else
>>  it will put in binary the representation of exec charset)
> 
> Of course.  However, the test program I posted showed what that if the
> locale has been appropriately initialised, there is an additional
> translation between the exec charset and the output charset specified
> by the locale (see the Latin characters correctly preserved and output
> as ISO-8859-1 in an ISO-8859-1 locale).

No ;-)  Ok, it take me some modifications of your program and
looking to POSIX to discover the reason.

You forget to check error codes. In this case we have
"Invalid or incomplete multibyte or wide character" in the
non UTF-8 locale.

So looking to POSIX:
"Wide-character codes for other characters are locale and implementation-defined."
so you (and me) compiled the code with UTF-8, so in binary there is
different wchar representation. Which is invalid on non-UTF-8 locale.

Note that that it is locale dependent, so same charset with different
language could give different results (I don't know if there are such
cases on glibc).


>> Usually the interpretation of bytes is done by terminal, not by compiler.
> 
> It's done at several points:
> compiler: source->exec
> runtime: locale-dependent exec->output (and optional use of gettext)
> terminal: output->display

to go to the point: what is the problem in mksh?
At which level it fails?


>> yes, in a perfect world we need only one charset (and maybe only
>> one language and one locale). From all the proposals to reach this
>> target, unicode and UTF-8 seems the best solution.
>> But... for now take care about locales and don't assume UTF-8,
>> or you will cause trouble with a lot of non-UTF-8 users.
>> Converting locale (from non-UTF-8 to UTF-8) is simple for
>> English and few European languages, but it is a tedious work
>> for many user: it need a "flag day", in which I should convert
>> all my files to UTF-8 or annotate every file with the right
>> encoding (most of editors and tools understands such annotations).
> 
> I have never *ever* suggested that we only use one charset.  I'm only
> suggesting that the *C locale* must be UTF-8 in order to allow for
> full UTF-8 support.  Normal user locales can use whatever charset
> they like.

(see the other mail: what do "full UTF-8" mean)


> Non-UTF-8 users won't be disadvantaged because the UTF-8 exec charset
> will be recoded to their locale-specific output codeset, either by
> libc or gettext.

Not sure to understand. Debian is moving all file to UTF-8
(manual pages, documentation, debian control files, ...).
So I totally agree.
But was not the point of the original proglem?


> The C locale is special in that normal users won't use it, but
> system programs and code needing locale independence do use it.
> Any program wanting to work correctly in a C locale must only use
> ASCII or it *breaks*.  This means we are /de facto/ restricted to
> ASCII unless we take special effort to work around the fact (and
> this was the point of my l10n/i18n comments above).
> 
> Most programs do need to work correctly in a C locale, and so can't
> use UTF-8 either as a source or exec charset.  This is a severe
> limitation.

No. "locale" is not really charset. A program can use
as input and output any charset (note: most of editor handle
different file charsets, indipendently).
The problem are the terminals. If you print a non-ASCII char,
terminal will confuse. It is the reason of "libncurse"
(maybe more oriented to control terminal then charsets).

Debian target is to support UTF-8 on all programs, but
the problem is that I connect to debian machine from
outside Debian and also the contrary, I connect
from my Debian machine to other machines.

So a program which support only UTF-8 could cause problem
to such user, and it is outside Debian control.

On a long term I can imagine that UTF-8 will become nearly
standard, but I think we should wait for other distribution
and vendors before to make such big jump.
But now UTF-8 is nearly default in Debian.

But if mksh don't work on "C", I'm very worried.
The problems are on inputs or on outputs?


>>>  A C.UTF-8 would
>>> be a solution to this problem, and allow full use of the *existing*
>>> UTF-8 string handling which all sources are built with, yet only a
>>> tiny fraction dare to use.  Note that gettext is *completely disabled*
>>> if used in a C locale, and this does additional mangling in addition
>>> to the plain libc damage, resulting in *no output at all*!  (I would
>>> need to double check that; this was the case when I last looked,
>>> and the reason I had to abandon use of UTF-8 string literals.)
>> Use "en_US.UTF-8".
> 
> Why?  Did you actually understand the rationale I provided above.
> I could use en_US.UTF-8, or any locale.  But the point is that the
> code works in all locales *except* C.

ah. This is strange (considering the huge list of locales).
Why doesn't work in C?


>> "C.UTF-8" is a bad name. Locale "C" means "no locale, old behaviour,
>> for machine".
> 
> No.  "C" and "POSIX" mean the /default/ POSIX-specified locale.  And
> there's nothing written in the standard that restricts that locale
> to 7-bit ASCII as its codeset.  There are UNIX systems out there
> right now using UTF-8 in their C locale.

No. Default locale is "". "C" is a precise locale with fulfill
precise rules. Yes, it can be UTF-8, but why it matters?
(see the other mail: what you need from UTF-8)


>> Do we need to translate all strings also on C.UTF-8?
> 
> Of course not.  We don't do any translation in the C locale.  The
> only difference is the character encoding, which is backward
> compatible with ASCII in any case.

"C" and "en" could have different translations. POSIX mandates
output in "C" locale, usually providing a printf like string,
so that it can be used in scripts.
I think user must use en_ or other language and only scripts
"C".
So the frequent question: why does scripts need a UTF-8 locale? ;-)

> 
>> Which alphabetic characters?  Which numeric characters?  Which
>> alphabetic order? etc. etc.  You see: it is difficult to create
>> a new locale, and people must understand the meaning of such locale
>> (without reading all the locale definition).
> 
> For a minimal locale it could just use strict numerical ordering.
> It should probably copy what existing systems using UTF-8 C locales do.

Different language use different unicode range.


>>> Regarding the standards conformance of using a UTF-8 C locale:
>>> I've spent some time reading the standards (SUSv3), and see no reason
>>> why C can't use UTF-8 as its default codeset and still remain strictly
>>> conforming.
>> UTF-8 as a lot of characters (alphabetic, numeric, white).
>> C locale requires that whitespace are only SPACE and TAB.
> 
> Where is this requirement?  Can you point me to the SUSv3 definition?

7.3.1:
"In the POSIX locale, only the <space> and <tab> shall be included."

Ok. I confused "blank" with "white". Anyway in 7.3.1 you see requirement
of "C". So a UTF-8 is ok, but which definition?
It need to be simple (but people don't want us_EN like, because of
collation and other complex rules).

(...)

ciao
	cate




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 11:54:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andrew McMillan <andrew@morphoss.com>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 11:54:05 GMT) Full text and rfc822 format available.

Message #167 received at 522776@bugs.debian.org (full text, mbox):

From: Andrew McMillan <andrew@morphoss.com>
To: "Giacomo A. Catenazzi" <cate@debian.org>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 08 Apr 2009 23:52:08 +1200
On Wed, 2009-04-08 at 10:15 +0200, Giacomo A. Catenazzi wrote:
> 
> So I've a question: what does UTF-8 mean in this context (C.UTF-8) ?
> 
> It is not a stupid question, and the answer is not the UTF-8 algorithm
> to code/decode unicode.
> I'm still thinking that you are confusing the various meanings.
> And until I understand the problem, I cannot propose a solution.

While it is true that the C locale is (already) a UTF-8 compatible
locale, it provides no clues to the system for the encoding of
characters outside that locale.

We can all be pure about the C locale and believe that all characters
have 7 bits, but we all know that reality is not like that.  It's not
like that even in the northern part of the content pair that 'ASCII'
gets it's name from.

I believe that Debian should endorse Unicode as the preferred method for
mapping between numbers to characters.  I do not expect there is any
real argument against this, although I do understand that current
versions of Unicode may not yet comprehensively/satisfactorily represent
all glyphs in some languages.  I think there is hope that these problems
will eventually be ironed out.

There are, of course, a number of systems for encoding unicode
characters, but I do not seriously expect that anyone is recommending
that Debian should use UTF-16, UTF-32 (or, $DEITY forbid, Punycode :-)
as something which should be available everywhere.

So given a character which is outside of the 0x00 <= 0x7f range, in an
environment which does not specify an encoding, I would like to one day
be able to categorically state that "Debian will by default assume that
character is unicode, encoded according to UTF-8".

In such an environment, with a C.UTF-8 encoding selected, when I start a
word processing program and insert an a-umlaut in there, I would expect
that my file will be written with a UTF-8 encoded unicode character in
it.  I would not expect that if I sort the lines in that file, that the
lines beginning with a-umlaut would sort before 'z'.  I would not expect
that if I grep such a file for '^[[:alpha:]]$' that my a-umlaut line
would appear.

At present I don't believe that this does happen.  At present we
continue to perpetuate encodings such as ISO 8859-1 in these situations,
making pain for our children and grandchildren to resolve.


So as a first step in this process of 'cleaning up our world', this bug
is proposing a smaller change than that, and a smaller change than I
believe you think it is.


The proposal, at this stage is only that the C.UTF-8 locale is
*installed* and *available* by default.  Not that it *be* the default,
but that it *be there* as a default. People will naturally continue to
be free to uninstall it, or to leave their locale to 'C'.


Once this minimum step is made, and we've all calmed down, we can think
further on radical and dramatic changes over coming years where more
significant shifts are made, like:

* The default locale at installation is C.UTF-8 rather than C.
* The default locale at installation is assigned based on the
installation language.
* If a locale is set which doesn't specify an encoding, the system
defaults to assuming UTF-8.
* All ISO8859 locales are moved to a new locales-legacy-encodings
package.
* ... and so on.


Yes, I think that the C.UTF-8 locale offers something different that the
C locale doesn't.  Primarily it offers us a way out of the current
default encodings which are legacy encodings, without jumping boots and
all into a world where suddenly our sort ordering is changed, and our
users are screaming at us that en_US.UTF-8 is wrong for *them*, or that
'sort' is suddenly putting 'A' next to 'a' and all of their legacy shell
scripts expect are broken because they expect a different behaviour.


I believe that the list above might be the set of smallest useful
incremental changes in this process.  I would really like to see that
second step taken too, where the default locale is set to the most basic
UTF-8 locale possible, but I'm happy to see a second bug and further
discussion, if that's what we need to do to get agreement.


> - terminals should be sensible to charsets, on choosing how to display
>    things
> - programs should be sensible to locales (topic of this discussion):
>    the locales provides some charsets dependent strings, and interpretation
>    of the various characters, but (usually) they MUST NOT translate characters.
Not so.  They have to consider how to handle input also, unless by
'terminal' you mean any program which might handle character input and
output...

An example I have had in the last week was that some software processing
information from the internet was converting &nbsp; into the character
0xa0.  While I have now stopped using that particular software
(Html::Strip, if anyone's interested), it illustrates exactly how
software currently doesn't know, and through not knowing it can
perpetuate encoding systems which need to die.


> Anyway:
> 
> The locale C is already a UTF-8 compatible locale.
> No? so what it misses?
> - other alphabetic, numeric, currency, whitespace characters?  But not UTF-8
>    local provides all characters: they define only the needed range for the
>    language [see wikipedia, which should code UTF-8 as binary for this reason].
>    The "C" "spoken" language require only ASCII-7 (or maybe only a subrange of it).
>    So why we need further characters?
>    Note: whitespace are restricted in "C" locale by POSIX, in only two values
> 
>    We could use charset UTF-8 for C locale, declaring unused/illegal all
>    c > 127.  Whould this solve the problems with mksh? I don't think so,
>    so what you need in this C.UTF-8?
> 
> I still think that "en_US.UTF-8" is the right default (note:
> I'm not a US citizen, nor I speak English).

Note that this proposal is not that we change the default sort ordering
or character typing, which en_US *would* do (vs C).

This proposal (if it were that strong) would be pushing for adoption of
UTF-8 encoding as the default encoding.  It isn't as strong as that,
though.  It is merely pushing for the *availability* of a UTF-8 locale
on a default install.


> The installation will install the correct locale, so the en_US period is very
> short (we'll dominate them ;-) ).
> 
> On debootstrap/pbuild/... things are different.  But if it this the problem,
> let check a solution for building environment (and I still think that in this
> env "en_US.UTF-8" could be nice.
> 
> But I'll prefer a simple basic ASCII-7 "C" for basic/plain build, and only
> after packager thinks if it is a bug or a feature to have a specific build with
> UTF-8, it should manually set it.
> Why build need to depend to a locale?
> UNIX way is to allow to compile things for remote (maybe other OS, other arch)
> system.
> For testing? So why not test various locales (UTF-8, but also other non
> ascii based encodings)

What environments people build or test in is a separate issue to what
environments are available to them to build or test in, and indeed Steve
Langasek has already suggested a seemingly reasonable workaround for the
immediate problem which was, funnily enough, that mksh wants to have a
UTF-8 locale *available* in order for it to *test the build*...

So we could close this bug as 'why bother', really, but the discussion
is much more important than that.

Regards,
					Andrew McMillan.

------------------------------------------------------------------------
andrew (AT) morphoss (DOT) com                            +64(272)DEBIAN
              Does the turtle move for you?  www.kame.net
------------------------------------------------------------------------






Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 12:51:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 12:51:08 GMT) Full text and rfc822 format available.

Message #172 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: "Giacomo A. Catenazzi" <cate@debian.org>, 522776@bugs.debian.org
Cc: Roger Leigh <rleigh@codelibre.net>, Steve Langasek <vorlon@debian.org>, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: locale dependend compilation
Date: Wed, 08 Apr 2009 14:47:30 +0200
Ok, maybe I found the problem.

Giacomo A. Catenazzi wrote:
 > No ;-)  Ok, it take me some modifications of your program and
> looking to POSIX to discover the reason.
> 
> You forget to check error codes. In this case we have
> "Invalid or incomplete multibyte or wide character" in the
> non UTF-8 locale.
> 
> So looking to POSIX:
> "Wide-character codes for other characters are locale and 
> implementation-defined."
> so you (and me) compiled the code with UTF-8, so in binary there is
> different wchar representation. Which is invalid on non-UTF-8 locale.
> 
> Note that that it is locale dependent, so same charset with different
> language could give different results (I don't know if there are such
> cases on glibc).

So it means that NO portable programs could use constant (i.e. as fixed
value in sources) wchars and wstrings, because a compiled program has
now way to distinguish a wstring build at compiler time and a wstring from
outside, thus with possible two different locales/charsets.
[GCC uses as default UTF-16 or UTF-32 for wchar, according to the space need
for chars in current locale]

BTW we have a similar problem with "normal" strings.

This is very unfortunate, and it is *worse* than the initial problem.
Changing locale will not solve this, but probably will reduce the
visibility of the error. [no locale specified means UTF-8 for GCC].

So maybe we need to specify the locale to be passed to debian/rule
or the parameter to gcc, in order to have a (default) fix source
encoding.

But this doesn't not solve the problem. An encoded UTF-8 or
UTF-32 (for wchar) is not decoded correctly on non UTF-8 terminals.

But in this case we have iconv() function (because NOW we know the
inizial encoding), to convert constant-string to the right locale.


So: programs that use constant wchar or string with chars outside ASCII
must be compiled with the right encoding (ev. with right locale), specified
in debian/rule (or every developer will see a different output).
Such program should convert the string to the right locale, before to
print it to terminal.


Alternatively, the string must be put outside source code, and read
from a file. The iconv() apply also in this case.


PS: requiring "us_EN.UTF-8" as default to debian/rule seems also
nice, so logs can be read from all developers.

Possibly also "C" in UTF-8 could be good. Such "C" should have
only charset UTF-8 and not other additional meaning to
characters outside ASCII-7.  But this should be carefully tested:
I really things that there are existing wrong assumption and
cases we forgot.


So ok: I think I've understood the problem (but part of the bug
is in the program / Makefile).

ciao
	cate




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 13:33:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 13:33:03 GMT) Full text and rfc822 format available.

Message #177 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Andrew McMillan <andrew@morphoss.com>
Cc: 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 08 Apr 2009 15:31:34 +0200
Andrew McMillan wrote:
> On Wed, 2009-04-08 at 10:15 +0200, Giacomo A. Catenazzi wrote:
>> So I've a question: what does UTF-8 mean in this context (C.UTF-8) ?
>>
>> It is not a stupid question, and the answer is not the UTF-8 algorithm
>> to code/decode unicode.
>> I'm still thinking that you are confusing the various meanings.
>> And until I understand the problem, I cannot propose a solution.
> 
> While it is true that the C locale is (already) a UTF-8 compatible
> locale, it provides no clues to the system for the encoding of
> characters outside that locale.
> 
> We can all be pure about the C locale and believe that all characters
> have 7 bits, but we all know that reality is not like that.  It's not
> like that even in the northern part of the content pair that 'ASCII'
> gets it's name from.
> 
> I believe that Debian should endorse Unicode as the preferred method for
> mapping between numbers to characters.  I do not expect there is any
> real argument against this, although I do understand that current
> versions of Unicode may not yet comprehensively/satisfactorily represent
> all glyphs in some languages.  I think there is hope that these problems
> will eventually be ironed out.
> 
> There are, of course, a number of systems for encoding unicode
> characters, but I do not seriously expect that anyone is recommending
> that Debian should use UTF-16, UTF-32 (or, $DEITY forbid, Punycode :-)
> as something which should be available everywhere.
>
> So given a character which is outside of the 0x00 <= 0x7f range, in an
> environment which does not specify an encoding, I would like to one day
> be able to categorically state that "Debian will by default assume that
> character is unicode, encoded according to UTF-8".

I agreem but the last sentence.
"Debian will use as default unicode, encoded according to UTF-8", but
not *assume*.  It is again portability.  Let (old) programs to works
also on the future Debian.

Note that the problem with ASCII7 arise also to other encoding.
We are Europeans or Americans, so UTF-8 seems an easy transition,
but for people who use other non-ASCII based encoding, this could be
very hard.  If we start assuming UTF-8 we cause a lot of troubles in
other continents.  Files which were readable in Lenny will be readable
in future only using a command line utility, what a nightmare for our
users!


So if your first paragraph are a nice objective, we should not
add "assumptions" that causes more troubles.
I think the opposite direction will be the best: let assume
less about locale, and let user and system to find and choose
the right encodings.
I.e. let me read file with "less" in many encodings
(heuristic, magic strings, or command line argument), instead of building
"less" to assume UTF-8.


We have the same objective, but two different ways. And because
I used and use a lot of different systems, I think my way is the best.


> In such an environment, with a C.UTF-8 encoding selected, when I start a
> word processing program and insert an a-umlaut in there, I would expect
> that my file will be written with a UTF-8 encoded unicode character in
> it.  I would not expect that if I sort the lines in that file, that the
> lines beginning with a-umlaut would sort before 'z'.  I would not expect
> that if I grep such a file for '^[[:alpha:]]$' that my a-umlaut line
> would appear.

I think nobody should use "C" or "C.UTF-8" as user encoding.
And I really hope that Debian will try to convince user to use a
proper locale.


> At present I don't believe that this does happen.  At present we
> continue to perpetuate encodings such as ISO 8859-1 in these situations,
> making pain for our children and grandchildren to resolve.

No, I think Debian is really pushing UTF-8, and fortunately we can
distinguish automatically ISO 8859-1 from UTF-8 (but few "degenerate"
cases). This could help.  But world is not only ASCII based, so
mandate UTF-8 will causes more trouble.

I think we can do more heuristic to find the right encoding,
and encouraging programmers to annotate file with the right
encoding (you see more and more file with tell explicitly
the editor about the encoding).

> So as a first step in this process of 'cleaning up our world', this bug
> is proposing a smaller change than that, and a smaller change than I
> believe you think it is.

It helps you, it helps Europeans and Americans, but it doesn't help
writing program that all world could use (also to read older documents).

Setting a real locale (not "POSIX" or "C") solve this, and BTW is
what Debian is doing.
C.UTF-8 will create a new locale, not destroying one, so not going
in the right direction.


> The proposal, at this stage is only that the C.UTF-8 locale is
> *installed* and *available* by default.  Not that it *be* the default,
> but that it *be there* as a default. People will naturally continue to
> be free to uninstall it, or to leave their locale to 'C'.
> 
> 
> Once this minimum step is made, and we've all calmed down, we can think
> further on radical and dramatic changes over coming years where more
> significant shifts are made, like:
> 
> * The default locale at installation is C.UTF-8 rather than C.

BTW is not C.  The real default is en_US.UTF-8 (if you press
Enter continously on installation time), so already a UTF-8
encoding. We could hide further the non UTF-8 encoding
(but it seems that in lenny the other encoding are already
hidden, in "European" languages)

> * The default locale at installation is assigned based on the
> installation language.

Already in Lenny

> * If a locale is set which doesn't specify an encoding, the system
> defaults to assuming UTF-8.

ok. "C" is not the default in POSIX. Systems can choose any locale
But only in few case we need it. Locales are normally set.
So let look at the different cases when we have no locale,
and see why, and the best solution (debootstrap, ssh on
some remote machine (ok outside debian), ...)

> * All ISO8859 locales are moved to a new locales-legacy-encodings
> package.

This encoding is used also on CD/, floppy, remote filesystems, USB pens,
on a lot of internet pages, etc.

So we can discurage in new contents, but we must be able to read the
actual and the old world!


> * ... and so on.
> 
> 
> Yes, I think that the C.UTF-8 locale offers something different that the
> C locale doesn't.  Primarily it offers us a way out of the current
> default encodings which are legacy encodings, without jumping boots and
> all into a world where suddenly our sort ordering is changed, and our
> users are screaming at us that en_US.UTF-8 is wrong for *them*, or that
> 'sort' is suddenly putting 'A' next to 'a' and all of their legacy shell
> scripts expect are broken because they expect a different behaviour.

But an ASCII7 "C" encoding allow you to do the same things. It doesn't
forbid 8 bit characters (thus UTF-8). Unix is transparent on characters
(i.e. binary and text are the same, you can grep binaries, ...).

So scripts should use LANG=C on most cases.

If you have trouble seems characters, is because the terminal.
In this case we can force terminal to use UTF-8 on "C" encoding
(as an option) or you should use a real locale.  In this case
you are the user, so you can choose the right localte.


There are problem with binary code, when compiler run in a different
locale, and when code was not so "portable". But this is a different
problem, which require a different solution (possibly at building time)


> I believe that the list above might be the set of smallest useful
> incremental changes in this process.  I would really like to see that
> second step taken too, where the default locale is set to the most basic
> UTF-8 locale possible, but I'm happy to see a second bug and further
> discussion, if that's what we need to do to get agreement.

already in Lenny.

> 
> 
>> - terminals should be sensible to charsets, on choosing how to display
>>    things
>> - programs should be sensible to locales (topic of this discussion):
>>    the locales provides some charsets dependent strings, and interpretation
>>    of the various characters, but (usually) they MUST NOT translate characters.

> Not so.  They have to consider how to handle input also, unless by
> 'terminal' you mean any program which might handle character input and
> output...
> 
> An example I have had in the last week was that some software processing
> information from the internet was converting &nbsp; into the character
> 0xa0.  While I have now stopped using that particular software
> (Html::Strip, if anyone's interested), it illustrates exactly how
> software currently doesn't know, and through not knowing it can
> perpetuate encoding systems which need to die.

No, I mean true terminals. Programs should be usually transparent to
encoding (when used as filter, etc..).  The "sed" would not have
such problem.  Hmm. but 0xa0 should be specified by number.
If che c=0xa0 was in source ok, I see the problem, but most of
language permit annotation at the top of the source.

ciao
	cate




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 17:51:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 17:51:07 GMT) Full text and rfc822 format available.

Message #182 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: 522776@bugs.debian.org
Cc: debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Wed, 8 Apr 2009 17:41:24 +0000 (UTC)
Giacomo A. Catenazzi dixit:

> The locale C is already a UTF-8 compatible locale.

It is UTF-8 transparent but that's its pro and con.
It does not tell the system that UTF-8 encoding is to be used.
It basically says the encoding is none/unknown.


> Why build need to depend to a locale?
[...]
> For testing? So why not test various locales (UTF-8, but also other non
> ascii based encodings)

> to go to the point: what is the problem in mksh?
> At which level it fails?
[...]
> But if mksh don't work on "C", I'm very worried.
> The problems are on inputs or on outputs?

I think you misunderstand the mksh part of the problem.

mksh has two modi: a legacy mode, in which it does not make any
assumptions about charsets or encodings and is 8-bit clean and
mostly 8-bit transparent, safe a few mostly past bugs and imple-
mentation shortcomings, and a unicode mode, in which it assumes
its input is UTF-8 (although, with ^V, you can still enter non-
UTF-8 sequences, and tabcomplete filenames in legacy encodings
as well). The unicode mode is enabled with "mksh -U" or "set -U".
However, mksh has a feature which automatically enables the uni-
code mode if
- the current CODESET is UTF-8 (or the locale ends in .utf8 or
  .UTF-8 or something similar, in some cases), or
- the input begins with a UTF-8 BOM.

The regression test suite merely checks for this feature. To do
so, it needs a way to set the checked mksh process' CODESET to
UTF-8, which is only possible by setting a non-C/POSIX locale.


Andrew McMillan dixit:

>The proposal, at this stage is only that the C.UTF-8 locale is
>*installed* and *available* by default.  Not that it *be* the default,
>but that it *be there* as a default.

This is about what I was to propose, indeed.


Andrew McMillan dixit:

>Once this minimum step is made, and we've all calmed down, we can think
>further on radical and dramatic changes over coming years where more
>significant shifts are made, like:
>
>* The default locale at installation is C.UTF-8 rather than C.

That would be nice.

>* If a locale is set which doesn't specify an encoding, the system
>defaults to assuming UTF-8.


Andrew McMillan dixit:

>[...] and indeed Steve
>Langasek has already suggested a seemingly reasonable workaround for the
>immediate problem which was, funnily enough, that mksh wants to have a
>UTF-8 locale *available* in order for it to *test the build*...

Yes, his suggestion and searching for someone to actually use it
(Daniel Jacobowitz does) helped that part of the problem. However,
the mksh regression test suite is only one of the manifestations.
Even as a mere user, I'd like to have, see above, a UTF-8 locale
available and, if possible, default. Well, maybe not a UTF-8 locale,
just UTF-8 encoding (especially when I ssh from a MirBSD system to
a Debian system, since on MirBSD there is *only* UTF-8¹), but glibc
defines encodings exclusively via locales, which is why I'm in fa-
vour of C.UTF-8 for myself, but setting LC_CTYPE only has the same
effect (and I often set LC_MESSAGES to en_GB.UTF-8 for gcc's bene-
fit).


Giacomo A. Catenazzi dixit:

> "Debian will use as default unicode, encoded according to UTF-8", but
> not *assume*.  It is again portability.

I agree too. You cannot simply assume things.

> Let (old) programs to works
> also on the future Debian.

These need to export LC_ALL=C already, since you've been able to
choose a locale in d-i for a while, so no change there.


bye,
//mirabilos
-- 
23:22⎜«mikap:#grml» mirabilos: und dein bootloader ist geil :)
23:29⎜«mikap:#grml» und ich finds saugeil dass ich ein bsd zum booten mit
     ⎜  grml hab, das muss ich dann gleich mal auf usb-stick installieren
-- Michael Prokop von grml.org über MirGRML und MirOS bsd4grml




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Wed, 08 Apr 2009 21:30:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andrew McMillan <andrew@morphoss.com>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Wed, 08 Apr 2009 21:30:08 GMT) Full text and rfc822 format available.

Message #187 received at 522776@bugs.debian.org (full text, mbox):

From: Andrew McMillan <andrew@morphoss.com>
To: "Giacomo A. Catenazzi" <cate@debian.org>
Cc: 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 09 Apr 2009 09:24:32 +1200
On Wed, 2009-04-08 at 15:31 +0200, Giacomo A. Catenazzi wrote:
> 
> We have the same objective, but two different ways.

Indeed, but it seems to me that you are pushing for a much bigger change
than I am.

So the smallest step which is in the same direction both of us want to
go, is for *a* UTF-8 locale to be *available* on all Debian systems,
which is what is being proposed by this bug.


Cheers,
					Andrew.

------------------------------------------------------------------------
andrew (AT) morphoss (DOT) com                            +64(272)DEBIAN
                       Just to have it is enough.
------------------------------------------------------------------------






Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 09 Apr 2009 08:12:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 09 Apr 2009 08:12:07 GMT) Full text and rfc822 format available.

Message #192 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: 522776@bugs.debian.org
Cc: debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 09 Apr 2009 10:08:05 +0200
Thorsten Glaser wrote:
> Giacomo A. Catenazzi dixit:

> I think you misunderstand the mksh part of the problem.
> 
> mksh has two modi: a legacy mode, in which it does not make any
> assumptions about charsets or encodings and is 8-bit clean and
> mostly 8-bit transparent, safe a few mostly past bugs and imple-
> mentation shortcomings, and a unicode mode, in which it assumes
> its input is UTF-8 (although, with ^V, you can still enter non-
> UTF-8 sequences, and tabcomplete filenames in legacy encodings
> as well). The unicode mode is enabled with "mksh -U" or "set -U".
> However, mksh has a feature which automatically enables the uni-
> code mode if
> - the current CODESET is UTF-8 (or the locale ends in .utf8 or
>   .UTF-8 or something similar, in some cases), or
> - the input begins with a UTF-8 BOM.

This is good way to do things!

> 
> The regression test suite merely checks for this feature. To do
> so, it needs a way to set the checked mksh process' CODESET to
> UTF-8, which is only possible by setting a non-C/POSIX locale.

This means that we make few automatic regression tests ;-)

But so, the UTF-8 requirement are a lot narrow than the
rest of discussion.

I think that we should provide some package that give pbuilder
environment a UTF-8 environment. Or a debhelper (or like) utility
that "construct it" for build needs.

In this case "us_EN.UTF-8" is a sensible locale (we want to test
a real locale), but in this case I would also test some UTF-16
or Asian locale (mksh should not assume UTF-8 in these cases).

You had already a solution (but embedding in a standard utility
is IMHO better, which hide the complexity, and show direct what
you need).  BTW the locale could be also a pathname, so
no root power needs (i.e. for other tests in user gleba).


> Andrew McMillan dixit:
> 
>> The proposal, at this stage is only that the C.UTF-8 locale is
>> *installed* and *available* by default.  Not that it *be* the default,
>> but that it *be there* as a default.
> 
> This is about what I was to propose, indeed.

I agree that we must provide by default also a UTF-8, but I don't
like "C.UTF-8".  A solution: force all locales to have also the
UTF-8 "brother", and force installation of such locale when user
choose (at installation time) a non UTF-8 locale.

"C" is not offered at installation time (but IIRC KDE offered
at first run, some versions ago).

For building env I prefer a "us_EN.UTF-8" (we need English to
read logs) or build when needed (better because probably
we need other locales to test, and probably some packages
needs some Asian locale for building/testing)



> Andrew McMillan dixit:
> 
>> Once this minimum step is made, and we've all calmed down, we can think
>> further on radical and dramatic changes over coming years where more
>> significant shifts are made, like:
>>
>> * The default locale at installation is C.UTF-8 rather than C.
> 
> That would be nice.

C is not the default locale. "en_US.UTF-8" is the default
(d-i of lenny, pressing only ENTERs).


> Andrew McMillan dixit:
> 
>> [...] and indeed Steve
>> Langasek has already suggested a seemingly reasonable workaround for the
>> immediate problem which was, funnily enough, that mksh wants to have a
>> UTF-8 locale *available* in order for it to *test the build*...
> 
> Yes, his suggestion and searching for someone to actually use it
> (Daniel Jacobowitz does) helped that part of the problem. However,
> the mksh regression test suite is only one of the manifestations.
> Even as a mere user, I'd like to have, see above, a UTF-8 locale
> available and, if possible, default. Well, maybe not a UTF-8 locale,
> just UTF-8 encoding (especially when I ssh from a MirBSD system to
> a Debian system, since on MirBSD there is *only* UTF-8¹), but glibc
> defines encodings exclusively via locales, which is why I'm in fa-
> vour of C.UTF-8 for myself, but setting LC_CTYPE only has the same
> effect (and I often set LC_MESSAGES to en_GB.UTF-8 for gcc's bene-
> fit).

But your case is very specific (to building package). And
in these case we want a minimal build environment.
Additionally it is for testing purpose, so you test UTF-8,
other package maybe needs other locales.

Anyway I agree that a UTF-8 locale could be installed by default
(also on pbuilder), but I we need also a locale utility for
debian/rules, and that user has the right UTF-8 locale
(so for a generic user, not C.UTF-8, but xz_YW.UTF-8,
if is normally using xz_YW)

ciao
	cate




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 09 Apr 2009 08:27:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 09 Apr 2009 08:27:06 GMT) Full text and rfc822 format available.

Message #197 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: "Giacomo A. Catenazzi" <cate@debian.org>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 9 Apr 2009 08:21:45 +0000 (UTC)
Giacomo A. Catenazzi dixit:

> This is good way to do things!

Thanks.

> Or a debhelper (or like) utility
> that "construct it" for build needs.

That’s already done, as I said – vorlon gave me an idea, I implemented
it, it works, I uploaded a new mksh package… and then I saw someone’s
added it to the D-D-R since I last looked into it…

> In this case "us_EN.UTF-8" is a sensible locale (we want to test

It’s “en_US.UTF-8” by the way.

> a real locale), but in this case I would also test some UTF-16
> or Asian locale (mksh should not assume UTF-8 in these cases).

It doesn’t. This test is already run for the C locale.
Besides, there are no UTF-16 or somesuch locales on UNIX® or
compatible systems.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 09 Apr 2009 09:30:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 09 Apr 2009 09:30:09 GMT) Full text and rfc822 format available.

Message #202 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Thorsten Glaser <tg@mirbsd.de>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 09 Apr 2009 11:24:42 +0200
Thorsten Glaser wrote:
> Giacomo A. Catenazzi dixit:

>> a real locale), but in this case I would also test some UTF-16
>> or Asian locale (mksh should not assume UTF-8 in these cases).
> 
> It doesn’t. This test is already run for the C locale.
> Besides, there are no UTF-16 or somesuch locales on UNIX® or
> compatible systems.

Yes, right. ASCII-7 characters need to be encoded as a single
char (octet), with values between 1 and 127, but not necessarily
with ASCII values. With a quick look, it seems that all locales
implement are ASCII compatible charsets, which is also very
nice for filename portability (also between users and system).

Recently there was a short discussion in POSIX about locales
which code "/" in a non stanrdard place, thus creating a lot
of problems (also security related), but this is an other
story.

ciao
	cate





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 07 May 2009 12:21:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Joey Hess <joeyh@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 07 May 2009 12:21:02 GMT) Full text and rfc822 format available.

Message #207 received at 522776@bugs.debian.org (full text, mbox):

From: Joey Hess <joeyh@debian.org>
To: 522776@bugs.debian.org, Debian Policy List <debian-policy@lists.debian.org>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 7 May 2009 08:18:55 -0400
[Message part 1 (text/plain, inline)]
FWIW, the installation-locale udeb provides a C.UTF-8 locale,
which d-i runs under. Takes about 168k.

-- 
see shy jo
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 27 Nov 2009 02:12:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Albert Cahalan <acahalan@gmail.com>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 27 Nov 2009 02:12:03 GMT) Full text and rfc822 format available.

Message #212 received at 522776@bugs.debian.org (full text, mbox):

From: Albert Cahalan <acahalan@gmail.com>
To: 522776@bugs.debian.org, tg@mirbsd.de, vorlon@debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 26 Nov 2009 21:06:38 -0500
Steve Langasek writes:
> On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote:

>> If you need a specific locale (as seems from "mksh", not
>> sure if it is a bug in that program), you need to set it.
>>
>> You can only set a locale on a glibc-based system if it's
>> installed beforehand, which root needs to do.

This is of course a horrid bug. I'm fighting it right now.
I install a zam.mo file, nothing else, and I damn well expect
that file to get used for messages! Obviously, it's UTF-8.
Obviously, I expect towupper() to follow Unicode defaults.

> You can build-depend on the locales package and generate the locales
> you want locally, using LOCPATH to reference them.  There's no need
> for Debian to guarantee the presence of a particular locale ahead of
> time - particularly one that isn't actually useful to end users,
> as C.UTF-8 would be.

Unless plain "C" goes UTF-8, that's exactly the locale I need.
The stupid broken en_US.UTF-8 fucks up the sort order.

Granted, fixing en_US.UTF-8 would be sweet, but it may be far too late.

We really need a do-nothing locale that follows the Unicode spec
using the UTF-8 encoding. We could also use a do-nothing locale
that follows the Unicode spec using the Latin-1 encoding.




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 27 Nov 2009 03:21:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Albert Cahalan <acahalan@gmail.com>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 27 Nov 2009 03:21:03 GMT) Full text and rfc822 format available.

Message #217 received at 522776@bugs.debian.org (full text, mbox):

From: Albert Cahalan <acahalan@gmail.com>
To: 522776@bugs.debian.org, rleigh@codelibre.net, dato@net.com.org.es, tg@mirbsd.de
Subject: Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 26 Nov 2009 22:16:41 -0500
Roger Leigh writes:
> On Tue, Apr 07, 2009 at 09:24:38PM +0200, Adeodato Simó wrote:
>> + Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +0000):

>>> Except the ton which sets LC_ALL=C to get sane (parsable,
>>> dependable, historically compatible) output.
>>
>>> These would then unset all other LC_* and LANG and LANGUAGE,
>>> and only set LC_CTYPE to C.UTF-8 to get "old" behaviour but
>>> with UTF-8 (and mbrtowc and iswctype and and and) available.
>>
>> Isn't setting LC_ALL=C.UTF-8 going to be about the same and less work?
>> I'm genuinely interested if that would behave any different to what you
>> said (unsetting all, setting LC_CTYPE).
>
> % sudo localedef -c -i POSIX -f UTF-8 C.UTF-8
>
> % LANG=C.UTF8 locale charmap
> UTF-8
>
> % LANG=C locale charmap
> ANSI_X3.4-1968
>
> This appears to work correctly at first glance.
>
> However, I would ideally like the C/POSIX locales to be UTF-8
> by default as on other systems (with a C.ASCII variant if required).

By far the most critical thing is that the <wctype.h> functions
work in the normal Unicode manner, with wchar_t assumed to be
purely Unicode. This means iswupper() works, towupper() works, etc.

This applies for locales called "", "C", and "some-unknown-junk".
The only possible exception would be when there are environment
variables set which are known to need something else. Unrecognized
locales and all other defaults have to support full Unicode.

Note that none of the above necessarily requires UTF-8, though UTF-8
seems desirable. You could use Latin-1 and still have wchar_t work.
This could all be configurable of course. Suppose /etc/locale had:

"" UTF-8        # setlocale with "" and no environment variables
"C" Latin-1     # if the "C" locale is specifically requested
unknown UTF-8   # if we don't recognize the locale
broken UTF-8    # if parts of the locale info are missing/broken

Right now, gettext doesn't even distinguish those cases. This could
be considered part of the problem. When I put a zam.mo file (Zapotec)
in the right place and set LC_ALL to "zam", I get the "C" locale!!!
Any imperfection in a locale results in "C", as ASCII as can be.




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 27 Nov 2009 04:03:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Albert Cahalan <acahalan@gmail.com>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 27 Nov 2009 04:03:03 GMT) Full text and rfc822 format available.

Message #222 received at 522776@bugs.debian.org (full text, mbox):

From: Albert Cahalan <acahalan@gmail.com>
To: 522776@bugs.debian.org, cate@debian.org, andrew@morphoss.com
Subject: Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 26 Nov 2009 22:56:20 -0500
Andrew McMillan writes:
> On Wed, 2009-04-08 at 10:15 +0200, Giacomo A. Catenazzi wrote:

>> So I've a question: what does UTF-8 mean in this context (C.UTF-8) ?
...
> So given a character which is outside of the 0x00 <= 0x7f range, in an
> environment which does not specify an encoding, I would like to one day
> be able to categorically state that "Debian will by default assume that
> character is unicode, encoded according to UTF-8".

Damn right. The obscure languages of the world are numerous. Unlike
the languages of countries that were wealthy enough to participate
in native-language computing prior to UTF-8, these less-popular
languages are getting done in UTF-8. We mostly aren't inventing
new incompatible encodings.

> In such an environment, with a C.UTF-8 encoding selected, when I start a
> word processing program and insert an a-umlaut in there, I would expect
> that my file will be written with a UTF-8 encoded unicode character in
> it.  I would not expect that if I sort the lines in that file, that the
> lines beginning with a-umlaut would sort before 'z'.

Right...

> I would not expect
> that if I grep such a file for '^[[:alpha:]]$' that my a-umlaut line
> would appear.

No. It's a letter in the Unicode spec.

> The proposal, at this stage is only that the C.UTF-8 locale is
> *installed* and *available* by default.  Not that it *be* the default,
> but that it *be there* as a default. People will naturally continue to
> be free to uninstall it, or to leave their locale to 'C'.

What if you don't set your locale to anything, or if you set it
to something that isn't recognized? You should get UTF-8 in any
of those cases.

The mechanism isn't so important. It could be that the fallback
locale used by gettext is no longer "C" (perhaps "C.UTF-8"), or it
could be that the "C" locale does UTF-8.

LC_ALL=pirate  -->  you get UTF-8, with messages from pirate.mo

> Yes, I think that the C.UTF-8 locale offers something different that the
> C locale doesn't.  Primarily it offers us a way out of the current
> default encodings which are legacy encodings, without jumping boots and
> all into a world where suddenly our sort ordering is changed, and our
> users are screaming at us that en_US.UTF-8 is wrong for *them*, or that
> 'sort' is suddenly putting 'A' next to 'a' and all of their legacy shell
> scripts expect are broken because they expect a different behaviour.

> I believe that the list above might be the set of smallest useful
> incremental changes in this process.  I would really like to see that
> second step taken too, where the default locale is set to the most basic
> UTF-8 locale possible, but I'm happy to see a second bug and further
> discussion, if that's what we need to do to get agreement.

There are different meanings of "default".

By default, the locale should not be set in the environment.
That should give UTF-8. It could map to "C", "C.UTF-8", "(nil)",
or whatever.

>> I still think that "en_US.UTF-8" is the right default (note:
>> I'm not a US citizen, nor I speak English).

As a US citizen who does speak English, I guess I'm an authority
on the en_US.UTF-8 locale. It is offensively defective. It sorts
stuff in a crazy order designed by some moronic committee.
I doubt it even accepts Cyrillic and Korean as having letters.




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 27 Nov 2009 04:18:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Albert Cahalan <acahalan@gmail.com>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 27 Nov 2009 04:18:03 GMT) Full text and rfc822 format available.

Message #227 received at 522776@bugs.debian.org (full text, mbox):

From: Albert Cahalan <acahalan@gmail.com>
To: 522776@bugs.debian.org, andrew@morphoss.com, cate@debian.org
Subject: Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 26 Nov 2009 23:13:21 -0500
Giacomo A. Catenazzi writes:
> [Andrew McMillan probably]

> I think nobody should use "C" or "C.UTF-8" as user encoding.
> And I really hope that Debian will try to convince user to
> use a proper locale.

Debian doesn't ship a proper locale. I want sorting according
to the raw Unicode values. I want iswprint() to return non-zero
for a Cyrillic character, a Korean character, etc.

Debian shouldn't be setting locale-related environment variables
unless the user specifically chooses. The implementation-specific
defaults, applied in the absense of any environment variables,
should support Unicode.

>> * All ISO8859 locales are moved to a new locales-legacy-encodings
>> package.
>
> This encoding is used also on CD/, floppy, remote filesystems,
> USB pens, on a lot of internet pages, etc.

Nope.

It's actually UTF-16 in VFAT, Joliet, CIFS, and so on. Linux has
mount options to control how that gets make POSIX-compatible.
You can choose UTF-8. (this should be Debian's default)

> But an ASCII7 "C" encoding allow you to do the same things. It doesn't
> forbid 8 bit characters (thus UTF-8). Unix is transparent on characters
> (i.e. binary and text are the same, you can grep binaries, ...).
>
> So scripts should use LANG=C on most cases.

That leaves iswprint() and towupper() broken. (not that it must)




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 27 Nov 2009 11:03:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 27 Nov 2009 11:03:03 GMT) Full text and rfc822 format available.

Message #232 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: 522776@bugs.debian.org
Subject: Re: Bug#522776: Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 27 Nov 2009 10:46:19 +0000 (UTC)
Albert Cahalan dixit:

>Giacomo A. Catenazzi writes:

>> I think nobody should use "C" or "C.UTF-8" as user encoding.

I’d use it.

>Debian doesn't ship a proper locale. I want sorting according
>to the raw Unicode values.

Also called ASCIIbetically ☺ But C exists, C.UTF-8 doesn’t.

>>> * All ISO8859 locales are moved to a new locales-legacy-encodings
>>> package.
>>
>> This encoding is used also on CD/, floppy, remote filesystems,
>> USB pens, on a lot of internet pages, etc.
>
>Nope.
>
>It's actually UTF-16 in VFAT, Joliet, CIFS, and so on.

And cp437 (or, worse, cp850) in FAT SFNs.

>> So scripts should use LANG=C on most cases.
>
>That leaves iswprint() and towupper() broken. (not that it must)

No, LANG is *also* wrong. Scripts relying on certain behaviour
use LC_ALL=C (and, on GNU OSes, also must “unset LANGUAGE”), but
some things just require UTF-8, so the current approach is to
unset everything beginning with LC_*, setting LANG=C (or unsetting
it) and LC_ALL=en_US.UTF-8 or en_GB.UTF-8 or whatever and hoping
that that locale is installed… not acceptable!

bye,
//mirabilos
-- 
16:47⎜«mika:#grml» .oO(mira ist einfach gut....)      23:22⎜«mikap:#grml»
mirabilos: und dein bootloader ist geil :)    23:29⎜«mikap:#grml» und ich
finds saugeil dass ich ein bsd zum booten mit grml hab, das muss ich dann
gleich mal auf usb-stick installieren	-- Michael Prokop über MirOS bsd4grml




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 27 Nov 2009 11:03:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 27 Nov 2009 11:03:04 GMT) Full text and rfc822 format available.

Message #237 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
Cc: 522776@bugs.debian.org
Subject: Re: Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 27 Nov 2009 10:39:25 +0000 (UTC)
Albert Cahalan dixit:

>Any imperfection in a locale results in "C", as ASCII as can be.

Yes, and "C" shall not imply latin1 but 7-bit ASCII but 8-bit
transparent.

//mirabilos
-- 
Sometimes they [people] care too much: pretty printers [and syntax highligh-
ting, d.A.] mechanically produce pretty output that accentuates irrelevant
detail in the program, which is as sensible as putting all the prepositions
in English text in bold font.	-- Rob Pike in "Notes on Programming in C"




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 27 Nov 2009 11:03:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 27 Nov 2009 11:03:06 GMT) Full text and rfc822 format available.

Message #242 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 27 Nov 2009 10:37:37 +0000 (UTC)
Albert Cahalan dixit:

>Unless plain "C" goes UTF-8

Not going to happen, it’s not binary-safe. (I fought that in
MirBSD with the OPTU-8/16 encoding scheme.)

>The stupid broken en_US.UTF-8 fucks up the sort order.

So true… (and paper size!)

>We really need a do-nothing locale that follows the Unicode spec
>using the UTF-8 encoding.

Yes, my proposal exactly.

>We could also use a do-nothing locale
>that follows the Unicode spec using the Latin-1 encoding.

No, for two reasons:
① legacy encodings must die
② then you need one for EVERY legacy encoding (why special-case one?)

bye,
//mirabilos
-- 
Sometimes they [people] care too much: pretty printers [and syntax highligh-
ting, d.A.] mechanically produce pretty output that accentuates irrelevant
detail in the program, which is as sensible as putting all the prepositions
in English text in bold font.	-- Rob Pike in "Notes on Programming in C"




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 01 Dec 2009 16:51:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 01 Dec 2009 16:51:04 GMT) Full text and rfc822 format available.

Message #247 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Thorsten Glaser <tg@mirbsd.de>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 01 Dec 2009 17:42:39 +0100
Thorsten Glaser wrote:
> Albert Cahalan dixit:
> 
>> Unless plain "C" goes UTF-8
> 
> Not going to happen, it’s not binary-safe. (I fought that in
> MirBSD with the OPTU-8/16 encoding scheme.)

Why not? Note that usual functions work on bytes, not on characters, and 
on POSIX utilities the old/classical options work on bytes by default. 
POSIX introduced new options for characters. E.g. the -c in 'wc' means 
really bytes, not characters (which is given by -m). Not so logical, but
compatible with the expected old behaviour.

POSIX was discussing if is is "legal" to have a UTF-8 POSIX/C locale.
IIRC the doubts was about the language in the standard, not about real 
problems. OTOH they acknowledged that real bugs could appear.

OTOH I use by default the UTF-8 locale, because I don't expect that 
Debian will corrupt my data. And I think system utilities will do
the right things with locale.


I start to think that moving C to UTF-8 will be the real simpler and
faster way to *hide* most of the encoding bugs.

ciao
	cate





Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Tue, 01 Dec 2009 18:36:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Tue, 01 Dec 2009 18:36:06 GMT) Full text and rfc822 format available.

Message #252 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Tue, 1 Dec 2009 18:28:06 +0000 (UTC)
Giacomo A. Catenazzi dixit:

>> Not going to happen, it’s not binary-safe. (I fought that in
>> MirBSD with the OPTU-8/16 encoding scheme.)
>
> Why not? Note that usual functions work on bytes

Not really.

The difference between 'tr u x' on binary files can, depending on
the implementation of tr (if it does 'tr ¥ €' correctly in an UTF-8
locale), trash it because it must use mbsrtowcs then, which is, by
POSIX, required to fail for non-representable strings.

In MirBSD, we have solved that by clever use of the PUA.

//mirabilos
-- 
Sometimes they [people] care too much: pretty printers [and syntax highligh-
ting, d.A.] mechanically produce pretty output that accentuates irrelevant
detail in the program, which is as sensible as putting all the prepositions
in English text in bold font.	-- Rob Pike in "Notes on Programming in C"




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 02 Sep 2010 22:48:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Samuel Thibault <sthibault@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 02 Sep 2010 22:48:03 GMT) Full text and rfc822 format available.

Message #257 received at 522776@bugs.debian.org (full text, mbox):

From: Samuel Thibault <sthibault@debian.org>
To: Thorsten Glaser <tg@mirbsd.de>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 00:44:53 +0200
Hello,

No news on this?

Hurd's console needs a UTF-8 locale to be able to use wcwidth() for
proper double-width support.

Note: debian-installer is already providing a C.UTF-8 locale to d-i
components, so it works there.

Samuel




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 02 Sep 2010 22:57:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Russ Allbery <rra@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 02 Sep 2010 22:57:03 GMT) Full text and rfc822 format available.

Message #262 received at 522776@bugs.debian.org (full text, mbox):

From: Russ Allbery <rra@debian.org>
To: Samuel Thibault <sthibault@debian.org>
Cc: 522776@bugs.debian.org, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 02 Sep 2010 15:53:50 -0700
Samuel Thibault <sthibault@debian.org> writes:

> No news on this?

> Hurd's console needs a UTF-8 locale to be able to use wcwidth() for
> proper double-width support.

> Note: debian-installer is already providing a C.UTF-8 locale to d-i
> components, so it works there.

Does libc in Debian provide a C.UTF-8 locale?  I think that's a
prerequisite for doing anything in Policy, no?

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 02 Sep 2010 23:06:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Samuel Thibault <sthibault@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 02 Sep 2010 23:06:03 GMT) Full text and rfc822 format available.

Message #267 received at 522776@bugs.debian.org (full text, mbox):

From: Samuel Thibault <sthibault@debian.org>
To: Russ Allbery <rra@debian.org>
Cc: 522776@bugs.debian.org, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 01:05:10 +0200
Russ Allbery, le Thu 02 Sep 2010 15:53:50 -0700, a écrit :
> Samuel Thibault <sthibault@debian.org> writes:
> > No news on this?
> 
> > Hurd's console needs a UTF-8 locale to be able to use wcwidth() for
> > proper double-width support.
> 
> > Note: debian-installer is already providing a C.UTF-8 locale to d-i
> > components, so it works there.
> 
> Does libc in Debian provide a C.UTF-8 locale?

It doesn't yet but it's easy to do, that's not the question.  See the
questions in the bug thread.

> I think that's a prerequisite for doing anything in Policy, no?

Probably, but before doing anything in libc we need to decide what to
do.

Samuel




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 02 Sep 2010 23:09:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Russ Allbery <rra@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 02 Sep 2010 23:09:06 GMT) Full text and rfc822 format available.

Message #272 received at 522776@bugs.debian.org (full text, mbox):

From: Russ Allbery <rra@debian.org>
To: Samuel Thibault <sthibault@debian.org>
Cc: 522776@bugs.debian.org, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 02 Sep 2010 16:07:25 -0700
Samuel Thibault <sthibault@debian.org> writes:
> Russ Allbery, le Thu 02 Sep 2010 15:53:50 -0700, a écrit :

>> Does libc in Debian provide a C.UTF-8 locale?

> It doesn't yet but it's easy to do, that's not the question.  See the
> questions in the bug thread.

>> I think that's a prerequisite for doing anything in Policy, no?

> Probably, but before doing anything in libc we need to decide what to
> do.

Ah, then no, in that case there has been no progress.  I don't believe
anyone is currently working on this.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 02 Sep 2010 23:15:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Samuel Thibault <sthibault@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 02 Sep 2010 23:15:06 GMT) Full text and rfc822 format available.

Message #277 received at 522776@bugs.debian.org (full text, mbox):

From: Samuel Thibault <sthibault@debian.org>
To: Russ Allbery <rra@debian.org>
Cc: 522776@bugs.debian.org, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 01:12:44 +0200
Russ Allbery, le Thu 02 Sep 2010 16:07:25 -0700, a écrit :
> Samuel Thibault <sthibault@debian.org> writes:
> > Russ Allbery, le Thu 02 Sep 2010 15:53:50 -0700, a écrit :
> 
> >> Does libc in Debian provide a C.UTF-8 locale?
> 
> > It doesn't yet but it's easy to do, that's not the question.  See the
> > questions in the bug thread.
> 
> >> I think that's a prerequisite for doing anything in Policy, no?
> 
> > Probably, but before doing anything in libc we need to decide what to
> > do.
> 
> Ah, then no, in that case there has been no progress.  I don't believe
> anyone is currently working on this.

Well, no work is needed, what is needed is to agree on what work to do.

Samuel




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 02 Sep 2010 23:27:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Russ Allbery <rra@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 02 Sep 2010 23:27:03 GMT) Full text and rfc822 format available.

Message #282 received at 522776@bugs.debian.org (full text, mbox):

From: Russ Allbery <rra@debian.org>
To: Samuel Thibault <sthibault@debian.org>
Cc: 522776@bugs.debian.org, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 02 Sep 2010 16:24:56 -0700
Samuel Thibault <sthibault@debian.org> writes:
> Russ Allbery, le Thu 02 Sep 2010 16:07:25 -0700, a écrit :

>> Ah, then no, in that case there has been no progress.  I don't believe
>> anyone is currently working on this.

> Well, no work is needed, what is needed is to agree on what work to do.

That's work.  :)

Generally what that means is that someone needs to digest the discussion
in the thread and the technical requirements into a concrete proposal for
what Policy should say and then send that to this bug for discussion.
After that discussion concludes, they should then either propose a patch
or get someone else to write wording to propose a patch and ask for
seconds.  It will then go in to the next version of Policy if there are at
least three DD supporters and no objections.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 02 Sep 2010 23:39:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Samuel Thibault <sthibault@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 02 Sep 2010 23:39:06 GMT) Full text and rfc822 format available.

Message #287 received at 522776@bugs.debian.org (full text, mbox):

From: Samuel Thibault <sthibault@debian.org>
To: Russ Allbery <rra@debian.org>
Cc: 522776@bugs.debian.org, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 01:37:24 +0200
Russ Allbery, le Thu 02 Sep 2010 16:24:56 -0700, a écrit :
> Generally what that means is that someone needs to digest the discussion
> in the thread

Well, it's mostly

- some people saying "it's useless",
- while other people saying "I need it",

and also

- "en_US.UTF-8 is just fine" vs.
- "en_US.UTF-8 sucks, we really need C.UTF-8 instead"

without any convergence.

Samuel




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Thu, 02 Sep 2010 23:51:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Russ Allbery <rra@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Thu, 02 Sep 2010 23:51:02 GMT) Full text and rfc822 format available.

Message #292 received at 522776@bugs.debian.org (full text, mbox):

From: Russ Allbery <rra@debian.org>
To: Samuel Thibault <sthibault@debian.org>
Cc: 522776@bugs.debian.org, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Thu, 02 Sep 2010 16:46:43 -0700
Samuel Thibault <sthibault@debian.org> writes:

> Well, it's mostly

> - some people saying "it's useless",
> - while other people saying "I need it",

> and also

> - "en_US.UTF-8 is just fine" vs.
> - "en_US.UTF-8 sucks, we really need C.UTF-8 instead"

> without any convergence.

I think the way to get past that is to make a specific proposal.

With my Lintian maintainer hat on, I need a UTF-8 locale that's guaranteed
to always be available.  Right now, we're doing something complicated and
annoying (and fragile on Ubuntu) to generate one on the fly (en_US.UTF-8
just because it's probably always there), and we would love to stop doing
that.

I agree with others in this thread that having a UTF-8 locale without the
collation changes implied by en_US is very useful for various software
packages such as automated test suites that want reproducible results and
were originally written for the C locale.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 13:18:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 13:18:03 GMT) Full text and rfc822 format available.

Message #297 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@debian.org>
To: Russ Allbery <rra@debian.org>
Cc: Samuel Thibault <sthibault@debian.org>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 13:02:31 +0000 (UTC)
Russ Allbery dixit:

>I agree with others in this thread that having a UTF-8 locale without the
>collation changes implied by en_US is very useful for various software
>packages such as automated test suites that want reproducible results and
>were originally written for the C locale.

Same for testsuites that are written for UTF-8 but don’t care about
anything other than LC_CTYPE. And for people to whom en_US.UTF-8 is
too fat or “politically incorrect” (though the latter is usually be
fixed by en_GB.UTF-8 which has metric and ISO A4 paper) and others,
like apparently Hurd.

To me, strictly spoken, it doesn’t matter which one as long as there
is one, for the mksh testsuite, but as user, being able to run a
command with 'env LC_ALL=C.UTF-8 foo' on a “hostile” system (e.g.
my cow-orkers insist on installing systems in German *shudder*)
simply rocks.

If nobody beats me, I’ll digest-and-write-a-proposal as suggested.

bye,
//mirabilos
-- 
I believe no one can invent an algorithm. One just happens to hit upon it
when God enlightens him. Or only God invents algorithms, we merely copy them.
If you don't believe in God, just consider God as Nature if you won't deny
existence.		-- Coywolf Qi Hunt




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 13:33:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Giacomo A. Catenazzi" <cate@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 13:33:03 GMT) Full text and rfc822 format available.

Message #302 received at 522776@bugs.debian.org (full text, mbox):

From: "Giacomo A. Catenazzi" <cate@debian.org>
To: Russ Allbery <rra@debian.org>, 522776@bugs.debian.org
Cc: Samuel Thibault <sthibault@debian.org>, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 03 Sep 2010 15:26:47 +0200
On 03.09.2010 01:46, Russ Allbery wrote:
> Samuel Thibault<sthibault@debian.org>  writes:
>
>> Well, it's mostly
>
>> - some people saying "it's useless",
>> - while other people saying "I need it",
>
>> and also
>
>> - "en_US.UTF-8 is just fine" vs.
>> - "en_US.UTF-8 sucks, we really need C.UTF-8 instead"
>
>> without any convergence.
>
> I think the way to get past that is to make a specific proposal.
>
> With my Lintian maintainer hat on, I need a UTF-8 locale that's guaranteed
> to always be available.  Right now, we're doing something complicated and
> annoying (and fragile on Ubuntu) to generate one on the fly (en_US.UTF-8
> just because it's probably always there), and we would love to stop doing
> that.
>
> I agree with others in this thread that having a UTF-8 locale without the
> collation changes implied by en_US is very useful for various software
> packages such as automated test suites that want reproducible results and
> were originally written for the C locale.

BTW I think we should wait some more time. Last week I was on 
debian-glibc list a bug: printf fails if it find an invalid UTF-8
character (when the locale uses UTF-8). Note it is allowed in POSIX,
which distinguish raw strings and parts which uses locale definitions.
So I don't think a C.UTF-8 is safe.
But a good release goal for squeeze+1.

ciao
	cate




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 13:45:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Samuel Thibault <sthibault@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 13:45:03 GMT) Full text and rfc822 format available.

Message #307 received at 522776@bugs.debian.org (full text, mbox):

From: Samuel Thibault <sthibault@debian.org>
To: Thorsten Glaser <tg@debian.org>
Cc: Russ Allbery <rra@debian.org>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 15:43:13 +0200
Thorsten Glaser, le Fri 03 Sep 2010 13:02:31 +0000, a écrit :
> Russ Allbery dixit:
> >I agree with others in this thread that having a UTF-8 locale without the
> >collation changes implied by en_US is very useful for various software
> >packages such as automated test suites that want reproducible results and
> >were originally written for the C locale.
> 
> Same for testsuites that are written for UTF-8 but don’t care about
> anything other than LC_CTYPE.

A sequence of remarks here: one could think that it'd be just enough to
unset LC_ALL and set LC_CTYPE to achieve the same.  However, even
LC_CTYPE has differences between locales, transliterations notably.  For
the transliterations alone we'd probably better go with a stable C.UTF-8
which doesn't depend on transliteration fixes in whichever locale would
be chosen to provide a UTF-8 variant.

> If nobody beats me, I’ll digest-and-write-a-proposal as suggested.

I'd say go on :)
(of course we'll need to wait for libc to provide the locale
(post-squeeze I guess) before changing the policy).

Samuel




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 13:48:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Samuel Thibault <sthibault@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 13:48:03 GMT) Full text and rfc822 format available.

Message #312 received at 522776@bugs.debian.org (full text, mbox):

From: Samuel Thibault <sthibault@debian.org>
To: "Giacomo A. Catenazzi" <cate@debian.org>
Cc: Russ Allbery <rra@debian.org>, 522776@bugs.debian.org, Thorsten Glaser <tg@mirbsd.de>, glibc@packages.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 15:45:43 +0200
Giacomo A. Catenazzi, le Fri 03 Sep 2010 15:26:47 +0200, a écrit :
> BTW I think we should wait some more time. Last week I was on 
> debian-glibc list a bug: printf fails if it find an invalid UTF-8
> character (when the locale uses UTF-8). Note it is allowed in POSIX,
> which distinguish raw strings and parts which uses locale definitions.
> So I don't think a C.UTF-8 is safe.

It's not safe as a system default, yes.  But we're not talking about
making the system default a UTF-8 locale.  We're talking about providing
one for those packages which need it.  Such package should know what
they are doing already, and should probably actually prefer to get such
error properly.

> But a good release goal for squeeze+1.

I wasn't planning to push it for Squeeze actually, unless glibc people
think it's ok to add it.

Samuel




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 13:57:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 13:57:03 GMT) Full text and rfc822 format available.

Message #317 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Samuel Thibault <sthibault@debian.org>, 522776@bugs.debian.org
Cc: Russ Allbery <rra@debian.org>, Thorsten Glaser <tg@mirbsd.de>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 14:52:39 +0100
[Message part 1 (text/plain, inline)]
On Fri, Sep 03, 2010 at 01:37:24AM +0200, Samuel Thibault wrote:
> Russ Allbery, le Thu 02 Sep 2010 16:24:56 -0700, a écrit :
> > Generally what that means is that someone needs to digest the discussion
> > in the thread
> 
> Well, it's mostly
> 
> - some people saying "it's useless",
> - while other people saying "I need it",
> 
> and also
> 
> - "en_US.UTF-8 is just fine" vs.
> - "en_US.UTF-8 sucks, we really need C.UTF-8 instead"
> 
> without any convergence.

I think reading back through the entire log, people who were initially
rather opposed to the proposal did come around once they appreciated
exactly what the changes would be, and why they were needed.  The
conversation was mostly constructive bar some initial misunderstandings
about what the changes actually meant--it did flesh out some of the
issues WRT standards conformance and what might break if the default
was changed, but this bug isn't really about the default, it's about
having a standard UTF-8 locale available.

Andrew Macmillan's message in
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776#167
is a rather good look at a summary of the issues and the
"big picture" behind the motives for changing.

Introducing a C.UTF-8 is a trivial change to make and does not
impact any existing software.  It doesn't mandate a specific
national locale, nor does it alter the existing C locale.  To quote:

"The proposal, at this stage is only that the C.UTF-8 locale is
*installed* and *available* by default.  Not that it *be* the default,
but that it *be there* as a default. People will naturally continue to
be free to uninstall it, or to leave their locale to 'C'."

There were no objections to having a UTF-8 locale installed and
available by default, just to it *being* the default.  Taking this
first small step is IMO important to do, preferably for squeeze if
possible.  Since it's a tiny one-liner change, this should be no
trouble in getting this done.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 14:24:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Samuel Thibault <sthibault@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 14:24:03 GMT) Full text and rfc822 format available.

Message #322 received at 522776@bugs.debian.org (full text, mbox):

From: Samuel Thibault <sthibault@debian.org>
To: Roger Leigh <rleigh@codelibre.net>
Cc: 522776@bugs.debian.org, Russ Allbery <rra@debian.org>, Thorsten Glaser <tg@mirbsd.de>, glibc@packages.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 16:20:27 +0200
Roger Leigh, le Fri 03 Sep 2010 14:52:39 +0100, a écrit :
> On Fri, Sep 03, 2010 at 01:37:24AM +0200, Samuel Thibault wrote:
> > without any convergence.
> 
> I think reading back through the entire log,

Thanks for having done it!

> people who were initially
> rather opposed to the proposal did come around once they appreciated
> exactly what the changes would be, and why they were needed.

Ok.  There was still a question of en_US.UTF-8 vs C.UTF-8, but I believe
the "en_US.UTF-8 is fine enough" argument doesn't hold any more since
some other people say that it isn't for them.

> There were no objections to having a UTF-8 locale installed and
> available by default, just to it *being* the default.  Taking this
> first small step is IMO important to do, preferably for squeeze if
> possible.  Since it's a tiny one-liner change, this should be no
> trouble in getting this done.

I believe so too, I just didn't want to push it too much, but yes, I
believe that's something that shouldn't break Squeeze at all.

Samuel




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 15:48:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ben Finney <ben+debian@benfinney.id.au>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 15:48:03 GMT) Full text and rfc822 format available.

Message #327 received at 522776@bugs.debian.org (full text, mbox):

From: Ben Finney <ben+debian@benfinney.id.au>
To: Roger Leigh <rleigh@codelibre.net>
Cc: 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Sat, 04 Sep 2010 01:46:13 +1000
[Message part 1 (text/plain, inline)]
Roger Leigh <rleigh@codelibre.net> writes:

> There were no objections to having a UTF-8 locale installed and
> available by default, just to it *being* the default.
[…]

Would a less confusing way to make this distinction be to say something
like: “The minimal Debian installation must have a locale available that
uses the UTF-8 character encoding.”?

That is, avoiding the “be there by default” versus “be the default”
confusion altogether?

-- 
 \            “Simplicity is prerequisite for reliability.” —Edsger W. |
  `\                                                          Dijkstra |
_o__)                                                                  |
Ben Finney
[Message part 2 (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 16:30:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 16:30:06 GMT) Full text and rfc822 format available.

Message #332 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Samuel Thibault <sthibault@debian.org>
Cc: Russ Allbery <rra@debian.org>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 16:20:16 +0000 (UTC)
Samuel Thibault dixit:

>LC_CTYPE has differences between locales, transliterations notably.  For

Oh, okay – good to know…

>I'd say go on :)

OK.

>(of course we'll need to wait for libc to provide the locale
>(post-squeeze I guess) before changing the policy).

Sure. Maybe think of something to help backporters, make a
source package able to detect if it already can use this
locale or has to use the localedef way.

bye,
//mirabilos
-- 
I believe no one can invent an algorithm. One just happens to hit upon it
when God enlightens him. Or only God invents algorithms, we merely copy them.
If you don't believe in God, just consider God as Nature if you won't deny
existence.		-- Coywolf Qi Hunt




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 16:30:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 16:30:08 GMT) Full text and rfc822 format available.

Message #337 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: Samuel Thibault <sthibault@debian.org>
Cc: Roger Leigh <rleigh@codelibre.net>, 522776@bugs.debian.org, Russ Allbery <rra@debian.org>, glibc@packages.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 16:21:31 +0000 (UTC)
Samuel Thibault dixit:

>believe that's something that shouldn't break Squeeze at all.

I also believe it cannot possibly do that.

bye,
//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 17:18:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Aurelien Jarno <aurelien@aurel32.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 17:18:06 GMT) Full text and rfc822 format available.

Message #342 received at 522776@bugs.debian.org (full text, mbox):

From: Aurelien Jarno <aurelien@aurel32.net>
To: Samuel Thibault <sthibault@debian.org>
Cc: Roger Leigh <rleigh@codelibre.net>, 522776@bugs.debian.org, Russ Allbery <rra@debian.org>, Thorsten Glaser <tg@mirbsd.de>, glibc@packages.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 3 Sep 2010 19:16:40 +0200
On Fri, Sep 03, 2010 at 04:20:27PM +0200, Samuel Thibault wrote:
> Roger Leigh, le Fri 03 Sep 2010 14:52:39 +0100, a écrit :
> > On Fri, Sep 03, 2010 at 01:37:24AM +0200, Samuel Thibault wrote:
> > > without any convergence.
> > 
> > I think reading back through the entire log,
> 
> Thanks for having done it!
> 
> > people who were initially
> > rather opposed to the proposal did come around once they appreciated
> > exactly what the changes would be, and why they were needed.
> 
> Ok.  There was still a question of en_US.UTF-8 vs C.UTF-8, but I believe
> the "en_US.UTF-8 is fine enough" argument doesn't hold any more since
> some other people say that it isn't for them.
> 
> > There were no objections to having a UTF-8 locale installed and
> > available by default, just to it *being* the default.  Taking this
> > first small step is IMO important to do, preferably for squeeze if
> > possible.  Since it's a tiny one-liner change, this should be no
> > trouble in getting this done.
> 
> I believe so too, I just didn't want to push it too much, but yes, I
> believe that's something that shouldn't break Squeeze at all.
> 

That's not something allowed anymore at this period of the freeze, you
will have to get an exception from the release team first.

-- 
Aurelien Jarno	                        GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 18:18:09 GMT) Full text and rfc822 format available.

Acknowledgement sent to Russ Allbery <rra@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 18:18:09 GMT) Full text and rfc822 format available.

Message #347 received at 522776@bugs.debian.org (full text, mbox):

From: Russ Allbery <rra@debian.org>
To: Ben Finney <ben+debian@benfinney.id.au>
Cc: 522776@bugs.debian.org, Roger Leigh <rleigh@codelibre.net>
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Fri, 03 Sep 2010 11:16:18 -0700
Ben Finney <ben+debian@benfinney.id.au> writes:

> Would a less confusing way to make this distinction be to say something
> like: “The minimal Debian installation must have a locale available that
> uses the UTF-8 character encoding.”?

The other angle here is that it can't just be any UTF-8 locale, since that
isn't very helpful to software that needs to choose a UTF-8 locale on an
automated basis.  Lintian, for example, just needs *some* locale that's
UTF-8, but I don't want to have to try en_US.UTF-8 and then fr.UTF-8 and
then pt_BR.UTF-8 and then....

I think we need to explicitly require a *specific* UTF-8 locale be
available.  C.UTF-8 has a lot of appeal since it's the minimal UTF-8
locale and it doesn't get into issues of favoring one particular language
and its corresponding collation rules, etc.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Fri, 03 Sep 2010 22:39:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Samuel Thibault <sthibault@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Fri, 03 Sep 2010 22:39:03 GMT) Full text and rfc822 format available.

Message #352 received at 522776@bugs.debian.org (full text, mbox):

From: Samuel Thibault <sthibault@debian.org>
To: Aurelien Jarno <aurelien@aurel32.net>
Cc: Roger Leigh <rleigh@codelibre.net>, 522776@bugs.debian.org, Russ Allbery <rra@debian.org>, Thorsten Glaser <tg@mirbsd.de>, glibc@packages.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Sat, 4 Sep 2010 00:37:07 +0200
Aurelien Jarno, le Fri 03 Sep 2010 19:16:40 +0200, a écrit :
> On Fri, Sep 03, 2010 at 04:20:27PM +0200, Samuel Thibault wrote:
> > Roger Leigh, le Fri 03 Sep 2010 14:52:39 +0100, a écrit :
> > > There were no objections to having a UTF-8 locale installed and
> > > available by default, just to it *being* the default.  Taking this
> > > first small step is IMO important to do, preferably for squeeze if
> > > possible.  Since it's a tiny one-liner change, this should be no
> > > trouble in getting this done.
> > 
> > I believe so too, I just didn't want to push it too much, but yes, I
> > believe that's something that shouldn't break Squeeze at all.
> 
> That's not something allowed anymore at this period of the freeze, you
> will have to get an exception from the release team first.

Ok.  I don't feel any urgency so I won't ask for it myself.

Samuel




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Sat, 04 Sep 2010 14:39:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Sat, 04 Sep 2010 14:39:03 GMT) Full text and rfc822 format available.

Message #357 received at 522776@bugs.debian.org (full text, mbox):

From: Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>
To: Samuel Thibault <sthibault@debian.org>, 522776@bugs.debian.org
Cc: Aurelien Jarno <aurelien@aurel32.net>, Roger Leigh <rleigh@codelibre.net>, glibc@packages.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Sat, 4 Sep 2010 16:08:33 +0200
On Sat, Sep 04, 2010 at 12:37:07AM +0200, Samuel Thibault wrote:
> Aurelien Jarno, le Fri 03 Sep 2010 19:16:40 +0200, a écrit :
> > On Fri, Sep 03, 2010 at 04:20:27PM +0200, Samuel Thibault wrote:
> > > Roger Leigh, le Fri 03 Sep 2010 14:52:39 +0100, a écrit :
> > > > There were no objections to having a UTF-8 locale installed and
> > > > available by default, just to it *being* the default.  Taking this
> > > > first small step is IMO important to do, preferably for squeeze if
> > > > possible.  Since it's a tiny one-liner change, this should be no
> > > > trouble in getting this done.
> > > 
> > > I believe so too, I just didn't want to push it too much, but yes, I
> > > believe that's something that shouldn't break Squeeze at all.
> > 
> > That's not something allowed anymore at this period of the freeze, you
> > will have to get an exception from the release team first.
> 
> Ok.  I don't feel any urgency so I won't ask for it myself.

Well, the big advantage to have it in squeeze is that this allows squeeze+1 to rely on
it without worrying with partial upgrades.

For me, the fact that d-i already provides it is a major point in favor of C.UTF-8, because
this show that this actually work and is useful.

Cheers,
-- 
Bill. <ballombe@debian.org>

Imagine a large red swirl here. 




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 06 Sep 2010 07:21:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to andrew@morphoss.com:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 06 Sep 2010 07:21:03 GMT) Full text and rfc822 format available.

Message #362 received at 522776@bugs.debian.org (full text, mbox):

From: Andrew McMillan <andrew@morphoss.com>
To: Bill Allombert <Bill.Allombert@math.u-bordeaux1.fr>, 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale
Date: Mon, 06 Sep 2010 19:16:00 +1200
On Sat, 2010-09-04 at 16:08 +0200, Bill Allombert wrote:
> On Sat, Sep 04, 2010 at 12:37:07AM +0200, Samuel Thibault wrote:
> > Aurelien Jarno, le Fri 03 Sep 2010 19:16:40 +0200, a écrit :
> > > On Fri, Sep 03, 2010 at 04:20:27PM +0200, Samuel Thibault wrote:
> > > > Roger Leigh, le Fri 03 Sep 2010 14:52:39 +0100, a écrit :
> > > > > There were no objections to having a UTF-8 locale installed and
> > > > > available by default, just to it *being* the default.  Taking this
> > > > > first small step is IMO important to do, preferably for squeeze if
> > > > > possible.  Since it's a tiny one-liner change, this should be no
> > > > > trouble in getting this done.
> > > > 
> > > > I believe so too, I just didn't want to push it too much, but yes, I
> > > > believe that's something that shouldn't break Squeeze at all.
> > > 
> > > That's not something allowed anymore at this period of the freeze, you
> > > will have to get an exception from the release team first.
> > 
> > Ok.  I don't feel any urgency so I won't ask for it myself.
> 
> Well, the big advantage to have it in squeeze is that this allows squeeze+1 to rely on
> it without worrying with partial upgrades.
> 
> For me, the fact that d-i already provides it is a major point in favor of C.UTF-8, because
> this show that this actually work and is useful.

I agree.  I think that the impact of having a guaranteed UTF-8 locale
available is only positive.  It may be that nothing presently depends on
it, but for that very reason it should be fine to promote to the release
team for a freeze exception.

Regards,
					Andrew McMillan.


-- 
------------------------------------------------------------------------
andrew (AT) morphoss (DOT) com                            +64(272)DEBIAN
                    Pyros of the world... IGNITE !!!
------------------------------------------------------------------------






Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Sat, 08 Jan 2011 02:45:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to David Holland <dholland@eecs.harvard.edu>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Sat, 08 Jan 2011 02:45:03 GMT) Full text and rfc822 format available.

Message #367 received at 522776@bugs.debian.org (full text, mbox):

From: David Holland <dholland@eecs.harvard.edu>
To: 522776@bugs.debian.org
Subject: Re: Bug#522776: debian-policy: mandate existence of a standardised locale
Date: Fri, 7 Jan 2011 21:14:47 -0500
hello,

Can this please get done (adding a C.UTF-8 locale)? It is absolutely
required for writing shell scripts that handle UTF-8 data, if you want
those shell scripts to have anything like portable or reliable
behavior.

-- 
   - David A. Holland / dholland@eecs.harvard.edu




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Sat, 08 Jan 2011 11:48:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Sat, 08 Jan 2011 11:48:03 GMT) Full text and rfc822 format available.

Message #372 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: David Holland <dholland@eecs.harvard.edu>, 522776@bugs.debian.org
Cc: GNU Libc Maintainers <debian-glibc@lists.debian.org>, debian-release@lists.debian.org
Subject: C.UTF-8 in squeeze (was: Re: Bug#522776: debian-policy: mandate existence of a standardised locale)
Date: Sat, 8 Jan 2011 11:44:56 +0000
[Message part 1 (text/plain, inline)]
clone 522776 -1
reassign -1 eglibc
retitle -1 eglibc: Please provide a C.UTF-8 locale by default
severity -1 important
thanks

On Fri, Jan 07, 2011 at 09:14:47PM -0500, David Holland wrote:
> Can this please get done (adding a C.UTF-8 locale)? It is absolutely
> required for writing shell scripts that handle UTF-8 data, if you want
> those shell scripts to have anything like portable or reliable
> behavior.

This is really in the hands of the glibc maintainers.  I thought that
a bug had been filed months ago, but I can't find it.  I've done so
now.

Note this comment from Aurelien Jarno:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776#342

This will only be done with the approval of the release team, who
I've copied in.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[signature.asc (application/pgp-signature, inline)]

Bug 522776 cloned as bug 609306. Request was from Roger Leigh <rleigh@codelibre.net> to control@bugs.debian.org. (Sat, 08 Jan 2011 11:48:05 GMT) Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Sat, 08 Jan 2011 12:00:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Julien Cristau <jcristau@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Sat, 08 Jan 2011 12:00:03 GMT) Full text and rfc822 format available.

Message #379 received at 522776@bugs.debian.org (full text, mbox):

From: Julien Cristau <jcristau@debian.org>
To: Roger Leigh <rleigh@codelibre.net>
Cc: David Holland <dholland@eecs.harvard.edu>, 522776@bugs.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>, debian-release@lists.debian.org
Subject: Re: C.UTF-8 in squeeze (was: Re: Bug#522776: debian-policy: mandate existence of a standardised locale)
Date: Sat, 8 Jan 2011 12:57:05 +0100
[Message part 1 (text/plain, inline)]
On Sat, Jan  8, 2011 at 11:44:56 +0000, Roger Leigh wrote:

> This will only be done with the approval of the release team, who
> I've copied in.
> 
I don't think that's not going to happen.  Try again for wheezy, and
maybe you can manage not to wait until the last minute of the freeze.

Cheers,
Julien
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Sat, 08 Jan 2011 12:15:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Julien Cristau <jcristau@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Sat, 08 Jan 2011 12:15:06 GMT) Full text and rfc822 format available.

Message #384 received at 522776@bugs.debian.org (full text, mbox):

From: Julien Cristau <jcristau@debian.org>
To: Roger Leigh <rleigh@codelibre.net>
Cc: David Holland <dholland@eecs.harvard.edu>, 522776@bugs.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>, debian-release@lists.debian.org
Subject: Re: C.UTF-8 in squeeze (was: Re: Bug#522776: debian-policy: mandate existence of a standardised locale)
Date: Sat, 8 Jan 2011 13:10:04 +0100
[Message part 1 (text/plain, inline)]
On Sat, Jan  8, 2011 at 12:57:05 +0100, Julien Cristau wrote:

> On Sat, Jan  8, 2011 at 11:44:56 +0000, Roger Leigh wrote:
> 
> > This will only be done with the approval of the release team, who
> > I've copied in.
> > 
> I don't think that's not going to happen.  Try again for wheezy, and
                       ^^^
scratch that 'not'.  I need more coffee.

> maybe you can manage not to wait until the last minute of the freeze.
> 
Cheers,
Julien
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Sat, 08 Jan 2011 19:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to David Holland <dholland@eecs.harvard.edu>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Sat, 08 Jan 2011 19:24:02 GMT) Full text and rfc822 format available.

Message #389 received at 522776@bugs.debian.org (full text, mbox):

From: David Holland <dholland@eecs.harvard.edu>
To: Roger Leigh <rleigh@codelibre.net>
Cc: David Holland <dholland@eecs.harvard.edu>, 522776@bugs.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>, debian-release@lists.debian.org
Subject: Re: C.UTF-8 in squeeze (was: Re: Bug#522776: debian-policy: mandate existence of a standardised locale)
Date: Sat, 8 Jan 2011 14:22:43 -0500
On Sat, Jan 08, 2011 at 11:44:56AM +0000, Roger Leigh wrote:
 > > Can this please get done (adding a C.UTF-8 locale)? It is absolutely
 > > required for writing shell scripts that handle UTF-8 data, if you want
 > > those shell scripts to have anything like portable or reliable
 > > behavior.
 > 
 > This is really in the hands of the glibc maintainers.  I thought that
 > a bug had been filed months ago, but I can't find it.  I've done so
 > now.

Thanks.

 > Note this comment from Aurelien Jarno:
 > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776#342
 > 
 > This will only be done with the approval of the release team, who
 > I've copied in.

Right, I think I wasn't very clear. I'm less concerned about whether
this makes it into squeeze than that it gets done eventually; if it's
been applied already in a later version I must have missed this in the
comment log. :-/

-- 
   - David A. Holland / dholland@eecs.harvard.edu




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 10 Jan 2011 08:21:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Aurelien Jarno <aurelien@aurel32.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 10 Jan 2011 08:21:07 GMT) Full text and rfc822 format available.

Message #394 received at 522776@bugs.debian.org (full text, mbox):

From: Aurelien Jarno <aurelien@aurel32.net>
To: Roger Leigh <rleigh@codelibre.net>
Cc: David Holland <dholland@eecs.harvard.edu>, 522776@bugs.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>, debian-release@lists.debian.org
Subject: Re: C.UTF-8 in squeeze
Date: Mon, 10 Jan 2011 08:44:10 +0100
tag


Roger Leigh a écrit :
> clone 522776 -1
> reassign -1 eglibc
> retitle -1 eglibc: Please provide a C.UTF-8 locale by default
> severity -1 important
> thanks
> 
> On Fri, Jan 07, 2011 at 09:14:47PM -0500, David Holland wrote:
>> Can this please get done (adding a C.UTF-8 locale)? It is absolutely
>> required for writing shell scripts that handle UTF-8 data, if you want
>> those shell scripts to have anything like portable or reliable
>> behavior.
> 
> This is really in the hands of the glibc maintainers.  I thought that
> a bug had been filed months ago, but I can't find it.  I've done so
> now.
> 
> Note this comment from Aurelien Jarno:
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776#342
> 
> This will only be done with the approval of the release team, who
> I've copied in.
> 

I know some persons already tried to work on that, so if patches are
already available, they will be really appreciated.

Providing a C.UTF-8 locale is quite easy, d-i is already doing that.
Providing a C.UTF-8 *by default* is more complicated, as it has to be
done in the GNU libc code, we can't really on the locale package
generating one. This would mean this package should always be installed,
and that we should trust on user to correctly regenerate the locales if
they do.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 10 Jan 2011 10:39:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 10 Jan 2011 10:39:06 GMT) Full text and rfc822 format available.

Message #399 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Aurelien Jarno <aurelien@aurel32.net>
Cc: David Holland <dholland@eecs.harvard.edu>, 522776@bugs.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>, debian-release@lists.debian.org
Subject: Re: C.UTF-8 in squeeze
Date: Mon, 10 Jan 2011 10:38:24 +0000
[Message part 1 (text/plain, inline)]
On Mon, Jan 10, 2011 at 08:44:10AM +0100, Aurelien Jarno wrote:
> Roger Leigh a écrit :
> > On Fri, Jan 07, 2011 at 09:14:47PM -0500, David Holland wrote:
> >> Can this please get done (adding a C.UTF-8 locale)? It is absolutely
> >> required for writing shell scripts that handle UTF-8 data, if you want
> >> those shell scripts to have anything like portable or reliable
> >> behavior.
> > 
> > This is really in the hands of the glibc maintainers.  I thought that
> > a bug had been filed months ago, but I can't find it.  I've done so
> > now.
> 
> I know some persons already tried to work on that, so if patches are
> already available, they will be really appreciated.
> 
> Providing a C.UTF-8 locale is quite easy, d-i is already doing that.
> Providing a C.UTF-8 *by default* is more complicated, as it has to be
> done in the GNU libc code, we can't really on the locale package
> generating one. This would mean this package should always be installed,
> and that we should trust on user to correctly regenerate the locales if
> they do.

Hi Aurelien,

I think that initially, simply guaranteeing the presence of C.UTF-8
as a standard locale, generated by localedef/gen will be sufficient.
This will allow packages to rely on its presence during normal
system operation e.g. in maintainer scripts, for lintian and other
programs requiring it.

I think having it hardcoded into libc is rather more difficult and
having it prior to /usr being mounted is not that important--all
of the known use cases do not require this.  So at least initially,
I think simply providing it outside libc will be more than sufficient.

I would like to see it in libc itself eventually, but I am concerned
about the UTF-8 codeset table being duplicated for every locale.  I'd
like to see it shared so that users using it don't have to pay a
large penalty for the needless duplication.  Possibly best looked at
upstream; I did already mention it a year or so back, but I didn't get
too far--it was more of a casual enquiry about the possibilities.

Regards,
Roger
-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 10 Jan 2011 10:57:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Aurelien Jarno <aurelien@aurel32.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 10 Jan 2011 10:57:03 GMT) Full text and rfc822 format available.

Message #404 received at 522776@bugs.debian.org (full text, mbox):

From: Aurelien Jarno <aurelien@aurel32.net>
To: Roger Leigh <rleigh@codelibre.net>
Cc: David Holland <dholland@eecs.harvard.edu>, 522776@bugs.debian.org, GNU Libc Maintainers <debian-glibc@lists.debian.org>, debian-release@lists.debian.org
Subject: Re: C.UTF-8 in squeeze
Date: Mon, 10 Jan 2011 11:53:54 +0100
Roger Leigh a écrit :
> On Mon, Jan 10, 2011 at 08:44:10AM +0100, Aurelien Jarno wrote:
>> Roger Leigh a écrit :
>>> On Fri, Jan 07, 2011 at 09:14:47PM -0500, David Holland wrote:
>>>> Can this please get done (adding a C.UTF-8 locale)? It is absolutely
>>>> required for writing shell scripts that handle UTF-8 data, if you want
>>>> those shell scripts to have anything like portable or reliable
>>>> behavior.
>>> This is really in the hands of the glibc maintainers.  I thought that
>>> a bug had been filed months ago, but I can't find it.  I've done so
>>> now.
>> I know some persons already tried to work on that, so if patches are
>> already available, they will be really appreciated.
>>
>> Providing a C.UTF-8 locale is quite easy, d-i is already doing that.
>> Providing a C.UTF-8 *by default* is more complicated, as it has to be
>> done in the GNU libc code, we can't really on the locale package
>> generating one. This would mean this package should always be installed,
>> and that we should trust on user to correctly regenerate the locales if
>> they do.
> 
> Hi Aurelien,
> 
> I think that initially, simply guaranteeing the presence of C.UTF-8
> as a standard locale, generated by localedef/gen will be sufficient.
> This will allow packages to rely on its presence during normal
> system operation e.g. in maintainer scripts, for lintian and other
> programs requiring it.
> 

Doing so means that the locales or locales-all package will be installed
by default. People are going to shout... Or we should create a
locales-cutf8 packages, but then the integration with the two other
packages will become quite complex.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 10 Jan 2011 13:12:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thorsten Glaser <tg@mirbsd.de>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 10 Jan 2011 13:12:03 GMT) Full text and rfc822 format available.

Message #409 received at 522776@bugs.debian.org (full text, mbox):

From: Thorsten Glaser <tg@mirbsd.de>
To: 522776@bugs.debian.org
Subject: Re: Bug#522776: C.UTF-8 in squeeze
Date: Mon, 10 Jan 2011 13:09:10 +0000 (UTC)
Aurelien Jarno dixit:

>Doing so means that the locales or locales-all package will be installed

Hm, localedef is in libc-bin – can C.UTF-8 not be generated
by its postinst (with some logic in locales-all to restore
C.UTF-8 in its postrm)?

bye,
//mirabilos
-- 
  “Having a smoking section in a restaurant is like having
          a peeing section in a swimming pool.”
						-- Edward Burr




Information forwarded to debian-bugs-dist@lists.debian.org, Debian Policy List <debian-policy@lists.debian.org>:
Bug#522776; Package debian-policy. (Mon, 10 Jan 2011 13:33:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roger Leigh <rleigh@codelibre.net>:
Extra info received and forwarded to list. Copy sent to Debian Policy List <debian-policy@lists.debian.org>. (Mon, 10 Jan 2011 13:33:06 GMT) Full text and rfc822 format available.

Message #414 received at 522776@bugs.debian.org (full text, mbox):

From: Roger Leigh <rleigh@codelibre.net>
To: Thorsten Glaser <tg@mirbsd.de>, 522776@bugs.debian.org
Subject: Re: Bug#522776: C.UTF-8 in squeeze
Date: Mon, 10 Jan 2011 13:29:21 +0000
[Message part 1 (text/plain, inline)]
On Mon, Jan 10, 2011 at 01:09:10PM +0000, Thorsten Glaser wrote:
> Aurelien Jarno dixit:
> 
> >Doing so means that the locales or locales-all package will be installed
> 
> Hm, localedef is in libc-bin – can C.UTF-8 not be generated
> by its postinst (with some logic in locales-all to restore
> C.UTF-8 in its postrm)?

I was thinking the same thing, but I think it needs things like
/usr/share/i18n/charmaps/UTF-8.gz which are in the locales package.

Could we pre-generate it using --no-archive so we don't use the
locale archive; would this be sufficient to providing it in a
package separate from the locales package or are there other issues?


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
[signature.asc (application/pgp-signature, inline)]

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Thu Apr 17 12:30:20 2014; Machine Name: buxtehude.debian.org

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.