Package: manpages; Maintainer for manpages is Dr. Tobias Quathamer <toddy@debian.org>; Source for manpages is src:manpages (PTS, buildd, popcon).
Reported by: Hugo Herbelin <Hugo.Herbelin@inria.fr>
Date: Tue, 10 Mar 2009 12:18:01 UTC
Severity: wishlist
Tags: l10n
Fixed in version manpages/3.20-1
Done: Joey Schulze <joey@infodrom.org>
Bug is archived. No further changes may be made.
View this report as an mbox folder, status mbox, maintainer mbox
Report forwarded
to debian-bugs-dist@lists.debian.org, Colin Watson <cjwatson@debian.org>:
Bug#519095; Package man-db.
(Tue, 10 Mar 2009 12:18:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Hugo Herbelin <Hugo.Herbelin@inria.fr>:
New Bug report received and forwarded. Copy sent to Colin Watson <cjwatson@debian.org>.
(Tue, 10 Mar 2009 12:18:03 GMT) (full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
Package: man-db Version: 2.5.4-1 Severity: wishlist Tags: l10n Hi, My primary wish was to be able to correctly display the pages iso_8859-* and I end up with a suggestion for better supporting all pages encoded in one of the iso-8859-X coding systems. Here were my successive experiences for displaying, e.g., the iso_8859-15 man page: * Bad solutions * - If I set my locale to utf8, I see all non-ascii characters in the iso_8859-* pages as if they were iso-8859-1 characters. As reported by "man -d", the information in the pipeline that is relevant to the encoding is: manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tutf8 and indeed, nroff assumes having latin1 as default input and utf8 in output. - If I set my locale to iso885915@euro, I see "?" for the euro sign and "1/4", "1/2" and "3/4" for the oe ligature and Y with diaeresis. Indeed, the pipeline is manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tlatin1 | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT which does as if the page were in ISO-8859-1 (while in fact it is in ISO-8859-15) and translate what it thinks are ISO-8859-1 chars into valid ISO-8859-15 sequences (the "¤" currency sign becomes "?" because it has no equivalent and the "¼", "½", "¾" characters become "1/4" and so on). * Better solutions * In a second step, I tried to move the page iso_8859-* to a directory whose name tells what the encoding is (I typically move the iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline seems to become better as we now obtain: - with a utf8 locale: page_encoding = ISO-8859-15 source_encoding = ISO-8859-1 roff_encoding = ISO-8859-1 output_encoding = UTF-8 pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff -mandoc -Tutf8 - with an iso885915@euro locale: page_encoding = ISO-8859-15 source_encoding = ISO-8859-1 roff_encoding = ISO-8859-1 output_encoding = ISO-8859-1 pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff -mandoc -Tlatin1 | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT What is better, is that man has recognized that the encoding of the page is iso-8859-15 (based on the directory name) but it has failed to to propagate this information when it turned to find an encoding that nroff supports. Something is strange there regarding the respective roles of the "source" and "page" encodings in the calls to manconv and roff. >From what I understand (but I'm uncertain), nroff does not support multibyte characters and hence, pages have to be converted to single-byte characters using the ascii8 device (it seems there is something special for east-asia languages but I don't understand well how it works). The problem seems to be that the single-byte encoding used to call nroff forgets about the encoding mentioned in the directory name and only keeps the language part of the directory name, then reassigning to each language a canonical default encoding. This strategy would be good for pages encoded in utf8: since nroff does not support utf8, we assume that, say, a Polish page in utf8 can always be converted to the single-byte iso-8859-2 encoding. But this strategy losses information when we already know that the page is encoded in a single-byte encoding. My suggestions then are: - Change the definition of "source encoding" so that if the language directory name already mentions a single-byte encoding (say one of the iso-8859-* encodings), it considers it to be the source encoding and looks for a language-based canonical single-byte encoding (table directory_table in file encodings.c of the man package) only if the language directory name tells it is an UTF-8 page. - Move the English-written pages using iso-8859-X encodings in directories named en.ISO8859-X (this is about the manpages package). Hoping I did not miss some other complex parts of the conversion process... Hugo Herbelin Remark: assuming the current version of man-db, there is still a (tedious) workaround to see correctly the iso_8859-* pages: to see a page with the iso-8859-X encoding, choose a iso88591 locale (e.g. fr_FR.iso88591) so that no translation happens, and display the result in a terminal, setting first the terminal to believe it is displaying iso-8859-X text. -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (500, 'unstable'), (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 2.6.27-1-amd64 (SMP w/2 CPU cores) Locale: LANG=fr_FR.utf-8, LC_CTYPE=fr_FR.utf-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages man-db depends on: ii bsdmainutils 6.1.10 collection of more utilities from ii debconf [debconf-2.0] 1.5.26 Debian configuration management sy ii dpkg 1.14.25 Debian package management system ii groff-base 1.18.1.1-21 GNU troff text-formatting system ( ii libc6 2.9-4 GNU C Library: Shared libraries ii libgdbm3 1.8.3-4 GNU dbm database routines (runtime ii zlib1g 1:1.2.3.3.dfsg-13 compression library - runtime man-db recommends no packages. Versions of packages man-db suggests: ii epiphany-gecko [www-browser 2.22.3-9 Intuitive GNOME web browser - Geck ii groff 1.18.1.1-21 GNU troff text-formatting system ii iceweasel [www-browser] 3.0.7-1 lightweight web browser based on M ii less 418-1 Pager program similar to more ii lynx-cur [www-browser] 2.8.7dev13-1 Text-mode WWW Browser with NLS sup ii w3m [www-browser] 0.5.2-2+b1 WWW browsable pager with excellent -- debconf information excluded
Information forwarded
to debian-bugs-dist@lists.debian.org:
Bug#519095; Package man-db.
(Tue, 10 Mar 2009 23:42:02 GMT) (full text, mbox, link).
Acknowledgement sent
to Colin Watson <cjwatson@debian.org>:
Extra info received and forwarded to list.
(Tue, 10 Mar 2009 23:42:02 GMT) (full text, mbox, link).
Message #10 received at 519095@bugs.debian.org (full text, mbox, reply):
clone 519095 -1
user man-db@packages.debian.org
usertags 519095 target-2.5.5
tags 519095 fixed-upstream
reassign -1 manpages
retitle -1 manpages: state encoding of iso-8859-* pages
thanks
[Dear manpages maintainer: please read down for the part that affects
you.]
On Tue, Mar 10, 2009 at 01:16:18PM +0100, Hugo Herbelin wrote:
> My primary wish was to be able to correctly display the pages
> iso_8859-* and I end up with a suggestion for better supporting
> all pages encoded in one of the iso-8859-X coding systems.
So, this is really pretty complicated. I agree with almost all of your
analysis, but let me try to explain a bit further.
> Here were my successive experiences for displaying, e.g., the
> iso_8859-15 man page:
>
> * Bad solutions *
>
> - If I set my locale to utf8, I see all non-ascii characters in the
> iso_8859-* pages as if they were iso-8859-1 characters. As reported
> by "man -d", the information in the pipeline that is relevant to the
> encoding is:
>
> manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tutf8
>
> and indeed, nroff assumes having latin1 as default input and utf8 in
> output.
Correct.
> - If I set my locale to iso885915@euro, I see "?" for the euro sign
> and "1/4", "1/2" and "3/4" for the oe ligature and Y with
> diaeresis. Indeed, the pipeline is
>
> manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tlatin1 | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT
>
> which does as if the page were in ISO-8859-1 (while in fact it is in
> ISO-8859-15) and translate what it thinks are ISO-8859-1 chars into
> valid ISO-8859-15 sequences (the "¤" currency sign becomes "?"
> because it has no equivalent and the "¼", "½", "¾" characters become
> "1/4" and so on).
Correct. If man treated a file on the filesystem as being in a different
encoding just because you were using a different locale, that would be a
bug in itself; files don't change encoding just because you set an
environment variable.
That said, using the latin1 device and then recoding to ISO-8859-15 is
not really the best solution. I think it might be better to use the utf8
device and then recode to ISO-8859-15 from there. This doesn't entirely
fix the problem, though; see below.
> * Better solutions *
>
> In a second step, I tried to move the page iso_8859-* to a directory
> whose name tells what the encoding is (I typically move the
> iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline
> seems to become better as we now obtain:
This is one approach, but a cleaner one would be to change the first
line of iso-8859-15.7.gz to:
'\" t -*- coding: ISO-8859-15 -*-
(See manconv(1) for documentation of this.) Although you won't see
evidence of this in the debugging output, this will cause manconv to
ignore the input encoding(s) given to it and instead assume ISO-8859-15.
Although see my comments below about bugs in this ...
I've cloned this bug and reassigned the clone to manpages, since,
regardless of any other work done in this area, any English manual pages
that are not encoded in ISO-8859-1 or UTF-8 should state an explicit
encoding using the above mechanism.
> - with a utf8 locale:
>
> page_encoding = ISO-8859-15
> source_encoding = ISO-8859-1
> roff_encoding = ISO-8859-1
> output_encoding = UTF-8
> pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff -mandoc -Tutf8
>
> - with an iso885915@euro locale:
>
> page_encoding = ISO-8859-15
> source_encoding = ISO-8859-1
> roff_encoding = ISO-8859-1
> output_encoding = ISO-8859-1
> pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff -mandoc -Tlatin1 | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT
>
> What is better, is that man has recognized that the encoding of the
> page is iso-8859-15 (based on the directory name) but it has failed to
> to propagate this information when it turned to find an encoding that
> nroff supports. Something is strange there regarding the respective
> roles of the "source" and "page" encodings in the calls to manconv and
> roff.
>
> From what I understand (but I'm uncertain), nroff does not support
> multibyte characters and hence, pages have to be converted to
> single-byte characters using the ascii8 device (it seems there is
> something special for east-asia languages but I don't understand well
> how it works). The problem seems to be that the single-byte encoding
> used to call nroff forgets about the encoding mentioned in the
> directory name and only keeps the language part of the directory name,
> then reassigning to each language a canonical default encoding. This
> strategy would be good for pages encoded in utf8: since nroff does not
> support utf8, we assume that, say, a Polish page in utf8 can always be
> converted to the single-byte iso-8859-2 encoding. But this strategy
> losses information when we already know that the page is encoded in a
> single-byte encoding.
I agree that the recoding from one legacy encoding to another loses
information, and this is definitely a bug.
It's important to remember that, with some exceptions, the current
version of groff in Debian cannot really be told to use a different
input encoding, which is where a lot of this weirdness comes from. It's
not just about single-byte vs. multibyte; with the exception of some
hacks for CJK (the nippon device), and the awful, awful ascii8 hack,
groff always assumes that its input is ISO-8859-1.
This has been fixed upstream by the introduction of the preconv
preprocessor, which will allow man to feed in any input encoding it
likes and have preconv convert it to a notation involving Unicode
codepoints that the groff core can understand. man-db is already
prepared to use this once it's available. However, there is one last
significant blocker to upgrading the Debian package, namely the
introduction of character class support so that the new groff can format
CJK text reasonably without the massive non-forward-portable Debian
patch. I'm working on this on and off at the moment.
Now, we can work around this somewhat by using the awful, awful hack I
mentioned above: the purpose of the ascii8 device is that its output
encoding is always the same as its input encoding (so far from
converting multibyte characters to single-byte characters, the ascii8
device exists to perform no conversion at all). This is typographically
unsound because groff is not supposed to just pass through character
data, but also to interpret it (e.g. hyphenation) and unless it knows
what characters are which it can't do its job properly. Nevertheless, in
the case of manual pages the consequences are not too bad, so this will
do as a workaround for the time being.
Using the ascii8 device for pages declared as ISO-8859-15 in their
preprocessor line breaks in man-db 2.5.4-1 for the following reasons:
* manconv doesn't spot the -*- coding -*- line, because zsoelim puts a
".lf 1 -" line number marker before it. I've fixed this upstream by
arranging for zsoelim to put the .lf request after any leading
comment line.
* The -*- coding -*- line is only read by manconv, not man itself.
Thus, man recodes to ISO-8859-1 unnecessarily when it should realise
that it needs to just use ISO-8859-15 all the way (with ascii8).
I've fixed this upstream.
> My suggestions then are:
>
> - Change the definition of "source encoding" so that if the language
> directory name already mentions a single-byte encoding (say one of
> the iso-8859-* encodings), it considers it to be the source encoding
> and looks for a language-based canonical single-byte encoding (table
> directory_table in file encodings.c of the man package) only if the
> language directory name tells it is an UTF-8 page.
I think that makes sense on the grounds that recoding between legacy
encodings tends to do more harm than good, and have implemented this.
> - Move the English-written pages using iso-8859-X encodings in
> directories named en.ISO8859-X (this is about the manpages package).
As mentioned above, I think this is better done by way of a preprocessor
encoding declaration, assuming the fixes I've applied upstream and
intend to backport to Debian. That's neater than creating new
directories for a small number of pages.
Tue Mar 10 23:24:27 GMT 2009 Colin Watson <cjwatson@debian.org>
Fix handling of pages that declare a non-default encoding in their
preprocessor lines. Thanks to Hugo Herbelin for some of the ideas
here (Debian bug #519095).
* src/encodings.c (get_source_encoding): Note that this function
should only be called if the page encoding is UTF-8. Add another
example.
* src/manconv.c (check_preprocessor_encoding): Move to ...
* src/encodings.c (check_preprocessor_encoding): ... here.
* src/encodings.h (check_preprocessor_encoding): Add prototype.
* src/man.c (make_roff_command): Use preprocessor-declared encoding
as page_encoding if known. Set source_encoding to page_encoding
unless the latter is UTF-8.
* src/Makefile.am (manconv_SOURCES): Add encodings.c.
* src/encodings.c (charset_table): Use ISO-8859-15 -> latin1 entry
only in the !MULTIBYTE_GROFF case; true ISO-8859-15 pages are
better handled using ascii8 or preconv if possible.
Tue Mar 10 14:11:14 GMT 2009 Colin Watson <cjwatson@debian.org>
* src/zsoelim.l (zsoelim_parse_file): Put the initial .lf request
after any initial comment line, so that manconv can find encoding
instructions more easily.
Thanks a lot,
--
Colin Watson [cjwatson@debian.org]
Bug 519095 cloned as bug 519209.
Request was from Colin Watson <cjwatson@debian.org>
to control@bugs.debian.org.
(Tue, 10 Mar 2009 23:42:02 GMT) (full text, mbox, link).
Bug reassigned from package `man-db' to `manpages'.
Request was from Colin Watson <cjwatson@debian.org>
to control@bugs.debian.org.
(Tue, 10 Mar 2009 23:42:05 GMT) (full text, mbox, link).
Changed Bug title to `manpages: state encoding of iso-8859-* pages' from `man-db: Improving man support for pages iso-8859-* encoded'.
Request was from Colin Watson <cjwatson@debian.org>
to control@bugs.debian.org.
(Tue, 10 Mar 2009 23:42:06 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Martin Schulze <joey@debian.org>:
Bug#519209; Package manpages.
(Wed, 11 Mar 2009 10:48:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Hugo Herbelin <hugo.herbelin@inria.fr>:
Extra info received and forwarded to list. Copy sent to Martin Schulze <joey@debian.org>.
(Wed, 11 Mar 2009 10:48:04 GMT) (full text, mbox, link).
Message #21 received at 519209@bugs.debian.org (full text, mbox, reply):
Thanks for your detailed answer. > As mentioned above, I think this is better done by way of a preprocessor > encoding declaration, assuming the fixes I've applied upstream and > intend to backport to Debian. That's neater than creating new > directories for a small number of pages. Wonderful. I just tried version new 2.5.4-2 and, if I (manually) pre-patch the iso_8859-* man pages with a "-*- coding: ... -*-" header, it works as expected. Using the "coding" declaration is indeed simpler than creating a new directory for just a very few files. I had tried this solution with 2.5.4-1 before but without being able to make it works (I missed to notice that zsoelim was actually responsible of this not working). Thanks very much. Hugo Herbelin
Information forwarded
to debian-bugs-dist@lists.debian.org, Martin Schulze <joey@debian.org>:
Bug#519209; Package manpages.
(Wed, 11 Mar 2009 18:18:06 GMT) (full text, mbox, link).
Acknowledgement sent
to mtk.manpages@gmail.com:
Extra info received and forwarded to list. Copy sent to Martin Schulze <joey@debian.org>.
(Wed, 11 Mar 2009 18:18:06 GMT) (full text, mbox, link).
Message #26 received at 519209@bugs.debian.org (full text, mbox, reply):
[As noted earlier, I was updating the wrong bug (517074) with my previous emails -- now I'll reforward my mails, and Colin's reply, to the right bug.] ---------- Forwarded message ---------- From: Michael Kerrisk <mtk.manpages@googlemail.com> Date: Wed, Mar 11, 2009 at 1:40 PM Subject: Bug#517074: manpages: state encoding of iso-8859-* pages To: 517074@bugs.debian.org Cc: Colin Watson <cjwatson@debian.org> Hello Colin, (I'm the upstream manpages maintainer, and I'm going to defer totally to your judgement on what needs to be done here.) In this report I see: [[ > * Better solutions * > > In a second step, I tried to move the page iso_8859-* to a directory > whose name tells what the encoding is (I typically move the > iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline > seems to become better as we now obtain: This is one approach, but a cleaner one would be to change the first line of iso-8859-15.7.gz to: '\" t -*- coding: ISO-8859-15 -*- ]] It looks like this is the only piece that applies for the man-pages maintainer, right? Am I correct to assume that there should be analogous changes in all of the other iso_8859-*.7 pages, so that each such page specifies its specific locale at the top? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html
Information forwarded
to debian-bugs-dist@lists.debian.org, Martin Schulze <joey@debian.org>:
Bug#519209; Package manpages.
(Wed, 11 Mar 2009 18:18:07 GMT) (full text, mbox, link).
Acknowledgement sent
to mtk.manpages@gmail.com:
Extra info received and forwarded to list. Copy sent to Martin Schulze <joey@debian.org>.
(Wed, 11 Mar 2009 18:18:07 GMT) (full text, mbox, link).
Message #31 received at 519209@bugs.debian.org (full text, mbox, reply):
---------- Forwarded message ---------- From: Colin Watson <cjwatson@debian.org> Date: Wed, Mar 11, 2009 at 11:15 PM Subject: Bug#517074: manpages: state encoding of iso-8859-* pages To: mtk.manpages@gmail.com Cc: 517074@bugs.debian.org On Wed, Mar 11, 2009 at 01:40:41PM +1300, Michael Kerrisk wrote: > (I'm the upstream manpages maintainer, and I'm going to defer totally > to your judgement on what needs to be done here.) :-) > In this report I see: > > [[ > > * Better solutions * > > > > In a second step, I tried to move the page iso_8859-* to a directory > > whose name tells what the encoding is (I typically move the > > iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline > > seems to become better as we now obtain: > > This is one approach, but a cleaner one would be to change the first > line of iso-8859-15.7.gz to: > > '\" t -*- coding: ISO-8859-15 -*- > ]] > > It looks like this is the only piece that applies for the man-pages > maintainer, right? That's correct. There is one small downside: versions of man-db before 2.4.4 will misparse this because they don't know to stop at the first space after the "t", and spew some error messages but otherwise behave correctly. This was released in 2007, though, and I don't think that there will be any distributions that take the new manpages but don't take the new man-db. (The old version of the alternative 'man' package that I have lying around from 2001 doesn't suffer from this problem, so I think it must be OK.) > Am I correct to assume that there should be analogous changes in all > of the other iso_8859-*.7 pages, so that each such page specifies its > specific locale at the top? Yes. For your reference, here's a potted summary of the rules that reasonably recent versions of man-db apply to manual page encoding. (man is much less tolerant here; as I understand it, it basically has to be configured to expect a single encoding for any given directory tree and can't really cope with anything more complicated, so if I were you I'd leave that problem to distributions shipping it.) * man-db will always attempt to decode a page as UTF-8 before anything else, since in practice text is only going to successfully decode as UTF-8 if it actually is UTF-8. * Explicitly declared encodings are used next if available, whether they're explicitly declared in a preprocessor line as above, or by means of installing the manual page into a directory such as /usr/share/man/en_GB.ISO-8859-15. * Every manual page hierarchy has a default legacy encoding which is tried next, usually that in which the vast majority of historical pages are encoded. For English, of course, that's ISO-8859-1. * Unless groff 1.20 is available, man-db effectively recodes the page to the legacy encoding before feeding it to groff, since older versions of groff can't deal with UTF-8 input. (The patches I just applied as a result of this bug cause man-db to only do this for UTF-8 pages, since recoding between legacy encodings is generally a mug's game.) In practice, this means that even if you encode your pages in UTF-8 you can only use those characters available in the appropriate legacy encoding; anything else will at best be approximated, perhaps badly. I haven't yet been encouraging upstream maintainers to switch to shipping manual pages in UTF-8 because I think there are still a number of distributions that would have trouble dealing with that (although most of the major distributions have switched, including Debian). Declaring an explicit encoding for anything that isn't ISO-8859-1 or UTF-8 is a good middle ground for the moment. -- Colin Watson [cjwatson@debian.org] -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html
Information forwarded
to debian-bugs-dist@lists.debian.org, Martin Schulze <joey@debian.org>:
Bug#519209; Package manpages.
(Wed, 11 Mar 2009 18:21:03 GMT) (full text, mbox, link).
Acknowledgement sent
to mtk.manpages@gmail.com:
Extra info received and forwarded to list. Copy sent to Martin Schulze <joey@debian.org>.
(Wed, 11 Mar 2009 18:21:03 GMT) (full text, mbox, link).
Message #36 received at 519209@bugs.debian.org (full text, mbox, reply):
---------- Forwarded message ----------
From: Michael Kerrisk <mtk.manpages@googlemail.com>
Date: Thu, Mar 12, 2009 at 7:00 AM
Subject: Re: Bug#517074: manpages: state encoding of iso-8859-* pages
To: Colin Watson <cjwatson@debian.org>, 517074@bugs.debian.org
Cc: edimitro@tee.gr
Hello Colin,
On Wed, Mar 11, 2009 at 11:15 PM, Colin Watson <cjwatson@debian.org> wrote:
> On Wed, Mar 11, 2009 at 01:40:41PM +1300, Michael Kerrisk wrote:
>> (I'm the upstream manpages maintainer, and I'm going to defer totally
>> to your judgement on what needs to be done here.)
>
> :-)
>
>> In this report I see:
>>
>> [[
>> > * Better solutions *
>> >
>> > In a second step, I tried to move the page iso_8859-* to a directory
>> > whose name tells what the encoding is (I typically move the
>> > iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline
>> > seems to become better as we now obtain:
>>
>> This is one approach, but a cleaner one would be to change the first
>> line of iso-8859-15.7.gz to:
>>
>> '\" t -*- coding: ISO-8859-15 -*-
>> ]]
>>
>> It looks like this is the only piece that applies for the man-pages
>> maintainer, right?
>
> That's correct.
>
> There is one small downside: versions of man-db before 2.4.4 will
> misparse this because they don't know to stop at the first space after
> the "t", and spew some error messages but otherwise behave correctly.
> This was released in 2007, though, and I don't think that there will be
> any distributions that take the new manpages but don't take the new
> man-db. (The old version of the alternative 'man' package that I have
> lying around from 2001 doesn't suffer from this problem, so I think it
> must be OK.)
Thanks for the info.
>> Am I correct to assume that there should be analogous changes in all
>> of the other iso_8859-*.7 pages, so that each such page specifies its
>> specific locale at the top?
>
> Yes.
Okay -- I've made the suggested changes for
iso_8859-{2,3,4,5,6,7,8,9,10,11,13,14,15,16}.7. The changes will be
in upstream man-pages-3.20.
Since a couple of releases ago (man-pages-3.18, downloadable at
http://www.kernel.org/doc/man-pages/ ), we have a few other character
set pages in section 7. These are:
armscii-8.7 (Armenian SCII)
cp1251.7
koi8-r.7 (Russian Net CS)
koi8-u.7 (Ukrainien Net CS)
Are analogous changes also needed for these?
> For your reference, here's a potted summary of the rules that reasonably
<snip>
Thanks for that detailed explanation!
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html
Information forwarded
to debian-bugs-dist@lists.debian.org, Martin Schulze <joey@debian.org>:
Bug#519209; Package manpages.
(Wed, 11 Mar 2009 22:12:46 GMT) (full text, mbox, link).
Acknowledgement sent
to edimitro@tee.gr:
Extra info received and forwarded to list. Copy sent to Martin Schulze <joey@debian.org>.
(Wed, 11 Mar 2009 22:12:46 GMT) (full text, mbox, link).
Message #41 received at 519209@bugs.debian.org (full text, mbox, reply):
Στις Wednesday 11 March 2009 23:07:30 ο/η Michael Kerrisk έγραψε: > Lefteris, > > Sorry -- this was my fault (I started replying to the wrng bug number) > -- could you please resend your message to the bug 519209? > > Cheers, > > Michael > > On Thu, Mar 12, 2009 at 10:04 AM, Lefteris Dimitroulakis > <edimitro@tee.gr> wrote: > > Στις Wednesday 11 March 2009 20:00:08 ο/η Michael Kerrisk έγραψε: > >> Hello Colin, > >> > >> On Wed, Mar 11, 2009 at 11:15 PM, Colin Watson <cjwatson@debian.org> wrote: > >> > On Wed, Mar 11, 2009 at 01:40:41PM +1300, Michael Kerrisk wrote: > >> >> (I'm the upstream manpages maintainer, and I'm going to defer totally > >> >> to your judgement on what needs to be done here.) > >> > > >> > :-) > >> > > >> >> In this report I see: > >> >> > >> >> [[ > >> >> > * Better solutions * > >> >> > > >> >> > In a second step, I tried to move the page iso_8859-* to a directory > >> >> > whose name tells what the encoding is (I typically move the > >> >> > iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline > >> >> > seems to become better as we now obtain: > >> >> > >> >> This is one approach, but a cleaner one would be to change the first > >> >> line of iso-8859-15.7.gz to: > >> >> > >> >> '\" t -*- coding: ISO-8859-15 -*- > >> >> ]] > >> >> > >> >> It looks like this is the only piece that applies for the man-pages > >> >> maintainer, right? > > What about the translation of man pages? > > Any how for a discussion with a similar content please have > > a look around here: > > http://mail.nl.linux.org/linux-utf8/2005-07/msg00006.html > > > > regards > > Lefteris
Information forwarded
to debian-bugs-dist@lists.debian.org, Martin Schulze <joey@debian.org>:
Bug#519209; Package manpages.
(Wed, 11 Mar 2009 23:42:03 GMT) (full text, mbox, link).
Acknowledgement sent
to Colin Watson <cjwatson@debian.org>:
Extra info received and forwarded to list. Copy sent to Martin Schulze <joey@debian.org>.
(Wed, 11 Mar 2009 23:42:03 GMT) (full text, mbox, link).
Message #46 received at 519209@bugs.debian.org (full text, mbox, reply):
On Thu, Mar 12, 2009 at 07:00:08AM +1300, Michael Kerrisk wrote:
> On Wed, Mar 11, 2009 at 11:15 PM, Colin Watson <cjwatson@debian.org> wrote:
> > On Wed, Mar 11, 2009 at 01:40:41PM +1300, Michael Kerrisk wrote:
> >> Am I correct to assume that there should be analogous changes in all
> >> of the other iso_8859-*.7 pages, so that each such page specifies its
> >> specific locale at the top?
> >
> > Yes.
>
> Okay -- I've made the suggested changes for
> iso_8859-{2,3,4,5,6,7,8,9,10,11,13,14,15,16}.7. The changes will be
> in upstream man-pages-3.20.
Great, thanks.
> Since a couple of releases ago (man-pages-3.18, downloadable at
> http://www.kernel.org/doc/man-pages/ ), we have a few other character
> set pages in section 7. These are:
>
> armscii-8.7 (Armenian SCII)
> cp1251.7
> koi8-r.7 (Russian Net CS)
> koi8-u.7 (Ukrainien Net CS)
>
> Are analogous changes also needed for these?
Yes. If you set the appropriate coding for these (simply the capitalised
versions of the page names) then almost everything works with the
current version of man-db in Debian unstable: I made the appropriate
changes to man-pages-3.19 and tested this.
The only thing that breaks is that input character codes 0x80-0x9f are
reserved for internal use by groff, so the corresponding ranges of the
CP1251, KOI8-R, and KOI8-U character sets fail to display; however, you
can't do any better than this with the current groff package in Debian,
and the rest of those character sets and all of ARMSCII-8 display
perfectly as far as I can tell.
Once we switch to a version of groff with preconv support, this problem
will go away; indeed, I've tested with Debian man-db 2.5.4-2 and a local
build of groff 1.20.1, and the entirety of all four character ranges is
displayed perfectly.
Regards,
--
Colin Watson [cjwatson@debian.org]
Information forwarded
to debian-bugs-dist@lists.debian.org, Martin Schulze <joey@debian.org>:
Bug#519209; Package manpages.
(Thu, 12 Mar 2009 00:21:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Colin Watson <cjwatson@debian.org>:
Extra info received and forwarded to list. Copy sent to Martin Schulze <joey@debian.org>.
(Thu, 12 Mar 2009 00:21:04 GMT) (full text, mbox, link).
Message #51 received at 519209@bugs.debian.org (full text, mbox, reply):
On Wed, Mar 11, 2009 at 11:04:05PM +0200, Lefteris Dimitroulakis wrote:
> What about the translation of man pages?
Translated manual pages will be installed in a different manual page
hierarchy with a different default legacy encoding. For example, man-db
will attempt to decode pages in /usr/share/man/el/ as UTF-8 as usual,
but will then fall back to ISO-8859-7, not ISO-8859-1.
Just like English manual pages, if a single page is in a different
encoding (neither the usual legacy encoding nor UTF-8), then it should
state its encoding at the top of the page. Obviously it would be
ridiculous to require this for all translated manual pages, which is why
we have default encodings.
(Obviously, if the default encoding that man-db uses is just plain
wrong, please file a bug report on man-db.)
> Any how for a discussion with a similar content please have
> a look around here:
> http://mail.nl.linux.org/linux-utf8/2005-07/msg00006.html
Yes, this discussion is not new to me. (Andries did a lot of good work
on man in the past, and in the post you quote he is mostly correct; I
think I am safe in saying that I have more recent experience of the
implementation work involved in this, though.)
Since that discussion, Bruno Haible implemented preconv in groff, which
can be used by any caller of groff to state an appropriate input
encoding.
Andries' point (2) is correct as far as I'm concerned. The correct
program to use to format arbitrary manual pages is /usr/bin/man, and it
can take care of passing appropriate groff options as necessary (indeed,
it often had to do so even before encodings became a concern; consider
preprocessors).
With regard to Andries' point (3A), man-db implements this differently,
and I would say rather better. The notion of a default device (latin1
vs. nippon vs. utf8) in man's configuration file is a fundamental
misdesign, probably from the days when these matters were less
well-understood. The device passed to groff -T needs to be determined
dynamically based on combining the page encoding with the locale, as
man-db does.
(3B) is a statement of fact and of course correct.
Regarding (3C), a charset file in each manual page hierarchy would be a
possibility, but I chose to put this in the man-db executable instead
(src/encodings.c). I think this is probably simpler for distributions
who thus don't need to care about file ownership conflicts between
packages for this.
Regarding (4), I agree entirely that it is bad to require encoding
information, which is why man-db has the mechanism it has to try a
couple of possibilities with appropriate defaults. In my opinion,
though, Andries' objections to it being on the first line do not stand
up to close examination. Preprocessor directives and encoding
information can provably exist on the first line. The .so directive does
not need to be on the first line, except perhaps when there is nothing
else in the file, in which case this discussion is irrelevant since
manual page file names and hence .so arguments are always ASCII today.
(Let's not get into Unix file name encoding issues just now; if you
really wanted to do that, UTF-8 would be a rather more reasonable thing
to require than it is in the content of manual pages.) Andries says
"etc.", but I have never encountered anything else that needs to go on
the first line of a manual page. If this truly becomes a problem, you
can always use the alternative approach of installing the page into a
directory whose name explicitly states its encoding.
Andries makes a comment towards the end of his post that "The man
program [...] can react to the user's locale settings". He may have
meant that it can select the appropriate output device, which would be
fair enough. However, in case he meant that the user's locale could
influence the choice of encoding for the input file: any design that
infers the encoding of the contents of files from the user's locale
settings is fundamentally broken. It just doesn't work. If a user has
one terminal open with the locale set to el_GR (ISO-8859-7) and another
with the locale set to el_GR.UTF-8, then the same manual pages should be
displayable in both terminals and produce output encoded according to
what was requested by the locale, subject to necessary restrictions such
as whether characters are actually representable in each locale. The
manual pages are in a fixed encoding on disk and changing one's locale
does not magically change that; therefore the locale should not be a
deciding factor in how to read the manual page input. I fixed this
broken assumption in man-db in April 2003, and do not intend to revert
to the old incorrect algorithm.
In any case, all of this is already implemented in man-db, so I don't
think it's necessary to spend more time going over old ground. Nowadays,
the simple rules for all manual pages (including translated ones) are:
* Generally you can just write your manual page in the appropriate
legacy encoding for your language. If man-db doesn't know what that
is, contact me and I'll fix it; the same probably goes for man.
* If you want to write it in UTF-8, that's fine too. (Systems using
something based on Debian's old groff package will lose any
characters not representable in the default legacy encoding, until
we get round to updating to groff 1.20. Don't lose too much sleep
over this.)
* If for some reason you need to write the page in something other
than these two encodings, declare it on the first line using either
of these two forms (depending on whether you also need
preprocessors):
'\" -*- coding: YOUR-ENCODING-HERE -*-
'\" t -*- coding: YOUR-ENCODING-HERE -*-
* Any distribution using a reasonably recent version of man-db will
get almost all of this right. If you declare a custom encoding, they
may need to grab the patch from Debian bug #519095 until such time
as I release man-db 2.5.5.
* My understanding is that distributions using man typically need to
recode all pages in a given hierarchy to the same encoding; I think
Fedora had a flag day where they switched everything over to UTF-8,
although that's a vague memory and I have not actually gone and
researched it. I think it would have been less effort to switch to
man-db instead, but it's their problem not mine. :-) You can
probably just leave these guys to figure it out, or lobby them to
use better software if they don't.
Cheers,
--
Colin Watson [cjwatson@debian.org]
Information forwarded
to debian-bugs-dist@lists.debian.org, Martin Schulze <joey@debian.org>:
Bug#519209; Package manpages.
(Thu, 12 Mar 2009 01:12:02 GMT) (full text, mbox, link).
Acknowledgement sent
to mtk.manpages@gmail.com:
Extra info received and forwarded to list. Copy sent to Martin Schulze <joey@debian.org>.
(Thu, 12 Mar 2009 01:12:02 GMT) (full text, mbox, link).
Message #56 received at 519209@bugs.debian.org (full text, mbox, reply):
Colin,
On Thu, Mar 12, 2009 at 12:40 PM, Colin Watson <cjwatson@debian.org> wrote:
> On Thu, Mar 12, 2009 at 07:00:08AM +1300, Michael Kerrisk wrote:
>> On Wed, Mar 11, 2009 at 11:15 PM, Colin Watson <cjwatson@debian.org> wrote:
>> > On Wed, Mar 11, 2009 at 01:40:41PM +1300, Michael Kerrisk wrote:
>> >> Am I correct to assume that there should be analogous changes in all
>> >> of the other iso_8859-*.7 pages, so that each such page specifies its
>> >> specific locale at the top?
>> >
>> > Yes.
>>
>> Okay -- I've made the suggested changes for
>> iso_8859-{2,3,4,5,6,7,8,9,10,11,13,14,15,16}.7. The changes will be
>> in upstream man-pages-3.20.
>
> Great, thanks.
>
>> Since a couple of releases ago (man-pages-3.18, downloadable at
>> http://www.kernel.org/doc/man-pages/ ), we have a few other character
>> set pages in section 7. These are:
>>
>> armscii-8.7 (Armenian SCII)
>> cp1251.7
>> koi8-r.7 (Russian Net CS)
>> koi8-u.7 (Ukrainien Net CS)
>>
>> Are analogous changes also needed for these?
>
> Yes. If you set the appropriate coding for these (simply the capitalised
> versions of the page names) then almost everything works with the
> current version of man-db in Debian unstable: I made the appropriate
> changes to man-pages-3.19 and tested this.
>
> The only thing that breaks is that input character codes 0x80-0x9f are
> reserved for internal use by groff, so the corresponding ranges of the
> CP1251, KOI8-R, and KOI8-U character sets fail to display; however, you
> can't do any better than this with the current groff package in Debian,
> and the rest of those character sets and all of ARMSCII-8 display
> perfectly as far as I can tell.
>
> Once we switch to a version of groff with preconv support, this problem
> will go away; indeed, I've tested with Debian man-db 2.5.4-2 and a local
> build of groff 1.20.1, and the entirety of all four character ranges is
> displayed perfectly.
I AM IN AWE of the thoroughness of your testing and written responses
on this bug. It is so helpful! Thank you.
I've fixed the other 4 pages (again, to appear in man-pages-3.20).
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html
Information forwarded
to debian-bugs-dist@lists.debian.org, Martin Schulze <joey@debian.org>:
Bug#519209; Package manpages.
(Thu, 12 Mar 2009 13:51:02 GMT) (full text, mbox, link).
Acknowledgement sent
to edimitro@tee.gr:
Extra info received and forwarded to list. Copy sent to Martin Schulze <joey@debian.org>.
(Thu, 12 Mar 2009 13:51:03 GMT) (full text, mbox, link).
Message #61 received at 519209@bugs.debian.org (full text, mbox, reply):
Στις Thursday 12 March 2009 02:20:38 γράψατε: > On Wed, Mar 11, 2009 at 11:04:05PM +0200, Lefteris Dimitroulakis wrote: > > What about the translation of man pages? > [snip all the good stuff from here...] > > Any how for a discussion with a similar content please have > > a look around here: > > http://mail.nl.linux.org/linux-utf8/2005-07/msg00006.html > >[...and from here] > Many thanks Colin. It will take me quite some time in order to digest all this info you kindly offered here. > Cheers, > Regards lefteris
Reply sent
to Joey Schulze <joey@infodrom.org>:
You have taken responsibility.
(Sun, 26 Apr 2009 10:12:11 GMT) (full text, mbox, link).
Notification sent
to Hugo Herbelin <Hugo.Herbelin@inria.fr>:
Bug acknowledged by developer.
(Sun, 26 Apr 2009 10:12:11 GMT) (full text, mbox, link).
Message #66 received at 519209-close@bugs.debian.org (full text, mbox, reply):
Source: manpages
Source-Version: 3.20-1
We believe that the bug you reported is fixed in the latest version of
manpages, which is due to be installed in the Debian FTP archive:
manpages-dev_3.20-1_all.deb
to pool/main/m/manpages/manpages-dev_3.20-1_all.deb
manpages_3.20-1.diff.gz
to pool/main/m/manpages/manpages_3.20-1.diff.gz
manpages_3.20-1.dsc
to pool/main/m/manpages/manpages_3.20-1.dsc
manpages_3.20-1_all.deb
to pool/main/m/manpages/manpages_3.20-1_all.deb
manpages_3.20.orig.tar.gz
to pool/main/m/manpages/manpages_3.20.orig.tar.gz
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to 519209@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Joey Schulze <joey@infodrom.org> (supplier of updated manpages package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@debian.org)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.8
Date: Sun, 26 Apr 2009 11:36:34 +0200
Source: manpages
Binary: manpages manpages-dev
Architecture: source all
Version: 3.20-1
Distribution: unstable
Urgency: low
Maintainer: Martin Schulze <joey@debian.org>
Changed-By: Joey Schulze <joey@infodrom.org>
Description:
manpages - Manual pages about using a GNU/Linux system
manpages-dev - Manual pages about using GNU/Linux for development
Closes: 516677 517074 517485 519209 519230 520904
Changes:
manpages (3.20-1) unstable; urgency=low
.
* New upstream version
. Add explicit character set encoding to first line of several
manpages (closes: Bug#519209)
. Fix type of 'offset' argument in seekdir(3) and return type in
telldir (closes: Bug#519230)
. Small fix to description in strftime(3) (closes: Bug#516677)
. Fix 'argp' type for KDGETLED description in console_ioctl(4)
(closes: Bug#517485)
. Add description of /srv in hier(7) (closes: Bug#520904)
. Fix types used to declare sin6_family and sin6_port in ipv6(7)
(closes: Bug#517074)
* Corrected fclose(3)
Checksums-Sha1:
03fa98f500d63bad215d1aa54ea840b1629157cf 964 manpages_3.20-1.dsc
e85794d5f613f8d70ac83a92a626eb5ca91465af 1590787 manpages_3.20.orig.tar.gz
4351e33a2116f749181a2386bace0dbe9d8cff00 48327 manpages_3.20-1.diff.gz
a32965aaa7633303c2e96328ffaac652d6705cfa 713822 manpages_3.20-1_all.deb
4b5211c2297342a64a420bd24e9883c40490919a 1555264 manpages-dev_3.20-1_all.deb
Checksums-Sha256:
74704b80b6472549572e3903f850ebd6b73f2576e5cb9daf9d4cb81f6cffea26 964 manpages_3.20-1.dsc
4351a0537c7d05f23e1c17f99b62e4b75d1d81a9c99347821bca0bbb5794ed09 1590787 manpages_3.20.orig.tar.gz
010a879e44692acea4d7ef103ad0e340863799d2511d14a8f86e7a518f68e9c4 48327 manpages_3.20-1.diff.gz
75a576115a5ccd8f8624171507ab736d831161b0b850ab7fd9435d62c8d91c43 713822 manpages_3.20-1_all.deb
f84b168719f293efaccfc379bd0f07313efe23b7b9f732ef01fab409386b8a65 1555264 manpages-dev_3.20-1_all.deb
Files:
676509b49bc7740897adb400d2c8c802 964 doc important manpages_3.20-1.dsc
2df4c07be521ef7c2579eeaae677ad8c 1590787 doc important manpages_3.20.orig.tar.gz
e7791092c3a2b721b36f75ca21a4d23a 48327 doc important manpages_3.20-1.diff.gz
18d26a0f0bd2b5c8eed4aeeba3421abd 713822 doc important manpages_3.20-1_all.deb
696caf409bc48e9193d395f08ae4696f 1555264 doc optional manpages-dev_3.20-1_all.deb
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iD8DBQFJ9CvOW5ql+IAeqTIRArxjAKCFOzN3tWq2Ajij6sjBryP+Ne4i9ACfQ6dy
4zmFH3XqZqqRy2gbVfu6E+w=
=/jco
-----END PGP SIGNATURE-----
Bug archived.
Request was from Debbugs Internal Request <owner@bugs.debian.org>
to internal_control@bugs.debian.org.
(Mon, 25 May 2009 07:30:23 GMT) (full text, mbox, link).
Send a report that this bug log contains spam.
Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.