Debian Bug report logs - #959474
Issues with Chinese language (all variants) when building some pages in buster

Package: www.debian.org; Maintainer for www.debian.org is Debian WWW Team <debian-www@lists.debian.org>;

Reported by: Laura Arjona Reina <larjona@debian.org>

Date: Sat, 2 May 2020 18:45:01 UTC

Severity: normal

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, debian-i18n@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Sat, 02 May 2020 18:45:03 GMT) (full text, mbox, link).


Acknowledgement sent to Laura Arjona Reina <larjona@debian.org>:
New Bug report received and forwarded. Copy sent to debian-i18n@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>. (Sat, 02 May 2020 18:45:03 GMT) (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Laura Arjona Reina <larjona@debian.org>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Issues with Chinese language (all variants) when building some pages in buster
Date: Sat, 2 May 2020 20:43:38 +0200
Package: www.debian.org
Severity: normal
User: www.debian.org@packages.debian.org
Usertags: scripts
X-Debbugs-CC: debian-l10n-chinese@lists.debian.org
X-Debbugs-CC: debian-i18n@lists.debian.org

Hi all,

TL;DR

There are some issues with some Chinese pages when they are built in a
buster machine.
We need to fix those issues (at least the "Malformed UTF-8 character
[...] at ../../bin/tocn.pl [...]" ones) so DSA can upgrade the
www-master machine to buster. See the summary of the log at the bottom
to know which files produce this error.
I have no idea of how to fix the issues, so any help from the Chinese
team or web team mates is greatly appreciated..
Additional issues may arise (e.g. I still didn't test the release-notes
or doc-manual), any help testing is welcome too, please create bug
reports for each different issue or update the existing ones. Thanks!

LONG VERSION

I've done a test build of the /english and /chinese subdirs in a buster
machine, and I have noticed some warnings/errors related to the Chinese
pages (some, not all of them).

It would be desirable to upgrade www-master machine to buster as soon as
possible, so any help with this (from website  or Chinese team members)
is very appreciated.

Below you can find an extract of the build log, including only the the
files for which I got some error or warning message.

After the build, I have compared the problematic HTML files of a build
in stretch and a build in buster with a diff tool, to see if there were
significant changes in the html output due to these issues.

Here are my results:

* For the messages of the type ", [zh_TW]Invalid UTF8: " when building,
I couldn't note any difference between the output of a stretch build and
the output of a buster build.

I would say this is not a blocker for the buster upgrade of www-master.

* For the messages of the type "Malformed UTF-8 character [...] at
../../bin/tocn.pl [...]" I have seen important changes in the HTML diff,
I think the output in the stretch build is totally broken (fortunately,
there are not many files in that situation).

I would say this is a blocker for the buster upgrade of www-master, but
I would prefer somebody of the Chinese team to confirm (try to build
those files in a buster machine, and review the output).

Additional notes:

* I have only tested the wml build, not the rest of the cron scripts
that run on www-master. I will try to do it in the following days, but
if you already know any that works well (e.g. release-notes,
doc-manuals...) just tell so I can skip them.

* When I build files in my machines, there is something wrong in my
environment that I don't get the .po files integrated every time, so for
example the Chinese pages I build show the menus and footnote in
English. Therefore, if there is any issue with the encoding of the .po
files themselves, I guess I cannot detect it until I fix my particular
issue :/

* The local build that I make uses the SAMPLE_FILES that are needed in
some folders; so additional issues may arise when we use the actual
files that are generated at runtime in the often and lessoften cron jobs.

That's all for now, I think. Thanks for your patience reading and for
your help!

Kind regards,
-- 
Laura Arjona Reina
https://wiki.debian.org/LauraArjona


--- extract of the build log file

/chinese

Processing
donations.wml:
[zh_CN]Invalid UTF8:
ïŒŒç‚¹å‡»â€œæ·»åŠ åˆ°èŽ­ç‰©èœŠâ€ïŒŒç„¶åŽå®Œæˆå‰©äœ™è¿‡çš‹ã€‚
, [zh_TW]Invalid UTF8:
ïŒŒç‚¹å‡»â€œæ·»åŠ åˆ°èŽ­ç‰©èœŠâ€ïŒŒç„¶åŽå®Œæˆå‰©äœ™è¿‡çš‹ã€‚
, [zh_HK]Invalid UTF8:
ïŒŒç‚¹å‡»â€œæ·»åŠ åˆ°èŽ­ç‰©èœŠâ€ïŒŒç„¶åŽå®Œæˆå‰©äœ™è¿‡çš‹ã€‚
.

make[1]: Entering directory '/webwml/chinese/Bugs'
Processing Reporting.wml: [zh_CN]Invalid UTF8:
°äž€æ¬¡ç€ºäŸ‹äŒšè¯çš„过皋。</li>
, [zh_TW]Invalid UTF8: °äž€æ¬¡ç€ºäŸ‹äŒšè¯çš„过皋。</li>
, [zh_HK]Invalid UTF8: °äž€æ¬¡ç€ºäŸ‹äŒšè¯çš„过皋。</li>
.

make[2]: Entering directory '/webwml/chinese/News/2000'

Processing 20000815.wml:
[zh_CN]Invalid UTF8: µ·å€–朋友的錎力協助包括
, [zh_TW]Invalid UTF8: µ·å€–朋友的錎力協助包括
, [zh_HK]Invalid UTF8: µ·å€–朋友的錎力協助包括
.

make[2]: Entering directory '/webwml/chinese/News/2009'
Processing 20090214.wml: [zh_CN]Invalid UTF8: šSun SPARC (sparc)、
, [zh_TW]Invalid UTF8: šSun SPARC (sparc)、
, [zh_HK]Invalid UTF8: šSun SPARC (sparc)、
.

make[2]: Entering directory '/webwml/chinese/News/weekly'

copying index.zh-cn.html to ../../../../www/News/weekly/./2002/48
Processing index.wml: [zh_CN]Malformed UTF-8 character (unexpected end
of string) in substitution (s///) at ../../bin/tocn.pl line 13, <> line 146.
Malformed UTF-8 character (unexpected end of string) in substitution
(s///) at ../../bin/tocn.pl line 15, <> line 146.
panic: do_trans_simple_utf8 line 362 at ../../bin/tocn.pl line 20, <>
line 146.
, [zh_TW]Invalid UTF8: å‘
, [zh_HK]Invalid UTF8: å‘
.
copying index.zh-cn.html to ../../../../www/News/weekly/./2002/49

copying index.zh-cn.html to ../../../../www/News/weekly/./2003/09
Processing index.wml: [zh_CN]Invalid UTF8: –‡æª”描述了埞安裝
, [zh_TW]Invalid UTF8: –‡ä»¶æè¿°äº†åŸžå®‰è£
, [zh_HK]Invalid UTF8: –‡ä»¶æè¿°äº†åŸžå®‰è£
.
copying index.zh-cn.html to ../../../../www/News/weekly/./2003/10
Processing index.wml: [zh_CN]Invalid UTF8: ˆ‘们的<a
href="../../../../events/talks">挔讲页面</a>来吗
, [zh_TW]Invalid UTF8: ˆ‘们的<a
href="../../../../events/talks">挔讲页面</a>来吗
, [zh_HK]Invalid UTF8: ˆ‘们的<a
href="../../../../events/talks">挔讲页面</a>来吗
.
copying index.zh-cn.html to ../../../../www/News/weekly/./2012/15

make[1]: Entering directory '/webwml/chinese/devel'

Processing
testing.wml:
[zh_CN],
[zh_TW]Invalid
UTF8: ˆ°äº† 4
個䞍打算曎新的軟件包因爲它們會砎壞䟝賎。<q>(0)</q> 是無
, [zh_HK]Invalid
UTF8: ˆ°äº† 4
個䞍打算曎新的軟件包因爲它們會砎壞䟝賎。<q>(0)</q> 是無
.

make[2]: Entering directory '/webwml/chinese/devel/join'
Processing index.wml: [zh_CN]Malformed UTF-8 character: \xe9\x98\x0a
(unexpected non-continuation byte 0x0a, 2 bytes after start byte 0xe9;
need 3 bytes, got 2) in substitution (s///) at ../../bin/tocn.pl line
108, <> line 52.
, [zh_TW], [zh_HK].
copying index.zh-cn.html to ../../../../www/devel/join
copying index.zh-hk.html to ../../../../www/devel/join
copying index.zh-tw.html to ../../../../www/devel/join

make[1]: Entering directory '/webwml/chinese/international'
Processing index.wml: [zh_CN]Malformed UTF-8 character: \xe9\x98\x0a
(unexpected non-continuation byte 0x0a, 2 bytes after start byte 0xe9;
need 3 bytes, got 2) in substitution (s///) at ../bin/tocn.pl line 108,
<> line 89.
, [zh_TW]Invalid UTF8: …皋序
, [zh_HK]Invalid UTF8: …皋序
.

make[2]: Entering directory '/webwml/chinese/international/Chinese'

Processing thanks.wml: [zh_CN]Invalid UTF8: «™é»žçš„æœ‹å‹
, [zh_TW]Invalid UTF8: «™é»žçš„æœ‹å‹
, [zh_HK]Invalid UTF8: «™é»žçš„æœ‹å‹
.

make[1]: Entering directory '/webwml/chinese/intro'
Processing about.wml: [zh_CN], [zh_TW], [zh_HK]panic: swash_fetch got
swatch of unexpected bit width, slen=512, needents=64 at ../bin/tohk.pl
line 131, <> line 95.
.

make -C legal install
make[1]: Entering directory '/webwml/chinese/legal'
Processing index.wml: [zh_CN]Malformed UTF-8 character: \xe9\x98\x0a
(unexpected non-continuation byte 0x0a, 2 bytes after start byte 0xe9;
need 3 bytes, got 2) in substitution (s///) at ../bin/tocn.pl line 108,
<> line 68.
, [zh_TW], [zh_HK].
copying index.zh-cn.html to ../../../www/legal
copying index.zh-hk.html to ../../../www/legal
copying index.zh-tw.html to ../../../www/legal

make[1]: Entering directory '/webwml/chinese/releases'

Processing proposed-updates.wml: [zh_CN],
[zh_TW]Invalid UTF8: ‰èƒœæœ€çµ‚到達 proposed-updates
, [zh_HK]Invalid UTF8: ‰èƒœæœ€çµ‚到達 proposed-updates
.

make[2]: Entering directory '/webwml/chinese/releases/hamm'
Processing HOWTO.upgrade.wml: [zh_CN], [zh_TW]Malformed UTF-8 character:
\xe5\x8c\x0a (unexpected non-continuation byte 0x0a, 2 bytes after start
byte 0xe5; need 3 bytes, got 2) in substitution (s///) at
../../bin/totw.pl line 111, <> line 71.
, [zh_HK].



Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Sun, 03 May 2020 21:00:04 GMT) (full text, mbox, link).


Acknowledgement sent to Holger Wansing <hwansing@mailbox.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Sun, 03 May 2020 21:00:04 GMT) (full text, mbox, link).


Message #10 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Holger Wansing <hwansing@mailbox.org>
To: Laura Arjona Reina <larjona@debian.org>, 959474@bugs.debian.org
Cc: debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Sun, 3 May 2020 22:57:39 +0200
[Message part 1 (text/plain, inline)]
Hi,

Laura Arjona Reina <larjona@debian.org> wrote:
> There are some issues with some Chinese pages when they are built in a
> buster machine.
> We need to fix those issues (at least the "Malformed UTF-8 character
> [...] at ../../bin/tocn.pl [...]" ones) so DSA can upgrade the
> www-master machine to buster. See the summary of the log at the bottom
> to know which files produce this error.
> I have no idea of how to fix the issues, so any help from the Chinese
> team or web team mates is greatly appreciated..
> Additional issues may arise (e.g. I still didn't test the release-notes
> or doc-manual), any help testing is welcome too, please create bug
> reports for each different issue or update the existing ones. Thanks!
> 
> LONG VERSION
> 
> I've done a test build of the /english and /chinese subdirs in a buster
> machine, and I have noticed some warnings/errors related to the Chinese
> pages (some, not all of them).
> 
> It would be desirable to upgrade www-master machine to buster as soon as
> possible, so any help with this (from website  or Chinese team members)
> is very appreciated.
> 
> Below you can find an extract of the build log, including only the the
> files for which I got some error or warning message.
> 
> After the build, I have compared the problematic HTML files of a build
> in stretch and a build in buster with a diff tool, to see if there were
> significant changes in the html output due to these issues.
> 
> Here are my results:
> 
> * For the messages of the type ", [zh_TW]Invalid UTF8: " when building,
> I couldn't note any difference between the output of a stretch build and
> the output of a buster build.
> 
> I would say this is not a blocker for the buster upgrade of www-master.

Don't know what I did different than Laura, but here some of the built html files
with "Invalid UTF8: ... " messages are lacking much of the content, compared
to the one currently at www-master. 
So maybe they are also serious.

> * For the messages of the type "Malformed UTF-8 character [...] at
> ../../bin/tocn.pl [...]" I have seen important changes in the HTML diff,
> I think the output in the stretch build is totally broken (fortunately,
> there are not many files in that situation).
> 
> I would say this is a blocker for the buster upgrade of www-master, but
> I would prefer somebody of the Chinese team to confirm (try to build
> those files in a buster machine, and review the output).

Maybe someone from the chinese people can solve this, but if not, I want
to propose a possible (temporary) solution:

If I delete the files below from the webwml/chinese tree, I can build
chinese without any errors. So, probably we can go with a workaround like this:
delete this files, to remove these upgrade blockers out of the way, upgrade 
wolkenstein to buster, and then try to re-add the files step-by-step, maybe
with some modifications at some point, to get the original situation back. 


Holger



-- 
Holger Wansing <hwansing@mailbox.org>
PGP-Fingerprint: 496A C6E8 1442 4B34 8508  3529 59F1 87CA 156E B076
[files-deleted-from-chinese.txt (text/plain, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 00:33:07 GMT) (full text, mbox, link).


Acknowledgement sent to Boyuan Yang <byang@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Tue, 05 May 2020 00:33:07 GMT) (full text, mbox, link).


Message #15 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Boyuan Yang <byang@debian.org>
To: Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, abe@debian.org
Cc: debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, wml@packages.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Mon, 04 May 2020 20:31:01 -0400
[Message part 1 (text/plain, inline)]
Hi all,

(with my Debian Chinese Team hat on)

(see bottom...)

在 2020-05-03星期日的 22:57 +0200,Holger Wansing写道:
> Hi,
> 
> Laura Arjona Reina <larjona@debian.org> wrote:
> > There are some issues with some Chinese pages when they are built in a
> > buster machine.
> > We need to fix those issues (at least the "Malformed UTF-8 character
> > [...] at ../../bin/tocn.pl [...]" ones) so DSA can upgrade the
> > www-master machine to buster. See the summary of the log at the bottom
> > to know which files produce this error.
> > I have no idea of how to fix the issues, so any help from the Chinese
> > team or web team mates is greatly appreciated..
> > Additional issues may arise (e.g. I still didn't test the release-notes
> > or doc-manual), any help testing is welcome too, please create bug
> > reports for each different issue or update the existing ones. Thanks!
> > 
> > LONG VERSION
> > 
> > I've done a test build of the /english and /chinese subdirs in a buster
> > machine, and I have noticed some warnings/errors related to the Chinese
> > pages (some, not all of them).
> > 
> > It would be desirable to upgrade www-master machine to buster as soon as
> > possible, so any help with this (from website  or Chinese team members)
> > is very appreciated.
> > 
> > Below you can find an extract of the build log, including only the the
> > files for which I got some error or warning message.
> > 
> > After the build, I have compared the problematic HTML files of a build
> > in stretch and a build in buster with a diff tool, to see if there were
> > significant changes in the html output due to these issues.
> > 
> > Here are my results:
> > 
> > * For the messages of the type ", [zh_TW]Invalid UTF8: " when building,
> > I couldn't note any difference between the output of a stretch build and
> > the output of a buster build.
> > 
> > I would say this is not a blocker for the buster upgrade of www-master.
> 
> Don't know what I did different than Laura, but here some of the built html
> files
> with "Invalid UTF8: ... " messages are lacking much of the content, compared
> to the one currently at www-master. 
> So maybe they are also serious.
> 
> > * For the messages of the type "Malformed UTF-8 character [...] at
> > ../../bin/tocn.pl [...]" I have seen important changes in the HTML diff,
> > I think the output in the stretch build is totally broken (fortunately,
> > there are not many files in that situation).
> > 
> > I would say this is a blocker for the buster upgrade of www-master, but
> > I would prefer somebody of the Chinese team to confirm (try to build
> > those files in a buster machine, and review the output).
> 
> Maybe someone from the chinese people can solve this, but if not, I want
> to propose a possible (temporary) solution:
> 
> If I delete the files below from the webwml/chinese tree, I can build
> chinese without any errors. So, probably we can go with a workaround like
> this:
> delete this files, to remove these upgrade blockers out of the way, upgrade 
> wolkenstein to buster, and then try to re-add the files step-by-step, maybe
> with some modifications at some point, to get the original situation back. 

Thanks for raising this issue. These build errors might have multiple causes,
but I stripped the issue down to a (possible) regression of wml. Let's fix
this issue first before talking about others.

=======================================
$ wml --version
This is WML Version 2.12.2
Copyright (c) 1996-2001 Ralf S. Engelschall.
Copyright (c) 1999-2001 Denis Barbier.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
$ cat /etc/issue
Debian GNU/Linux bullseye/sid \n \l

$ cat a.wml
<p>
包
</p>
$ hexdump -C a.wml
00000000  3c 70 3e 0a e5 8c 85 0a  3c 2f 70 3e 0a           |<p>.....</p>.|
0000000d
$ wml a.wml > test.txt
$ cat test.txt
<p>
�
</p>
$ hexdump -C test.txt
00000000  3c 70 3e 0a e5 8c 0a 3c  2f 70 3e 0a              |<p>....</p>.|
0000000c
$ 

==================================================

The single character in the a.wml above is U+5305 [1], namely "CJK Unified
Ideograph-5305", a commonly-used Chinese character. Its UTF-8 encoding is
"0xE5 0x8C 0x85". However after wml transformation, only "0xE5 0x8C" was kept
and the "0x85" was dropped. That's surely a regression.

I am using Debian Unstable but similar things also happen in Buster.

I cc-ed the wml maintainer in Debian. Axel, is there any possibility to solve
this regression in both Sid/Testing and Stable?

-- 
Regards,
Boyuan Yang


[1] https://www.compart.com/en/unicode/U+5305
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 01:03:02 GMT) (full text, mbox, link).


Acknowledgement sent to Axel Beckert <abe@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Tue, 05 May 2020 01:03:02 GMT) (full text, mbox, link).


Message #20 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Axel Beckert <abe@debian.org>
To: Boyuan Yang <byang@debian.org>
Cc: Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, wml@packages.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Tue, 5 May 2020 03:01:00 +0200
[Message part 1 (text/plain, inline)]
Control: clone -1 -2
Control: reasign -2 wml 2.12.2~ds1-2
Control: retitle -2 wml: Regression in "htmlstrip -O2" (default) with Chinese language

Hi,

Boyuan Yang wrote:
> Thanks for raising this issue.

Thanks from me, too. I wasn't aware of such a regression, sorry.

> These build errors might have multiple causes,
> but I stripped the issue down to a (possible) regression of wml. Let's fix
> this issue first before talking about others.
> 
> =======================================
> $ wml --version
> This is WML Version 2.12.2
> Copyright (c) 1996-2001 Ralf S. Engelschall.
> Copyright (c) 1999-2001 Denis Barbier.
> 
> This program is distributed in the hope that it will be useful,
> but WITHOUT ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU General Public License for more details.
> $ cat /etc/issue
> Debian GNU/Linux bullseye/sid \n \l
> 
> $ cat a.wml
> <p>
> 包
> </p>
> $ hexdump -C a.wml
> 00000000  3c 70 3e 0a e5 8c 85 0a  3c 2f 70 3e 0a           |<p>.....</p>.|
> 0000000d
> $ wml a.wml > test.txt
> $ cat test.txt
> <p>
> �
> </p>
> $ hexdump -C test.txt
> 00000000  3c 70 3e 0a e5 8c 0a 3c  2f 70 3e 0a              |<p>....</p>.|
> 0000000c
> $ 
[…]
> I am using Debian Unstable but similar things also happen in Buster.

Can confirm that this is a regression between Stretch and Buster. :-(

> The single character in the a.wml above is U+5305 [1], namely "CJK Unified
> Ideograph-5305", a commonly-used Chinese character. Its UTF-8 encoding is
> "0xE5 0x8C 0x85". However after wml transformation, only "0xE5 0x8C" was kept
> and the "0x85" was dropped. That's surely a regression.

Ack. Figured out that it's pass 8 of 9 passes in WML:

→ cat a.wml | wml -p1-8
<p>
�
</p>
→ cat a.wml | wml -p1-7
<p>
包
</p>
→ cat a.wml | wml -p1-7,9
<p>
包
</p>
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip
�
→

Pass 8 is htmlstrip, something similar uglifyjs, but for HTML.

Since that pass should be only for delivery performance and disk space
reasons, it likely can be left out easily.

So I see multiple ways to more or less quickly fix this issue in the
Debian web:

* Always call wml with "-p1-7,9".
* Call wml with "-p1-7,9" if any of the affected languages is build.
* Add <nostrip>…</nostrip> containers in the header and footer
  templates for the affected langauges.

To be more precise, it's the optimisation level 2 of htmlstrip:

→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 0
包
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 1
包
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 2
�
→

The man page says:

       Level 2:
           Good stripping: Same as level 1 plus compression of
	   multiple whitespaces (more then one in sequence) to single
	   whitespaces [txt,tag] and stripping of trailing whitespaces
	   at the of of a line [txt,tag,pre].
	   
           This level is the default because while providing good
	   optimization the HTML markup is not destroyed and remains
	   human readable.

So instead of skipping htmlstrip completely, everywhere, where I
suggested passing "-p1-7,9", also "-O1" could be passed to wml as
this is passed to htmlstrip:

→ cat a.wml | wml -O1
<p>
包
</p>

> I cc-ed the wml maintainer in Debian. Axel, is there any possibility to solve
> this regression in both Sid/Testing and Stable?

I think the above is a good first workaround on buster. With this
mail, I clone the bug report and will try to figure out what change in
htmlstrip caused the regression and/or how it can be fixed.

I though currently have issues building more recent upstream versions
of WML which is the reason why wml in Unstable hasn't seen an update
yet. A more recent version is in git, but IIRC there was another
release or two recently, at which I haven't looked yet.

		Regards, Axel
-- 
 ,''`.  |  Axel Beckert <abe@debian.org>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
[signature.asc (application/pgp-signature, inline)]

Bug 959474 cloned as bug 959761 Request was from Axel Beckert <abe@debian.org> to 959474-submit@bugs.debian.org. (Tue, 05 May 2020 01:03:02 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 01:39:02 GMT) (full text, mbox, link).


Acknowledgement sent to Axel Beckert <abe@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Tue, 05 May 2020 01:39:02 GMT) (full text, mbox, link).


Message #27 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Axel Beckert <abe@debian.org>
To: Boyuan Yang <byang@debian.org>, Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org
Cc: 959761@bugs.debian.org, perl@packages.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Tue, 5 May 2020 03:34:28 +0200
[Message part 1 (text/plain, inline)]
Hi,

found the culprit quicker than expected. I'm though no more sure if
it's really a WML issue or if sits even deeper:

Axel Beckert wrote:
> → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 1
> 包
> → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 2
> �

Level 2 actually only consists of these two regular expressions being
applied:

* s|(\S+)[ \t]{2,}|$1 |sg
* s|\s+\n|\n|sg

It's the latter one (a really simple regexp) which causes the
breakage. But not always. It depends on which Perl version
compatibility level is used:

→ echo 包 | perl -pe 's|\s+\n|\n|sg;'
包
→ echo 包 | perl -pE 's|\s+\n|\n|sg;'
�

"-E' instead of "-e" means "use the most recent Perl version feature
set", for this bug it is equivalent to "use 5.014;" as that's what is
used in htmlstrip.

From some point of view, we're lucky, because the feature set of Perl
5.14 wasn't that big: "say state switch unicode_strings".

It's obvious that neither say, state nor switch are causing this. So
it seems as if "use feature unicode_strings" is the culprit. Proof:

→ echo 包 | perl -pe 's|\s+\n|\n|sg;'
包
→ echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
�

Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's
perl package (not the whole Debian Perl Team), maybe they have some
insight what actually goes wrong here and if that's indeed a Perl bug.

I'm leaving #959761 open in wml as I now have an idea how to fix this
there (adding "no feature unicode_strings" to htmlstrip in the hope
that this doesn't do any collateral damage):

→ echo 包 | perl -pE 'no feature unicode_strings; s|\s+\n|\n|sg;'
包

		Regards, Axel
-- 
 ,''`.  |  Axel Beckert <abe@debian.org>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, perl@packages.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 02:24:03 GMT) (full text, mbox, link).


Acknowledgement sent to "Yao Wei (魏銘廷)" <mwei@lxde.org>:
Extra info received and forwarded to list. Copy sent to perl@packages.debian.org, Debian WWW Team <debian-www@lists.debian.org>. (Tue, 05 May 2020 02:24:03 GMT) (full text, mbox, link).


Message #32 received at 959474@bugs.debian.org (full text, mbox, reply):

From: "Yao Wei (魏銘廷)" <mwei@lxde.org>
To: Debian Bug Tracking System <959474@bugs.debian.org>
Subject: Re: Issues with Chinese language (all variants) when building some pages in buster
Date: Tue, 5 May 2020 10:18:39 +0800
[Message part 1 (text/plain, inline)]
Package: www.debian.org
Followup-For: Bug #959474

Hi,

After a bit of investigation of Perl source code (5.31.11 downloaded
from upstream) I found the they have weird handling of whitespace when
`feature unicode_strings` turned on.  I am not a perl person and I
haven't executed the source code yet, so my interpretation might be
wrong.

When `unicode_strings` is on, `in_uni_8_bit` should true internally, and
in three places of pp.c:6040, pp.c:6076, pp.c:6114 `isSPACE_L1` is
called to check whether the examining character is a whitespace, by
checking whether the character is 0x85 or 0xA0 (handy.h:1611).  In the
case of the character 包, the last byte of 3-byte UTF-8 code is 0x85,
henceforth the problem.

-- System Information:
Debian Release: bullseye/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 5.6.0-1-amd64 (SMP w/8 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 02:24:04 GMT) (full text, mbox, link).


Acknowledgement sent to Boyuan Yang <byang@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Tue, 05 May 2020 02:24:04 GMT) (full text, mbox, link).


Message #37 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Boyuan Yang <byang@debian.org>
To: Axel Beckert <abe@debian.org>, Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org
Cc: 959761@bugs.debian.org, perl@packages.debian.org, mwei@debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Mon, 04 May 2020 22:19:02 -0400
[Message part 1 (text/plain, inline)]
Hi,

在 2020-05-05星期二的 03:34 +0200,Axel Beckert写道:
> → echo 包 | perl -pe 's|\s+\n|\n|sg;'
> 包
> → echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
> �
> 
> Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's
> perl package (not the whole Debian Perl Team), maybe they have some
> insight what actually goes wrong here and if that's indeed a Perl bug.

I guess it is a Perl bug. I am listing more Chinese characters other than "包"
here that can trigger the problem:


% echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
�
% echo 赠 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
�
% echo 传 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
�
% echo 阅 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
�
% echo 加 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
�
% echo 者 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
�

% echo -n 赠 | hexdump -C
00000000  e8 b5 a0
% echo -n 传 | hexdump -C
00000000  e4 bc a0
% echo -n 包 | hexdump -C                                        
00000000  e5 8c 85
% echo -n 阅 | hexdump -C
00000000  e9 98 85
% echo -n 加 | hexdump -C
00000000  e5 8a a0
% echo -n 者 | hexdump -C
00000000  e8 80 85

(Note that 0xA0 and 0x85 at the end.)

Mwei (https://nm.debian.org/person/mwei/) just talked to me saying that it
could be a bug with isSPACE_L1 macro in perl's pp.c. He will be replying the
email soon.

-- 
Thanks,
Boyuan Yang
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 03:48:02 GMT) (full text, mbox, link).


Acknowledgement sent to Yao Wei <mwei@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Tue, 05 May 2020 03:48:02 GMT) (full text, mbox, link).


Message #42 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Yao Wei <mwei@debian.org>
To: Boyuan Yang <byang@debian.org>, 959474@bugs.debian.org
Cc: Axel Beckert <abe@debian.org>, Holger Wansing <hwansing@mailbox.org>, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, 959761@bugs.debian.org, perl@packages.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Tue, 5 May 2020 11:14:31 +0800
[Message part 1 (text/plain, inline)]
On Mon, May 04, 2020 at 10:19:02PM -0400, Boyuan Yang wrote:
> Mwei (https://nm.debian.org/person/mwei/) just talked to me saying that it
> could be a bug with isSPACE_L1 macro in perl's pp.c. He will be replying the
> email soon.
> 

Hi,

(I used reportbug to handle reply of this thread, and I missed a lot of
recipients here.  This is a resend of reply in #959474.  Sorry for the
noise.)

After a bit of investigation of Perl source code (5.31.11 downloaded
from upstream) I found the they have weird handling of whitespace when
`feature unicode_strings` turned on.  I am not a perl person and I
haven't executed the source code yet, so my interpretation might be
wrong.

When `unicode_strings` is on, `in_uni_8_bit` should true internally, and
in three places of pp.c:6040, pp.c:6076, pp.c:6114 `isSPACE_L1` is
called to check whether the examining character is a whitespace, by
checking whether the character is 0x85 or 0xA0 (handy.h:1611).  In the
case of the character 包, the last byte of 3-byte UTF-8 code is 0x85,
henceforth the problem.
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 05:57:02 GMT) (full text, mbox, link).


Message #45 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Damyan Ivanov <dmn@debian.org>
To: Boyuan Yang <byang@debian.org>, Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, 959761@bugs.debian.org, perl@packages.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Tue, 5 May 2020 08:45:11 +0300
(not a Perl maintainer here)

-=| Axel Beckert, 05.05.2020 03:34:28 +0200 |=-
> → echo 包 | perl -pe 's|\s+\n|\n|sg;'
> 包
> → echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
> �
> 
> Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's
> perl package (not the whole Debian Perl Team), maybe they have some
> insight what actually goes wrong here and if that's indeed a Perl 
> bug.

Seems like a user (wml) bug to me (improper handling of UTF-8 encoded data):

→ echo 包赠传阅加者 | perl -CS -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
包赠传阅加者

From perlrun(1):

      -C [number/list]
            The -C flag controls some of the Perl Unicode features.

            As of 5.8.1, the -C can be followed either by a number or a list
            of option letters.  The letters, their numeric values, and effects
            are as follows; listing the letters is equal to summing the
            numbers.

                I     1   STDIN is assumed to be in UTF-8
                O     2   STDOUT will be in UTF-8
                E     4   STDERR will be in UTF-8
                S     7   I + O + E

Perhaps the strings in wml need to be decoded from UTF-8 so that they 
aren't treated as a sequence of independent bytes?

U+0085 is "Next line (NEL)", which seems to be treated as "\n".


(
Strangely, replacing -CS with a call to STDIN->binmode("UTF-8") 
doesn't help:

 echo 包 | perl -E 'STDIN->binmode("UTF-8"); while(<>) { s|\s+\n|\n|sg; print }'
 �

Explicitly using Encode helps:

 echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }'
 Wide character in print at -e line 1, <> line 1.
 包

(whe wide character warning is expected, because STDOUT is not instructed how to encode unicode characters)
)

-- dam



Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 07:45:02 GMT) (full text, mbox, link).


Message #48 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Jakub Wilk <jwilk@jwilk.net>
To: Damyan Ivanov <dmn@debian.org>, 959474@bugs.debian.org
Cc: Boyuan Yang <byang@debian.org>, Holger Wansing <hwansing@mailbox.org>, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, 959761@bugs.debian.org, perl@packages.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Tue, 5 May 2020 09:43:01 +0200
* Damyan Ivanov <dmn@debian.org>, 2020-05-05, 08:45:
>Strangely, replacing -CS with a call to STDIN->binmode("UTF-8")
>doesn't help:
>
> echo 包 | perl -E 'STDIN->binmode("UTF-8"); while(<>) { s|\s+\n|\n|sg; print }'
> �

That's because "UTF-8" is not a valid argument for binmode().

You want:

  $ echo 包 | perl -E 'STDIN->binmode(":encoding(UTF-8)") or die; while(<>) { s|\s+\n|\n|sg; print }'
  Wide character in print at -e line 1, <> line 1.
  包

or:

  $ echo 包 | perl -E 'STDIN->binmode(":utf8") or die; while(<>) { s|\s+\n|\n|sg; print }'
  Wide character in print at -e line 1, <> line 1.
  包

-- 
Jakub Wilk



Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 08:57:03 GMT) (full text, mbox, link).


Acknowledgement sent to Axel Beckert <abe@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Tue, 05 May 2020 08:57:03 GMT) (full text, mbox, link).


Message #53 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Axel Beckert <abe@debian.org>
To: Damyan Ivanov <dmn@debian.org>, 959761@bugs.debian.org
Cc: Boyuan Yang <byang@debian.org>, Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, perl@packages.debian.org
Subject: Re: Bug#959761: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Tue, 5 May 2020 10:53:29 +0200
Hi Damyan,

Damyan Ivanov wrote:
> (not a Perl maintainer here)

Did help nevertheless. Just didn't want to spam the whole Perl Team
with potential Perl bugs. ;-)

> -=| Axel Beckert, 05.05.2020 03:34:28 +0200 |=-
> > → echo 包 | perl -pe 's|\s+\n|\n|sg;'
> > 包
> > → echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
> > �
> > 
> > Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's
> > perl package (not the whole Debian Perl Team), maybe they have some
> > insight what actually goes wrong here and if that's indeed a Perl 
> > bug.
> 
> Seems like a user (wml) bug to me (improper handling of UTF-8 encoded data):
> 
> → echo 包赠传阅加者 | perl -CS -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
> 包赠传阅加者
> 
> >From perlrun(1):
> 
>       -C [number/list]
>             The -C flag controls some of the Perl Unicode features.
> 
>             As of 5.8.1, the -C can be followed either by a number or a list
>             of option letters.  The letters, their numeric values, and effects
>             are as follows; listing the letters is equal to summing the
>             numbers.
> 
>                 I     1   STDIN is assumed to be in UTF-8
>                 O     2   STDOUT will be in UTF-8
>                 E     4   STDERR will be in UTF-8
>                 S     7   I + O + E

Thanks! I was not aware of the -C option...

> Perhaps the strings in wml need to be decoded from UTF-8 so that they 
> aren't treated as a sequence of independent bytes?

... and would have expect "use feature unicode_strings;" already
activates all of this.

> U+0085 is "Next line (NEL)", which seems to be treated as "\n".

I see.

> Strangely, replacing -CS with a call to STDIN->binmode("UTF-8") 
> doesn't help:
> 
>  echo 包 | perl -E 'STDIN->binmode("UTF-8"); while(<>) { s|\s+\n|\n|sg; print }'
>  �
> 
> Explicitly using Encode helps:
> 
>  echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }'
>  Wide character in print at -e line 1, <> line 1.
>  包

Thanks, will try to use whatever works from these.

		Regards, Axel
-- 
 ,''`.  |  Axel Beckert <abe@debian.org>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE



Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Tue, 05 May 2020 10:18:02 GMT) (full text, mbox, link).


Acknowledgement sent to gregor herrmann <gregoa@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Tue, 05 May 2020 10:18:02 GMT) (full text, mbox, link).


Message #58 received at 959474@bugs.debian.org (full text, mbox, reply):

From: gregor herrmann <gregoa@debian.org>
To: Damyan Ivanov <dmn@debian.org>, 959761@bugs.debian.org, Boyuan Yang <byang@debian.org>, Holger Wansing <hwansing@mailbox.org>, 959474@bugs.debian.org, Laura Arjona Reina <larjona@debian.org>, debian-l10n-chinese@lists.debian.org, debian-i18n@lists.debian.org, perl@packages.debian.org
Subject: Re: Bug#959761: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Tue, 5 May 2020 12:16:17 +0200
On Tue, 05 May 2020 10:53:29 +0200, Axel Beckert wrote:

> > Perhaps the strings in wml need to be decoded from UTF-8 so that they 
> > aren't treated as a sequence of independent bytes?
> ... and would have expect "use feature unicode_strings;" already
> activates all of this.

(I haven't read the thread in detail …).

Personally I often use "use utf8:all" (from libutf8-all-perl) if I'm
reasonably sure that the input is not weird and I want to output
utf-8. It is sometimes a bit slow but handles all the en/decoding in
my experience.
 
> > Explicitly using Encode helps:
> > 
> >  echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }'
> >  Wide character in print at -e line 1, <> line 1.
> >  包

% time echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }'
Wide character in print at -e line 1, <> line 1.
包
echo 包  0.00s user 0.00s system 42% cpu 0.002 total
perl -E   0.03s user 0.01s system 97% cpu 0.034 total

% time echo 包 | perl -Mutf8::all -E ' while(<>) { s|\s+\n|\n|sg; print }'
包
echo 包  0.00s user 0.00s system 63% cpu 0.002 total
perl -Mutf8::all -E ' while(<>) { s|\s+\n|\n|sg; print }'  0.04s user 0.01s system 98% cpu 0.050 total

% time echo 包 | perl -CS -E 'while(<>) { s|\s+\n|\n|sg; print }'
包
echo 包  0.00s user 0.00s system 60% cpu 0.002 total
perl -CS -E 'while(<>) { s|\s+\n|\n|sg; print }'  0.00s user 0.00s system 83% cpu 0.005 total


Cheers,
gregor

-- 
 .''`.  https://info.comodo.priv.at -- Debian Developer https://www.debian.org
 : :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D  85FA BB3A 6801 8649 AA06
 `. `'  Member VIBE!AT & SPI Inc. -- Supporter Free Software Foundation Europe
   `-   BOFH excuse #378:  Operators killed by year 2000 bug bite. 



Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Wed, 06 May 2020 02:57:02 GMT) (full text, mbox, link).


Acknowledgement sent to Boyuan Yang <byang@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Wed, 06 May 2020 02:57:02 GMT) (full text, mbox, link).


Message #63 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Boyuan Yang <byang@debian.org>
To: "959761@bugs.debian.org" <959761@bugs.debian.org>, Axel Beckert <abe@debian.org>
Cc: "959474@bugs.debian.org" <959474@bugs.debian.org>, Laura Arjona Reina <larjona@debian.org>
Subject: Follow-up fix for wml in Debian Stable?
Date: Tue, 05 May 2020 22:51:56 -0400
[Message part 1 (text/plain, inline)]
Hi Axel,

I just tested the new wml 2.12.2~ds1-3 on Chinese translations for website
(webwml). It looks like the previous bug has been properly fixed.

Since the webmaster team is trying to upgrade the machine from Debian 9 to
Debian 10, it should be better if we have this fix pushed into stable soon.
Can you make a stable update for package wml with this fix?

-- 
Thanks,
Boyuan Yang
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Wed, 06 May 2020 08:33:06 GMT) (full text, mbox, link).


Acknowledgement sent to Axel Beckert <abe@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Wed, 06 May 2020 08:33:07 GMT) (full text, mbox, link).


Message #68 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Axel Beckert <abe@debian.org>
To: Boyuan Yang <byang@debian.org>, 959761@bugs.debian.org
Cc: "959474@bugs.debian.org" <959474@bugs.debian.org>, Laura Arjona Reina <larjona@debian.org>
Subject: Re: Bug#959761: Follow-up fix for wml in Debian Stable?
Date: Wed, 6 May 2020 10:28:09 +0200
Hi Boyuan,

Boyuan Yang wrote:
> I just tested the new wml 2.12.2~ds1-3 on Chinese translations for website
> (webwml). It looks like the previous bug has been properly fixed.

Thanks a lot for testing and verifying!

> Since the webmaster team is trying to upgrade the machine from Debian 9 to
> Debian 10, it should be better if we have this fix pushed into stable soon.
> Can you make a stable update for package wml with this fix?

As mentioned on IRC (not sure if you're on #debian-www, probably not),
this is my plan.

I'll though will have to wait until wml 2.12.2~ds1-3 migrates to
testing. Should happen within 2 or 3 days once autopkgtest has been
run and passed.

Laura though meant on IRC that the webmasters might not want to wait
until the next stable update.

But maybe I can get it to stable-proposed-updates soon and they can
use it from there, so that wouldn't cause much of a lag.

(While I was writing this mail, on #debian-www it was decided that
they will use one of the workarounds, likely the -O1" one.)

		Regards, Axel
-- 
 ,''`.  |  Axel Beckert <abe@debian.org>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE



Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Sun, 07 Jun 2020 13:45:02 GMT) (full text, mbox, link).


Acknowledgement sent to Laura Arjona Reina <larjona@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Sun, 07 Jun 2020 13:45:02 GMT) (full text, mbox, link).


Message #73 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Laura Arjona Reina <larjona@debian.org>
To: Axel Beckert <abe@debian.org>, 959474@bugs.debian.org, Boyuan Yang <byang@debian.org>, 959761@bugs.debian.org
Subject: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Sun, 7 Jun 2020 15:43:54 +0200
Hi all

As a workaround for the Debian website, until wml 2.12.2~ds1-3 or higher
arrives to stable, I have added the option "-O1" to the options passed
to wml for Chinese, in the /chinese/Make.lang file:


+# Add "-O1"  to wml to be passed to htmlstrip, to avoid malformed UTF-8
+# see bug #959474
+# This option needs to be kept in Chinese until wml 2.12.2~ds1-3 or higher
+# arrives to Debian stable
+
+WMLOPTIONSZH = -O1

 WMLOUTPUT = -o UNDEFuZH@uCNuCNHKuCNTW:$(*F).zh-cn.html.tmp@g+w \
        -o UNDEFuZH@uHKuCNHKuHKTWuTWHK:$(*F).zh-hk.html.tmp@g+w \
@@ -54,7 +60,7 @@ WMLPROLOG = --prolog=$(FORMAT_ZH)
 # Remove initial blank line due "[ZH::]" in $(TEMPLDIR)/common_tags.wml,
 # an unfortunate but necessary workaround of a bug in slice < 1.3.9
 WMLEPILOG = --epilog=$(STRIP_INITIAL_BLANK_LINE)
-WML = wml $(WMLOPTIONS) $(WMLOUTPUT) $(WMLPROLOG) $(WMLEPILOG)
+WML = wml $(WMLOPTIONS) $(WMLOPTIONSZH) $(WMLOUTPUT) $(WMLPROLOG)
$(WMLEPILOG)

I have compared the results of builds in stretch and buster both with
and without the option, and there are no changes in stretch, and the
UTF-8 issues are fixed in buster with the option (by the way, thanks
Boyuan for the additional fixes you did to mitigate the error).

So, I think that Bug#959474 can be closed, but I'll leave it open until
we effectively migrate to Buster and see the results in www.debian.org
"live" :-)

Thanks everybody for your work!

Kind regards,
-- 
Laura Arjona Reina
https://wiki.debian.org/LauraArjona



Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Sun, 07 Jun 2020 14:06:02 GMT) (full text, mbox, link).


Acknowledgement sent to Axel Beckert <abe@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Sun, 07 Jun 2020 14:06:02 GMT) (full text, mbox, link).


Message #78 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Axel Beckert <abe@debian.org>
To: Laura Arjona Reina <larjona@debian.org>
Cc: 959474@bugs.debian.org, Boyuan Yang <byang@debian.org>, 959761@bugs.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Sun, 7 Jun 2020 16:02:03 +0200
Hi,

Laura Arjona Reina wrote:
> I have compared the results of builds in stretch and buster both with
> and without the option, and there are no changes in stretch, and the
> UTF-8 issues are fixed in buster with the option

Thanks for these tests.

> So, I think that Bug#959474 can be closed, but I'll leave it open until
> we effectively migrate to Buster and see the results in www.debian.org
> "live" :-)

Just ot be sure: I should still provide a stable update for buster,
right?

(Sorry, was a bit busy IRL and nearly forgot about this open "to do"
item. So thanks for the reminder.)

		Regards, Axel
-- 
 ,''`.  |  Axel Beckert <abe@debian.org>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE



Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Sun, 07 Jun 2020 19:27:03 GMT) (full text, mbox, link).


Acknowledgement sent to Laura Arjona Reina <larjona@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Sun, 07 Jun 2020 19:27:03 GMT) (full text, mbox, link).


Message #83 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Laura Arjona Reina <larjona@debian.org>
To: Axel Beckert <abe@debian.org>
Cc: 959474@bugs.debian.org, Boyuan Yang <byang@debian.org>, 959761@bugs.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Sun, 7 Jun 2020 21:23:17 +0200
Hi

El 7/6/20 a las 16:02, Axel Beckert escribió:

> Just ot be sure: I should still provide a stable update for buster,
> right?
> 

I don't know if the type of bug qualifies for a stable update.

For www.debian.org, we'll be using the -O1 workaround for building the
Chinese pages, and that's about optimization, we don't lose any
functionality, so I think we can wait for bullseye.

Boyuan, please correct me if I am wrong...

Kind regards,
-- 
Laura Arjona Reina
https://wiki.debian.org/LauraArjona



Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Wed, 10 Jun 2020 00:48:03 GMT) (full text, mbox, link).


Acknowledgement sent to Boyuan Yang <byang@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Wed, 10 Jun 2020 00:48:03 GMT) (full text, mbox, link).


Message #88 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Boyuan Yang <byang@debian.org>
To: Laura Arjona Reina <larjona@debian.org>, Axel Beckert <abe@debian.org>
Cc: 959474@bugs.debian.org, 959761@bugs.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Tue, 09 Jun 2020 20:44:02 -0400
在 2020-06-07星期日的 21:23 +0200,Laura Arjona Reina写道:
> Hi
> 
> El 7/6/20 a las 16:02, Axel Beckert escribió:
> 
> > Just ot be sure: I should still provide a stable update for buster,
> > right?
> > 
> 
> I don't know if the type of bug qualifies for a stable update.

If I were the maintainer, I would give it a try to make the stable
update. (Why not?)

> For www.debian.org, we'll be using the -O1 workaround for building
> the
> Chinese pages, and that's about optimization, we don't lose any
> functionality, so I think we can wait for bullseye.
> 
> Boyuan, please correct me if I am wrong...

If we have the workaround applied, website building with Chinese
contents should not be an issue anymore.

-- 
Thanks,
Boyuan Yang




Information forwarded to debian-bugs-dist@lists.debian.org, Debian WWW Team <debian-www@lists.debian.org>:
Bug#959474; Package www.debian.org. (Thu, 28 Jan 2021 02:03:03 GMT) (full text, mbox, link).


Acknowledgement sent to Changwoo Ryu <cwryu@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian WWW Team <debian-www@lists.debian.org>. (Thu, 28 Jan 2021 02:03:03 GMT) (full text, mbox, link).


Message #93 received at 959474@bugs.debian.org (full text, mbox, reply):

From: Changwoo Ryu <cwryu@debian.org>
To: 959474@bugs.debian.org
Subject: Re: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster
Date: Thu, 28 Jan 2021 11:01:34 +0900
Korean is affected too and I added the "-O1" option workaround also to Korean.



Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Sun Jun 4 07:03:37 2023; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.