Debian Bug report logs -
#338264
w3m vs. HTML entities
Reply or subscribe to this bug.
Toggle useless messages
Report forwarded to debian-bugs-dist@lists.debian.org, Fumitoshi UKAI <ukai@debian.or.jp>:
Bug#338264; Package w3m.
(full text, mbox, link).
Message #3 received at submit@bugs.debian.org (full text, mbox, reply):
Package: w3m
Version: 0.5.1-4
Severity: wishlist
w3m has big problems reading a file full of HTML entities.
$ w3m http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html
We see lots of "?". Firefox doesn't have any problems.
Even after
$ wwwoffle -o http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html|\
perl -pwe 'use HTML::Entities;$_=decode_entities($_);\
s/gb2312/utf-8/'>file.html
w3m has problems.
OK, I was finally able to prepare it for a big5 PDA:
wwwoffle -o http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html|\
perl -pwe 'use HTML::Entities;$_=decode_entities($_);\
s/gb2312/big5/'|iconv -f utf-8 -t gb2312 -c|\
iconv -f gb2312 -t big5 -c > file.html
We note the two iconv steps probably due to thier non complete mapping
which I recall telling them. Also there is in fact no gb2312 in the original file.
-- System Information:
Locale: LANG=zh_TW.Big5, LC_CTYPE=zh_TW.Big5 (charmap=BIG5)
Information forwarded to debian-bugs-dist@lists.debian.org, Fumitoshi UKAI <ukai@debian.or.jp>:
Bug#338264; Package w3m.
(full text, mbox, link).
Acknowledgement sent to Karsten Schoelzel <kuser@gmx.de>:
Extra info received and forwarded to list. Copy sent to Fumitoshi UKAI <ukai@debian.or.jp>.
(full text, mbox, link).
Message #8 received at 338264@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
package w3m
retitle 338264 Improve conversion from GB2312 to Big5 characters
thanks
On Wed, Nov 09, 2005 at 03:37:17AM +0800, Dan Jacobson wrote:
> Package: w3m
> Version: 0.5.1-4
> Severity: wishlist
>
> w3m has big problems reading a file full of HTML entities.
> $ w3m http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html
> We see lots of "?". Firefox doesn't have any problems.
>
> Even after
> $ wwwoffle -o http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html|\
> perl -pwe 'use HTML::Entities;$_=decode_entities($_);\
> s/gb2312/utf-8/'>file.html
> w3m has problems.
>
> OK, I was finally able to prepare it for a big5 PDA:
> wwwoffle -o http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html|\
> perl -pwe 'use HTML::Entities;$_=decode_entities($_);\
> s/gb2312/big5/'|iconv -f utf-8 -t gb2312 -c|\
> iconv -f gb2312 -t big5 -c > file.html
>
> We note the two iconv steps probably due to thier non complete mapping
> which I recall telling them. Also there is in fact no gb2312 in the original file.
> -- System Information:
> Locale: LANG=zh_TW.Big5, LC_CTYPE=zh_TW.Big5 (charmap=BIG5)
>
The problem are not the HTML entities but the conversion from GB2312 to
Big5 characters. The entities in the file specify characters which map
directly to characters in GB2312. But they are not directly mapped to Big5,
because the mapping from Simple Chinese to Traditional Chinese is
ambiguous and e.g. depending on the context.
A solution to this is using either a GB2312 or UTF-8 locale. With
LANG=zh_CN.GB2312 there were only two question marks left and using it
with LANG=zh_TW.UTF-8 all symbols were displayed.
Regards,
Karsten Schölzel
[signature.asc (application/pgp-signature, inline)]
Information forwarded to debian-bugs-dist@lists.debian.org, Fumitoshi UKAI <ukai@debian.or.jp>:
Bug#338264; Package w3m.
(full text, mbox, link).
Acknowledgement sent to Jeff Abrahamson <jeff@purple.com>:
Extra info received and forwarded to list. Copy sent to Fumitoshi UKAI <ukai@debian.or.jp>.
(full text, mbox, link).
Message #13 received at 338264@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Package: w3m
Version: 0.5.1-4
Followup-For: Bug #338264
Attached is a simple example on which w3m fails. It should render
'a"b' but it renders 'a?b'.
The economist.com is a rich source of pages that w3m has trouble
rendering for this reason. Mozilla/firefox does just fine, by
contrast, as do lynx and links.
(Is this bug related to 291735?)
-- System Information:
Debian Release: testing/unstable
APT prefers testing
APT policy: (500, 'testing')
Architecture: i386 (i686)
Shell: /bin/sh linked to /bin/bash
Kernel: Linux 2.6.15-1-686
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)
Versions of packages w3m depends on:
ii libc6 2.3.6-15 GNU C Library: Shared libraries
ii libgc1c2 1:6.7-2 conservative garbage collector for
ii libgpmg1 1.19.6-22 General Purpose Mouse - shared lib
ii libncurses5 5.5-2 Shared libraries for terminal hand
ii libssl0.9.7 0.9.7i-1 SSL shared libraries
ii zlib1g 1:1.2.3-11 compression library - runtime
Versions of packages w3m recommends:
ii ca-certificates 20050804 Common CA Certificates PEM files
-- no debconf information
[foo.html (text/html, attachment)]
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Mon Jun 5 03:08:42 2023;
Machine Name:
buxtehude
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.