Debian Bug report logs - #279221
should transcode characters from utf-8 if the terminal is not utf-8 capable

version graph

Package: w3m; Maintainer for w3m is Tatsuya Kinoshita <tats@debian.org>; Source for w3m is src:w3m (PTS, buildd, popcon).

Reported by: Joey Hess <joeyh@debian.org>

Date: Mon, 1 Nov 2004 15:03:02 UTC

Severity: minor

Merged with 322823

Found in versions 0.5.1-3, w3m/0.5.1-3

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Fumitoshi UKAI <ukai@debian.or.jp>:
Bug#279221; Package w3m. (full text, mbox, link).


Acknowledgement sent to Joey Hess <joeyh@debian.org>:
New Bug report received and forwarded. Copy sent to Fumitoshi UKAI <ukai@debian.or.jp>. (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Joey Hess <joeyh@debian.org>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: should transcde characters from utf-8 if the terminal is not utf-8 capable
Date: Mon, 1 Nov 2004 00:50:05 -0500
[Message part 1 (text/plain, inline)]
Package: w3m
Version: 0.5.1-3
Severity: wishlist

Here's the problem:

joey@dragon:~>locale | grep CtypE
LC_CTYPE="POSIX"
joey@dragon:~>echo '&mdash;' > foo.html 
joey@dragon:~>w3m -dump foo.html        
?

That comes out as a '?' because w3m apparently internally converts it to the
utf-8 character for mdash (which is not '-', but the other dash), and then
discovers it's not in the character set for this terminal and decides to render
it as a question mark. When reading a document with lots of &mdash;, &ldquo;,
&helip; and other fancy entities, this gets very annoying.

Instead, w3m should be aware of the character set and just use available
characters that are close to the right ones, like "-". Other browsers, such
as lynx, do that.

-- System Information:
Debian Release: 3.1
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.4.27
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)

Versions of packages w3m depends on:
ii  libc6                       2.3.2.ds1-18 GNU C Library: Shared libraries an
ii  libgc1                      1:6.3-1      Conservative garbage collector for
ii  libgpmg1                    1.19.6-19    General Purpose Mouse - shared lib
ii  libncurses5                 5.4-4        Shared libraries for terminal hand
ii  libssl0.9.7                 0.9.7d-5     SSL shared libraries
ii  zlib1g                      1:1.2.2-1    compression library - runtime

-- no debconf information

-- 
see shy jo
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Fumitoshi UKAI <ukai@debian.or.jp>:
Bug#279221; Package w3m. (full text, mbox, link).


Acknowledgement sent to Samuel Thibault <samuel.thibault@ens-lyon.org>:
Extra info received and forwarded to list. Copy sent to Fumitoshi UKAI <ukai@debian.or.jp>. (full text, mbox, link).


Message #10 received at 279221@bugs.debian.org (full text, mbox, reply):

From: Samuel Thibault <samuel.thibault@ens-lyon.org>
To: 279221@bugs.unicode.org
Subject: Re: should transcde characters from utf-8 if the terminal is not utf-8 capable
Date: Tue, 7 Jun 2005 14:56:16 +0200
Hi,

For this, iconv can be much helpful:

$ hexdump foo
0000000  e2 80 94 0a                                    
$ iconv -f utf-8 -t latin1//translit < foo
--
$

The //translit suffixe tells iconv to translate everything.

So w3m should do something like:

#define TRANSLIT "//translit"
char *codeset = nl_langinfo(CODESET);
int len = strlen(codeset);
char *charset = malloc(len+strlen(TRANSLIT)+1);
memcpy(charset,codeset,len);
memcpy(charset+len,TRANSLIT,strlen(TRANSLIT)+1);
conv = iconv_open(charset, page_charset);
iconv(conv, ...);

Regards,
Samuel



Severity set to `minor'. Request was from Samuel Thibault <samuel.thibault@ens-lyon.org> to control@bugs.debian.org. (full text, mbox, link).


Merged 279221 322823. Request was from Samuel Thibault <samuel.thibault@ens-lyon.org> to control@bugs.debian.org. (full text, mbox, link).


Changed Bug title to `should transcode characters from utf-8 if the terminal is not utf-8 capable' from `should transcde characters from utf-8 if the terminal is not utf-8 capable'. Request was from Vincent Lefevre <vincent@vinc17.org> to control@bugs.debian.org. (Wed, 06 Jun 2007 12:09:02 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Fumitoshi UKAI <ukai@debian.or.jp>:
Bug#279221; Package w3m. (full text, mbox, link).


Acknowledgement sent to Vincent Lefevre <vincent@vinc17.org>:
Extra info received and forwarded to list. Copy sent to Fumitoshi UKAI <ukai@debian.or.jp>. (full text, mbox, link).


Message #21 received at 279221@bugs.debian.org (full text, mbox, reply):

From: Vincent Lefevre <vincent@vinc17.org>
To: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: 279221@bugs.unicode.org
Subject: Re: should transcde characters from utf-8 if the terminal is not utf-8 capable
Date: Wed, 6 Jun 2007 14:16:57 +0200
Hi,

On 2005-06-07 14:56:16 +0200, Samuel Thibault wrote:
> The //translit suffixe tells iconv to translate everything.
> 
> So w3m should do something like:
> 
> #define TRANSLIT "//translit"
> char *codeset = nl_langinfo(CODESET);
> int len = strlen(codeset);
> char *charset = malloc(len+strlen(TRANSLIT)+1);
> memcpy(charset,codeset,len);
> memcpy(charset+len,TRANSLIT,strlen(TRANSLIT)+1);
> conv = iconv_open(charset, page_charset);
> iconv(conv, ...);

Any news?

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)



Noted your statement that Bug has been forwarded to http://sourceforge.net/tracker/index.php?func=detail&aid=1338256&group_id=39518&atid=425439. Request was from Samuel Thibault <samuel.thibault@ens-lyon.org> to control@bugs.debian.org. (Wed, 06 Jun 2007 17:57:02 GMT) (full text, mbox, link).


Unset Bug forwarded-to-address Request was from d+deb@vdr.jp to control@bugs.debian.org. (Fri, 23 Jul 2010 18:27:04 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Tatsuya Kinoshita <tats@debian.org>:
Bug#279221; Package w3m. (Sun, 12 Oct 2014 12:45:04 GMT) (full text, mbox, link).


Acknowledgement sent to Markus Hiereth <markus.hiereth@freenet.de>:
Extra info received and forwarded to list. Copy sent to Tatsuya Kinoshita <tats@debian.org>. (Sun, 12 Oct 2014 12:45:04 GMT) (full text, mbox, link).


Message #30 received at 279221@bugs.debian.org (full text, mbox, reply):

From: Markus Hiereth <markus.hiereth@freenet.de>
To: Debian Bug Tracking System <279221@bugs.debian.org>
Subject: Re: should transcode characters from utf-8 if the terminal is not utf-8 capable
Date: Sun, 12 Oct 2014 14:31:24 +0200
Package: w3m
Followup-For: Bug #279221

Dear Maintainer,

I wonder it this bug report can be closed for w3m in Debian 7. 

I got the correct output

$ echo '&mdash;' > foo.html
$ w3m -dump < foo.html 
&mdash;

Regards
Markus



-- System Information:
Debian Release: 7.6
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 3.2.0-4-486
Locale: LANG=de_DE, LC_CTYPE=de_DE (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/dash

Versions of packages w3m depends on:
ii  libc6        2.13-38+deb7u2
ii  libgc1c2     1:7.1-9.1
ii  libgpm2      1.20.4-6
ii  libssl1.0.0  1.0.1e-2+deb7u11
ii  libtinfo5    5.9-10
ii  zlib1g       1:1.2.7.dfsg-13

Versions of packages w3m recommends:
ii  ca-certificates  20130119

Versions of packages w3m suggests:
ii  man-db        2.6.2-1
pn  menu          <none>
pn  migemo        <none>
ii  mime-support  3.52-1
pn  w3m-el        <none>
ii  w3m-img       0.5.3-8

-- no debconf information



Information forwarded to debian-bugs-dist@lists.debian.org:
Bug#279221; Package w3m. (Sun, 12 Oct 2014 15:00:05 GMT) (full text, mbox, link).


Acknowledgement sent to Tatsuya Kinoshita <tats@debian.org>:
Extra info received and forwarded to list. (Sun, 12 Oct 2014 15:00:05 GMT) (full text, mbox, link).


Message #35 received at 279221@bugs.debian.org (full text, mbox, reply):

From: Tatsuya Kinoshita <tats@debian.org>
To: markus.hiereth@freenet.de, 279221@bugs.debian.org
Subject: Re: Bug#279221: should transcode characters from utf-8 if the terminal is not utf-8 capable
Date: Sun, 12 Oct 2014 23:46:45 +0900 (JST)
[Message part 1 (text/plain, inline)]
On October 12, 2014 at 2:31PM +0200, markus.hiereth (at freenet.de) wrote:
> I got the correct output
>
> $ echo '&mdash;' > foo.html
> $ w3m -dump < foo.html
> &mdash;

Still not improved.

    $ w3m -dump foo.html
    ?
    $ w3m -dump -T text/html < foo.html
    ?

Thanks,
--
Tatsuya Kinoshita
[Message part 2 (application/pgp-signature, inline)]

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Mon Jun 5 03:09:25 2023; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.