Debian Bug report logs - #496226
html2text: should recognize 'meta' html tag and make input recoding

version graph

Package: html2text; Maintainer for html2text is Holger Levsen <holger@debian.org>; Source for html2text is src:html2text.

Reported by: Samuel Thibault <samuel.thibault@ens-lyon.org>

Date: Sat, 23 Aug 2008 15:24:02 UTC

Severity: minor

Tags: upstream

Found in version html2text/1.3.2a-6

Fixed in versions html2text/1.3.2a-9, html2text/1.3.2a-10

Done: jackyf.devel@gmail.com (Eugene V. Lyubimkin)

Bug is archived. No further changes may be made.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, jackyf.devel@gmail.com (Eugene V. Lyubimkin):
Bug#496226; Package html2text. Full text and rfc822 format available.

Acknowledgement sent to Samuel Thibault <samuel.thibault@ens-lyon.org>:
New Bug report received and forwarded. Copy sent to jackyf.devel@gmail.com (Eugene V. Lyubimkin). Full text and rfc822 format available.

Message #5 received at submit@bugs.debian.org (full text, mbox):

From: Samuel Thibault <samuel.thibault@ens-lyon.org>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: html2text: Input and output charsets should be independant and if possible detected
Date: Sat, 23 Aug 2008 15:39:10 +0100
Package: html2text
Version: 1.3.2a-6
Severity: minor

Hello,

As the information below says, I'm not using a UTF-8 locale.  html2text
will however, on utf-8 html pages, produce UTF-8 text.  Conversely, on a
UTF-8 system, html2text will, on latin1 html pages, produce latin1 text.
The recently added -utf8 option handles the UTF-8 on UTF-8 case, but
not the two cases above.

Generally speaking, there is no reason why the input and output charsets
should be related at all.  For the input, html2text should recognize
the meta http-equiv tag; that should work for a lot of pages, else an
input-charset option can be provided.  For the output, the current
locale's charset should be used (as returned by nl_langinfo(CODESET)
after calling setlocale(LC_CTYPE,"")); that should work in almost all
cases, else an output-charset option can be provided.

Yes, that means conversions.  But that's the way charsets are supposed
to be handled.  Note btw that for the conversions, one can just use
iconv_open(nl_langinfo(CODESET), page_charset), but can can also append
"//translit" to nl_langinfo(CODESET), so that iconv makes the
transliterations itself, i.e. turn curly quotes and long dashes into
equivalents in the target charset.

Samuel

-- System Information:
Debian Release: lenny/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'unstable'), (500, 'stable'), (1, 'experimental')
Architecture: i386 (i686)

Kernel: Linux 2.6.26
Locale: LANG=fr_FR@euro, LC_CTYPE=fr_FR@euro (charmap=ISO-8859-15)
Shell: /bin/sh linked to /bin/bash

Versions of packages html2text depends on:
ii  libc6                         2.7-13     GNU C Library: Shared libraries
ii  libgcc1                       1:4.3.1-2  GCC support library
ii  libstdc++6                    4.3.1-2    The GNU Standard C++ Library v3

html2text recommends no packages.

Versions of packages html2text suggests:
ii  curl                          7.18.2-5   Get a file from an HTTP, HTTPS or 
ii  wget                          1.11.4-1   retrieves files from the web

-- no debconf information




Tags added: upstream Request was from "Eugene V. Lyubimkin" <jackyf.devel@gmail.com> to control@bugs.debian.org. (Sat, 23 Aug 2008 20:00:02 GMT) Full text and rfc822 format available.

Bug 496226 cloned as bug 498797. Request was from "Eugene V. Lyubimkin" <jackyf.devel@gmail.com> to control@bugs.debian.org. (Sat, 13 Sep 2008 14:03:41 GMT) Full text and rfc822 format available.

Changed Bug title to `html2text: should recognize 'meta' html tag and make input recoding' from `html2text: Input and output charsets should be independant and if possible detected'. Request was from "Eugene V. Lyubimkin" <jackyf.devel@gmail.com> to control@bugs.debian.org. (Sat, 13 Sep 2008 14:03:49 GMT) Full text and rfc822 format available.

Message sent on to Samuel Thibault <samuel.thibault@ens-lyon.org>:
Bug#496226. Full text and rfc822 format available.

Message #14 received at 496226-submitter@bugs.debian.org (full text, mbox):

From: "Eugene V. Lyubimkin" <jackyf.devel@gmail.com>
To: control@bugs.debian.org, 496226-submitter@bugs.debian.org
Subject: dividing recoding-related bugs
Date: Sat, 13 Sep 2008 15:05:04 +0300
[Message part 1 (text/plain, inline)]
package html2text
clone 496226 -1
retitle -1 html2text: should convert output to current locale charset
retitle 496226 html2text: should recognize 'meta' html tag and make input recoding
thanks

-- 
Eugene V. Lyubimkin aka JackYF, Ukrainian C++ developer.

[signature.asc (application/pgp-signature, attachment)]

Tags added: pending Request was from jackyf.devel@gmail.com to control@bugs.debian.org. (Sat, 13 Sep 2008 16:09:02 GMT) Full text and rfc822 format available.

Reply sent to jackyf.devel@gmail.com (Eugene V. Lyubimkin):
You have taken responsibility. Full text and rfc822 format available.

Notification sent to Samuel Thibault <samuel.thibault@ens-lyon.org>:
Bug acknowledged by developer. Full text and rfc822 format available.

Message #21 received at 496226-close@bugs.debian.org (full text, mbox):

From: jackyf.devel@gmail.com (Eugene V. Lyubimkin)
To: 496226-close@bugs.debian.org
Subject: Bug#496226: fixed in html2text 1.3.2a-9
Date: Sun, 14 Sep 2008 10:47:03 +0000
Source: html2text
Source-Version: 1.3.2a-9

We believe that the bug you reported is fixed in the latest version of
html2text, which is due to be installed in the Debian FTP archive:

html2text_1.3.2a-9.diff.gz
  to pool/main/h/html2text/html2text_1.3.2a-9.diff.gz
html2text_1.3.2a-9.dsc
  to pool/main/h/html2text/html2text_1.3.2a-9.dsc
html2text_1.3.2a-9_amd64.deb
  to pool/main/h/html2text/html2text_1.3.2a-9_amd64.deb



A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to 496226@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Eugene V. Lyubimkin <jackyf.devel@gmail.com> (supplier of updated html2text package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@debian.org)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.8
Date: Sun, 07 Sep 2008 11:12:35 +0300
Source: html2text
Binary: html2text
Architecture: source amd64
Version: 1.3.2a-9
Distribution: experimental
Urgency: low
Maintainer: Eugene V. Lyubimkin <jackyf.devel@gmail.com>
Changed-By: Eugene V. Lyubimkin <jackyf.devel@gmail.com>
Description: 
 html2text  - advanced HTML to text converter
Closes: 285378 307425 496226
Changes: 
 html2text (1.3.2a-9) experimental; urgency=low
 .
   The "grepping binary device for patch parts" release.
   * debian/patches:
     - Refreshed all patches.
     - Add comments to all patches.
     - New patch 400-remove-builtin-http-support.patch: remove limited built-in
       http support. "Wontfix" bugs related to http support are closed thus.
       (Closes: #307425, #285378)
     - New patch 600-multiple-meta-tags.patch: recognize all 'meta' tags, not
       one. Thanks to Dmirty E. Oboukhov for the patch. Thanks to
       Stanislav Maslovski <stanislav.maslovski@gmail.com> for the help in bison
       patching.
     - New patch 611-recognize-input-encoding.patch: recode input according to
       'meta' tag. Thanks to Dmirty E. Oboukhov for the idea of patch.
       (Closes: #496226)
   * debian/html2text.1:
     - Mentioned new '-nometa' option.
     - Updated descriptions of '-utf8' and '-ascii' options.
     - Mentioned that Debian version of html2text has no http support.
     - Updated author's mail and download page.
   * debian/README.Debian:
     - Updated HTTP section, wrote META HTTP-EQUIV section.
Checksums-Sha1: 
 6c7ad448c4d2a8917fd247d83ecd9afa077a9082 1037 html2text_1.3.2a-9.dsc
 4d72e2e1af6088f07881d539cbd923c9b2a47395 24358 html2text_1.3.2a-9.diff.gz
 4d4eb1a815a4819bd6a5f4754b4714026f1a9826 100096 html2text_1.3.2a-9_amd64.deb
Checksums-Sha256: 
 042a0bb25e2a4121abda5b3465be1e06d416b87a0b8e8c1a11a41b59db1fb844 1037 html2text_1.3.2a-9.dsc
 38c22b233f01dacf7601b774ccd1fea7c91eb9203cfc742f7f9abf5a28e4ac40 24358 html2text_1.3.2a-9.diff.gz
 06dd95853b6edc3dad041d1adefdbbaa493ed6f9bfcc6b4269e5f007107409bf 100096 html2text_1.3.2a-9_amd64.deb
Files: 
 35439190e784388ec3d6cb396ca531f5 1037 web optional html2text_1.3.2a-9.dsc
 35ba538b0993ec9161560242ec1bc8e3 24358 web optional html2text_1.3.2a-9.diff.gz
 f216aebaf249623b6da246babcb4171c 100096 web optional html2text_1.3.2a-9_amd64.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkjM6Y4ACgkQKFvXofIqeU7GAACgwXt5HcGFk4wzMAWDWHsuwF87
eKYAoL9qSvxS4aT/ULznBbxEv5dDFiAy
=RsXf
-----END PGP SIGNATURE-----





Reply sent to jackyf.devel@gmail.com (Eugene V. Lyubimkin):
You have taken responsibility. (Fri, 03 Oct 2008 13:00:15 GMT) Full text and rfc822 format available.

Notification sent to Samuel Thibault <samuel.thibault@ens-lyon.org>:
Bug acknowledged by developer. (Fri, 03 Oct 2008 13:00:16 GMT) Full text and rfc822 format available.

Message #26 received at 496226-close@bugs.debian.org (full text, mbox):

From: jackyf.devel@gmail.com (Eugene V. Lyubimkin)
To: 496226-close@bugs.debian.org
Subject: Bug#496226: fixed in html2text 1.3.2a-10
Date: Fri, 03 Oct 2008 12:47:03 +0000
Source: html2text
Source-Version: 1.3.2a-10

We believe that the bug you reported is fixed in the latest version of
html2text, which is due to be installed in the Debian FTP archive:

html2text_1.3.2a-10.diff.gz
  to pool/main/h/html2text/html2text_1.3.2a-10.diff.gz
html2text_1.3.2a-10.dsc
  to pool/main/h/html2text/html2text_1.3.2a-10.dsc
html2text_1.3.2a-10_i386.deb
  to pool/main/h/html2text/html2text_1.3.2a-10_i386.deb



A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to 496226@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Eugene V. Lyubimkin <jackyf.devel@gmail.com> (supplier of updated html2text package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@debian.org)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.8
Date: Sat, 20 Sep 2008 14:10:09 +0300
Source: html2text
Binary: html2text
Architecture: source i386
Version: 1.3.2a-10
Distribution: experimental
Urgency: low
Maintainer: Eugene V. Lyubimkin <jackyf.devel@gmail.com>
Changed-By: Eugene V. Lyubimkin <jackyf.devel@gmail.com>
Description: 
 html2text  - advanced HTML to text converter
Closes: 285378 307425 496226 498797
Changes: 
 html2text (1.3.2a-10) experimental; urgency=low
 .
   * debian/rules:
     - Really install NEWS.Debian as documentation.
   * debian/patches:
     - 220-nobs-when-stdout-is-a-tty.patch: deleted, useless now, since
       backspaces are not produced at all.
     - 400-remove-builtin-http-support.patch: refreshed.
     - 500-utf8-support.patch: refreshed.
     - 510-utf8-implies-nobs.patch: deleted, useless now.
     - New 510-disable-backspaces.patch: disable backspaces because parser
       cannot produce them rightly in multi-byte sequences now.
     - 611-recognize-input-encoding.patch:
       - Corrected to don't produce error if '-nometa' option was not supplied
         and input html doesn't contain valid 'meta http-equiv' tag.
       - Corrected to don't display debug info twicely (if -debug-parser or
         -debug-scanner was supplied).
       - Corrected: now parser always processes UTF-8 text, needed for proper
         output recoding.
       - Moved recoding code to separate function.
       - Close input stream directly after read, not after the processing.
       - Correctly mark the end of converted sequence.
     - New 630-recode-output-to-locale-charset.patch: convert output to current
       locale charset. (Closes: #498797)
     - 300-replace-zeroes-with-null.patch: renamed to
       800-replace-zeroes-with-null.patch.
     - New 810-fix-deprecated-conversion-warnings.patch: fix 'deprecated
       conversion from string constant to ‘char*’' warnings during build by
       supplying 'const' qualifier in needed places.
   * debian/README.Debian:
     - Renamed 'META HTTP-EQUIV' section to 'Input recoding'.
     - Added correct input encoding cases to 'Input recoding' section.
     - Added 'Backspaces' section.
     - Added 'Output recoding' section.
   * debian/html2text.1:
     - Mentioned that Debian version of html2text doesn't produce backspaces,
       so '-nobs' does nothing.
     - Added paragraph about input/output recoding.
   * debian/NEWS.Debian:
     - Corrected news for 1.3.2a-9.
   * debian/control:
     - Renewed long description.
   [unera]
   * debian/changelog:
     - fixed incorrect changelog record 1.3.2a-9 (Thanks for Stanislav
       Maslovski <stanislav.maslovski@gmail.com> for the
       600-multiple-meta-tags.patch :)).
 .
 html2text (1.3.2a-9) experimental; urgency=low
 .
   The "grepping binary device for patch parts" release.
   * debian/patches:
     - Refreshed all patches.
     - Add comments to all patches.
     - New patch 400-remove-builtin-http-support.patch: remove limited built-in
       http support. "Wontfix" bugs related to http support are closed thus.
       (Closes: #307425, #285378)
     - New patch 600-multiple-meta-tags.patch: recognize all 'meta' tags, not
       one. Thanks to Stanislav Maslovski <stanislav.maslovski@gmail.com> for
       the patch, thanks to Dmitry E. Oboukhov for the idea of patch.
     - New patch 611-recognize-input-encoding.patch: recode input according to
       'meta' tag. Thanks to Dmirty E. Oboukhov for the idea of patch.
       (Closes: #496226)
   * debian/html2text.1:
     - Mentioned new '-nometa' option.
     - Updated descriptions of '-utf8' and '-ascii' options.
     - Mentioned that Debian version of html2text has no http support.
     - Updated author's mail and download page.
   * debian/README.Debian:
     - Updated HTTP section, wrote META HTTP-EQUIV section.
Checksums-Sha1: 
 b2f86e2c6de48dbb33fd8ef1c4bc57e7ad3db209 1033 html2text_1.3.2a-10.dsc
 6b916eee26412677e814d6240b82354c7d265889 27387 html2text_1.3.2a-10.diff.gz
 ba895bdab623c68842ae74c575290f3e14b868f5 97532 html2text_1.3.2a-10_i386.deb
Checksums-Sha256: 
 9de781984b64445d96686ac95bc2e8dbae1e5745aef8714b428c9034f8d65e88 1033 html2text_1.3.2a-10.dsc
 dee337dbafa0b79eff59b215e5727696d444bd524c8f990b8417be818e4296e7 27387 html2text_1.3.2a-10.diff.gz
 314c924bc21be146af89f6764243310d1b25f5b6e99f587c6363a3e9feac6891 97532 html2text_1.3.2a-10_i386.deb
Files: 
 338106f0781fa56e59a8b2b0d054326c 1033 web optional html2text_1.3.2a-10.dsc
 1f93477ccdee23a16733dc0b611b8553 27387 web optional html2text_1.3.2a-10.diff.gz
 b06963d527ecbc741aade80115b1a3f9 97532 web optional html2text_1.3.2a-10_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFI5g+Qq4wAz/jiZTcRAn99AKDJ824+P0f7AMnG70zZa5zWk9OUbQCgt9Pn
So82b7Io7lR+JK47K9Ahj6o=
=R0Tp
-----END PGP SIGNATURE-----





Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Sat, 01 Nov 2008 07:27:18 GMT) Full text and rfc822 format available.

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Thu Apr 24 22:55:29 2014; Machine Name: buxtehude.debian.org

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.