Debian Bug report logs -
#420728
mairix returns incorrect results in certain local-situations
Reported by: Vincent Lefevre <vincent@vinc17.org>
Date: Tue, 24 Apr 2007 10:45:02 UTC
Severity: normal
Tags: upstream
Found in version mairix/0.20-1
Forwarded to Richard Curnow <rc@rc0.org.uk>
Reply or subscribe to this bug.
Toggle useless messages
Report forwarded to
debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package
mairix.
Full text and
rfc822 format available.
Acknowledgement sent to
Vincent Lefevre <vincent@vinc17.org>:
New Bug report received and forwarded. Copy sent to
Benjamin Mako Hill <mako@debian.org>.
Full text and
rfc822 format available.
Message #5 received at submit@bugs.debian.org (full text, mbox):
Package: mairix
Version: 0.20-1
Severity: normal
mairix finds no matches in UTF-8 locales. For instance, in ISO-8859-1:
vin:~> locale
LANG=POSIX
LC_CTYPE=en_US.ISO8859-1
LC_NUMERIC="POSIX"
LC_TIME=en_DK
LC_COLLATE=POSIX
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
vin:~> mairix testé
Matched 305 messages
But in UTF-8, on the same machine:
vin:~> locale
LANG=POSIX
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="POSIX"
LC_TIME=en_DK
LC_COLLATE=POSIX
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
vin:~> mairix testé
Matched 0 messages
-- System Information:
Debian Release: lenny/sid
APT prefers unstable
APT policy: (500, 'unstable'), (500, 'stable')
Architecture: i386 (i686)
Kernel: Linux 2.6.18-4-686-bigmem (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.ISO8859-1 (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/bash
Versions of packages mairix depends on:
ii libbz2-1.0 1.0.3-6 high-quality block-sorting file co
ii libc6 2.5-2 GNU C Library: Shared libraries
ii zlib1g 1:1.2.3-13 compression library - runtime
mairix recommends no packages.
-- no debconf information
Information forwarded to
debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package
mairix.
Full text and
rfc822 format available.
Acknowledgement sent to
"Benj. Mako Hill" <mako@debian.org>:
Extra info received and forwarded to list. Copy sent to
Benjamin Mako Hill <mako@debian.org>.
Full text and
rfc822 format available.
Message #10 received at 420728@bugs.debian.org (full text, mbox):
Something's going on here. I *only* use mairix in UTF-8 locales and, as
you can imagine, it works just fine. Are you sure you have generated
your locale data?
Regards,
Mako
--
Benjamin Mako Hill
mako@debian.org
http://mako.cc/
Information forwarded to
debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package
mairix.
Full text and
rfc822 format available.
Acknowledgement sent to
Vincent Lefevre <vincent@vinc17.org>:
Extra info received and forwarded to list. Copy sent to
Benjamin Mako Hill <mako@debian.org>.
Full text and
rfc822 format available.
Message #15 received at 420728@bugs.debian.org (full text, mbox):
On 2007-04-25 09:21:21 -0400, Benj. Mako Hill wrote:
> Something's going on here. I *only* use mairix in UTF-8 locales and, as
> you can imagine, it works just fine. Are you sure you have generated
> your locale data?
Yes, UTF-8 works fine with other applications.
Also, I most often use mairix in ISO-8859-1.
--
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
Information forwarded to
debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package
mairix.
Full text and
rfc822 format available.
Acknowledgement sent to
"Benj. Mako Hill" <mako@debian.org>:
Extra info received and forwarded to list. Copy sent to
Benjamin Mako Hill <mako@debian.org>.
Full text and
rfc822 format available.
Message #20 received at 420728@bugs.debian.org (full text, mbox):
<quote who="Vincent Lefevre" date="Wed, Apr 25, 2007 at 04:35:35PM +0200">
> On 2007-04-25 09:21:21 -0400, Benj. Mako Hill wrote:
> > Something's going on here. I *only* use mairix in UTF-8 locales and, as
> > you can imagine, it works just fine. Are you sure you have generated
> > your locale data?
>
> Yes, UTF-8 works fine with other applications.
My point was that it works fine with Mairix. I can't reproduce your bug.
I'm pretty sure you've got something screwing going on with your system.
Regards,
Mako
--
Benjamin Mako Hill
mako@debian.org
http://mako.cc/
Information forwarded to
debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package
mairix.
Full text and
rfc822 format available.
Acknowledgement sent to
Vincent Lefevre <vincent@vinc17.org>:
Extra info received and forwarded to list. Copy sent to
Benjamin Mako Hill <mako@debian.org>.
Full text and
rfc822 format available.
Message #25 received at 420728@bugs.debian.org (full text, mbox):
On 2007-04-25 11:13:54 -0400, Benj. Mako Hill wrote:
> My point was that it works fine with Mairix. I can't reproduce your bug.
> I'm pretty sure you've got something screwing going on with your system.
I have exactly the same problem under Mac OS X.
--
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
Information forwarded to
debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package
mairix.
Full text and
rfc822 format available.
Acknowledgement sent to
Vincent Lefevre <vincent@vinc17.org>:
Extra info received and forwarded to list. Copy sent to
Benjamin Mako Hill <mako@debian.org>.
Full text and
rfc822 format available.
Message #30 received at 420728@bugs.debian.org (full text, mbox):
I've done some more tests under Debian/unstable.
* With a database rebuilt (rm Mail/.mairix; mairix -p -v) under
UTF-8 locales, "mairix testé" finds no matches, whether it is
typed under ISO-8859-1 locales or UTF-8 locales.
* With a database rebuilt (rm Mail/.mairix; mairix -p -v) under
ISO-8859-1 locales "mairix testé" finds matches only when typed
under ISO-8859-1 locales.
Applications (e.g. mutt) work fine, whether I'm under ISO-8859-1
or UTF-8 locales. This is a mairix-specific problem.
I'll try to provide a simple testcase.
--
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
Information forwarded to
debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package
mairix.
Full text and
rfc822 format available.
Acknowledgement sent to
Vincent Lefevre <vincent@vinc17.org>:
Extra info received and forwarded to list. Copy sent to
Benjamin Mako Hill <mako@debian.org>.
Full text and
rfc822 format available.
Message #35 received at 420728@bugs.debian.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 2007-12-11 15:27:30 +0100, Vincent Lefevre wrote:
> I'll try to provide a simple testcase.
Here is it. To reproduce the bug, run the script. In /tmp/mairix-test,
it will create a mailbox with two messages: one with the iso-8859-1
charset, one with the utf-8 charset. Both contain the word "accentué".
The script also creates the mairix database under 3 different locales:
C, en_US.iso88591 and en_US.utf8 (all installed here).
Then cd to /tmp/mairix-test and run the following commands:
mairix -f mairixrc-C accentué
mairix -f mairixrc-en_US.iso88591 accentué
mairix -f mairixrc-en_US.utf8 accentué
in iso-8859-1 locales and in utf-8 locales.
Here, I get 0 messages except for mairixrc-en_US.iso88591 in iso88591
locales, where I get 1 message: the one written in iso-8859-1.
The correct behavior would be 2 messages in each case.
--
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
[non-ascii.sh (application/x-sh, attachment)]
Changed Bug title to `mairix returns incorrect results in certain local-situations' from `mairix finds no matches in UTF-8 locales'.
Request was from
"Benj. Mako Hill" <mako@debian.org>
to
control@bugs.debian.org.
(Mon, 14 Jan 2008 15:30:04 GMT)
Full text and
rfc822 format available.
Tags added: upstream
Request was from
"Benj. Mako Hill" <mako@debian.org>
to
control@bugs.debian.org.
(Mon, 14 Jan 2008 15:30:05 GMT)
Full text and
rfc822 format available.
Reply sent to
"Benj. Mako Hill" <mako@debian.org>:
You have marked Bug as forwarded.
Full text and
rfc822 format available.
Message #42 received at 420728-forwarded@bugs.debian.org (full text, mbox):
[Message part 1 (text/plain, inline)]
retitle 420728 mairix returns incorrect results in certain local-situations
tags 420728 upstream
thanks
Richard,
I've forwarding a bug in Mairix reported by a Debian user. I've worked
with Vincent to confirm and provide a test case for this issue. I'm
forwarding our complete correspondence on this issue but you can, as
always, check out the bug and necessary attachments in the Debian bug
tracking system.
Let me know if you have additional questions or if there's anything I
can do to help with this.
Thanks Vincent, for getting us here and thanks Richard for maintaining
Mairix!
Regards
Mako
--
Benjamin Mako Hill
mako@debian.org
http://mako.cc/
[Message part 2 (message/rfc822, inline)]
Package: mairix
Version: 0.20-1
Severity: normal
mairix finds no matches in UTF-8 locales. For instance, in ISO-8859-1:
vin:~> locale
LANG=POSIX
LC_CTYPE=en_US.ISO8859-1
LC_NUMERIC="POSIX"
LC_TIME=en_DK
LC_COLLATE=POSIX
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
vin:~> mairix testé
Matched 305 messages
But in UTF-8, on the same machine:
vin:~> locale
LANG=POSIX
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="POSIX"
LC_TIME=en_DK
LC_COLLATE=POSIX
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
vin:~> mairix testé
Matched 0 messages
-- System Information:
Debian Release: lenny/sid
APT prefers unstable
APT policy: (500, 'unstable'), (500, 'stable')
Architecture: i386 (i686)
Kernel: Linux 2.6.18-4-686-bigmem (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.ISO8859-1 (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/bash
Versions of packages mairix depends on:
ii libbz2-1.0 1.0.3-6 high-quality block-sorting file co
ii libc6 2.5-2 GNU C Library: Shared libraries
ii zlib1g 1:1.2.3-13 compression library - runtime
mairix recommends no packages.
-- no debconf information
[Message part 3 (message/rfc822, inline)]
Something's going on here. I *only* use mairix in UTF-8 locales and, as
you can imagine, it works just fine. Are you sure you have generated
your locale data?
Regards,
Mako
--
Benjamin Mako Hill
mako@debian.org
http://mako.cc/
[Message part 4 (message/rfc822, inline)]
On 2007-04-25 09:21:21 -0400, Benj. Mako Hill wrote:
> Something's going on here. I *only* use mairix in UTF-8 locales and, as
> you can imagine, it works just fine. Are you sure you have generated
> your locale data?
Yes, UTF-8 works fine with other applications.
Also, I most often use mairix in ISO-8859-1.
--
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
[Message part 5 (message/rfc822, inline)]
<quote who="Vincent Lefevre" date="Wed, Apr 25, 2007 at 04:35:35PM +0200">
> On 2007-04-25 09:21:21 -0400, Benj. Mako Hill wrote:
> > Something's going on here. I *only* use mairix in UTF-8 locales and, as
> > you can imagine, it works just fine. Are you sure you have generated
> > your locale data?
>
> Yes, UTF-8 works fine with other applications.
My point was that it works fine with Mairix. I can't reproduce your bug.
I'm pretty sure you've got something screwing going on with your system.
Regards,
Mako
--
Benjamin Mako Hill
mako@debian.org
http://mako.cc/
[Message part 6 (message/rfc822, inline)]
On 2007-04-25 11:13:54 -0400, Benj. Mako Hill wrote:
> My point was that it works fine with Mairix. I can't reproduce your bug
Message #43 received at 420728-forwarded@bugs.debian.org (full text, mbox):
On Sun, Jan 13, 2008 at 06:10:18PM -0500, Benj. Mako Hill wrote:
>
> I've forwarding a bug in Mairix reported by a Debian user. I've worked
> with Vincent to confirm and provide a test case for this issue. I'm
> forwarding our complete correspondence on this issue but you can, as
> always, check out the bug and necessary attachments in the Debian bug
> tracking system.
>
> Let me know if you have additional questions or if there's anything I
> can do to help with this.
Firstly, sorry for the dreadfully slow reply.
I think the problem might be that the index is built without any attempt
to canonicalise the tokens with regard to the message encodings. If the
messages containing 'testé' all used iso-8859-1 encoding for their
bodies, the 'é' would be stored as 0xe9 inside the tokens in the index.
However, I suspect that when the locale is utf-8 at search time, the 'é'
is represented as the byte pair 0xc3 0xa9 (maybe my unicode is incorrect
though), and it doesn't match. Or something along those lines :-)
Generalising, if someone has various messages with bodies using all
sorts of encodings, this is going to be chaos. I could image polyglots
with friends in Russia, Japan, Europe who have messages in a mix of
SJIS, iso-8859-1 & whatever Russia uses.
I suppose the 'fix' would be to pass all bodies through iconv and store
the index in some canonical form (utf-8 or ucs-2 probably.) As I
largely still inhabit a world of us-ascii encodings, or iso-8859-1 at a
pinch, I am not itching to tackle this. We would also acquire a
dependency on the iconv library.
Message #44 received at 420728-forwarded@bugs.debian.org (full text, mbox):
[Message part 1 (text/plain, inline)]
<quote who="Richard Curnow" date="Thu, Feb 14, 2008 at 12:16:26AM +0000">
> I suppose the 'fix' would be to pass all bodies through iconv and store
> the index in some canonical form (utf-8 or ucs-2 probably.) As I
> largely still inhabit a world of us-ascii encodings, or iso-8859-1 at a
> pinch, I am not itching to tackle this. We would also acquire a
> dependency on the iconv library.
Thanks for the explanation about what is going on.
I tend to agree that a UTF-8 or UCS-2 is going to be the right approach.
I'm happy to update the bug accordingly and leave it at that until
someone can provide a patch.
Regards,
Mako
--
Benjamin Mako Hill
mako@atdot.cc
http://mako.cc/
Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto
[signature.asc (application/pgp-signature, inline)]
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Fri Feb 10 06:21:56 2012;
Machine Name:
duarte.debian.org
Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.