Debian Bug report logs - #420728
mairix returns incorrect results in certain local-situations

version graph

Package: mairix; Maintainer for mairix is Benjamin Mako Hill <mako@debian.org>; Source for mairix is src:mairix.

Reported by: Vincent Lefevre <vincent@vinc17.org>

Date: Tue, 24 Apr 2007 10:45:02 UTC

Severity: normal

Tags: upstream

Found in version mairix/0.20-1

Forwarded to Richard Curnow <rc@rc0.org.uk>

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package mairix. Full text and rfc822 format available.

Acknowledgement sent to Vincent Lefevre <vincent@vinc17.org>:
New Bug report received and forwarded. Copy sent to Benjamin Mako Hill <mako@debian.org>. Full text and rfc822 format available.

Message #5 received at submit@bugs.debian.org (full text, mbox):

From: Vincent Lefevre <vincent@vinc17.org>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: mairix finds no matches in UTF-8 locales
Date: Tue, 24 Apr 2007 12:43:43 +0200
Package: mairix
Version: 0.20-1
Severity: normal

mairix finds no matches in UTF-8 locales. For instance, in ISO-8859-1:

vin:~> locale
LANG=POSIX
LC_CTYPE=en_US.ISO8859-1
LC_NUMERIC="POSIX"
LC_TIME=en_DK
LC_COLLATE=POSIX
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
vin:~> mairix testé
Matched 305 messages

But in UTF-8, on the same machine:

vin:~> locale
LANG=POSIX
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="POSIX"
LC_TIME=en_DK
LC_COLLATE=POSIX
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
vin:~> mairix testé
Matched 0 messages

-- System Information:
Debian Release: lenny/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 2.6.18-4-686-bigmem (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.ISO8859-1 (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/bash

Versions of packages mairix depends on:
ii  libbz2-1.0                    1.0.3-6    high-quality block-sorting file co
ii  libc6                         2.5-2      GNU C Library: Shared libraries
ii  zlib1g                        1:1.2.3-13 compression library - runtime

mairix recommends no packages.

-- no debconf information



Information forwarded to debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package mairix. Full text and rfc822 format available.

Acknowledgement sent to "Benj. Mako Hill" <mako@debian.org>:
Extra info received and forwarded to list. Copy sent to Benjamin Mako Hill <mako@debian.org>. Full text and rfc822 format available.

Message #10 received at 420728@bugs.debian.org (full text, mbox):

From: "Benj. Mako Hill" <mako@debian.org>
To: Vincent Lefevre <vincent@vinc17.org>, 420728@bugs.debian.org
Subject: Re: Bug#420728: mairix finds no matches in UTF-8 locales
Date: Wed, 25 Apr 2007 09:21:21 -0400
Something's going on here. I *only* use mairix in UTF-8 locales and, as
you can imagine, it works just fine. Are you sure you have generated
your locale data?

Regards,
Mako

-- 
Benjamin Mako Hill
mako@debian.org
http://mako.cc/




Information forwarded to debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package mairix. Full text and rfc822 format available.

Acknowledgement sent to Vincent Lefevre <vincent@vinc17.org>:
Extra info received and forwarded to list. Copy sent to Benjamin Mako Hill <mako@debian.org>. Full text and rfc822 format available.

Message #15 received at 420728@bugs.debian.org (full text, mbox):

From: Vincent Lefevre <vincent@vinc17.org>
To: "Benj. Mako Hill" <mako@debian.org>
Cc: 420728@bugs.debian.org
Subject: Re: Bug#420728: mairix finds no matches in UTF-8 locales
Date: Wed, 25 Apr 2007 16:35:35 +0200
On 2007-04-25 09:21:21 -0400, Benj. Mako Hill wrote:
> Something's going on here. I *only* use mairix in UTF-8 locales and, as
> you can imagine, it works just fine. Are you sure you have generated
> your locale data?

Yes, UTF-8 works fine with other applications.

Also, I most often use mairix in ISO-8859-1.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)



Information forwarded to debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package mairix. Full text and rfc822 format available.

Acknowledgement sent to "Benj. Mako Hill" <mako@debian.org>:
Extra info received and forwarded to list. Copy sent to Benjamin Mako Hill <mako@debian.org>. Full text and rfc822 format available.

Message #20 received at 420728@bugs.debian.org (full text, mbox):

From: "Benj. Mako Hill" <mako@debian.org>
To: Vincent Lefevre <vincent@vinc17.org>
Cc: 420728@bugs.debian.org
Subject: Re: Bug#420728: mairix finds no matches in UTF-8 locales
Date: Wed, 25 Apr 2007 11:13:54 -0400
<quote who="Vincent Lefevre" date="Wed, Apr 25, 2007 at 04:35:35PM +0200">
> On 2007-04-25 09:21:21 -0400, Benj. Mako Hill wrote:
> > Something's going on here. I *only* use mairix in UTF-8 locales and, as
> > you can imagine, it works just fine. Are you sure you have generated
> > your locale data?
> 
> Yes, UTF-8 works fine with other applications.

My point was that it works fine with Mairix. I can't reproduce your bug.
I'm pretty sure you've got something screwing going on with your system.

Regards,
Mako

-- 
Benjamin Mako Hill
mako@debian.org
http://mako.cc/




Information forwarded to debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package mairix. Full text and rfc822 format available.

Acknowledgement sent to Vincent Lefevre <vincent@vinc17.org>:
Extra info received and forwarded to list. Copy sent to Benjamin Mako Hill <mako@debian.org>. Full text and rfc822 format available.

Message #25 received at 420728@bugs.debian.org (full text, mbox):

From: Vincent Lefevre <vincent@vinc17.org>
To: "Benj. Mako Hill" <mako@debian.org>
Cc: 420728@bugs.debian.org
Subject: Re: Bug#420728: mairix finds no matches in UTF-8 locales
Date: Wed, 25 Apr 2007 18:21:47 +0200
On 2007-04-25 11:13:54 -0400, Benj. Mako Hill wrote:
> My point was that it works fine with Mairix. I can't reproduce your bug.
> I'm pretty sure you've got something screwing going on with your system.

I have exactly the same problem under Mac OS X.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)



Information forwarded to debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package mairix. Full text and rfc822 format available.

Acknowledgement sent to Vincent Lefevre <vincent@vinc17.org>:
Extra info received and forwarded to list. Copy sent to Benjamin Mako Hill <mako@debian.org>. Full text and rfc822 format available.

Message #30 received at 420728@bugs.debian.org (full text, mbox):

From: Vincent Lefevre <vincent@vinc17.org>
To: 420728@bugs.debian.org
Subject: Re: mairix finds no matches in UTF-8 locales
Date: Tue, 11 Dec 2007 15:27:30 +0100
I've done some more tests under Debian/unstable.

* With a database rebuilt (rm Mail/.mairix; mairix -p -v) under
  UTF-8 locales, "mairix testé" finds no matches, whether it is
  typed under ISO-8859-1 locales or UTF-8 locales.

* With a database rebuilt (rm Mail/.mairix; mairix -p -v) under
  ISO-8859-1 locales "mairix testé" finds matches only when typed
  under ISO-8859-1 locales.

Applications (e.g. mutt) work fine, whether I'm under ISO-8859-1
or UTF-8 locales. This is a mairix-specific problem.

I'll try to provide a simple testcase.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)




Information forwarded to debian-bugs-dist@lists.debian.org, Benjamin Mako Hill <mako@debian.org>:
Bug#420728; Package mairix. Full text and rfc822 format available.

Acknowledgement sent to Vincent Lefevre <vincent@vinc17.org>:
Extra info received and forwarded to list. Copy sent to Benjamin Mako Hill <mako@debian.org>. Full text and rfc822 format available.

Message #35 received at 420728@bugs.debian.org (full text, mbox):

From: Vincent Lefevre <vincent@vinc17.org>
To: 420728@bugs.debian.org
Subject: Re: mairix finds no matches in UTF-8 locales
Date: Sat, 22 Dec 2007 21:41:56 +0100
[Message part 1 (text/plain, inline)]
On 2007-12-11 15:27:30 +0100, Vincent Lefevre wrote:
> I'll try to provide a simple testcase.

Here is it. To reproduce the bug, run the script. In /tmp/mairix-test,
it will create a mailbox with two messages: one with the iso-8859-1
charset, one with the utf-8 charset. Both contain the word "accentué".
The script also creates the mairix database under 3 different locales:
C, en_US.iso88591 and en_US.utf8 (all installed here).

Then cd to /tmp/mairix-test and run the following commands:

   mairix -f mairixrc-C accentué
   mairix -f mairixrc-en_US.iso88591 accentué
   mairix -f mairixrc-en_US.utf8 accentué

in iso-8859-1 locales and in utf-8 locales.

Here, I get 0 messages except for mairixrc-en_US.iso88591 in iso88591
locales, where I get 1 message: the one written in iso-8859-1.

The correct behavior would be 2 messages in each case.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
[non-ascii.sh (application/x-sh, attachment)]

Changed Bug title to `mairix returns incorrect results in certain local-situations' from `mairix finds no matches in UTF-8 locales'. Request was from "Benj. Mako Hill" <mako@debian.org> to control@bugs.debian.org. (Mon, 14 Jan 2008 15:30:04 GMT) Full text and rfc822 format available.

Tags added: upstream Request was from "Benj. Mako Hill" <mako@debian.org> to control@bugs.debian.org. (Mon, 14 Jan 2008 15:30:05 GMT) Full text and rfc822 format available.

Reply sent to "Benj. Mako Hill" <mako@debian.org>:
You have marked Bug as forwarded. Full text and rfc822 format available.

Message #42 received at 420728-forwarded@bugs.debian.org (full text, mbox):

From: "Benj. Mako Hill" <mako@debian.org>
To: Richard Curnow <rc@rc0.org.uk>
Cc: 420728-forwarded@bugs.debian.org, control@bugs.debian.org,
	Vincent Lefevre <vincent@vinc17.org>
Subject: [vincent@vinc17.org: Bug#420728: mairix finds no matches in UTF-8
	locales]
Date: Sun, 13 Jan 2008 18:10:18 -0500
[Message part 1 (text/plain, inline)]
retitle 420728 mairix returns incorrect results in certain local-situations
tags 420728 upstream
thanks

Richard,

I've forwarding a bug in Mairix reported by a Debian user. I've worked
with Vincent to confirm and provide a test case for this issue. I'm
forwarding our complete correspondence on this issue but you can, as
always, check out the bug and necessary attachments in the Debian bug
tracking system.

Let me know if you have additional questions or if there's anything I
can do to help with this.

Thanks Vincent, for getting us here and thanks Richard for maintaining
Mairix!

Regards
Mako

-- 
Benjamin Mako Hill
mako@debian.org
http://mako.cc/

[Message part 2 (message/rfc822, inline)]
From: Vincent Lefevre <vincent@vinc17.org>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Bug#420728: mairix finds no matches in UTF-8 locales
Date: Tue, 24 Apr 2007 12:43:43 +0200
Package: mairix
Version: 0.20-1
Severity: normal

mairix finds no matches in UTF-8 locales. For instance, in ISO-8859-1:

vin:~> locale
LANG=POSIX
LC_CTYPE=en_US.ISO8859-1
LC_NUMERIC="POSIX"
LC_TIME=en_DK
LC_COLLATE=POSIX
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
vin:~> mairix testé
Matched 305 messages

But in UTF-8, on the same machine:

vin:~> locale
LANG=POSIX
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="POSIX"
LC_TIME=en_DK
LC_COLLATE=POSIX
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
vin:~> mairix testé
Matched 0 messages

-- System Information:
Debian Release: lenny/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 2.6.18-4-686-bigmem (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.ISO8859-1 (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/bash

Versions of packages mairix depends on:
ii  libbz2-1.0                    1.0.3-6    high-quality block-sorting file co
ii  libc6                         2.5-2      GNU C Library: Shared libraries
ii  zlib1g                        1:1.2.3-13 compression library - runtime

mairix recommends no packages.

-- no debconf information
[Message part 3 (message/rfc822, inline)]
From: "Benj. Mako Hill" <mako@debian.org>
To: Vincent Lefevre <vincent@vinc17.org>, 420728@bugs.debian.org
Subject: Bug#420728: mairix finds no matches in UTF-8 locales
Date: Wed, 25 Apr 2007 09:21:21 -0400
Something's going on here. I *only* use mairix in UTF-8 locales and, as
you can imagine, it works just fine. Are you sure you have generated
your locale data?

Regards,
Mako

-- 
Benjamin Mako Hill
mako@debian.org
http://mako.cc/

[Message part 4 (message/rfc822, inline)]
From: Vincent Lefevre <vincent@vinc17.org>
To: "Benj. Mako Hill" <mako@debian.org>
Cc: 420728@bugs.debian.org
Subject: Re: Bug#420728: mairix finds no matches in UTF-8 locales
Date: Wed, 25 Apr 2007 16:35:35 +0200
On 2007-04-25 09:21:21 -0400, Benj. Mako Hill wrote:
> Something's going on here. I *only* use mairix in UTF-8 locales and, as
> you can imagine, it works just fine. Are you sure you have generated
> your locale data?

Yes, UTF-8 works fine with other applications.

Also, I most often use mairix in ISO-8859-1.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
[Message part 5 (message/rfc822, inline)]
From: "Benj. Mako Hill" <mako@debian.org>
To: Vincent Lefevre <vincent@vinc17.org>
Cc: 420728@bugs.debian.org
Subject: Bug#420728: mairix finds no matches in UTF-8 locales
Date: Wed, 25 Apr 2007 11:13:54 -0400
<quote who="Vincent Lefevre" date="Wed, Apr 25, 2007 at 04:35:35PM +0200">
> On 2007-04-25 09:21:21 -0400, Benj. Mako Hill wrote:
> > Something's going on here. I *only* use mairix in UTF-8 locales and, as
> > you can imagine, it works just fine. Are you sure you have generated
> > your locale data?
> 
> Yes, UTF-8 works fine with other applications.

My point was that it works fine with Mairix. I can't reproduce your bug.
I'm pretty sure you've got something screwing going on with your system.

Regards,
Mako

-- 
Benjamin Mako Hill
mako@debian.org
http://mako.cc/

[Message part 6 (message/rfc822, inline)]
From: Vincent Lefevre <vincent@vinc17.org>
To: "Benj. Mako Hill" <mako@debian.org>
Cc: 420728@bugs.debian.org
Subject: Re: Bug#420728: mairix finds no matches in UTF-8 locales
Date: Wed, 25 Apr 2007 18:21:47 +0200
On 2007-04-25 11:13:54 -0400, Benj. Mako Hill wrote:
> My point was that it works fine with Mairix. I can't reproduce your bug



Message #43 received at 420728-forwarded@bugs.debian.org (full text, mbox):

From: Richard Curnow <rc@rc0.org.uk>
To: "Benj. Mako Hill" <mako@debian.org>
Cc: 420728-forwarded@bugs.debian.org, control@bugs.debian.org,
	Vincent Lefevre <vincent@vinc17.org>
Subject: Re: [vincent@vinc17.org: Bug#420728: mairix finds no matches in
	UTF-8 locales]
Date: Thu, 14 Feb 2008 00:16:26 +0000
On Sun, Jan 13, 2008 at 06:10:18PM -0500, Benj. Mako Hill wrote:
> 
> I've forwarding a bug in Mairix reported by a Debian user. I've worked
> with Vincent to confirm and provide a test case for this issue. I'm
> forwarding our complete correspondence on this issue but you can, as
> always, check out the bug and necessary attachments in the Debian bug
> tracking system.
> 
> Let me know if you have additional questions or if there's anything I
> can do to help with this.

Firstly, sorry for the dreadfully slow reply.

I think the problem might be that the index is built without any attempt
to canonicalise the tokens with regard to the message encodings.  If the
messages containing 'testé' all used iso-8859-1 encoding for their
bodies, the 'é' would be stored as 0xe9 inside the tokens in the index.
However, I suspect that when the locale is utf-8 at search time, the 'é'
is represented as the byte pair 0xc3 0xa9 (maybe my unicode is incorrect
though), and it doesn't match.  Or something along those lines :-)

Generalising, if someone has various messages with bodies using all
sorts of encodings, this is going to be chaos.  I could image polyglots
with friends in Russia, Japan, Europe who have messages in a mix of
SJIS, iso-8859-1 & whatever Russia uses.

I suppose the 'fix' would be to pass all bodies through iconv and store
the index in some canonical form (utf-8 or ucs-2 probably.)  As I
largely still inhabit a world of us-ascii encodings, or iso-8859-1 at a
pinch, I am not itching to tackle this.  We would also acquire a
dependency on the iconv library.





Message #44 received at 420728-forwarded@bugs.debian.org (full text, mbox):

From: "Benj. Mako Hill" <mako@atdot.cc>
To: Richard Curnow <rc@rc0.org.uk>
Cc: 420728-forwarded@bugs.debian.org, control@bugs.debian.org,
	Vincent Lefevre <vincent@vinc17.org>
Subject: Re: [vincent@vinc17.org: Bug#420728: mairix finds no matches in
	UTF-8 locales]
Date: Fri, 15 Feb 2008 10:37:30 -0500
[Message part 1 (text/plain, inline)]
<quote who="Richard Curnow" date="Thu, Feb 14, 2008 at 12:16:26AM +0000">
> I suppose the 'fix' would be to pass all bodies through iconv and store
> the index in some canonical form (utf-8 or ucs-2 probably.)  As I
> largely still inhabit a world of us-ascii encodings, or iso-8859-1 at a
> pinch, I am not itching to tackle this.  We would also acquire a
> dependency on the iconv library.

Thanks for the explanation about what is going on. 

I tend to agree that a UTF-8 or UCS-2 is going to be the right approach.
I'm happy to update the bug accordingly and leave it at that until
someone can provide a patch.

Regards,
Mako

-- 
Benjamin Mako Hill
mako@atdot.cc
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto
[signature.asc (application/pgp-signature, inline)]

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Fri Feb 10 06:21:56 2012; Machine Name: duarte.debian.org

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.