Debian Bug report logs - #358858
grep uses too much memory

version graph

Package: grep; Maintainer for grep is Anibal Monsalve Salazar <anibal@debian.org>; Source for grep is src:grep (PTS, buildd, popcon).

Reported by: John Goerzen <jgoerzen@complete.org>

Date: Fri, 24 Mar 2006 20:18:05 UTC

Severity: normal

Merged with 195719

Found in version 2.5.1-5

Fixed in version 2.5.1.ds2-4

Done: Aníbal Monsalve Salazar <anibal@debian.org>

Bug is archived. No further changes may be made.

Forwarded to bug-gnu-utils@gnu.org

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Anibal Monsalve Salazar <anibal@debian.org>:
Bug#358858; Package grep. (full text, mbox, link).


Acknowledgement sent to John Goerzen <jgoerzen@complete.org>:
New Bug report received and forwarded. Copy sent to Anibal Monsalve Salazar <anibal@debian.org>. (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: John Goerzen <jgoerzen@complete.org>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: grep: Leaks memory
Date: Fri, 24 Mar 2006 14:09:29 -0600
Package: grep
Version: 2.5.1.ds2-4
Severity: important

As I was trying to track down why a lot of memory was in use on my
system, I noticed:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27123 jgoerzen  18   0  220m 219m  604 R  6.0 24.8   1:13.36 grep

That is just running grep -r with an unescaped IP over the current
directory.  That is:

grep -r 1.2.3.4 .

I see no reason why it should be using so much RAM.

-- System Information:
Debian Release: testing/unstable
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.13.3
Locale: LANG=C, LC_CTYPE=en_US (charmap=ISO-8859-1)

Versions of packages grep depends on:
ii  libc6                         2.3.6-3    GNU C Library: Shared libraries an

grep recommends no packages.

-- no debconf information



Message sent on to John Goerzen <jgoerzen@complete.org>:
Bug#358858. (full text, mbox, link).


Message #8 received at 358858-submitter@bugs.debian.org (full text, mbox, reply):

From: Justin Pryzby <justinpryzby@users.sourceforge.net>
To: 358858-submitter@bugs.debian.org
Subject: your bug/grep: Leaks memory
Date: Wed, 12 Apr 2006 13:39:35 -0400
Hello John,

Can you still reproduce this bug?

grep: Leaks memory
http://bugs.debian.org/358858

Thanks
Justin



Message sent on to John Goerzen <jgoerzen@complete.org>:
Bug#358858. (full text, mbox, link).


Message #11 received at 358858-submitter@bugs.debian.org (full text, mbox, reply):

From: Justin Pryzby <justinpryzby@users.sourceforge.net>
To: 358858-submitter@bugs.debian.org
Subject: reproducing your bug/grep: Leaks memory
Date: Wed, 12 Apr 2006 15:03:08 -0400
I had some success reproducing this.  I grepped through my laptop's
nearly-full 10GB disk filesystem (not the device), and grep had
nearly-constant memory usage through the 30m period :)

But it did have trouble (as can be expected) eg with grepping devices
(in particular, /dev/zero).

Could you provide details about the environment under which grep
exhibited this behaviour?  For starters,

  pwd
  du -sh
  find |wc -l

Perhaps some statistics about the filesizes, etc.

Also a list of opened files (ls -l /proc/pid/fd).

Also, it would be interesting to see the valgrind output if you
compile grep locally with DEB_BUILD_OPTIONS=nostrip, or just use the
unstrpped binary from the build-tree.  Something like:

  valgrind --log-file-exactly=/tmp/grep-r-valgrind --show-reachable=yes --leak-check=full grep -r 1.2.3.4

Thanks
Justin



Information forwarded to debian-bugs-dist@lists.debian.org, Anibal Monsalve Salazar <anibal@debian.org>:
Bug#358858; Package grep. (full text, mbox, link).


Acknowledgement sent to Nicolas François <nicolas.francois@centraliens.net>:
Extra info received and forwarded to list. Copy sent to Anibal Monsalve Salazar <anibal@debian.org>. (full text, mbox, link).


Message #16 received at 358858@bugs.debian.org (full text, mbox, reply):

From: Nicolas François <nicolas.francois@centraliens.net>
To: 195719@bugs.debian.org, 358858@bugs.debian.org
Cc: Justin Pryzby <justinpryzby@users.sourceforge.net>, John Goerzen <jgoerzen@complete.org>, KORN Andras <korn-debbugs@chardonnay.math.bme.hu>
Subject: Re: Bug#195719/#358858: grep: Leaks memory
Date: Sat, 6 Oct 2007 12:26:23 +0200
forcemerge 195719 358858
thanks

Hello,

I investigated this memory leak, and did not find any evidence of a memory
leak (a large amount of memory is used, without leaking).

Thus I'm merging 195719 358858 which are both about grep leaking memory.
Consequently, the 358858's severity is downgraded from important to normal.

I also recommend to close these bugs or tag them as wontfix and changing
the title to "grep uses too much memory", or a wishlist --lowmem option
could be implemented.



I'm answering below to some of the comments of #195719 and #358858

grep uses a DFA algorithm to perform regexp matching. This DFA algorithm
is either implemented in grep, or in the libc (when re_search is used).
Which DFA algorithm is used depends on the version of grep and on the grep
options.

The DFA algorithm is a state machine and each time it is used, the
automaton which represents the regexp may use more memory because a new
transition is investigated.
The memory allocated for the automaton is not freed after each line is
parsed, but it is kept so that if a transition path is also used in a
later line, the processing will be faster.
Thus more and more memory is allocated.

This explains Justin concern in #195719:

On Thu, Nov 25, 2004 at 06:38:53PM -0500, Justin Pryzby wrote:
> 
> What I don't understand and cannot explain is why memory usage begins
> to climb linearly *after* the compilation.  It seems to me that it
> should do the exact same thing once per input line, and that should
> never require more memory (assuming input lines are about the same
> length, which they are).  Thus I suspect some other problem.

Also, in #195719:
On Thu, Dec 10, 2005 at 20:08:28 -0500, Justin Pryzby wrote:
>
> The culprit seems to be
> 
>    1024816 bytes in 2596 blocks are still reachable in loss record 18 of 18
>    at 0x1B90459D: malloc (vg_replace_malloc.c:130)
>    by 0x804C5EC: xmalloc (dfa.c:143)
>    by 0x804E9BB: build_state (dfa.c:2330)
>    by 0x804FC7E: dfaexec (dfa.c:2372)
>    by 0x8054DE2: EGexecute (search.c:402)
>    by 0x804A6C5: grepbuf (grep.c:732)
>    by 0x804B1A5: grepfile (grep.c:851)
>    by 0x804C39D: main (grep.c:1788)
> 
> (valgrind --leak-check=full --show-reachable=yes ./build-tree/grep-2.5.1/src/grep -f /tmp/words /usr/share/dict/words)

I could not reproduce exactly the same (the grep DFA algorithm was
used, and currently grep uses re_search from the libc, so the `leak' is
now in the libc)

Given that /tmp/words was composed of > 16 characters words, the DFA
algorithm had to build one automaton per lines in /tmp/words, each of
these automaton should have more than 16 states, if there are 150 lines in
/tmp/words, we are close to the number of blocks allocated by build_state.

As the automata are used up to the last line, the buffers are not freed by
grep, but this is not a bug.
Doing a regfree at the end of grep shows that this valgrind record is then
freed (hence the memory is not leaking).



In this use case, I would recommend to use the -F option, as Justin
mentioned in #195719. This will avoid using a automaton.

In #358858, the search pattern is an IP, but John used "1.2.3.4".
To match an IP, again, the -F option could (should?) be used, or at least
the regular expression should be tighten to "1\.2\.3\.4"
(The maximum number of transitions is 7*256 versus probably something like
256*256*256*256)

So I'm not really surprised by a 200M memory usage.

Kind Regards,
-- 
Nekral




Forcibly Merged 195719 358858. Request was from Nicolas François <nicolas.francois@centraliens.net> to control@bugs.debian.org. (Sat, 06 Oct 2007 10:36:05 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Anibal Monsalve Salazar <anibal@debian.org>:
Bug#358858; Package grep. (full text, mbox, link).


Acknowledgement sent to Justin Pryzby <jpryzby+d@quoininc.com>:
Extra info received and forwarded to list. Copy sent to Anibal Monsalve Salazar <anibal@debian.org>. (full text, mbox, link).


Message #23 received at 358858@bugs.debian.org (full text, mbox, reply):

From: Justin Pryzby <jpryzby+d@quoininc.com>
To: Nicolas François <nicolas.francois@centraliens.net>
Cc: 195719@bugs.debian.org, 358858@bugs.debian.org, John Goerzen <jgoerzen@complete.org>, KORN Andras <korn-debbugs@chardonnay.math.bme.hu>
Subject: Re: Bug#195719/#358858: grep: Leaks memory
Date: Sat, 6 Oct 2007 13:56:28 -0400
On Sat, Oct 06, 2007 at 12:26:23PM +0200, Nicolas François wrote:
> forcemerge 195719 358858
> thanks
> 
> Hello,
> 
> I investigated this memory leak, and did not find any evidence of a memory
> leak (a large amount of memory is used, without leaking).
Thanks for looking into it.  I also tried to reproduce it, both on
current sid-ish versions and in an etch chroot.  Unfortunatley I was
unable to reproduce it even with my own test case (doh)..  I initially
came across the problem in a different context.  Perhaps I'll try to
dig this up.

> grep uses a DFA algorithm to perform regexp matching. This DFA algorithm
> is either implemented in grep, or in the libc (when re_search is used).
> Which DFA algorithm is used depends on the version of grep and on the grep
> options.
Neat; thanks for the explanation.  Is this configured at compiletime
or ???  What options have what effect?

Thanks
Justin




Changed Bug title to `grep uses too much memory' from `grep: Leaks memory'. Request was from Aníbal Monsalve Salazar <anibal@debian.org> to control@bugs.debian.org. (Sun, 29 Mar 2009 02:15:03 GMT) (full text, mbox, link).


Reply sent to Aníbal Monsalve Salazar <anibal@debian.org>:
You have taken responsibility. (Sun, 29 Mar 2009 02:24:10 GMT) (full text, mbox, link).


Notification sent to John Goerzen <jgoerzen@complete.org>:
Bug acknowledged by developer. (Sun, 29 Mar 2009 02:24:10 GMT) (full text, mbox, link).


Message #30 received at 358858-done@bugs.debian.org (full text, mbox, reply):

From: Aníbal Monsalve Salazar <anibal@debian.org>
To: 358858-done@bugs.debian.org, 195719-done@bugs.debian.org
Subject: Re: Bug#195719: /#358858: grep: Leaks memory
Date: Sun, 29 Mar 2009 13:21:05 +1100
Version: 2.5.1.ds2-4

On Sat, Oct 06, 2007 at 12:26:23PM +0200, Nicolas François wrote:
>forcemerge 195719 358858
>thanks
>
>Hello,
>
>I investigated this memory leak, and did not find any evidence of a
>memory leak (a large amount of memory is used, without leaking).
>
>Thus I'm merging 195719 358858 which are both about grep leaking
>memory.  Consequently, the 358858's severity is downgraded from
>important to normal.
>
>I also recommend to close these bugs or tag them as wontfix and
>changing the title to "grep uses too much memory", or a wishlist
>--lowmem option could be implemented.
>
>I'm answering below to some of the comments of #195719 and #358858
>
>grep uses a DFA algorithm to perform regexp matching. This DFA
>algorithm is either implemented in grep, or in the libc (when re_search
>is used).  Which DFA algorithm is used depends on the version of grep
>and on the grep options.
>
>The DFA algorithm is a state machine and each time it is used, the
>automaton which represents the regexp may use more memory because a new
>transition is investigated.  The memory allocated for the automaton is
>not freed after each line is parsed, but it is kept so that if a
>transition path is also used in a later line, the processing will be
>faster.  Thus more and more memory is allocated.
>
>This explains Justin concern in #195719:
>
>On Thu, Nov 25, 2004 at 06:38:53PM -0500, Justin Pryzby wrote:
>>
>>What I don't understand and cannot explain is why memory usage begins
>>to climb linearly *after* the compilation.  It seems to me that it
>>should do the exact same thing once per input line, and that should
>>never require more memory (assuming input lines are about the same
>>length, which they are).  Thus I suspect some other problem.
>
>Also, in #195719:
>On Thu, Dec 10, 2005 at 20:08:28 -0500, Justin Pryzby wrote:
>>
>>The culprit seems to be
>>
>>   1024816 bytes in 2596 blocks are still reachable in loss record 18 of 18
>>   at 0x1B90459D: malloc (vg_replace_malloc.c:130)
>>   by 0x804C5EC: xmalloc (dfa.c:143)
>>   by 0x804E9BB: build_state (dfa.c:2330)
>>   by 0x804FC7E: dfaexec (dfa.c:2372)
>>   by 0x8054DE2: EGexecute (search.c:402)
>>   by 0x804A6C5: grepbuf (grep.c:732)
>>   by 0x804B1A5: grepfile (grep.c:851)
>>   by 0x804C39D: main (grep.c:1788)
>>
>>(valgrind --leak-check=full --show-reachable=yes ./build-tree/grep-2.5.1/src/grep -f /tmp/words /usr/share/dict/words)
>
>I could not reproduce exactly the same (the grep DFA algorithm was
>used, and currently grep uses re_search from the libc, so the `leak' is
>now in the libc)
>
>Given that /tmp/words was composed of > 16 characters words, the DFA
>algorithm had to build one automaton per lines in /tmp/words, each of
>these automaton should have more than 16 states, if there are 150 lines
>in /tmp/words, we are close to the number of blocks allocated by
>build_state.
>
>As the automata are used up to the last line, the buffers are not freed
>by grep, but this is not a bug.  Doing a regfree at the end of grep
>shows that this valgrind record is then freed (hence the memory is not
>leaking).
>
>In this use case, I would recommend to use the -F option, as Justin
>mentioned in #195719. This will avoid using a automaton.
>
>In #358858, the search pattern is an IP, but John used "1.2.3.4".  To
>match an IP, again, the -F option could (should?) be used, or at least
>the regular expression should be tighten to "1\.2\.3\.4" (The maximum
>number of transitions is 7*256 versus probably something like
>256*256*256*256)
>
>So I'm not really surprised by a 200M memory usage.
>
>Kind Regards,
>-- 
>Nekral




Reply sent to Aníbal Monsalve Salazar <anibal@debian.org>:
You have taken responsibility. (Sun, 29 Mar 2009 02:24:11 GMT) (full text, mbox, link).


Notification sent to KORN Andras <korn-debbugs@chardonnay.math.bme.hu>:
Bug acknowledged by developer. (Sun, 29 Mar 2009 02:24:11 GMT) (full text, mbox, link).


No longer marked as found in versions grep/2.5.1.ds2-4. Request was from Andreas Beckmann <anbe@debian.org> to control@bugs.debian.org. (Fri, 01 Nov 2013 01:21:41 GMT) (full text, mbox, link).


Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Fri, 29 Nov 2013 07:29:28 GMT) (full text, mbox, link).


Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Wed Oct 11 00:24:15 2017; Machine Name: beach

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.