Debian Bug report logs -
#195719
grep uses too much memory
Reported by: KORN Andras <korn-debbugs@chardonnay.math.bme.hu>
Date: Sun, 1 Jun 2003 19:48:01 UTC
Severity: normal
Merged with 358858
Found in version 2.5.1-5
Fixed in version 2.5.1.ds2-4
Done: Aníbal Monsalve Salazar <anibal@debian.org>
Bug is archived. No further changes may be made.
Forwarded to bug-gnu-utils@gnu.org
Toggle useless messages
Report forwarded to debian-bugs-dist@lists.debian.org, rmgolbeck@debian.org (Ryan M. Golbeck):
Bug#195719; Package grep.
(full text, mbox, link).
Acknowledgement sent to KORN Andras <korn-debbugs@chardonnay.math.bme.hu>:
New Bug report received and forwarded. Copy sent to rmgolbeck@debian.org (Ryan M. Golbeck).
(full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
Package: grep
Version: 2.5.1-5
Severity: normal
Hi,
21936 webalize 19 19 2661m 808m 404 R 38.4 80.1 693:26.00 grep -haf /store/tmp/grepfile.21921
/store/tmp/grepfile.21921 is 17272 lines long; it contains patterns
like:
^yokiza.demon.nl
I.e. all lines begin with ^ and contain a hostname or IP address.
grep is invoked as:
zcat large-file1.gz [...] large-fileN.gz \
| grep -haf /store/tmp/grepfile.21921 \
| gzip -9 \
> somefile.gz
The large-files are apache accesslogs in "combined" format.
I fail to see why this should need so much RAM. (Perhaps there is a
reason, but it isn't apparent.)
Andrew
-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux stop 2.4.21-pre5-xfs-stop #1 Thu Feb 27 12:03:37 CET 2003 i686
Locale: LANG=C, LC_CTYPE=C
Versions of packages grep depends on:
ii libc6 2.3.1-17 GNU C Library: Shared libraries an
-- no debconf information
Information forwarded to debian-bugs-dist@lists.debian.org, rmgolbeck@debian.org (Ryan M. Golbeck):
Bug#195719; Package grep.
(full text, mbox, link).
Acknowledgement sent to Justin Pryzby <justinpryzby@users.sourceforge.net>:
Extra info received and forwarded to list. Copy sent to rmgolbeck@debian.org (Ryan M. Golbeck).
(full text, mbox, link).
Message #10 received at 195719@bugs.debian.org (full text, mbox, reply):
Hi Andrew,
Last year, you reported a bug in with `grep -f `. I'm trying to do
something similar, and I, too, believe that grep has a memory leak.
The memory use increases mostly-monotonically, and my grep is
eventually killed, after using 1.5GB of RAM+Swap. My file is 8000
lines, with each line beginning with ^ and ending with $. I have not
escaped some . characters.
Do you have any more information to report?
Thanks,
Justin
Acknowledgement sent to Justin Pryzby <justinpryzby@users.sourceforge.net>:
Extra info received and filed, but not forwarded.
(full text, mbox, link).
Message #15 received at 195719-quiet@bugs.debian.org (full text, mbox, reply):
src/grep.c:main (*compile)(); uses a constant 82MB of RAM and takes 7
minutes. That is acceptable, but maybe not right. Afterwards, it
begins its linearly increasing mem usage.
Justin
Information forwarded to debian-bugs-dist@lists.debian.org, rmgolbeck@debian.org (Ryan M. Golbeck):
Bug#195719; Package grep.
(full text, mbox, link).
Acknowledgement sent to Justin Pryzby <justinpryzby@users.sourceforge.net>:
Extra info received and forwarded to list. Copy sent to rmgolbeck@debian.org (Ryan M. Golbeck).
(full text, mbox, link).
Message #20 received at 195719@bugs.debian.org (full text, mbox, reply):
Hi,
I think I've found the solution to the problem that you and I have
seen. However, I suspect that there is still a problem with grep.
Try grep -Fxvf. F means it is a string, not a pattern; x means to
match the *whole* string (it is equivalent, I think, to using ^ and
$). v inverts the match (I want it, maybe you don't). And f reads
from the file.
Alternately, try replacing and bare . characters in the file with \.
(otherwise . will matching anything). Then, use grep -Fvf. This also
does what I want (nearly the same thing as grep -Fxvf unescaped,
above; this also yields some false positives if one line is a strict
superset of another line).
This explains why grep -f takes such a long time to compile; it is
probably building a character list for every possible match match;
that also explains the memory usage (though I don't know that this is
its algorithm, or if that algorithm can't be vastly improved):
8000 lines * 20 chars per line = 160000 chars
This would also be the size of each element of the list of possible
matching strings
2 metachar per line * 256 bytes per metachar = 512 bytes / line
512*160000= 81920000 bytes = 82MB, almost precisely what I observed
compilation to use.
What I don't understand and cannot explain is why memory usage begins
to climb linearly *after* the compilation. It seems to me that it
should do the exact same thing once per input line, and that should
never require more memory (assuming input lines are about the same
length, which they are). Thus I suspect some other problem.
Cheers,
Justin
Information forwarded to debian-bugs-dist@lists.debian.org, rmgolbeck@debian.org (Ryan M. Golbeck):
Bug#195719; Package grep.
(full text, mbox, link).
Acknowledgement sent to Andras Korn <korn-debbugs@chardonnay.math.bme.hu>:
Extra info received and forwarded to list. Copy sent to rmgolbeck@debian.org (Ryan M. Golbeck).
(full text, mbox, link).
Message #25 received at 195719@bugs.debian.org (full text, mbox, reply):
On Thu, Nov 25, 2004 at 06:38:53PM -0500, Justin Pryzby wrote:
Hi,
> I think I've found the solution to the problem that you and I have
> seen. However, I suspect that there is still a problem with grep.
>
> Try grep -Fxvf. F means it is a string, not a pattern; x means to
> match the *whole* string (it is equivalent, I think, to using ^ and
> $). v inverts the match (I want it, maybe you don't). And f reads
> from the file.
>
> Alternately, try replacing and bare . characters in the file with \.
> (otherwise . will matching anything). Then, use grep -Fvf. This also
> does what I want (nearly the same thing as grep -Fxvf unescaped,
> above; this also yields some false positives if one line is a strict
> superset of another line).
I don't think this is a solution. What you suggest can only reduce the
memory requirement of reading the -f file. However, as you also have noted,
memory usage increases as input length increases. As I see it, this doesn't
have anything to do with how the patterns are organized; it's either a bug
or a misfeature (some data structures being kept around for efficiency,
maybe?).
> What I don't understand and cannot explain is why memory usage begins
> to climb linearly *after* the compilation. It seems to me that it
> should do the exact same thing once per input line, and that should
> never require more memory (assuming input lines are about the same
> length, which they are). Thus I suspect some other problem.
Exactly.
Andras
--
Andras Korn <korn at chardonnay.math.bme.hu>
<http://chardonnay.math.bme.hu/~korn/> QOTD:
At least a toad can eat what bugs him.
Reply sent to Justin Pryzby <justinpryzby@users.sourceforge.net>:
You have marked Bug as forwarded.
(full text, mbox, link).
Message #28 received at 195719-forwarded@bugs.debian.org (full text, mbox, reply):
Please see <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=195719>
where it is documented that grep -f uses large amounts of memory, and
it is suggested that it might be leaking memory, causing linearly
increasing mem usage.
You can test with:
grep '^c...............' /usr/share/dict/words >/tmp/words
grep -f /tmp/words /usr/share/dict/words
The culprit seems to be
1024816 bytes in 2596 blocks are still reachable in loss record 18 of 18
at 0x1B90459D: malloc (vg_replace_malloc.c:130)
by 0x804C5EC: xmalloc (dfa.c:143)
by 0x804E9BB: build_state (dfa.c:2330)
by 0x804FC7E: dfaexec (dfa.c:2372)
by 0x8054DE2: EGexecute (search.c:402)
by 0x804A6C5: grepbuf (grep.c:732)
by 0x804B1A5: grepfile (grep.c:851)
by 0x804C39D: main (grep.c:1788)
(valgrind --leak-check=full --show-reachable=yes ./build-tree/grep-2.5.1/src/grep -f /tmp/words /usr/share/dict/words)
--
Clear skies,
Justin
Information forwarded to debian-bugs-dist@lists.debian.org, Anibal Monsalve Salazar <anibal@debian.org>:
Bug#195719; Package grep.
(full text, mbox, link).
Acknowledgement sent to Touko Korpela <tkorpela@phnet.fi>:
Extra info received and forwarded to list. Copy sent to Anibal Monsalve Salazar <anibal@debian.org>.
(full text, mbox, link).
Message #33 received at 195719@bugs.debian.org (full text, mbox, reply):
Is this bug still present with grep version 2.5.1.ds2-6 (etch) or
2.5.3~dfsg-2 (sid/lenny)?
Message sent on to KORN Andras <korn-debbugs@chardonnay.math.bme.hu>:
Bug#195719.
(full text, mbox, link).
Acknowledgement sent to Andras Korn <korn-debbugs@chardonnay.math.bme.hu>:
Extra info received and filed, but not forwarded.
(full text, mbox, link).
Message #41 received at 195719-quiet@bugs.debian.org (full text, mbox, reply):
On Wed, Sep 19, 2007 at 12:26:35AM +0300, Touko Korpela wrote:
Hi,
> Is this bug still present with grep version 2.5.1.ds2-6 (etch) or
> 2.5.3~dfsg-2 (sid/lenny)?
uh, yes, but it will take a while. I don't do that particular thing anymore,
so I'll need to approximate the scenario.
Andras
--
Andras Korn <korn at chardonnay.math.bme.hu>
<http://chardonnay.math.bme.hu/~korn/> QOTD:
Assassins do it from behind.
Information forwarded to debian-bugs-dist@lists.debian.org, Anibal Monsalve Salazar <anibal@debian.org>:
Bug#195719; Package grep.
(full text, mbox, link).
Acknowledgement sent to Nicolas François <nicolas.francois@centraliens.net>:
Extra info received and forwarded to list. Copy sent to Anibal Monsalve Salazar <anibal@debian.org>.
(full text, mbox, link).
Message #46 received at 195719@bugs.debian.org (full text, mbox, reply):
forcemerge 195719 358858
thanks
Hello,
I investigated this memory leak, and did not find any evidence of a memory
leak (a large amount of memory is used, without leaking).
Thus I'm merging 195719 358858 which are both about grep leaking memory.
Consequently, the 358858's severity is downgraded from important to normal.
I also recommend to close these bugs or tag them as wontfix and changing
the title to "grep uses too much memory", or a wishlist --lowmem option
could be implemented.
I'm answering below to some of the comments of #195719 and #358858
grep uses a DFA algorithm to perform regexp matching. This DFA algorithm
is either implemented in grep, or in the libc (when re_search is used).
Which DFA algorithm is used depends on the version of grep and on the grep
options.
The DFA algorithm is a state machine and each time it is used, the
automaton which represents the regexp may use more memory because a new
transition is investigated.
The memory allocated for the automaton is not freed after each line is
parsed, but it is kept so that if a transition path is also used in a
later line, the processing will be faster.
Thus more and more memory is allocated.
This explains Justin concern in #195719:
On Thu, Nov 25, 2004 at 06:38:53PM -0500, Justin Pryzby wrote:
>
> What I don't understand and cannot explain is why memory usage begins
> to climb linearly *after* the compilation. It seems to me that it
> should do the exact same thing once per input line, and that should
> never require more memory (assuming input lines are about the same
> length, which they are). Thus I suspect some other problem.
Also, in #195719:
On Thu, Dec 10, 2005 at 20:08:28 -0500, Justin Pryzby wrote:
>
> The culprit seems to be
>
> 1024816 bytes in 2596 blocks are still reachable in loss record 18 of 18
> at 0x1B90459D: malloc (vg_replace_malloc.c:130)
> by 0x804C5EC: xmalloc (dfa.c:143)
> by 0x804E9BB: build_state (dfa.c:2330)
> by 0x804FC7E: dfaexec (dfa.c:2372)
> by 0x8054DE2: EGexecute (search.c:402)
> by 0x804A6C5: grepbuf (grep.c:732)
> by 0x804B1A5: grepfile (grep.c:851)
> by 0x804C39D: main (grep.c:1788)
>
> (valgrind --leak-check=full --show-reachable=yes ./build-tree/grep-2.5.1/src/grep -f /tmp/words /usr/share/dict/words)
I could not reproduce exactly the same (the grep DFA algorithm was
used, and currently grep uses re_search from the libc, so the `leak' is
now in the libc)
Given that /tmp/words was composed of > 16 characters words, the DFA
algorithm had to build one automaton per lines in /tmp/words, each of
these automaton should have more than 16 states, if there are 150 lines in
/tmp/words, we are close to the number of blocks allocated by build_state.
As the automata are used up to the last line, the buffers are not freed by
grep, but this is not a bug.
Doing a regfree at the end of grep shows that this valgrind record is then
freed (hence the memory is not leaking).
In this use case, I would recommend to use the -F option, as Justin
mentioned in #195719. This will avoid using a automaton.
In #358858, the search pattern is an IP, but John used "1.2.3.4".
To match an IP, again, the -F option could (should?) be used, or at least
the regular expression should be tighten to "1\.2\.3\.4"
(The maximum number of transitions is 7*256 versus probably something like
256*256*256*256)
So I'm not really surprised by a 200M memory usage.
Kind Regards,
--
Nekral
Forcibly Merged 195719 358858.
Request was from Nicolas François <nicolas.francois@centraliens.net>
to control@bugs.debian.org.
(Sat, 06 Oct 2007 10:36:05 GMT) (full text, mbox, link).
Information forwarded to debian-bugs-dist@lists.debian.org, Anibal Monsalve Salazar <anibal@debian.org>:
Bug#195719; Package grep.
(full text, mbox, link).
Acknowledgement sent to Justin Pryzby <jpryzby+d@quoininc.com>:
Extra info received and forwarded to list. Copy sent to Anibal Monsalve Salazar <anibal@debian.org>.
(full text, mbox, link).
Message #53 received at 195719@bugs.debian.org (full text, mbox, reply):
On Sat, Oct 06, 2007 at 12:26:23PM +0200, Nicolas François wrote:
> forcemerge 195719 358858
> thanks
>
> Hello,
>
> I investigated this memory leak, and did not find any evidence of a memory
> leak (a large amount of memory is used, without leaking).
Thanks for looking into it. I also tried to reproduce it, both on
current sid-ish versions and in an etch chroot. Unfortunatley I was
unable to reproduce it even with my own test case (doh).. I initially
came across the problem in a different context. Perhaps I'll try to
dig this up.
> grep uses a DFA algorithm to perform regexp matching. This DFA algorithm
> is either implemented in grep, or in the libc (when re_search is used).
> Which DFA algorithm is used depends on the version of grep and on the grep
> options.
Neat; thanks for the explanation. Is this configured at compiletime
or ??? What options have what effect?
Thanks
Justin
Changed Bug title to `grep uses too much memory' from `grep: memory leak?'.
Request was from Aníbal Monsalve Salazar <anibal@debian.org>
to control@bugs.debian.org.
(Sun, 29 Mar 2009 02:15:03 GMT) (full text, mbox, link).
Reply sent
to Aníbal Monsalve Salazar <anibal@debian.org>:
You have taken responsibility.
(Sun, 29 Mar 2009 02:24:08 GMT) (full text, mbox, link).
Notification sent
to KORN Andras <korn-debbugs@chardonnay.math.bme.hu>:
Bug acknowledged by developer.
(Sun, 29 Mar 2009 02:24:08 GMT) (full text, mbox, link).
Message #60 received at 195719-done@bugs.debian.org (full text, mbox, reply):
Version: 2.5.1.ds2-4
On Sat, Oct 06, 2007 at 12:26:23PM +0200, Nicolas François wrote:
>forcemerge 195719 358858
>thanks
>
>Hello,
>
>I investigated this memory leak, and did not find any evidence of a
>memory leak (a large amount of memory is used, without leaking).
>
>Thus I'm merging 195719 358858 which are both about grep leaking
>memory. Consequently, the 358858's severity is downgraded from
>important to normal.
>
>I also recommend to close these bugs or tag them as wontfix and
>changing the title to "grep uses too much memory", or a wishlist
>--lowmem option could be implemented.
>
>I'm answering below to some of the comments of #195719 and #358858
>
>grep uses a DFA algorithm to perform regexp matching. This DFA
>algorithm is either implemented in grep, or in the libc (when re_search
>is used). Which DFA algorithm is used depends on the version of grep
>and on the grep options.
>
>The DFA algorithm is a state machine and each time it is used, the
>automaton which represents the regexp may use more memory because a new
>transition is investigated. The memory allocated for the automaton is
>not freed after each line is parsed, but it is kept so that if a
>transition path is also used in a later line, the processing will be
>faster. Thus more and more memory is allocated.
>
>This explains Justin concern in #195719:
>
>On Thu, Nov 25, 2004 at 06:38:53PM -0500, Justin Pryzby wrote:
>>
>>What I don't understand and cannot explain is why memory usage begins
>>to climb linearly *after* the compilation. It seems to me that it
>>should do the exact same thing once per input line, and that should
>>never require more memory (assuming input lines are about the same
>>length, which they are). Thus I suspect some other problem.
>
>Also, in #195719:
>On Thu, Dec 10, 2005 at 20:08:28 -0500, Justin Pryzby wrote:
>>
>>The culprit seems to be
>>
>> 1024816 bytes in 2596 blocks are still reachable in loss record 18 of 18
>> at 0x1B90459D: malloc (vg_replace_malloc.c:130)
>> by 0x804C5EC: xmalloc (dfa.c:143)
>> by 0x804E9BB: build_state (dfa.c:2330)
>> by 0x804FC7E: dfaexec (dfa.c:2372)
>> by 0x8054DE2: EGexecute (search.c:402)
>> by 0x804A6C5: grepbuf (grep.c:732)
>> by 0x804B1A5: grepfile (grep.c:851)
>> by 0x804C39D: main (grep.c:1788)
>>
>>(valgrind --leak-check=full --show-reachable=yes ./build-tree/grep-2.5.1/src/grep -f /tmp/words /usr/share/dict/words)
>
>I could not reproduce exactly the same (the grep DFA algorithm was
>used, and currently grep uses re_search from the libc, so the `leak' is
>now in the libc)
>
>Given that /tmp/words was composed of > 16 characters words, the DFA
>algorithm had to build one automaton per lines in /tmp/words, each of
>these automaton should have more than 16 states, if there are 150 lines
>in /tmp/words, we are close to the number of blocks allocated by
>build_state.
>
>As the automata are used up to the last line, the buffers are not freed
>by grep, but this is not a bug. Doing a regfree at the end of grep
>shows that this valgrind record is then freed (hence the memory is not
>leaking).
>
>In this use case, I would recommend to use the -F option, as Justin
>mentioned in #195719. This will avoid using a automaton.
>
>In #358858, the search pattern is an IP, but John used "1.2.3.4". To
>match an IP, again, the -F option could (should?) be used, or at least
>the regular expression should be tighten to "1\.2\.3\.4" (The maximum
>number of transitions is 7*256 versus probably something like
>256*256*256*256)
>
>So I'm not really surprised by a 200M memory usage.
>
>Kind Regards,
>--
>Nekral
Reply sent
to Aníbal Monsalve Salazar <anibal@debian.org>:
You have taken responsibility.
(Sun, 29 Mar 2009 02:24:09 GMT) (full text, mbox, link).
Notification sent
to John Goerzen <jgoerzen@complete.org>:
Bug acknowledged by developer.
(Sun, 29 Mar 2009 02:24:09 GMT) (full text, mbox, link).
No longer marked as found in versions grep/2.5.1.ds2-4.
Request was from Andreas Beckmann <anbe@debian.org>
to control@bugs.debian.org.
(Fri, 01 Nov 2013 01:21:41 GMT) (full text, mbox, link).
Bug archived.
Request was from Debbugs Internal Request <owner@bugs.debian.org>
to internal_control@bugs.debian.org.
(Fri, 29 Nov 2013 07:29:28 GMT) (full text, mbox, link).
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Wed Oct 11 00:26:58 2017;
Machine Name:
beach
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.