Debian Bug report logs - #497038
apt-file speed improvement patch

version graph

Package: apt-file; Maintainer for apt-file is Niels Thykier <niels@thykier.net>; Source for apt-file is src:apt-file.

Reported by: Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>

Date: Fri, 29 Aug 2008 11:54:02 UTC

Severity: normal

Found in version apt-file/2.1.5

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. Full text and rfc822 format available.

Acknowledgement sent to Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>:
New Bug report received and forwarded. Copy sent to Stefan Fritsch <sf@debian.org>. Full text and rfc822 format available.

Message #5 received at submit@bugs.debian.org (full text, mbox):

From: Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>
To: submit@bugs.debian.org
Subject: apt-file speed improvement patch
Date: Fri, 29 Aug 2008 13:49:24 +0200
[Message part 1 (text/plain, inline)]
Package: apt-file
Version: 2.1.5

Searching for files with apt-file is slow mainly because the original 
input files are compressed. I did some benchmarking and it tuns out that 
the performance of the utilities zgrep and zcat are about 5 times worse 
than their uncompressed equivalent (grep and cat).

I've created a patch that simply lets apt-file keep an uncompressed 
cache and to perform the search operations using the uncompressed 
versions of the GNU utilities. As today disk space is quite cheap, this 
patch permits to gain a considerable speed gain where disk space is not 
an issued.

Since the input files used by apt-file can be quite big, the patch 
assumes that by default the previous behavior or apt-file should be 
used, thus the input files are left intact. The patch can be activated 
by simply adding the following configuration parameter to apt-file.conf:

   # If true then the contents files will be decompressed this takes 
more space but
   # gives faster results
   uncompress = yes

[apt-file (text/plain, inline)]
#!/usr/bin/perl -w

#
# apt-file - APT package searching utility -- command-line interface
#
# (c) 2001 Sebastien J. Gross <seb@debian.org>
#
# This package is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; version 2 dated June, 1991.
#
# This package is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this package; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301 USA.

use strict;
use Config::File "read_config_file";
use Getopt::Long qw/:config no_ignore_case/;
use Data::Dumper;
use File::Basename;
use AptPkg::Config '$_config';
use constant VERSION => "2.1.0";
use List::MoreUtils qw/uniq/;

my $Conf;
my $Version;

sub error($) {
    print STDERR "E: ", shift, $! ? ": $!" : "", "\n";
    undef $!;
    exit 1;
}

sub warning($) {
    print STDERR "W: ", shift, $! ? ": $!" : "", "\n";
    undef $!;
}

sub debug($;$) {
    return if !defined $Conf->{verbose};
    my ( $msg, $use_errstr ) = @_;
    print STDERR "D: ", $msg;
    print STDERR $! ? ": $!" : "" if $use_errstr;
    print STDERR "\n";
    undef $!;
}

sub debug_line($) {
    return if !defined $Conf->{verbose};
    print STDERR shift;
}

sub unique($) {
    my $seen = ();
    return [ grep { !$seen->{$_}++ } @{ (shift) } ];
}

sub reverse_hash($) {
    my $hash = shift;
    my $ret;
    foreach my $key ( keys %$hash ) {
        foreach ( @{ $hash->{$key} } ) {
            push @{ $ret->{$_} }, $key;
        }
    }
    return $ret;
}

# find_command
# looks through the PATH environment variable for the command named by
# $conf->{$scheme}, if that command doesn't exist, it will look for
# $conf->{${scheme}2}, and so on until it runs out of configured
# commands or an executable is found.
#
sub find_command {
    my $conf   = shift;
    my $scheme = shift;

    my $i = 1;
    while (1) {
        my $key = $scheme;
        $key = $key . $i if $i != 1;
        return unless defined $conf->{$key};
        my $cmd = $conf->{$key};
        $cmd =~ s/^[( ]+//;
        $cmd =~ s/ .*//;
        if ( $cmd =~ m{^/} and -x $cmd ) {
            return $conf->{$key};
        }
        for my $path ( split( /:/, $ENV{'PATH'} ) ) {
            return $conf->{$key} if -x ( $path . '/' . $cmd );
        }
        $i = $i + 1;
    }
}

sub parse_sources_list($) {
    my $file = shift;
    my $uri;
    my @uri_items;
    my @tmp;
    my $line;
    my $ret;

    my ( $cmd, $dest );

    my @files = ref $file ? @$file : [$file];

    foreach $file ( grep -f, @files ) {
        debug "reading sources file $file";
        open( SOURCE, "< $file" ) || error "Can't open $file";
        while (<SOURCE>) {
            next if /^\s*(?:$|\#|(?:deb-|rpm-))/xo;
            chomp;
            my $line = $_;
            debug "got \'$line\'";
            $line =~ s/([^\/])\#.*$/$1/o;
            $line =~ s/^(\S+\s+)\[\S+\]/$1/o;
            $line =~ s/\s+/ /go;
            $line =~ s/^\s+//o;

            # CDROM entry
            if ( @tmp = $line =~ m/^([^\[]*)\[([^\]]*)\](.*)$/o ) {
                $tmp[1] =~ s/ /_/g;
                $line = $tmp[0] . '[' . $tmp[1] . ']' . $tmp[2];
            }

            # Handle $(ARCH) in sources.list
            $line =~ s/\$\(ARCH\)/$Conf->{arch}/g;
            debug "kept \'$line\'";

            my ( $pkg, $uri, $dist, @extra ) = split /\s+/, $line;
            $uri =~ s/\/+$//;
            my ( $scheme, $user, $passwd, $host, $port, $path, $query,
                $fragment )
                = $uri =~ m|^
                    (?:([^:/?\#]+):)?           # scheme
                    (?://
                        (?:
                            ([^:@]*)            #username
                            (?::([^@]*))?       #passwd
                        @)?
                        ([^:/?\#]*)             # host
                        (?::(\d+))?             # port
                    )?
                    ([^?\#]*)                   # path
                    (?:\?([^\#]*))?             # query
                    (?:\#(.*))?                 # fragment
                |ox;

            my $fetch = [];

            foreach (@extra) {
                push @$fetch, m/(.*?)\/(?:.*)/o ? "$dist/$1" : "$dist";
            }

            foreach ( @{ ( unique $fetch) } ) {
                if ( !defined $Conf->{"${scheme}"} ) {
                    warning "Don't know how to handle $scheme";
                    next;
                }
                $dist = $_;
                $cmd = find_command( $Conf, $scheme );
                die "Could not find suitable command for $scheme" unless $cmd;
                $dest = $Conf->{destination};
                my $cache = $Conf->{cache};
                my $arch  = $Conf->{arch};
                my $cdrom = $Conf->{cdrom_mount};
                foreach my $var (
                    qw/host port user passwd path dist pkg
                    cache arch uri cdrom/
                    )
                {
                    map {
                        $_ =~ 
                            s{<$var(?:\|(.+?))?>}
                             { defined eval "\$$var" ? eval "\$$var" 
                             : defined $1            ? $1
                             : "";
                             }gsex;
                    } ( $cmd, $dest );
                }
                $dest =~ s/(\/|_)+/_/go;
                $cmd  =~ s/<dest>/$dest/g;
                my $hash;
                foreach (
                    qw/host port user passwd path dist pkg uri line
                    dest cmd scheme/
                    )
                {
                    $hash->{$_} = eval "\$$_";
                }
                push @$ret, $hash;
            }
        }
        close SOURCE;
    }
    return $ret;
}

sub fetch_files ($) {
    umask 0022;
    if ( !-d $Conf->{cache} ) {
        mkdir $Conf->{cache} or error "Can't create $Conf->{cache}";
    }
    error "Can't write in $Conf->{cache}" if !-w $Conf->{cache};
    foreach ( @{ (shift) } ) {
        if (   $Conf->{"non_interactive"}
            && $Conf->{interactive}->{ $_->{scheme} } )
        {
            debug "Ignoring interactive scheme $_->{scheme}";
            next;
        }
        local %ENV = %ENV;
        my $proxy = defined $_->{host}
            && $_config->get("Acquire::$_->{scheme}::Proxy::$_->{host}")
            || $_config->get("Acquire::$_->{scheme}::Proxy");
        if ($proxy) {

         # wget expects lower case, curl expects upper case (except for http).
         # we just set/unset both
            delete $ENV{no_proxy};
            delete $ENV{NO_PROXY};
            delete $ENV{all_proxy};
            delete $ENV{ALL_PROXY};
            if ( $proxy =~ /^(?:DIRECT|false)$/i ) {
                debug "not using proxy";
                delete $ENV{ lc("$_->{scheme}_proxy") };
                delete $ENV{ uc("$_->{scheme}_proxy") };
            }
            else {
                debug "using proxy: $proxy";
                $ENV{ lc("$_->{scheme}_proxy") } = $proxy;
                $ENV{ uc("$_->{scheme}_proxy") } = $proxy;
            }
        }
        debug $_->{cmd};
        my $cmd = $_->{cmd};
        $cmd = "set -x; $cmd"       if $Conf->{verbose};
        $cmd = "($cmd) < /dev/null" if $Conf->{non_interactive};
        system($cmd) if !defined $Conf->{dummy};
        my $file = "$Conf->{cache}/$_->{dest}";
        if ( $Conf->{uncompress} ) {
            system("gunzip", "--force", $file) if -e $file;
        }
        else {
            # If previously we where using uncompressed files and now we changed
            # our mind we should remove the old files otherwise we will have
            # both uncompressed and the compressed files in the disk!
            $file =~ s/\.gz$//;
            unlink $file;
        }
    }
}

sub print_winners ($$) {
    my ( $db, $matchfname ) = @_;
    my $filtered_db;

    # $db is a hash from package name to array of file names.  It is
    # a superset of the matching cases, so first we filter this by the
    # real pattern.
    foreach my $key ( keys %$db ) {
        if ( $matchfname || ( $key =~ /$Conf->{pattern}/ ) ) {
            $filtered_db->{$key} = $db->{$key};
        }
    }

    # Now print the winners
    if ( !defined $Conf->{package_only} ) {
        foreach my $key ( sort keys %$filtered_db ) {
            foreach ( uniq sort @{ $filtered_db->{$key} } ) {
                print "$key: $_\n";
            }
        }
    }
    else {
        print map {"$_\n"} ( sort keys %$filtered_db );
    }
    exit 0;
}

sub do_grep($$) {
    my ( $data, $pattern ) = @_;
    my $ret;
    my ( $pkgs, $fname );
    debug "regexp: $pattern";
    $| = 1;
    my $zgrep_pattern = $Conf->{pattern};
    $zgrep_pattern =~ s{^\\/}{};
    my $zcat
        = $Conf->{is_regexp}   ? "zcat"
        : $Conf->{ignore_case} ? "zfgrep -i $zgrep_pattern"
        :                        "zfgrep $zgrep_pattern";
    $zcat =~ s/^z// if $Conf->{uncompress};
    my $regexp = eval { $Conf->{ignore_case} ? qr/$pattern/i : qr/$pattern/ };
    error($@) if $@;
    my $quick_regexp = escape_parens($regexp);
    my %seen         = ();

    foreach (@$data) {
        my $file = "$Conf->{cache}/$_->{dest}";
        $file =~ s/\.gz$// if $Conf->{uncompress};
        next if ( !-f $file );

        # Skip already searched files:
        next if $seen{$file}++;
        debug "Search in $file using $zcat";
        # If the command is 'cat' then bypass the fork and just read the file
        my $open_cmd = ($zcat eq 'cat') ? $file : "$zcat \Q$file\E |";
        open( ZCAT, $open_cmd )
            || warning "Can't $zcat $file";
        while (<ZCAT>) {

            # faster, non-capturing search first
            next if !/$quick_regexp/o;

            next if !( ( $fname, $pkgs ) = /$regexp/o );

            # skip header lines
            # we can safely assume that the name of the top level directory
            # does not contain spaces
            next if !m{^[^\s/]*/};

            debug_line ".";
            foreach ( split /,/, $pkgs ) {

                # Put leading slash on file name
                push @{ $ret->{"/$fname"} }, basename $_;
            }
        }
        close ZCAT;
        debug_line "\n";
    }
    return reverse_hash($ret);
}

sub escape_parens {
    my $pattern = shift;

    # turn any capturing ( ... ) into non capturing (?: ... )
    $pattern =~ s{ (?<! \\ )    # not preceded by a \ 
                        \(      # (
                   (?!  \? )    # not followed by a ?
                 }{(?:}gx;
    return $pattern;
}

sub grep_file($) {
    my $data    = shift;
    my $pattern = $Conf->{pattern};

    # If pattern starts with /, we need to match both ^pattern-without-slash
    # (which is put in $pattern) and ^.*pattern (put in $pattern2).
    # Later, they will be or'ed together.
    my $pattern2;

    if ( $Conf->{is_regexp} ) {
        if ( substr( $pattern, 0, 1 ) eq '^' ) {

            # Pattern is anchored, so we're just not prefixing it with .*
            # and remove ^ and slash
            $pattern =~ s/^\^\/?//;
        }
        elsif ( substr( $pattern, 0, 1 ) eq '/' ) {

            # same logic as below, but the "/" is not escaped here
            $pattern2 = '.*?' . $pattern;
            $pattern  = substr( $pattern, 1 );
        }
        else {
            $pattern = '.*?' . $pattern;
        }
        $pattern  = escape_parens($pattern);
        $pattern2 = escape_parens($pattern2) if defined $pattern2;
    }
    elsif ( substr( $pattern, 0, 2 ) eq '\/' ) {
        if ( $Conf->{fixed_strings} ) {

            # remove leading /
            $pattern = substr( $pattern, 2 );
        }
        else {

            # If pattern starts with /, match both ^pattern-without-slash
            # and ^.*pattern.
            $pattern2 = '.*?' . $pattern;
            $pattern  = substr( $pattern, 2 );
        }
    }
    else {
        $pattern = '.*?' . $pattern unless $Conf->{fixed_strings};
    }

    if ( ! defined $Conf->{fixed_strings} ) {
        $pattern  .= '[^\s]*';
        $pattern2 .= '[^\s]*' if defined $pattern2;
    }

    $pattern = "$pattern|$pattern2" if defined $pattern2;
    $pattern = '^(' . $pattern . ')\s+(\S+)\s*$';

    my $ret = do_grep $data, $pattern;
    print_winners $ret, 1;
}

sub grep_package($) {
    my $data = shift;

    # Strip leading^ / trailing $ if regexp
    my $pkgpat = $Conf->{pattern};
    if ( $Conf->{is_regexp} ) {
        if ( !substr( $pkgpat, 0, 1 ) eq "^" ) {
            $pkgpat = '\S*';
        }
        $pkgpat = substr( $pkgpat, 1 );
        $pkgpat = escape_parens($pkgpat);
    }
    else {
        $pkgpat = '\S*' . $Conf->{pattern};
    }

    # File name may contain spaces, so match template is
    # ($fname, $pkgs) = (line =~ '^\s*(.*?)\s+(\S+)\s*$')
    my $pattern = join "",
        (
        '^\s*(.*?)\s+', '(\S*/', $pkgpat,
        defined $Conf->{fixed_strings} ? '(,\S*|)' : '\S*', ')\s*$',
        );
    my $ret = do_grep $data, $pattern;
    print_winners $ret, 0;
}

sub purge_cache($) {
    my $data = shift;
    foreach (@$data) {
        my $file = "$Conf->{cache}/$_->{dest}";
        $file =~ s/\.gz$// if $Conf->{uncompress};
        debug "Purging $file";
        next if defined $Conf->{dummy};
        next unless -e $file;
        next if ( unlink $file ) > 0;
        warning "Can't remove $file";
    }
}

sub print_version {
    print <<EOF;
apt-file version $Version
(c) 2002 Sebastien J. Gross <sjg\@debian.org>

EOF
}

sub print_help {
    my $err_code = shift || 0;

    print_version;
    print <<"EOF";

apt-file [options] action [pattern]

Configuration options:
    --sources-list     -s  <file>       sources.list location
    --cache            -c  <dir>        Cache directory
    --architecture     -a  <arch>       Use specific architecture
    --cdrom-mount      -d  <cdrom>      Use specific cdrom mountpoint
    --non-interactive  -N               Skip schemes requiring user input
                                        (useful in cron jobs)
    --package-only     -l               Only display packages name
    --fixed-string     -F               Do not expand pattern
    --ignore-case      -i               Ignore case distinctions
    --regexp           -x               pattern is a regular expression
    --verbose          -v               run in verbose mode
    --dummy            -y               run in dummy mode (no action)
    --help             -h               Show this help.
    --version          -V               Show version number

Action:
    update                              Fetch Contents files from apt-sources.
    search|find        <pattern>        Search files in packages
    list|show          <pattern>        List files in packages
    purge                               Remove cache files
EOF
    exit $err_code;
}

sub get_options() {
    my %options = (
        "sources-list|s=s"  => \$Conf->{sources_list},
        "cache|c=s"         => \$Conf->{cache},
        "architecture|a=s"  => \$Conf->{arch},
        "cdrom-mount|d=s"   => \$Conf->{cdrom_mount},
        "verbose|v"         => \$Conf->{verbose},
        "ignore-case|i"     => \$Conf->{ignore_case},
        "regexp|x"          => \$Conf->{is_regexp},
        "dummy|y"           => \$Conf->{dummy},
        "package-only|l"    => \$Conf->{package_only},
        "fixed-string|F"    => \$Conf->{fixed_strings},
        "non-interactive|N" => \$Conf->{non_interactive},
        "help|h"            => \$Conf->{help},
        "version|V"         => \$Conf->{version},
    );
    Getopt::Long::Configure("bundling");
    GetOptions(%options) || print_help 1;
}

sub dir_is_empty {
    my ($path) = @_;
    opendir DIR, $path or die "Cannot read cache directory $path: $!\n";
    while ( my $entry = readdir DIR ) {
        next if ( $entry =~ /^\.\.?$/ );
        closedir DIR;
        return 0;
    }
    closedir DIR;
    return 1;
}

sub main {
    my $conf_file;
    map { $conf_file = $_ if -f $_ } (
        "/etc/apt/apt-file.conf", "apt-file.conf", "$ENV{HOME}/.apt-file.conf"
    );

    error "No config file found\n" if !defined $conf_file;
    debug "Using $conf_file";

    $Conf = read_config_file $conf_file;
    get_options();
    if ( defined $Conf->{version} ) {
        print_version;
        exit 0;
    }

    if ( defined $Conf->{uncompress} ) {
        my $uncompress = lc $Conf->{uncompress};
        if ( $uncompress =~ /^\d+$/ ) {
            $Conf->{uncompress} = $uncompress;
        }
        elsif ( $uncompress eq 'true' or $uncompress eq 'yes' ) {
            $Conf->{uncompress} = 1;
        }
        else {
            $Conf->{uncompress} =  0;
        }
    }
    else {
        $Conf->{uncompress} =  0;
    }

    my $interactive = $Conf->{interactive};
    defined $interactive or $interactive = "cdrom rsh ssh";
    $Conf->{interactive} = {};
    foreach my $s ( split /\s+/, $interactive ) {
        $Conf->{interactive}{$s} = 1;
        if ( !$Conf->{$s} ) {
            warn "interactive scheme $s does not exist\n";
        }
    }

    $_config->init;
    $Conf->{arch} ||= $_config->{'APT::Architecture'};
    $Conf->{sources_list} = [
          $Conf->{sources_list}
        ? $Conf->{sources_list}
        : ( $_config->get_file('Dir::Etc::sourcelist'),
            glob( $_config->get_dir('Dir::Etc::sourceparts') . '/*.list' )
        )
    ];
    $Conf->{cache} ||= $_config->get_dir('Dir::Cache') . 'apt-file';
    $Conf->{cache} =~ s/\/\s*$//;
    $Conf->{cdrom_mount} ||= $_config->{'Acquire::cdrom::Mount'}
        || "/cdrom";

    $Conf->{action} = shift @ARGV || "none";
    $Conf->{pattern} = shift @ARGV;
    if ( defined $Conf->{pattern} ) {
        $Conf->{pattern} = quotemeta( $Conf->{pattern} )
            unless $Conf->{is_regexp};
        if ( $Conf->{is_regexp} and $Conf->{pattern} =~ /(\\[zZ]|\$)$/ ) {
            $Conf->{pattern} =~ s/(\\[zZ]|\$)$//;
            $Conf->{fixed_strings} = 1;
        }
    }
    undef $!;

    my $actions = {
        update => \&fetch_files,
        search => \&grep_file,
        find   => \&grep_file,
        list   => \&grep_package,
        show   => \&grep_package,
        purge  => \&purge_cache,
    };

    $Conf->{help} = 2
        if $Conf->{action} =~ m/search|find|list|show/
            && !defined $Conf->{pattern};
    $Conf->{help} = 2
        if !defined $actions->{ $Conf->{action} }
            && !defined $Conf->{help};
    print_help( $Conf->{help} - 1 ) if defined $Conf->{help};

    my $sources = parse_sources_list $Conf->{sources_list};
    error "No valid sources in @{$Conf->{sources_list}}" if !defined $sources;

    if ( $Conf->{action} =~ m/search|find|list|show/
        && dir_is_empty( $Conf->{cache} ) )
    {
        undef $!;    # unset "Bad file descriptor" error from dir_is_empty
        error
            "The cache directory is empty. You need to run 'apt-file update' first.";
    }
    $actions->{ $Conf->{action} }->($sources);
}

BEGIN {
    $Version = VERSION;
}

main();

__END__

# our style is roughly "perltidy -pbp"
# vim:sts=4:sw=4:expandtab

Information forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. Full text and rfc822 format available.

Acknowledgement sent to Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>:
Extra info received and forwarded to list. Copy sent to Stefan Fritsch <sf@debian.org>. Full text and rfc822 format available.

Message #10 received at 497038@bugs.debian.org (full text, mbox):

From: Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>
To: 497038@bugs.debian.org
Subject: apt-file speed improvements patch
Date: Fri, 29 Aug 2008 14:01:33 +0200
[Message part 1 (text/plain, inline)]
I'm sorry I submitted the modified version of apt-file and not the 
patch. The actual patch is in this message.

[apt-file-uncompress.patch (text/x-diff, inline)]
--- apt-file	2008-08-29 13:32:06.000000000 +0200
+++ apt-file-2.1.5/apt-file	2008-08-23 18:50:37.000000000 +0200
@@ -245,17 +245,6 @@
         $cmd = "set -x; $cmd"       if $Conf->{verbose};
         $cmd = "($cmd) < /dev/null" if $Conf->{non_interactive};
         system($cmd) if !defined $Conf->{dummy};
-        my $file = "$Conf->{cache}/$_->{dest}";
-        if ( $Conf->{uncompress} ) {
-            system("gunzip", "--force", $file) if -e $file;
-        }
-        else {
-            # If previously we where using uncompressed files and now we changed
-            # our mind we should remove the old files otherwise we will have
-            # both uncompressed and the compressed files in the disk!
-            $file =~ s/\.gz$//;
-            unlink $file;
-        }
     }
 }
 
@@ -298,7 +287,6 @@
         = $Conf->{is_regexp}   ? "zcat"
         : $Conf->{ignore_case} ? "zfgrep -i $zgrep_pattern"
         :                        "zfgrep $zgrep_pattern";
-    $zcat =~ s/^z// if $Conf->{uncompress};
     my $regexp = eval { $Conf->{ignore_case} ? qr/$pattern/i : qr/$pattern/ };
     error($@) if $@;
     my $quick_regexp = escape_parens($regexp);
@@ -306,15 +294,13 @@
 
     foreach (@$data) {
         my $file = "$Conf->{cache}/$_->{dest}";
-        $file =~ s/\.gz$// if $Conf->{uncompress};
         next if ( !-f $file );
 
         # Skip already searched files:
         next if $seen{$file}++;
+        $file = quotemeta $file;
         debug "Search in $file using $zcat";
-        # If the command is 'cat' then bypass the fork and just read the file
-        my $open_cmd = ($zcat eq 'cat') ? $file : "$zcat \Q$file\E |";
-        open( ZCAT, $open_cmd )
+        open( ZCAT, "$zcat $file |" )
             || warning "Can't $zcat $file";
         while (<ZCAT>) {
 
@@ -440,13 +426,10 @@
 sub purge_cache($) {
     my $data = shift;
     foreach (@$data) {
-        my $file = "$Conf->{cache}/$_->{dest}";
-        $file =~ s/\.gz$// if $Conf->{uncompress};
-        debug "Purging $file";
+        debug "Purging $Conf->{cache}/$_->{dest}";
         next if defined $Conf->{dummy};
-        next unless -e $file;
-        next if ( unlink $file ) > 0;
-        warning "Can't remove $file";
+        next if ( unlink "$Conf->{cache}/$_->{dest}" ) > 0;
+        warning "Can't remove $Conf->{cache}/$_->{dest}";
     }
 }
 
@@ -539,22 +522,6 @@
         exit 0;
     }
 
-    if ( defined $Conf->{uncompress} ) {
-        my $uncompress = lc $Conf->{uncompress};
-        if ( $uncompress =~ /^\d+$/ ) {
-            $Conf->{uncompress} = $uncompress;
-        }
-        elsif ( $uncompress eq 'true' or $uncompress eq 'yes' ) {
-            $Conf->{uncompress} = 1;
-        }
-        else {
-            $Conf->{uncompress} =  0;
-        }
-    }
-    else {
-        $Conf->{uncompress} =  0;
-    }
-
     my $interactive = $Conf->{interactive};
     defined $interactive or $interactive = "cdrom rsh ssh";
     $Conf->{interactive} = {};
@@ -618,6 +585,7 @@
         error
             "The cache directory is empty. You need to run 'apt-file update' first.";
     }
+
     $actions->{ $Conf->{action} }->($sources);
 }
 

Information forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. Full text and rfc822 format available.

Acknowledgement sent to "Thijs Kinkhorst" <thijs@debian.org>:
Extra info received and forwarded to list. Copy sent to Stefan Fritsch <sf@debian.org>. Full text and rfc822 format available.

Message #15 received at 497038@bugs.debian.org (full text, mbox):

From: "Thijs Kinkhorst" <thijs@debian.org>
To: "Emmanuel Rodriguez" <emmanuel.rodriguez@gmail.com>, 497038@bugs.debian.org
Subject: Re: Bug#497038: apt-file speed improvements patch
Date: Fri, 29 Aug 2008 14:26:23 +0200 (CEST)
On Fri, August 29, 2008 14:01, Emmanuel Rodriguez wrote:
> I'm sorry I submitted the modified version of apt-file and not the
> patch. The actual patch is in this message.

Thanks. The patch seems reversed though.

I think this would be a good addition to apt-file, but obviously only
after the lenny release.


Thijs





Information forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. Full text and rfc822 format available.

Acknowledgement sent to Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>:
Extra info received and forwarded to list. Copy sent to Stefan Fritsch <sf@debian.org>. Full text and rfc822 format available.

Message #20 received at 497038@bugs.debian.org (full text, mbox):

From: Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>
To: 497038@bugs.debian.org
Subject: Re: Bug#497038: apt-file speed improvements patch (3rd strike)
Date: Fri, 29 Aug 2008 16:04:00 +0200
[Message part 1 (text/plain, inline)]
Thijs Kinkhorst wrote:
> On Fri, August 29, 2008 14:01, Emmanuel Rodriguez wrote:
>   
>> I'm sorry I submitted the modified version of apt-file and not the
>> patch. The actual patch is in this message.
>>     
>
> Thanks. The patch seems reversed though.
>   
You're right! So I will, once a again, try to send the right patch. 
Third time's the charm!
> I think this would be a good addition to apt-file, but obviously only
> after the lenny release.
>   
That's fine, until that date I might manage to successfully send the patch!

[apt-file-uncompress.patch (text/x-diff, inline)]
--- apt-file-2.1.5/apt-file	2008-08-29 15:55:14.000000000 +0200
+++ apt-file	2008-08-29 13:32:06.000000000 +0200
@@ -245,6 +245,17 @@
         $cmd = "set -x; $cmd"       if $Conf->{verbose};
         $cmd = "($cmd) < /dev/null" if $Conf->{non_interactive};
         system($cmd) if !defined $Conf->{dummy};
+        my $file = "$Conf->{cache}/$_->{dest}";
+        if ( $Conf->{uncompress} ) {
+            system("gunzip", "--force", $file) if -e $file;
+        }
+        else {
+            # If previously we where using uncompressed files and now we changed
+            # our mind we should remove the old files otherwise we will have
+            # both uncompressed and the compressed files in the disk!
+            $file =~ s/\.gz$//;
+            unlink $file;
+        }
     }
 }
 
@@ -287,6 +298,7 @@
         = $Conf->{is_regexp}   ? "zcat"
         : $Conf->{ignore_case} ? "zfgrep -i $zgrep_pattern"
         :                        "zfgrep $zgrep_pattern";
+    $zcat =~ s/^z// if $Conf->{uncompress};
     my $regexp = eval { $Conf->{ignore_case} ? qr/$pattern/i : qr/$pattern/ };
     error($@) if $@;
     my $quick_regexp = escape_parens($regexp);
@@ -294,13 +306,15 @@
 
     foreach (@$data) {
         my $file = "$Conf->{cache}/$_->{dest}";
+        $file =~ s/\.gz$// if $Conf->{uncompress};
         next if ( !-f $file );
 
         # Skip already searched files:
         next if $seen{$file}++;
-        $file = quotemeta $file;
         debug "Search in $file using $zcat";
-        open( ZCAT, "$zcat $file |" )
+        # If the command is 'cat' then bypass the fork and just read the file
+        my $open_cmd = ($zcat eq 'cat') ? $file : "$zcat \Q$file\E |";
+        open( ZCAT, $open_cmd )
             || warning "Can't $zcat $file";
         while (<ZCAT>) {
 
@@ -426,10 +440,13 @@
 sub purge_cache($) {
     my $data = shift;
     foreach (@$data) {
-        debug "Purging $Conf->{cache}/$_->{dest}";
+        my $file = "$Conf->{cache}/$_->{dest}";
+        $file =~ s/\.gz$// if $Conf->{uncompress};
+        debug "Purging $file";
         next if defined $Conf->{dummy};
-        next if ( unlink "$Conf->{cache}/$_->{dest}" ) > 0;
-        warning "Can't remove $Conf->{cache}/$_->{dest}";
+        next unless -e $file;
+        next if ( unlink $file ) > 0;
+        warning "Can't remove $file";
     }
 }
 
@@ -522,6 +539,22 @@
         exit 0;
     }
 
+    if ( defined $Conf->{uncompress} ) {
+        my $uncompress = lc $Conf->{uncompress};
+        if ( $uncompress =~ /^\d+$/ ) {
+            $Conf->{uncompress} = $uncompress;
+        }
+        elsif ( $uncompress eq 'true' or $uncompress eq 'yes' ) {
+            $Conf->{uncompress} = 1;
+        }
+        else {
+            $Conf->{uncompress} =  0;
+        }
+    }
+    else {
+        $Conf->{uncompress} =  0;
+    }
+
     my $interactive = $Conf->{interactive};
     defined $interactive or $interactive = "cdrom rsh ssh";
     $Conf->{interactive} = {};

Information forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. Full text and rfc822 format available.

Acknowledgement sent to Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>:
Extra info received and forwarded to list. Copy sent to Stefan Fritsch <sf@debian.org>. Full text and rfc822 format available.

Message #25 received at 497038@bugs.debian.org (full text, mbox):

From: Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>
To: 497038@bugs.debian.org
Subject: Speed improvements are not warrantied
Date: Sat, 30 Aug 2008 17:11:18 +0200
I just wanted to report that I got some new benchmark results for the 
patched version of apt-file. This time the program was running in a 
different system. Based on the results it turns out that the speed 
improvements are not always warrantied, in fact in some systems the 
execution times could be worse!

In systems with a low buffer cache (low RAM) using the original 
compressed files yields faster results as the program has to read the 
whole file each time from the disk. Although, if a system has enough 
memory to cache the input files then processing the uncompressed files 
is faster than the time need for reading the compressed input and 
deflating it on the fly.

The benchmarks where performed in two different systems. The first one 
has 2 Gigs of RAM and the patch is clearly showing a speed improvement. 
The second system has 64 Megs and shows a speed degradation. It's hard 
to quantify how much RAM is enough, although RAM is getting cheaper each 
day and it's not uncommon to see laptops and desktops with Gigs of RAM.

Hopefully, this will not impact the acceptance of the patch.





Information forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. Full text and rfc822 format available.

Acknowledgement sent to Stefan Fritsch <sf@sfritsch.de>:
Extra info received and forwarded to list. Copy sent to Stefan Fritsch <sf@debian.org>. Full text and rfc822 format available.

Message #30 received at 497038@bugs.debian.org (full text, mbox):

From: Stefan Fritsch <sf@sfritsch.de>
To: Emmanuel Rodriguez <emmanuel.rodriguez@gmail.com>
Cc: 497038@bugs.debian.org
Subject: Re: Bug#497038: Speed improvements are not warrantied
Date: Sat, 30 Aug 2008 23:10:35 +0200
On Saturday 30 August 2008, Emmanuel Rodriguez wrote:
> In systems with a low buffer cache (low RAM) using the original
> compressed files yields faster results as the program has to read
> the whole file each time from the disk. Although, if a system has
> enough memory to cache the input files then processing the
> uncompressed files is faster than the time need for reading the
> compressed input and deflating it on the fly.

I expect that recompressing the file with lzo would give a better 
result, overall. It decompresses a lot faster than gzip and the file 
is still a lot smaller than uncompressed.

We will implement some speed improvement after lenny, one way or the 
other.




Information forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. (Sat, 21 Feb 2009 11:48:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Paul Wise <pabs3@bonedaddy.net>:
Extra info received and forwarded to list. Copy sent to Stefan Fritsch <sf@debian.org>. (Sat, 21 Feb 2009 11:48:02 GMT) Full text and rfc822 format available.

Message #35 received at 497038@bugs.debian.org (full text, mbox):

From: Paul Wise <pabs3@bonedaddy.net>
To: 497038@bugs.debian.org
Cc: control <control@bugs.debian.org>
Subject: apt-file: 497038: xapian index?
Date: Sat, 21 Feb 2009 20:45:03 +0900
[Message part 1 (text/plain, inline)]
usertags 497038 + bittenby
thanks

How about adding a xapian index option? That would reduce search times
but probably prevent regex searches.

Also, according to a recent blog post, gzip -1 is almost the same as
lzop. So, for those times you need regex searching (or don't have a
xapian index built), you can recompress with lzop if available and if
not, recompress with gzip -1.

http://changelog.complete.org/archives/931-how-to-think-about-compression-part-2

-- 
bye,
pabs

http://bonedaddy.net/pabs3/
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. (Sat, 21 Feb 2009 13:51:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Paul Wise <pabs@debian.org>:
Extra info received and forwarded to list. Copy sent to Stefan Fritsch <sf@debian.org>. (Sat, 21 Feb 2009 13:51:02 GMT) Full text and rfc822 format available.

Message #40 received at 497038@bugs.debian.org (full text, mbox):

From: Paul Wise <pabs@debian.org>
To: 497038@bugs.debian.org
Subject: Re: apt-file: 497038: xapian index?
Date: Sat, 21 Feb 2009 22:42:31 +0900
[Message part 1 (text/plain, inline)]
On Sat, 2009-02-21 at 20:45 +0900, Paul Wise wrote:

> How about adding a xapian index option? That would reduce search times
> but probably prevent regex searches.

Looks like Enrico has already implemented this:

<pabs> enrico: do you think xapian would be appropriate for an index for apt-file to speed things up (I suggested it in #497038)?
<enrico> pabs: I think so.  In fact, it's already implemented: curl http://dde.debian.net/dde/q/aptfile/byfile/YOUR/PATH
<enrico> pabs: add ?t=csv for human readable results
<enrico> pabs: I have it in my todo list to patch apt-file so that if apt-file update hasn't been run, it queries the remote one (with a warning that the query syntax is slightly different
<pabs> cool, sounds good
<enrico> pabs: http://dde.debian.net/dde/q/aptfile shows what's available
<enrico> pabs: the problem is that querying xapian you can't look for partial paths like *foo
<enrico> pabs: you can do */foo/bar/baz* if you want, though.  Or even */foo*/bar/baz
<enrico> pabs: but you can't have a '*' at the beginning of an item
<pabs> ok
<enrico> pabs: please copy and paste it, no problem.  apt-file's author is informed of my intentions, I'm sorry I haven't had time to pursue it yet
<pabs> will do

-- 
bye,
pabs

http://wiki.debian.org/PaulWise
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. (Sat, 02 Jul 2011 12:09:41 GMT) Full text and rfc822 format available.

Acknowledgement sent to Sami Liedes <sliedes@cc.hut.fi>:
Extra info received and forwarded to list. Copy sent to Stefan Fritsch <sf@debian.org>. (Sat, 02 Jul 2011 12:09:57 GMT) Full text and rfc822 format available.

Message #45 received at 497038@bugs.debian.org (full text, mbox):

From: Sami Liedes <sliedes@cc.hut.fi>
To: 497038@bugs.debian.org
Subject: lzop indeed would speed things up
Date: Sat, 2 Jul 2011 14:39:17 +0300
[Message part 1 (text/plain, inline)]
I have two ftp.$LANG_CODE.debian.org unstable mirrors in my
/etc/apt/sources.list, because sometimes one of them might be down or
out-of-date.

Doing an "apt-file search aoeui" on my fast (modern 8-core Sandy
Bridge) computer takes around 4 seconds. Almost all of this time is
spent gunzipping the file lists.

I believe merely using lzop could give a 2-4x speedup:

------------------------------------------------------------
$ time gzip -d <ftp.fi.debian.org_debian_dists_unstable_Contents-amd64.gz >/dev/null 

real    0m1.608s
user    0m1.584s
sys     0m0.016s

$ time lzop -d <ftp.fi.debian.org_debian_dists_unstable_Contents-amd64.lzo >/dev/null 

real    0m0.382s
user    0m0.356s
sys     0m0.024s
------------------------------------------------------------

IOW gzip -d takes 4.4 times longer than lzop -d. The price for this is
an 1.7x increase in the size of the compressed file:

-rw-r--r-- 1 sliedes sliedes 19892895 Jul  2 14:21 ftp.fi.debian.org_debian_dists_unstable_Contents-amd64.gz
-rw-r--r-- 1 sliedes sliedes 34062072 Jul  2 14:23 ftp.fi.debian.org_debian_dists_unstable_Contents-amd64.lzo

Decompressing a gzip -1 compressed file still takes 3.3x the time of
lzop -d:

------------------------------------------------------------
$ time gunzip -d <gzip-1/ftp.fi.debian.org_debian_dists_unstable_Contents-amd64.gz >/dev/null 

real    0m1.199s
user    0m1.164s
sys     0m0.012s
------------------------------------------------------------

	Sami
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, Stefan Fritsch <sf@debian.org>:
Bug#497038; Package apt-file. (Wed, 30 Nov 2011 23:33:09 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jakub Wilk <jwilk@debian.org>:
Extra info received and forwarded to list. Copy sent to Stefan Fritsch <sf@debian.org>. (Wed, 30 Nov 2011 23:33:09 GMT) Full text and rfc822 format available.

Message #50 received at 497038@bugs.debian.org (full text, mbox):

From: Jakub Wilk <jwilk@debian.org>
To: 497038@bugs.debian.org
Cc: Sami Liedes <sliedes@cc.hut.fi>
Subject: Re: Bug#497038: lzop indeed would speed things up
Date: Thu, 1 Dec 2011 00:29:49 +0100
* Sami Liedes <sliedes@cc.hut.fi>, 2011-07-02, 14:39:
>I believe merely using lzop could give a 2-4x speedup:
>
>------------------------------------------------------------
>$ time gzip -d <ftp.fi.debian.org_debian_dists_unstable_Contents-amd64.gz >/dev/null
>
>real    0m1.608s
>user    0m1.584s
>sys     0m0.016s
>
>$ time lzop -d <ftp.fi.debian.org_debian_dists_unstable_Contents-amd64.lzo >/dev/null
>
>real    0m0.382s
>user    0m0.356s
>sys     0m0.024s
>------------------------------------------------------------

You could achieve further speedup by using the -F option:

$ time gzip -dc < dists/unstable/main/Contents-amd64.gz > /dev/null

real    0m2.197s
user    0m2.184s
sys     0m0.012s

$ time lzop -dc < dists/unstable/main/Contents-amd64.lzo > /dev/null

real    0m0.475s
user    0m0.448s
sys     0m0.028s

$ time lzop -dFc < dists/unstable/main/Contents-amd64.lzo > /dev/null

real    0m0.266s
user    0m0.232s
sys     0m0.032s

That's only twice slower than reading uncompressed file:

$ time cat dists/unstable/main/Contents-amd64 > /dev/null

real    0m0.130s
user    0m0.000s
sys     0m0.128s

However, that's only because the files are already cached. If I drop 
caches before running the commands, lzop wins:

$ time cat dists/unstable/main/Contents-amd64 > /dev/null

real    0m1.495s
user    0m0.000s
sys     0m0.224s

$ time lzop -dFc < dists/unstable/main/Contents-amd64.lzo > /dev/null

real    0m0.370s
user    0m0.264s
sys     0m0.020s

-- 
Jakub Wilk




Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Fri Apr 18 16:23:49 2014; Machine Name: beach.debian.org

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.