Debian Bug report logs - #703366
RFH: apt-file -- search for files within Debian packages (command-line interface)

version graph

Package: wnpp; Maintainer for wnpp is wnpp@debian.org;

Reported by: Stefan Fritsch <sf@sfritsch.de>

Date: Mon, 18 Mar 2013 20:42:02 UTC

Severity: normal

Fixed in version apt-file/2.5.2

Done: Niels Thykier <niels@thykier.net>

Bug is archived. No further changes may be made.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, debian-devel@lists.debian.org, thijs@debian.org, enrico@debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Mon, 18 Mar 2013 20:42:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Fritsch <sf@sfritsch.de>:
New Bug report received and forwarded. Copy sent to debian-devel@lists.debian.org, thijs@debian.org, enrico@debian.org, wnpp@debian.org. (Mon, 18 Mar 2013 20:42:06 GMT) Full text and rfc822 format available.

Message #5 received at submit@bugs.debian.org (full text, mbox):

From: Stefan Fritsch <sf@sfritsch.de>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Mon, 18 Mar 2013 21:38:19 +0100
Package: wnpp
Severity: normal

We are looking for a developer / co-maintainer for apt-file.


The package description is:
 apt-file is a command line tool for searching files contained in packages
 for the APT packaging system. You can search in which package a file is
 included or list the contents of a package without installing or fetching it.
 If you would prefer not to download the large files used by apt-file you can
 run rapt-file, which calls a remote server to do the searches.


I no longer have the time or interest to improve apt-file, and Thijs is
also only keeping it operational but not doing significant new
development.

The largest task as I see it would be to better integrate some or all of
the transport mechanisms, so that apt-file can behave properly for
different kinds of errors, report up-to-date-ness, etc.  Also better
download progress reporting and bandwith limiting would be nice.  Of
course, there are many other possible improvements that no one had time
to implement, see the open wishlist bugs.

Apt-file is written in perl, so the prospective developer should have
decent perl knowledge. Since Apt-file interprets data received over
unauthenticated methods, being aware of possible security issues is also
necessary.

There is also rapt-file (written by Enrico), which consults a web
service instead of a local database. It is written in python and could
use some improvements, in particular to make it behave more similar to
apt-file.



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Wed, 20 Mar 2013 10:42:16 GMT) Full text and rfc822 format available.

Acknowledgement sent to David Kalnischkies <kalnischkies@gmail.com>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Wed, 20 Mar 2013 10:42:16 GMT) Full text and rfc822 format available.

Message #10 received at 703366@bugs.debian.org (full text, mbox):

From: David Kalnischkies <kalnischkies@gmail.com>
To: Stefan Fritsch <sf@sfritsch.de>, 703366@bugs.debian.org
Cc: debian-devel@lists.debian.org, deity@lists.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Wed, 20 Mar 2013 11:40:08 +0100
On Mon, Mar 18, 2013 at 9:38 PM, Stefan Fritsch <sf@sfritsch.de> wrote:
> The largest task as I see it would be to better integrate some or all of
> the transport mechanisms, so that apt-file can behave properly for
> different kinds of errors, report up-to-date-ness, etc.  Also better
> download progress reporting and bandwith limiting would be nice.  Of
> course, there are many other possible improvements that no one had time
> to implement, see the open wishlist bugs.

I would like to take the opportunity to invite anyone interested to join
deity@lists.debian.org (cc'ed) and discuss if and how we could work
on integrating apt-file more closely with other apt-* tools.

I (and I guess many other users) would be pleased if we could reach a
point in which "apt-get update" (or its countless alternative ways) would
update indeed all data I requested to be downloaded as a user rather than
remembering to run also "apt-file update" (and "debtags update" and and and).

Beside pleasing user it might also free some resources on the code front
as APT already does progress reporting, bandwidth limiting and security,
even though I am certain we can improve all these further.


The deeper layers of APT we would need to touch for this are written in C++
and help in that area is certainly welcomed, but would be independent from
apt-file as a consumer of the files on disk, so I see no need of rewriting
apt-file to $whatever language if that is a concern for you.


Best regards

David Kalnischkies



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Wed, 20 Mar 2013 12:21:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to nick black <nick.black@sprezzatech.com>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Wed, 20 Mar 2013 12:21:04 GMT) Full text and rfc822 format available.

Message #15 received at 703366@bugs.debian.org (full text, mbox):

From: nick black <nick.black@sprezzatech.com>
To: sf@sfritsch.de, 703366@bugs.debian.org
Subject: apt-file assistance
Date: Wed, 20 Mar 2013 08:06:26 -0400
Stefan,

I'd be interested in helping. I'm not a DD, but I am the developer of
"raptorial", an APT clone. I was already planning on starting in on the
apt-file component, so this is fortuitous.

-- 
nick black     http://www.sprezzatech.com -- unix and hpc consulting
to make an apple pie from scratch, you need first invent a universe.



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Wed, 20 Mar 2013 16:39:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to Niels Thykier <niels@thykier.net>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Wed, 20 Mar 2013 16:39:07 GMT) Full text and rfc822 format available.

Message #20 received at 703366@bugs.debian.org (full text, mbox):

From: Niels Thykier <niels@thykier.net>
To: 703366@bugs.debian.org
Cc: Stefan Fritsch <sf@sfritsch.de>, deity@lists.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Wed, 20 Mar 2013 17:35:54 +0100
(Dropping CC for d-devel)

On 2013-03-20 11:40, David Kalnischkies wrote:
> On Mon, Mar 18, 2013 at 9:38 PM, Stefan Fritsch <sf@sfritsch.de> wrote:
>> The largest task as I see it would be to better integrate some or all of
>> the transport mechanisms, so that apt-file can behave properly for
>> different kinds of errors, report up-to-date-ness, etc.  Also better
>> download progress reporting and bandwith limiting would be nice.  Of
>> course, there are many other possible improvements that no one had time
>> to implement, see the open wishlist bugs.
> 

I am interested in working on apt-file.

> I would like to take the opportunity to invite anyone interested to join
> deity@lists.debian.org (cc'ed) and discuss if and how we could work
> on integrating apt-file more closely with other apt-* tools.
> 
> I (and I guess many other users) would be pleased if we could reach a
> point in which "apt-get update" (or its countless alternative ways) would
> update indeed all data I requested to be downloaded as a user rather than
> remembering to run also "apt-file update" (and "debtags update" and and and).
> 

Indeed that would be great.  Maybe packages like apt-file could install
a file in some directory APT reads saying "Please download X with updates" ?

> Beside pleasing user it might also free some resources on the code front
> as APT already does progress reporting, bandwidth limiting and security,
> even though I am certain we can improve all these further.
> 
> 
> [...]
> 
> 
> Best regards
> 
> David Kalnischkies
> 
> 


Not to mention new "transports" come for free. :)

~Niels





Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Wed, 20 Mar 2013 17:33:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Fritsch <sf@sfritsch.de>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Wed, 20 Mar 2013 17:33:04 GMT) Full text and rfc822 format available.

Message #25 received at 703366@bugs.debian.org (full text, mbox):

From: Stefan Fritsch <sf@sfritsch.de>
To: deity@lists.debian.org
Cc: Niels Thykier <niels@thykier.net>, 703366@bugs.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Wed, 20 Mar 2013 18:30:32 +0100
On Wednesday 20 March 2013, Niels Thykier wrote:
> > I would like to take the opportunity to invite anyone interested
> > to join deity@lists.debian.org (cc'ed) and discuss if and how we
> > could work on integrating apt-file more closely with other apt-*
> > tools.
> >
> > 
> >
> > I (and I guess many other users) would be pleased if we could
> > reach a point in which "apt-get update" (or its countless
> > alternative ways) would update indeed all data I requested to be
> > downloaded as a user rather than remembering to run also
> > "apt-file update" (and "debtags update" and and and).
> >
> > 
> 
> Indeed that would be great.  Maybe packages like apt-file could
> install a file in some directory APT reads saying "Please download
> X with updates" ?

That would be the perfect solution. Unfortunately, it would also mean 
that apt's pdiff implementation would need to be rewritten because it 
is so inefficient. AFAICS, with N the number of lines in the Contents 
or Packages file, and M the number of diffs, apt currently scales like 
O(N*M) while apt-file's implementation scales more like O(M+N). Since 
the content files are much larger than the packages files, this would 
be an even bigger issue with apt-file than it is with apt. In order to 
get decent performance, one really must download all diffs and apply 
them at the same time. Also, it is not possible to keep the whole 
Contents file in memory (though I don't know if apt does that).

But of course, if someone would tackle that problem, the benefit would 
be much greater than only to apt-file. Maybe this would be a nice GSOC 
project? Don't know if it is too late for this year's deadline, 
though.



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Thu, 21 Mar 2013 15:18:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Niels Thykier <niels@thykier.net>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Thu, 21 Mar 2013 15:18:04 GMT) Full text and rfc822 format available.

Message #30 received at 703366@bugs.debian.org (full text, mbox):

From: Niels Thykier <niels@thykier.net>
To: Stefan Fritsch <sf@sfritsch.de>, 703366@bugs.debian.org, deity@lists.debian.org
Cc: Nick Black <nick.black@sprezzatech.com>
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Thu, 21 Mar 2013 16:14:56 +0100
On 2013-03-20 18:30, Stefan Fritsch wrote:
> On Wednesday 20 March 2013, Niels Thykier wrote:
>>> [...]
>>>
>>> I (and I guess many other users) would be pleased if we could
>>> reach a point in which "apt-get update" (or its countless
>>> alternative ways) would update indeed all data I requested to be
>>> downloaded as a user rather than remembering to run also
>>> "apt-file update" (and "debtags update" and and and).
>>>
>>>
>>
>> Indeed that would be great.  Maybe packages like apt-file could
>> install a file in some directory APT reads saying "Please download
>> X with updates" ?
> 
> That would be the perfect solution. Unfortunately, it would also mean 
> that apt's pdiff implementation would need to be rewritten because it 
> is so inefficient. [...]

I spoke with David Kalnischkies (DonKult) and he told me that (part of)
the reason why it is slow is that it makes no assumption about pdiffs.
It is my understanding (of the code) that apt-file just blindly
downloads all ("new") patches and applies them in one go.

Allegedly, rerepro can merge pdiffs so not all of them needs to be
applied and (understandably) the APT maintainers do not want that to
break.  The solution is probably to extend the pdiff format (e.g. like
the suggestion in [1]), so the client side can see exactly which patches
are needed (instead of having to do them one at a time).
  To this end, I have been making a bit of noise in #d-ftp; hopefully I
will have news here soon.

> But of course, if someone would tackle that problem, the benefit would 
> be much greater than only to apt-file. Maybe this would be a nice GSOC 
> project? Don't know if it is too late for this year's deadline, 
> though.
> 

David reminded me that the APT side of things already had a GSoC last
year[2].  The code has not been merged yet but at least a
proof-of-concept branch is there.  Assuming that can be used, we are
probably very close to making apt-file's update/purge commands obsolete.
  As understood Nick, he was not interested in maintaining the current
Perl variant of apt-file, but he would be interested in rewriting (and
maintain said rewrite of) apt-file.  He was certain he could improve the
search speed of apt-file while doing so.  Given the results of his
apt-show-versions rewrite I am looking forward to that rewrite with
great anticipation.  :)

What I propose we do is that I take over the maintenance of the current
apt-file.  I will focus on making apt-file update/purge obsolete.
  Meanwhile, Nick can work on his rewrite in parallel - possibly in the
same source tree.  I am okay either way and I certainly do not mind an
extra co-maintainer.
  As Nick's rewrite become more feature complete, the current code could
then delegate more and more tasks to it.  In a (hopefully) not too
distant future, the Perl code can then be removed. :)

~Niels

[1] https://lists.debian.org/deity/2009/08/msg00169.html


[2]
http://wiki.debian.org/SummerOfCode2012/StudentApplications/BogdanPurcareata

https://launchpad.net/apt-fetcher





Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Thu, 21 Mar 2013 15:57:10 GMT) Full text and rfc822 format available.

Acknowledgement sent to Nick Black <nick.black@sprezzatech.com>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Thu, 21 Mar 2013 15:57:11 GMT) Full text and rfc822 format available.

Message #35 received at 703366@bugs.debian.org (full text, mbox):

From: Nick Black <nick.black@sprezzatech.com>
To: Niels Thykier <niels@thykier.net>
Cc: Stefan Fritsch <sf@sfritsch.de>, 703366@bugs.debian.org, deity@lists.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Thu, 21 Mar 2013 11:51:48 -0400
Niels Thykier left as an exercise for the reader:
>   As understood Nick, he was not interested in maintaining the current
> Perl variant of apt-file, but he would be interested in rewriting (and
> maintain said rewrite of) apt-file.  He was certain he could improve the
> search speed of apt-file while doing so.  Given the results of his
> apt-show-versions rewrite I am looking forward to that rewrite with
> great anticipation.  :)

As discussed on #depian-APT, aye, I'm going to begin on this today, hoping
to finish it up by early next week. Apt-file has more interactions with the
APT ecosystem than apt-show-versions did, and also a larger feature set.
The core is very amenable to the design patterns that already exist.

Regarding needing to remember to run apt-file update and apt-show-versions
-i, please know that part of the RAPTORIAL design has been, from the
beginning, that these all must go. The performance numbers I posted for
apt-show-versions already reflect a cacheless case -- I'm taking no
prisoners, and lexing the plaintext package lists and status file directly.

ps: I appreciate the graciousness of the APT team in extending me this
opportunity, especially after the somewhat aggressive way I launched this
project. I think we're all on the same page, for at least the short term.

-- 
nick black     http://www.sprezzatech.com -- unix and hpc consulting
to make an apple pie from scratch, you need first invent a universe.



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Thu, 21 Mar 2013 17:57:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Fritsch <sf@sfritsch.de>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Thu, 21 Mar 2013 17:57:04 GMT) Full text and rfc822 format available.

Message #40 received at 703366@bugs.debian.org (full text, mbox):

From: Stefan Fritsch <sf@sfritsch.de>
To: Niels Thykier <niels@thykier.net>
Cc: 703366@bugs.debian.org, deity@lists.debian.org, Nick Black <nick.black@sprezzatech.com>
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Thu, 21 Mar 2013 18:53:10 +0100
On Thursday 21 March 2013, Niels Thykier wrote:
> On 2013-03-20 18:30, Stefan Fritsch wrote:
> > That would be the perfect solution. Unfortunately, it would also
> > mean that apt's pdiff implementation would need to be rewritten
> > because it is so inefficient. [...]
> 
> I spoke with David Kalnischkies (DonKult) and he told me that (part
> of) the reason why it is slow is that it makes no assumption about
> pdiffs. It is my understanding (of the code) that apt-file just
> blindly downloads all ("new") patches and applies them in one go.

I was under the impression that the Index file tells you exactly which 
patches are necessary. But due to the lack of any formal specification 
(at least at the time I wrote diffindex-* in apt-file), maybe I was 
wrong.

> Allegedly, rerepro can merge pdiffs so not all of them needs to be
> applied and (understandably) the APT maintainers do not want that
> to break.

This seems very broken to me. Merging the diffs on the server side has 
little benefit. You still need exactly the same number of diffs on the 
server but each diff gets larger and there is more change among the 
diffs so that the efficiency of caching proxies goes down. With keep-
alive connections and pipelining, downloading a few dozen files is not 
that big a problem.

And there are some implementations (at least apt-file's and the 
security tracker's) that depend on the pdiffs being incremental in 
order to be faster than apt by at least one order of magnitude. So if 
the archive would ever use the diff merging, those implementations 
would break.

> The solution is probably to extend the pdiff format
> (e.g. like the suggestion in [1]), so the client side can see
> exactly which patches are needed (instead of having to do them one
> at a time).
>   To this end, I have been making a bit of noise in #d-ftp;
> hopefully I will have news here soon.

I think apt should still be changed to assume incremental diffs unless 
the Index file is of a new format. That would bring the benefit even 
for old-style archives. Merging diffs on the server does not give 
comparable benefit.

> David reminded me that the APT side of things already had a GSoC
> last year[2].  The code has not been merged yet but at least a
> proof-of-concept branch is there.  Assuming that can be used, we
> are probably very close to making apt-file's update/purge commands
> obsolete.

Nice. But the pdiff problem still needs to be solved. You don't want 
to slow down apt-file update by a factor of 10 or more.

> As understood Nick, he was not interested in maintaining
> the current Perl variant of apt-file, but he would be interested
> in rewriting (and maintain said rewrite of) apt-file.  He was
> certain he could improve the search speed of apt-file while doing
> so.  Given the results of his apt-show-versions rewrite I am
> looking forward to that rewrite with great anticipation.  :)
> 
> What I propose we do is that I take over the maintenance of the
> current apt-file.  I will focus on making apt-file update/purge
> obsolete.

Sure. It's in collab-maint. Just commit away. But don't remove Thijs 
or Enrico, they still want to stay co-maintainers.

Cheers,
Stefan



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Thu, 21 Mar 2013 18:03:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Fritsch <sf@sfritsch.de>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Thu, 21 Mar 2013 18:03:07 GMT) Full text and rfc822 format available.

Message #45 received at 703366@bugs.debian.org (full text, mbox):

From: Stefan Fritsch <sf@sfritsch.de>
To: Nick Black <nick.black@sprezzatech.com>
Cc: Niels Thykier <niels@thykier.net>, 703366@bugs.debian.org, deity@lists.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Thu, 21 Mar 2013 18:58:54 +0100
On Thursday 21 March 2013, Nick Black wrote:
> Niels Thykier left as an exercise for the reader:
> >   As understood Nick, he was not interested in maintaining the
> >   current
> > 
> > Perl variant of apt-file, but he would be interested in rewriting
> > (and maintain said rewrite of) apt-file.  He was certain he
> > could improve the search speed of apt-file while doing so. 
> > Given the results of his apt-show-versions rewrite I am looking
> > forward to that rewrite with great anticipation.  :)

The search speed is currently dominated by the decompression time. 
That could be easily cut down by changing the local files to be 
compressed with lzo instead of gzip. But IMHO you don't want to keep > 
300MB of uncompressed data lying on the local hard disk, so there are 
limits to the speed up you can achieve. Unless you want to introduce 
some nifty compressed searchable index format.

> As discussed on #depian-APT, aye, I'm going to begin on this today,
> hoping to finish it up by early next week. Apt-file has more
> interactions with the APT ecosystem than apt-show-versions did,
> and also a larger feature set. The core is very amenable to the
> design patterns that already exist.

If you and Niels could send me the backlogs of those talks, I would be 
interested.

Cheers,
Stefan



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Sun, 24 Mar 2013 19:45:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to David Kalnischkies <kalnischkies+debian@gmail.com>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Sun, 24 Mar 2013 19:45:07 GMT) Full text and rfc822 format available.

Message #50 received at 703366@bugs.debian.org (full text, mbox):

From: David Kalnischkies <kalnischkies+debian@gmail.com>
To: Stefan Fritsch <sf@sfritsch.de>
Cc: Niels Thykier <niels@thykier.net>, 703366@bugs.debian.org, deity@lists.debian.org, Nick Black <nick.black@sprezzatech.com>
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Sun, 24 Mar 2013 20:41:14 +0100
On Thu, Mar 21, 2013 at 6:53 PM, Stefan Fritsch <sf@sfritsch.de> wrote:
> On Thursday 21 March 2013, Niels Thykier wrote:
>> On 2013-03-20 18:30, Stefan Fritsch wrote:
>> Allegedly, rerepro can merge pdiffs so not all of them needs to be
>> applied and (understandably) the APT maintainers do not want that
>> to break.
>
> This seems very broken to me. Merging the diffs on the server side has
> little benefit. You still need exactly the same number of diffs on the
> server but each diff gets larger and there is more change among the
> diffs so that the efficiency of caching proxies goes down. With keep-
> alive connections and pipelining, downloading a few dozen files is not
> that big a problem.

We needed to disable pipelining recently as we failed to "force" broken
proxies and servers into supporting it properly. Think e.g. squid and amazon.
Maybe the big webbrowsers are able to get them to behave now that
they all start to use pipelining …

Still, assuming a prefect world, we download a lot of files which means
a lot of gz-overhead per file. There is also the theory that a package that was
touched is soon touched again (e.g. to fix a bug) meaning we have a lot of
"useless" data downloaded. Add slow systems and those behind a self-controlled
mirror (where you could merge).

So in a perfect world we would support both.


> And there are some implementations (at least apt-file's and the
> security tracker's) that depend on the pdiffs being incremental in
> order to be faster than apt by at least one order of magnitude. So if
> the archive would ever use the diff merging, those implementations
> would break.

I wonder if that is the reason for the announced pdiff change in dak to not
be merged to this day:
https://lists.debian.org/debian-devel-announce/2012/09/msg00012.html


>> The solution is probably to extend the pdiff format
>> (e.g. like the suggestion in [1]), so the client side can see
>> exactly which patches are needed (instead of having to do them one
>> at a time).
>>   To this end, I have been making a bit of noise in #d-ftp;
>> hopefully I will have news here soon.
>
> I think apt should still be changed to assume incremental diffs unless
> the Index file is of a new format. That would bring the benefit even
> for old-style archives. Merging diffs on the server does not give
> comparable benefit.

As said, depends. Anyway, APT is usually extremely conservative regarding
breaking workflows, even if only a few users use this flow, so I highly doubt
we would change to incremental by default.


>> David reminded me that the APT side of things already had a GSoC
>> last year[2].  The code has not been merged yet but at least a
>> proof-of-concept branch is there.  Assuming that can be used, we
>> are probably very close to making apt-file's update/purge commands
>> obsolete.

I had unfortunately less time than I hoped, but I will try to write a proper
follow-up on this soon. Until then some loose ends:

The GSoC bundles another big change regarding sources.list handling which
needs work before we can merge this (the new code is incompatible with the
 old). On top of this the acquire system is extended to deal with more
complex extensions on the file front, which is interesting but independent
as most files we download do not need a complicated handling (like fallbacks
 and conditionals – think: (In)Release(.gpg)) so we need code for "simple"
files anyway, therefore no problem to do this independently.

Rewriting debReleaseIndex::ComputeIndexTargets in apt-pkg/deb/debmetaindex.cc
to query files based on configs rather than hardcoded should be key here
(beside moving this code up in the class hierarchy then).
Something along the lines of Acquire::Files::<Type>::<Identifier>::<Data>
there <Type> is "Base", "Flat" and "Tree", to have different settings for
"Flat" and "Tree" style archives. <Identifier> being a random name like
"Packages", "Contents", … And finally <Data> to set "URI", "Description" …
(I wonder if we need Acquire::Files::http://example.org/:: … too)

URI should be build with placeholders like BaseURI, Architectures,
NativeArchitecture, Languages. Many of these should be available in the
other <Data> elements as well (think: Description for Translation-*).

While we have IndexTargets and OptionalIndexTargets the later aren't really
optional (but hardcoded-optional as we couldn't break ABI at that point),
fixing this now would be good [aka: needed].


So long,
Best regards

David Kalnischkies



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Mon, 25 Mar 2013 15:06:09 GMT) Full text and rfc822 format available.

Acknowledgement sent to Nick Black <nick.black@sprezzatech.com>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Mon, 25 Mar 2013 15:06:09 GMT) Full text and rfc822 format available.

Message #55 received at 703366@bugs.debian.org (full text, mbox):

From: Nick Black <nick.black@sprezzatech.com>
To: Stefan Fritsch <sf@sfritsch.de>
Cc: Niels Thykier <niels@thykier.net>, 703366@bugs.debian.org, deity@lists.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Mon, 25 Mar 2013 11:02:29 -0400
[Message part 1 (text/plain, inline)]
So, as I posted to deity/debdev/derivatives last evening, apt-file has been
rewritten as raptorial-file. On typical queries on my quadcore,
raptorial-file() is about ~50% the runtime of apt-file(), a speedup of 
between 1.5x and 2x. For pathological queries, raptorial-file() is about ~3%
the runtime of apt-file, a speedup of 50x or so.

This email characterizes performance of apt-file: http://lists.debian.org/debian-devel/2013/03/msg00409.html
This email gives early results: http://lists.debian.org/debian-devel/2013/03/msg00415.html
Lucky email #420 updates common-case results following a bugfix: http://lists.debian.org/debian-devel/2013/03/msg00420.html 

Let me know how you'd like to proceed.
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Mon, 25 Mar 2013 15:33:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to Nick Black <nick.black@sprezzatech.com>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Mon, 25 Mar 2013 15:33:09 GMT) Full text and rfc822 format available.

Message #60 received at 703366@bugs.debian.org (full text, mbox):

From: Nick Black <nick.black@sprezzatech.com>
To: Stefan Fritsch <sf@sfritsch.de>, Niels Thykier <niels@thykier.net>, 703366@bugs.debian.org, deity@lists.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Mon, 25 Mar 2013 11:30:34 -0400
i ought point out here that raptorial-file does not have "update" or "purge"
functionality. it seems to me that if the user has installed apt-file or
some equivalent, they're interested in the contents of Contents files, and
thus they ought be downloaded with apt updates. if they have not installed
it, they can't use the data anyway, so don't download them. this can be done
with apt hooks, no? and the first time it's installed it ought try to
download contents. apt-file(1) as stands is perfectly reasonable at the
update and purge task; we could gut it of "list"/"search" functionality,
rename it apt-contents-get or something, and call it from the aforementioned
hooks.

let me know if there's some reason why this is impossible (i do not yet
pretend to grasp APT's semantics in their full detail). otherwise, let me
know if you'd like me to proceed along this path, or who should do what, or
whatever. i'd love to solve the longstanding annoyance of "apt-file update".
i might be overlooking something obvious, though?



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Tue, 26 Mar 2013 00:24:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Nick Black <nick.black@sprezzatech.com>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Tue, 26 Mar 2013 00:24:04 GMT) Full text and rfc822 format available.

Message #65 received at 703366@bugs.debian.org (full text, mbox):

From: Nick Black <nick.black@sprezzatech.com>
To: Stefan Fritsch <sf@sfritsch.de>
Cc: Niels Thykier <niels@thykier.net>, deity@lists.debian.org, 703366@bugs.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Mon, 25 Mar 2013 20:20:09 -0400
Stefan Fritsch left as an exercise for the reader:
> Well, the contents files are much larger than the package files and 
> are usually used less frequently. So some users may prefer to download 
> the contents files only when necessary. Apart from that, I don't see 

then can't they just leave apt-file uninstalled? especially as installing it
would perform the initial apt-get update?

> any problem. But that's not my decision anymore :-)

yeah i'm not wedded to any particular solution, but this one seems right to
me. if it's something that's been thraded out at length, though, no need to
entertain my suggestions.

> - Significant speedup could be attained by recompressing the local 
> file with lzop instead of gzip. You write "processing time is roughly

Absolutely. If we can make local changes, there's all kinds of things we can
do. I left this entire class of optimizations aside.

For that matter, if we stripped the free-form header section, that would,
perhaps surprisingly, *vastly* simplify my code. Again, I wanted to do an
implementation which conformed precisely to current disk layouts and such,
since I want to deploy this in SprezzOS independently of Debian's decision.

> - Try benchmarks with a single core, too. It's nice if you can use 
> more cores but you don't want to have too much regression on single 
> core systems.

Yep, I will send those around. I'm not doing anything stupid like launching
a fixed-sized pool; libblossom offers us per-CPU workers etc.

> - apt-file regex search performance sucks because it doesn't use grep. 
> Nowadays grep has -P, so grep could be used, too. Which regex type do 
> you use?

Hold on for a bit of theory:

 - I'm matching multiple patterns using an "Advanced Aho-Corasick
   automaton". The set of all AACAs is clearly a subset of all DFA (discrete
   finite automatons).

 - The set of languages recognized by DFAs is equivalent to the set of
   languages recognized by regular languages.

 - This, any regular operation can be encoded into a DFA, though possibly at
   a high cost in states. See Sipser or Hopcroft+Ullman.

 - Thus, we just encode the regular operations as alternates in our AACA.
   Since we already match the AACA in time independent of the number of
   patterns, adding these alternate patterns costs us no time in the main,
   but only in the preprocessing.

I'm doing basically the exact same thing grep/egrep does: Boyer-Moore-Galil
for one pattern, or AAC for multiple patterns.

> - Are you limiting the used memory? Remember there may still be VMs 
> with 256MB RAM and you shouldn't cause swapping on such systems.

Even if they only have 256MB of physical RAM, they still have however large
a virtual address space. mmap() is not going to map in all the requested
pages; it's just associating a VMA in our process with them. They'll be
faulted in as they're referenced. Thus it's not how much we have mmap()d
(which can be large, equivalent to the sum of the compressed files plus
dynamic state). Our dynamic state per thread is limited, however, so we'll
never be say trying to uncompress 256MB of text (this would be bad for
parallelism anyway).

Hope that answers these questions. Feel free to hit me with more.

Hack on!

--nick


-- 
nick black     http://www.sprezzatech.com -- unix and hpc consulting
to make an apple pie from scratch, you need first invent a universe.



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Tue, 26 Mar 2013 07:57:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Niels Thykier <niels@thykier.net>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Tue, 26 Mar 2013 07:57:04 GMT) Full text and rfc822 format available.

Message #70 received at 703366@bugs.debian.org (full text, mbox):

From: Niels Thykier <niels@thykier.net>
To: Nick Black <nick.black@sprezzatech.com>
Cc: Stefan Fritsch <sf@sfritsch.de>, deity@lists.debian.org, 703366@bugs.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Tue, 26 Mar 2013 08:53:32 +0100
On 2013-03-26 01:20, Nick Black wrote:
> Stefan Fritsch left as an exercise for the reader:
>> Well, the contents files are much larger than the package files and 
>> are usually used less frequently. So some users may prefer to download 
>> the contents files only when necessary. Apart from that, I don't see 
> 
> then can't they just leave apt-file uninstalled? especially as installing it
> would perform the initial apt-get update?
> 
>> any problem. But that's not my decision anymore :-)
> 
> yeah i'm not wedded to any particular solution, but this one seems right to
> me. if it's something that's been thraded out at length, though, no need to
> entertain my suggestions.
> 

Sounds like we should have some config variable that people can set (or
clear) to disable Contents fetching via apt-get update.  Assuming the
APT side of this would support such a use-case, I think we can have it.
  But to be honest, I would really like to remove the "apt-file update"
if I can get away with it.  It always seemed like something APT ought to
do... though I suppose if I end up delegating the entire thing to
apt-get update it will not really have any maintenance overhead.

>> - Significant speedup could be attained by recompressing the local 
>> file with lzop instead of gzip. You write "processing time is roughly
> 

I tried a bit with lzop and indeed it seems to half my runtimes with
search and show (at least the non-regex variant).  Though it comes at
lower compression rates, which is not a problem atm but might be when
multi-arch support is "added" (also see my comment about redundancy below).

> Absolutely. If we can make local changes, there's all kinds of things we can
> do. I left this entire class of optimizations aside.
> 
> For that matter, if we stripped the free-form header section, that would,
> perhaps surprisingly, *vastly* simplify my code. Again, I wanted to do an
> implementation which conformed precisely to current disk layouts and such,
> since I want to deploy this in SprezzOS independently of Debian's decision.
> 

There are also things we could do at update time:

 * pre-appending / to all paths as people expect that there is a leading
   slash.  To this end, apt-file is currently trying to rewrite people's
   search pattern to match reality but I hope we could eventually avoid
   that (because it does not work in all cases etc.).
 * remove redundancy between Contents-* files.  Between unstable and
   testing (or i386 and amd64) there is a huge overlap in files.  That
   would likely allow us to scale better as the number of architectures
   and distributions enabled increase.
   (related bugs include #658794, #578727 and #632254)
 * make optimized caches for certain use-cases like "list/show".  Maybe
   even "match pattern X against programs in default PATH".

The second item probably require merging the Contents files, which we
probably need to do in a very efficient manner.  I believe the files are
pre-sorted, so we could abuse this to do the "merge" part of mergesort
without having the whole ordeal loaded in memory (which is sadly quickly
measured in GB).

>> - Try benchmarks with a single core, too. It's nice if you can use 
>> more cores but you don't want to have too much regression on single 
>> core systems.
> 
> Yep, I will send those around. I'm not doing anything stupid like launching
> a fixed-sized pool; libblossom offers us per-CPU workers etc.
> 
>> - apt-file regex search performance sucks because it doesn't use grep. 
>> Nowadays grep has -P, so grep could be used, too. Which regex type do 
>> you use?
> 

Also possibly because Perl (Python, Java etc.) uses an expensive regular
expression implementation[1].

> Hold on for a bit of theory:
> 
>  - I'm matching multiple patterns using an "Advanced Aho-Corasick
>    automaton". The set of all AACAs is clearly a subset of all DFA (discrete
>    finite automatons).
> 

I think you mean s/discrete/deterministic/ as NFAs (which can be used to
match any regular language as well) is a "Non-deterministic finite
automaton"

>  - The set of languages recognized by DFAs is equivalent to the set of
>    languages recognized by regular languages.
> 
>  - This, any regular operation can be encoded into a DFA, though possibly at
>    a high cost in states. See Sipser or Hopcroft+Ullman.
> 
>  - Thus, we just encode the regular operations as alternates in our AACA.
>    Since we already match the AACA in time independent of the number of
>    patterns, adding these alternate patterns costs us no time in the main,
>    but only in the preprocessing.
> 
> I'm doing basically the exact same thing grep/egrep does: Boyer-Moore-Galil
> for one pattern, or AAC for multiple patterns.
> 
>> [...]
> Hack on!
> 
> --nick
> 
> 

True, but the perl "regular repression" is in fact more powerful than a
NFA.  Admittedly I believe the only real feature that exceeds NFAs is
the "backref"s, which are thankfully not used that often.
  I have no concerns about compiling the "perl regex" case into a
DFA/NFA were possible, but we have to either handle the backref case or
explicitly document that backrefs are not supported.
  I am planning on doing an apt-file 3 release post Wheezy where I
permit backwards incompatible changes (e.g. exit code and making -F
default with show/list), so either way we choose to do it will be fine.

~Niels

[1] http://swtch.com/~rsc/regexp/regexp1.html

Article is from 2007, so things could have changed.  Though it is my
understanding that they haven't.

NB: The first two graphs do not have same unit on the Y-axis (i.e.
seconds vs. micro seconds).



Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#703366; Package wnpp. (Tue, 26 Mar 2013 10:39:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Fritsch <sf@sfritsch.de>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Tue, 26 Mar 2013 10:39:03 GMT) Full text and rfc822 format available.

Message #75 received at 703366@bugs.debian.org (full text, mbox):

From: Stefan Fritsch <sf@sfritsch.de>
To: Nick Black <nick.black@sprezzatech.com>
Cc: Niels Thykier <niels@thykier.net>, deity@lists.debian.org, 703366@bugs.debian.org
Subject: Re: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Date: Tue, 26 Mar 2013 00:00:13 +0100
On Monday 25 March 2013, Nick Black wrote:
> i ought point out here that raptorial-file does not have "update"
> or "purge" functionality. it seems to me that if the user has
> installed apt-file or some equivalent, they're interested in the
> contents of Contents files, and thus they ought be downloaded with
> apt updates. if they have not installed it, they can't use the
> data anyway, so don't download them. this can be done with apt
> hooks, no? and the first time it's installed it ought try to
> download contents. apt-file(1) as stands is perfectly reasonable
> at the update and purge task; we could gut it of "list"/"search"
> functionality, rename it apt-contents-get or something, and call
> it from the aforementioned hooks.
> 
> let me know if there's some reason why this is impossible (i do not
> yet pretend to grasp APT's semantics in their full detail).
> otherwise, let me know if you'd like me to proceed along this
> path, or who should do what, or whatever. i'd love to solve the
> longstanding annoyance of "apt-file update". i might be
> overlooking something obvious, though?

Well, the contents files are much larger than the package files and 
are usually used less frequently. So some users may prefer to download 
the contents files only when necessary. Apart from that, I don't see 
any problem. But that's not my decision anymore :-)


Other comments:

- Significant speedup could be attained by recompressing the local 
file with lzop instead of gzip. You write "processing time is roughly
characterized as at least twice inflation time" which is not what I 
remember from my experiments some years back. Decompression time was 
dominant (assuming a query that has only a few matches, which is the 
common case IMHO).

- Try benchmarks with a single core, too. It's nice if you can use 
more cores but you don't want to have too much regression on single 
core systems.

- apt-file regex search performance sucks because it doesn't use grep. 
Nowadays grep has -P, so grep could be used, too. Which regex type do 
you use?

- Are you limiting the used memory? Remember there may still be VMs 
with 256MB RAM and you shouldn't cause swapping on such systems.


Cheers,
Stefan



Reply sent to Niels Thykier <niels@thykier.net>:
You have taken responsibility. (Sun, 05 May 2013 08:09:55 GMT) Full text and rfc822 format available.

Notification sent to Stefan Fritsch <sf@sfritsch.de>:
Bug acknowledged by developer. (Sun, 05 May 2013 08:09:55 GMT) Full text and rfc822 format available.

Message #80 received at 703366-close@bugs.debian.org (full text, mbox):

From: Niels Thykier <niels@thykier.net>
To: 703366-close@bugs.debian.org
Subject: Bug#703366: fixed in apt-file 2.5.2
Date: Sun, 05 May 2013 07:47:31 +0000
Source: apt-file
Source-Version: 2.5.2

We believe that the bug you reported is fixed in the latest version of
apt-file, which is due to be installed in the Debian FTP archive.

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to 703366@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Niels Thykier <niels@thykier.net> (supplier of updated apt-file package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@debian.org)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Format: 1.8
Date: Sun, 05 May 2013 09:32:24 +0200
Source: apt-file
Binary: apt-file
Architecture: source all
Version: 2.5.2
Distribution: unstable
Urgency: low
Maintainer: Niels Thykier <niels@thykier.net>
Changed-By: Niels Thykier <niels@thykier.net>
Description: 
 apt-file   - search for files within Debian packages (command-line interface)
Closes: 687221 703366 703594
Changes: 
 apt-file (2.5.2) unstable; urgency=low
 .
   [ Stefan Fritsch ]
   * Properly detect failed attempt to apply patches. Closes: #687221
 .
   [ Niels Thykier ]
   * New maintainer.  (Closes: #703366)
     Kudos to Stefan Fritsch for his work on apt-file.
   * When using "-f" and no files are given, default to stdin.
     (Closes: #703594)
   * Bump Standards-Versions to 3.9.4 - no changes required.
   * Use the canonical URIs for the Vcs-* fields.
Checksums-Sha1: 
 ed19dbfc449e00546afbd873a0917849746eaf3b 1864 apt-file_2.5.2.dsc
 52dc3520fc5faf720d685b5d9642fd649dca4c9d 42233 apt-file_2.5.2.tar.gz
 7e0c9353853dad1654dc5d4ee016a3c28faf9678 33652 apt-file_2.5.2_all.deb
Checksums-Sha256: 
 928ee0ef2572b690db0a1fe6be666080ca9a4d56284281e22a8cd0d5d0d6a6fa 1864 apt-file_2.5.2.dsc
 39f717e2c2df150da35e22ffb529a23c487b224efa2c0aae219d609399775ec6 42233 apt-file_2.5.2.tar.gz
 75bcbbbfb0c513f1cfdd1e94b111c5475aea46ce7d86b90b5a06c9ced8754225 33652 apt-file_2.5.2_all.deb
Files: 
 4195c32f4787152a190a76df90dd2213 1864 admin optional apt-file_2.5.2.dsc
 29f7f9fc363076b1a68cbd3127a70a7a 42233 admin optional apt-file_2.5.2.tar.gz
 feb117ff21cba2924e58544ca08dd763 33652 admin optional apt-file_2.5.2_all.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iQIcBAEBCAAGBQJRhgwiAAoJEAVLu599gGRC+qkQAIVu7tnOSp1SuEu686AdiqYx
UftN5s9Wlumew+fNFDDFXKbilLf99TByll6c4372BaTMy9YYwjuYF8VIGz9UFopK
ryQcyBma+mjifCkBnUUv7TN1t5iEnyxPyJjTd26GSXg3VBuuUrWgKmFQEdOfp9Xl
d5lrY1WTMutKHirFas6ezwR+GTctjbq7JFC3g9pHkkl5fAAFwsbsMuaeB7ZcLXGn
vPnAHVWrAiP2m9VyjG4K1t15SftsnhdnU+JnhMDaqTfYpHplIIjmkRdGrUrYzUwy
MR4rTSM4wDkInhZEWXLqGf5VNEu9SsaZugymd++BrUQmDLine9yinr5DQ7PUZ24P
7grzYjn4PLOQILBLfT8xTOHWdDsSQTn88kyg6a35Mlf3+APOP7BQt3XP+rZI0+9+
W/t12Y02uRqUTYJy4Mu03dpkw7k3QmcfVunMR7Lueo3NbEcwRaBkWPmrk/clRW85
taoAsObiUZl4KR/MtrgrBfoX24iYhSJU+9lyQ0ScdZqznB3+I7Kqxm9lpXAeM6u8
nVh9+zcue72orEj1jPYv3gnEON0sSft0DmdZvehqi2oOXGkgCLKQ0Pppnnxb7qPz
Urrofurq7fbYy8QSNgstcy/1lSaM50kqgqM1N2Tg8L4laNoCmiO41V9NEf/QdQ+2
SlmxrSIZYgMSpJ3uGFvv
=opDU
-----END PGP SIGNATURE-----




Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Mon, 03 Jun 2013 08:51:31 GMT) Full text and rfc822 format available.

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Thu Apr 17 07:09:18 2014; Machine Name: buxtehude.debian.org

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.