Debian Bug report logs - #128818
WISH: apt-get update using rsync protocol

version graph

Package: apt; Maintainer for apt is APT Development Team <deity@lists.debian.org>; Source for apt is src:apt.

Reported by: Radim Kolar <hsn@cybermail.net>

Date: Fri, 11 Jan 2002 22:03:08 UTC

Severity: wishlist

Merged with 213551

Found in version 0.5.4

Fixed in version 0.6.44

Done: "Eugene V. Lyubimkin" <jackyf.devel@gmail.com>

Bug is archived. No further changes may be made.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Radim Kolar <hsn@cybermail.net>:
New Bug report received and forwarded. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #5 received at maintonly@bugs.debian.org (full text, mbox):

From: Radim Kolar <hsn@cybermail.net>
To: Debian Bug Tracking System <maintonly@bugs.debian.org>
Subject: WISH: apt-get update using rsync protocol
Date: Fri, 11 Jan 2002 22:32:19 +0100
Package: apt
Version: 0.5.4
Severity: wishlist

My wish is add possible use (optionaly) rsync method when updating lists
of available packages. This will cut down amount of data transfered
considerably.


-- System Information
Debian Release: 3.0
Architecture: i386
Kernel: Linux home 2.2.20 #1 Sat Nov 17 12:08:35 CET 2001 i586
Locale: LANG=C, LC_CTYPE=C

Versions of packages apt depends on:
ii  libc6                  2.2.4-5           GNU C Library: Shared libraries an
ii  libstdc++2.10-glibc2.2 1:2.95.4-0.011006 The GNU stdc++ library




Information forwarded to APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Radim Kolar <hsn@cybermail.net>:
Extra info received and forwarded to maintainer. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #10 received at 128818-maintonly@bugs.debian.org (full text, mbox):

From: Radim Kolar <hsn@cybermail.net>
To: 128818-maintonly@bugs.debian.org
Subject: apt-get update using rsync protocol
Date: Wed, 16 Jan 2002 13:25:57 +0100
Hi, 
 here is my code for doing that:

#! /usr/bin/python
#  GPL v2 code. Radim Kolar
import string
import os

SOURCES='/etc/apt/sources.list'
LISTS='/var/lib/apt/lists/'
ARCH='i386'
RSYNCOPT='-t -b'

#read sources.list
f=open(SOURCES)
lines=f.readlines()
f.close()

for line in lines:
    data=line.split()
    if len(data)==0:
	continue # skip empty lines
    if data[0][0:0]=='#':
        continue # and comments
    if data[0]!='deb':
	continue # we are not interrested in them also
    if data[1][0:5]=='cdrom':
	continue # no CD
    i=string.find(data[1],"://")
    if i==-1: continue
    basefn=data[1][i+3:]
    
    #data[2]==distribution
    #data[3]..parts
    basefn=basefn+'/dists/'+data[2]+'/'
    for part in data[3:]:
	fn=basefn+part
	fn=fn+"/binary-"+ARCH+"/Packages"
	COMMAND='rsync '+RSYNCOPT+' rsync://'+fn+' '+LISTS+string.replace(fn,'/','_')
        print COMMAND
	os.system(COMMAND)

os.system('apt-cache gencaches')	



Information forwarded to APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Radim Kolar <hsn@cybermail.net>:
Extra info received and forwarded to maintainer. Copy sent to APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org. Full text and rfc822 format available.

Message #15 received at 128818-maintonly@bugs.debian.org (full text, mbox):

From: Radim Kolar <hsn@cybermail.net>
To: 128818-maintonly@bugs.debian.org
Subject: Re: apt-get update using rsync protocol
Date: Wed, 20 Feb 2002 21:20:05 +0100
I have posted an updated version to my homepage. It supports locking

http://home.worldonline.cz/~cz210552/aptrsync.html



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to tim@fungible.com (Tim Freeman):
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org. Full text and rfc822 format available.

Message #20 received at 128818@bugs.debian.org (full text, mbox):

From: tim@fungible.com (Tim Freeman)
To: 128818@bugs.debian.org
Subject: apt-rsync is great
Date: Mon, 8 Apr 2002 09:28:33 -0700
From: Radim Kolar <hsn@cybermail.net>
>I have posted an updated version to my homepage. It supports locking
>
>http://home.worldonline.cz/~cz210552/aptrsync.html

I'm using his code, and it works quite well.  If you could get
everyone to use it, you'd save time for your users and you'd save
bandwidth at your servers.

--
Tim Freeman       
tim@fungible.com



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Jason Gunthorpe <jgg@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org. Full text and rfc822 format available.

Message #25 received at 128818@bugs.debian.org (full text, mbox):

From: Jason Gunthorpe <jgg@debian.org>
To: Tim Freeman <tim@fungible.com>, 128818@bugs.debian.org, hsn@cybermail.net
Cc: debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org
Subject: Re: Bug#128818: apt-rsync is great
Date: Mon, 8 Apr 2002 12:53:15 -0600 (MDT)
On Mon, 8 Apr 2002, Tim Freeman wrote:

> From: Radim Kolar <hsn@cybermail.net>
> >I have posted an updated version to my homepage. It supports locking
> >
> >http://home.worldonline.cz/~cz210552/aptrsync.html
> 
> I'm using his code, and it works quite well.  If you could get
> everyone to use it, you'd save time for your users and you'd save
> bandwidth at your servers.

Hmm.

Actually, if everyone started using it I'd deny anon rsync access to our
servers.

I'm sure it's existance is already disrupting our mirroring process <sigh>

Jason




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to tim@fungible.com (Tim Freeman):
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org. Full text and rfc822 format available.

Message #30 received at 128818@bugs.debian.org (full text, mbox):

From: tim@fungible.com (Tim Freeman)
To: jgg@debian.org
Cc: 128818@bugs.debian.org, hsn@cybermail.net, debian-bugs-dist@lists.debian.org, deity@lists.debian.org, apt@packages.qa.debian.org
Subject: Re: Bug#128818: apt-rsync is great
Date: Mon, 8 Apr 2002 13:19:28 -0700
From: Jason Gunthorpe <jgg@debian.org>
>Actually, if everyone started using it I'd deny anon rsync access to our
>servers.

From the way you present that, it looks like an arbitrary purposeless
decision.  Would you care to explain your motives?

>I'm sure it's existance is already disrupting our mirroring process <sigh>

I don't see how doing an rsync against a server would be more
disruptive than doing an HTTP query against the same server,
especially when the rsync transfers many fewer bytes.  How could this
be disrupting the mirroring process?

Each time your server is hit by a user of dselect, in the present
scheme an entire file is copied (which may be gzipped) or in the
proposed new scheme an rsync will be done (on an uncompressed file, to
make comparisons efficient).  The files don't change all that fast, so
an rsync will generally be more efficient than copying the whole file.
Why is this a bad thing?

-- 
Tim Freeman       
tim@fungible.com



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Jason Gunthorpe <jgg@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org. Full text and rfc822 format available.

Message #35 received at 128818@bugs.debian.org (full text, mbox):

From: Jason Gunthorpe <jgg@debian.org>
To: Tim Freeman <tim@fungible.com>
Cc: 128818@bugs.debian.org, hsn@cybermail.net, Deity Creation Team <deity@lists.debian.org>
Subject: Re: Bug#128818: apt-rsync is great
Date: Mon, 8 Apr 2002 16:42:49 -0600 (MDT)
On Mon, 8 Apr 2002, Tim Freeman wrote:

> >Actually, if everyone started using it I'd deny anon rsync access to our
> >servers.
> 
> From the way you present that, it looks like an arbitrary purposeless
> decision.  Would you care to explain your motives?

I think the folks on -devel have gone over it enough, rsync is extermely
resource intensive, we tend to have a 10 connection limit for our rsyncd's
to prevent DOS'ing the box.

> I don't see how doing an rsync against a server would be more
> disruptive than doing an HTTP query against the same server,
> especially when the rsync transfers many fewer bytes.  How could this
> be disrupting the mirroring process?

Quite simply, rsync uses tremendous amounts of disk IO, it reads the
entire file on the server side and does lots of math on it, http on the
other hand is intrinsicly rate limited by the requester.

Mirrors have to use rsync because it is the only thing that can reliably
and efficiently well, to some degree anyhow) mirror the archive. But they
are often just anon users, so having slow modem folks taking up rsync
slots will deny them access.

Our boxes have excessive network bandwidth, so it is actually far better
to have people download the whole file than try to support rsync.

Jason




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to tim@fungible.com (Tim Freeman):
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org. Full text and rfc822 format available.

Message #40 received at 128818@bugs.debian.org (full text, mbox):

From: tim@fungible.com (Tim Freeman)
To: jgg@debian.org
Cc: 128818@bugs.debian.org, hsn@cybermail.net, deity@lists.debian.org
Subject: Re: Bug#128818: apt-rsync is great
Date: Mon, 8 Apr 2002 19:27:59 -0700
From: Jason Gunthorpe <jgg@debian.org>
>I think the folks on -devel have gone over it enough, 

For the benefit of other readers, the conversation you're talking
about probably starts at:

   http://lists.debian.org/deity/1999/deity-199910/msg00002.html

and continues on the rsync list at:

   http://lists.samba.org/pipermail/rsync/1999-October/001403.html

>Quite simply, rsync uses tremendous amounts of disk IO, it reads the
>entire file on the server side and does lots of math on it, http on the
>other hand is intrinsicly rate limited by the requester.

I see.  rsync has a "batch" mode which might resolve some of the
issues.  It's described as "experimental" in the man entry for version
2.5.4-1 of the rsync package, which is the one in testing right now,
so maybe it makes sense not to depend on it yet.  The discussion cited
above is about 2.5 years old and doesn't mention batch mode at all,
perhaps because rsync's batch mode didn't exist then.

The as-yet-nonexistent compressor that is rsync-friendly would be
required to make a good solution for the whole problem.

However, I'm satisfied that building a version of apt-rsync that is
friendly to the server is blocked on the development of this other
software, so it's time to set this issue aside.

-- 
Tim Freeman       
tim@fungible.com



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Yann Dirson <ydirson@fr.alcove.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org. Full text and rfc822 format available.

Message #45 received at 128818@bugs.debian.org (full text, mbox):

From: Yann Dirson <ydirson@fr.alcove.com>
To: 128818@bugs.debian.org
Subject: Re: Bug#128818: apt-rsync is great
Date: Tue, 23 Apr 2002 17:24:55 +0200
Another option would be to provide xproxy-based access to the Packages
file.  Xproxy server stores file history in an xdfs volume (see
xdelta2), and only sends the delta to an xproxy client, to which one
can connect as an HTTP 1.1 proxy.

xproxy binary is in the xdelta2 package currently in sid.  I have
split in into package xproxy-http, with startup scripts, docs, etc,
but the splitted package still awaits in queue/new.

Note that AFAIK it still lacks purging of very-old versions from the
xdfs volume, this has to be done manually.

HTH,
-- 
Yann Dirson <Yann.Dirson@fr.alcove.com>                 http://www.alcove.com/
Technical support manager                Responsable de l'assistance technique
Senior Free-Software Consultant          Consultant senior en Logiciels Libres
Debian developer (dirson@debian.org)                        Développeur Debian



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Martin Pool <mbp@samba.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>, apt@packages.qa.debian.org. Full text and rfc822 format available.

Message #50 received at 128818@bugs.debian.org (full text, mbox):

From: Martin Pool <mbp@samba.org>
To: Debian Bug Tracking System <128818@bugs.debian.org>
Subject: apt: paper about apt/rsync
Date: Tue, 14 May 2002 10:29:14 +1000
Package: apt
Version: 0.5.4
Followup-For: Bug #128818

A paper about apt, rsync, and Debian is available here: 

  http://rsync.samba.org/rsync-and-debian/

It's certainly not the last word on the topic.  Patches welcome.

-- 
Martin




Merged 128818 213551. Request was from Matt Zimmerman <mdz@debian.org> to control@bugs.debian.org. Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to meldroc@frii.com:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #57 received at 128818@bugs.debian.org (full text, mbox):

From: Doug Holland <meldroc@frii.com>
To: 128818@bugs.debian.org
Subject: Any way to get this working?
Date: Fri, 10 Oct 2003 14:02:34 -0600
I've just been referred here after suggesting that apt allow users to download 
diffs in place of full debs and Package.gz files, and read Martin Pool's 
paper.  For those of us, include me :(, that are stuck using 57.6 dialup 
conections, even a little delta compression would be helpful, say on the 
Packages files.  For example, doing an apt-get update downloads more than 3 
megs of Packages and Sources files.  These files are simple ASCII lists of 
package information, which would benefit greatly from delta compression, even 
if it only was running diff on the uncompressed Packages files, and making 
Packages.diff.gz files available.




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Matt Zimmerman <mdz@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #62 received at 128818@bugs.debian.org (full text, mbox):

From: Matt Zimmerman <mdz@debian.org>
To: meldroc@frii.com, 128818@bugs.debian.org
Subject: Re: Bug#128818: Any way to get this working?
Date: Fri, 10 Oct 2003 18:00:14 -0400
On Fri, Oct 10, 2003 at 02:02:34PM -0600, Doug Holland wrote:

> I've just been referred here after suggesting that apt allow users to download 
> diffs in place of full debs and Package.gz files, and read Martin Pool's 
> paper.  For those of us, include me :(, that are stuck using 57.6 dialup 
> conections, even a little delta compression would be helpful, say on the 
> Packages files.  For example, doing an apt-get update downloads more than 3 
> megs of Packages and Sources files.  These files are simple ASCII lists of 
> package information, which would benefit greatly from delta compression, even 
> if it only was running diff on the uncompressed Packages files, and making 
> Packages.diff.gz files available.

The way to get this working is to review some of the numerous times when
this has come up before, and develop a solution which addresses the concerns
which have been raised (most of which involve the archive mirror network and
ftpmasters, so I am not the best person to ask).

-- 
 - mdz



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Michael Vogt <mvogt@acm.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #67 received at 128818@bugs.debian.org (full text, mbox):

From: Michael Vogt <mvogt@acm.org>
To: 128818@bugs.debian.org
Cc: Gustavo Niemeyer <niemeyer@conectiva.com>
Subject: [patch] packages.gz diff support for apt
Date: Thu, 18 Nov 2004 20:39:37 +0100
[Message part 1 (text/plain, inline)]
Hi,

attached is the first version of a patch that will enable diff files
for the index files (Packages.gz, Sources.gz). It's basicly a
implementation of the ideas of
http://lists.debian.org/debian-devel/2002/04/msg00502.html and
http://azure.humbug.org.au/~aj/blog/2003/12/02#2003-12-02-pdiffs

Patches for the package file are generated like this:
"diff Packages-23-08-2004 Packages-24-08-2004 | gzip -c >      \
 Packages_diff_`md5sum Packages-23-08-2004|awk '{print $1}'`.gz"

The code will download until it finds a empty patch, it assumes then
that the index is now up-to-date and stops. If it does not find a
patch it will auto-fallback to Packages.bz2 and then to
Packages.gz. The code is diffed against the arch repository at:
http://people.debian.org/~mdz/arch/apt@packages.debian.org
(apt@packages.debian.org/apt--main--0) 

It's up to the people generating the diffs how much they want to
provide. A amount like 10-20 sounds reasonable to me, that means 10-20
days in unstable. 

I would love to get feedback from the apt upstream people (like Jason,
Gustavo, Matt). I wonder if this should be implemented differently
(like with a "patch" method). 

thanks,
 Michael

-- 
The first rule of holes is: when you find yourself in one, stop digging. - PJ
Linux is not The Answer. Yes is the answer. Linux is The Question. - Neo
[apt-incremental-package-diffs4 (text/plain, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Michael Vogt <mvogt@acm.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #72 received at 128818@bugs.debian.org (full text, mbox):

From: Michael Vogt <mvogt@acm.org>
To: 128818@bugs.debian.org
Subject: Re: Bug#128818: [patch] packages.gz diff support for apt
Date: Fri, 19 Nov 2004 22:27:38 +0100
On Thu, Nov 18, 2004 at 08:39:37PM +0100, Michael Vogt wrote:
> Hi,
> 
> attached is the first version of a patch that will enable diff files
> for the index files (Packages.gz, Sources.gz). 

FYI, I put a up-to-date version of the patch online at:
http://people.ubuntulinux.org/~mvo/pdiffs/apt-incremental-package-diffs7 

(against the arch archive at
http://people.debian.org/~mdz/arch/apt@packages.debian.org)

It contains some fixes for the previous patch.

> Patches for the package file are generated like this:
> "diff Packages-23-08-2004 Packages-24-08-2004 | gzip -c >      \
>  Packages_diff_`md5sum Packages-23-08-2004|awk '{print $1}'`.gz"

Please note that a empty diff file:
gzip -c > Packages_diff_`md5sum Packages|awk '{print $1}'`.gz" < /dev/null
marks the last patch.

thanks,
 Michael

-- 
The first rule of holes is: when you find yourself in one, stop digging. - PJ
Linux is not The Answer. Yes is the answer. Linux is The Question. - Neo



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Anthony Towns <aj@azure.humbug.org.au>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #77 received at 128818@bugs.debian.org (full text, mbox):

From: Anthony Towns <aj@azure.humbug.org.au>
To: Michael Vogt <mvogt@acm.org>, 128818@bugs.debian.org
Subject: Re: Bug#128818: [patch] packages.gz diff support for apt
Date: Wed, 24 Nov 2004 04:49:34 +1000
[Message part 1 (text/plain, inline)]
Michael wrote:
> The code will download until it finds a empty patch, it assumes then
> that the index is now up-to-date and stops. If it does not find a
> patch it will auto-fallback to Packages.bz2 and then to
> Packages.gz. The code is diffed against the arch repository at:
> http://people.debian.org/~mdz/arch/apt@packages.debian.org
> (apt@packages.debian.org/apt--main--0)

FWIW, what I was considering last I looked at this (Dec 2003 
apparently...) was a combination of an index file and gzipped --ed 
diffs. The index file gives you a bit more control over your patches, 
and some redundancy so you can check if you've gotten everything screwed 
up; --ed style diffs happen to kick ass for this problem.

So the index file I was imagining looked like:

Canonical-Name: netstat.txt
MD5-History:
 e43a9e356e65b79bde65ea8794594d9b    1934 2003-12-01-1259.10
 68ad5015da0dbd75b83ae03ea68c0fbd    1934 2003-12-01-1259.44
 51d331edc38ba522a1c95002e6ee91c9    2096 2003-12-01-1300.19
 11d476ccadd18072dfbb6d6907274b8b    1853 2003-12-01-1300.53
 ae1076d482f0376ee86c2ee9c6342fd4    1691 2003-12-01-1301.27
 756f08019c209eea1cc1fe0497ebd2f7    1691 2003-12-01-1302.01
MD5-Patches:
 1968c0ddf9761d0e6a8b1fa8766b32c8     882 2003-12-01-1259.10
 f3d6619a17d3065dee83bb3b6e328453     797 2003-12-01-1259.44
 5f2791687760176a6d243f4da0f6757b     468 2003-12-01-1300.19
 fc22e01fc575d8f9dbb4d4cd1ef1fb2d     468 2003-12-01-1300.53
 b75d4c0b33d2a76284ed86c395a60192     461 2003-12-01-1301.27
 b4e2aa24bda367acc9f83740840e5bc1     461 2003-12-01-1302.01

The History section tells you what the original file you're patching 
from was, and the Patches section lets you validate the patch you're 
about to apply. Knowing the md5sum/size of the original file is 
obviously crucial, since that's how you know what patch to apply. 
Knowing the md5sum/size of what you're going to end up with is a useful 
sanity check, so that you can stop halfway through if you've somehow 
managed to get yourself into a loop or similar. Knowing the md5sum of 
the patches is useful just in case diff has a root exploit. Knowing the 
size of the patches you need to download is good for progress bars. 
Knowing the date of the resulting Packages file you're going to create 
at each step is useful for debugging -- while you might expect daily 
patches for testing/unstable, they'll come at much more irregular 
intervals for stable or security updates.

The attached "update.py" is a python script that when invoked as:

	./update.py Index file file.prev

will generate an --ed style diff and update the Index in the format 
listed above. It'll also limit the number of patches to 14, deleting any 
that are too far out of date.

The above example was generated by something like:

	while : ; do
		cat netstat.txt > orignetstat.txt
		netstat > netstat.txt
		./update.py index.txt netstat.txt orignetstat.txt
		sleep 30
	done

Cheers,
aj
[update.py (text/plain, inline)]
#!/usr/bin/env python

import datetime, sys, os
import apt_pkg

class Updates:
    def __init__(self, readme = None):
        self.can_name = None
        self.history = {}
	self.max = 14

        if readme:
            f = open(readme)
            x = f.readline()

            def read_md5s(ind, x=x):
                while 1:
                    x = f.readline()
                    if not x or x[0] != " ": break
                    l = x.split()
                    if not self.history.has_key(l[2]):
                        self.history[l[2]] = [None,None]
                    self.history[l[2]][ind] = (l[0], int(l[1]))
                return x

            while x:
                l = x.split()

                if len(l) == 0:
                    x = f.readline()
                    continue

                if l[0] == "Canonical-Name:":
                    self.can_name = l[1]
                    x = f.readline()
                    continue

                if l[0] == "MD5-History:":
                    x = read_md5s(0)
                    continue

                if l[0] == "MD5-Patches:":
                    x = read_md5s(1)
                    continue

                x = f.readline()

    def dump(self, out=sys.stdout):
        out.write("Canonical-Name: %s\n" % (self.can_name))
	hs = self.history
        l = self.history.keys()
        l.sort()

	cnt = len(l)
	if cnt > self.max:
		for h in l[:cnt-self.max]:
			os.unlink("%s.diff" % (h))
			del hs[h]
		l = l[cnt-self.max:]

	out.write("MD5-History:\n")
        for h in l:
            out.write(" %s %7d %s\n" % (hs[h][0][0], hs[h][0][1], h))
	out.write("MD5-Patches:\n")
        for h in l:
            out.write(" %s %7d %s\n" % (hs[h][1][0], hs[h][1][1], h))
	

format = "%Y-%m-%d-%H%M.%S"
now = datetime.datetime.utcnow().strftime(format)
(outfile, newfile, oldfile) = sys.argv[1:4]

tmpfile = oldfile + ".tmp"
difffile = now + ".diff"

upd = Updates(outfile)

os.link(newfile, tmpfile)

def sizemd5(fn):
	size = os.stat(fn)[6]
	f = open(fn)
	md5sum = apt_pkg.md5sum(f)
	f.close()
	return (md5sum, size)

oldsizemd5 = sizemd5(oldfile)
newsizemd5 = sizemd5(tmpfile)

if newsizemd5 == oldsizemd5:
	os.unlink(tmpfile)
else:
	os.system("diff --ed %s %s > %s" % (oldfile, tmpfile, difffile))
	difsizemd5 = sizemd5(difffile)

	upd.history[now] = (oldsizemd5, difsizemd5)

	os.rename(tmpfile, oldfile)

	f = open(outfile, "w")
	upd.dump(f)
	f.close()
[signature.asc (application/pgp-signature, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Jeroen van Wolffelaar <jeroen@wolffelaar.nl>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #82 received at 128818@bugs.debian.org (full text, mbox):

From: Jeroen van Wolffelaar <jeroen@wolffelaar.nl>
To: Michael Vogt <mvogt@acm.org>
Cc: 128818@bugs.debian.org, Gustavo Niemeyer <niemeyer@conectiva.com>, aba@debian.org, ajt@debian.org
Subject: Re: [patch] packages.gz diff support for apt
Date: Tue, 23 Nov 2004 20:17:43 +0100
On Thu, Nov 18, 2004 at 08:39:37PM +0100, Michael Vogt wrote:
> Patches for the package file are generated like this:
> "diff Packages-23-08-2004 Packages-24-08-2004 | gzip -c >      \
>  Packages_diff_`md5sum Packages-23-08-2004|awk '{print $1}'`.gz"
> 
> The code will download until it finds a empty patch, it assumes then
> that the index is now up-to-date and stops. If it does not find a
> patch it will auto-fallback to Packages.bz2 and then to
> Packages.gz. The code is diffed against the arch repository at:
> http://people.debian.org/~mdz/arch/apt@packages.debian.org
> (apt@packages.debian.org/apt--main--0) 

This sounds like a good and easy solution to me. However, it does
require N+1 iterations of downloading a patch, applying it, md5summing,
and again polling the webserver.

This is avoidable:
Every time a new packages file comes available, calculate diff -e (ed
script) of the old and today's packages file. As a bonus, this is also
smaller since deletions are simple ranges, and not included in the
'patch'.

Then, for each existing Packages.<md5sum>.diff.gz, append the thusly
calculated edscript. It will result in a new edscript that transforms
the packages file with <md5sum> as md5sum into the most current one.

Applying the ed script in-place goes like this:
$ ( zcat $patch ; echo w ) | ed Packages

md5sum can (should?) be checked afterwards of course, multiple
possibilities here. You could append the md5sum as ed comment in the ed
script, for example, which will not make apt do any additional download
-- this way, a Packages file update requires exactly one download, and
two md5sum calculations client-side (one before to determine filename,
one after to verify). Analogous to the original suggestion, you can add
an empty ed script for the current md5sum to cater for people apt-get
updating while being uptodate -- though this is not required anymore to
signal the last ed script, as all ed scripts will transform into the
newest packages file.

Downside is of course a little bit more wasted diskspace server-side,
but on the upside, a much faster round-trip time for clients.

--Jeroen

-- 
Jeroen van Wolffelaar
jeroen@wolffelaar.nl
http://jeroen.A-Eskwadraat.nl



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Matt Zimmerman <mdz@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #87 received at 128818@bugs.debian.org (full text, mbox):

From: Matt Zimmerman <mdz@debian.org>
To: Anthony Towns <aj@azure.humbug.org.au>, 128818@bugs.debian.org
Cc: Michael Vogt <mvogt@acm.org>
Subject: Re: Bug#128818: [patch] packages.gz diff support for apt
Date: Tue, 23 Nov 2004 11:40:12 -0800
On Wed, Nov 24, 2004 at 04:49:34AM +1000, Anthony Towns wrote:

> Michael wrote:
> > The code will download until it finds a empty patch, it assumes then
> > that the index is now up-to-date and stops. If it does not find a
> > patch it will auto-fallback to Packages.bz2 and then to
> > Packages.gz. The code is diffed against the arch repository at:
> > http://people.debian.org/~mdz/arch/apt@packages.debian.org
> > (apt@packages.debian.org/apt--main--0)
> 
> FWIW, what I was considering last I looked at this (Dec 2003 
> apparently...) was a combination of an index file and gzipped --ed 
> diffs. The index file gives you a bit more control over your patches, 
> and some redundancy so you can check if you've gotten everything screwed 
> up; --ed style diffs happen to kick ass for this problem.
> 
> So the index file I was imagining looked like:
> 
> Canonical-Name: netstat.txt
> MD5-History:
>  e43a9e356e65b79bde65ea8794594d9b    1934 2003-12-01-1259.10
>  68ad5015da0dbd75b83ae03ea68c0fbd    1934 2003-12-01-1259.44
>  51d331edc38ba522a1c95002e6ee91c9    2096 2003-12-01-1300.19
>  11d476ccadd18072dfbb6d6907274b8b    1853 2003-12-01-1300.53
>  ae1076d482f0376ee86c2ee9c6342fd4    1691 2003-12-01-1301.27
>  756f08019c209eea1cc1fe0497ebd2f7    1691 2003-12-01-1302.01
> MD5-Patches:
>  1968c0ddf9761d0e6a8b1fa8766b32c8     882 2003-12-01-1259.10
>  f3d6619a17d3065dee83bb3b6e328453     797 2003-12-01-1259.44
>  5f2791687760176a6d243f4da0f6757b     468 2003-12-01-1300.19
>  fc22e01fc575d8f9dbb4d4cd1ef1fb2d     468 2003-12-01-1300.53
>  b75d4c0b33d2a76284ed86c395a60192     461 2003-12-01-1301.27
>  b4e2aa24bda367acc9f83740840e5bc1     461 2003-12-01-1302.01

I like this approach, and I think it's general enough to be useful beyond
apt.  Maintaining an index like this avoids the problem of poking around
looking for the deltas when they're remote.  For that reason, I think it
should be implemented separately, and then we can write an apt method to use
it.

Given a simple Python library doing the retrieval, I think this could be
folded into apt in a fairly straightforward way.  The client side would need
a cache of old files to work with, and a URL for the new file to retrieve,
and it could handle the delta stuff transparently.

-- 
 - mdz



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Michael Vogt <mvogt@acm.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #92 received at 128818@bugs.debian.org (full text, mbox):

From: Michael Vogt <mvogt@acm.org>
To: Anthony Towns <aj@azure.humbug.org.au>
Cc: 128818@bugs.debian.org
Subject: Re: Bug#128818: [patch] packages.gz diff support for apt
Date: Wed, 24 Nov 2004 23:18:44 +0100
On Wed, Nov 24, 2004 at 04:49:34AM +1000, Anthony Towns wrote:
> Michael wrote:
> > The code will download until it finds a empty patch, it assumes then
> > that the index is now up-to-date and stops. If it does not find a
> > patch it will auto-fallback to Packages.bz2 and then to
> > Packages.gz. The code is diffed against the arch repository at:
> > http://people.debian.org/~mdz/arch/apt@packages.debian.org
> > (apt@packages.debian.org/apt--main--0)
> 
> FWIW, what I was considering last I looked at this (Dec 2003 
> apparently...) was a combination of an index file and gzipped --ed 
> diffs. The index file gives you a bit more control over your patches, 
> and some redundancy so you can check if you've gotten everything screwed 
> up; --ed style diffs happen to kick ass for this problem.

Thanks for your answer. I'm happy about your comments. As I wrote in
the original mail, most of the patch is based on the ideas in your
blog. 

It should be easy enough to modify the code to generate/apply --ed
style diffs. 

I croned a simple script on http://people.debian.org/~mvo/pdiffs to
see if the code is stable in the real-world (still using normal diffs,
no --ed style). ed-style diffs should halve the size of the diffs
again :) 
 
> So the index file I was imagining looked like:
[..]

While all the information is certainly usefull, I wonder if it's all
needed. A problem I see that the index-file still needs to download
a bunch of patches. I wonder if the idea of Jeroen van Wolffelaar to
use only one ed-style diff is workable. It would indeed have a much
better performance for the client. 

Below I outline my thoughts on the index file. I would very much
appreciate your comments. My current feeling is that we may go without
a explicit index-file. But I may be wrong here of course.

> The History section tells you what the original file you're patching 
> from was, and the Patches section lets you validate the patch you're 
> about to apply. Knowing the md5sum/size of the original file is 
> obviously crucial, since that's how you know what patch to apply. 

The current approach calculates the md5sum of the local Packages
file. Then it checks if there is a patch on the server that matches
this md5sum. It's just one attempt to download a file. If the file is
not found, it will fallback for the Packages.gz file anyway. 

> Knowing the md5sum/size of what you're going to end up with is a useful 
> sanity check, so that you can stop halfway through if you've somehow 
> managed to get yourself into a loop or similar. 

If the patch fails for some reason the next calculated md5sum will not
match any file on the server and the code will fallback to download
the Packages.gz file. If patch itself fails, apt will notice and
fallback to downloading the Packages.gz file.

> Knowing the md5sum of the patches is useful just in case diff has a
> root exploit. 

I'm not sure if I understand this correctly. You think that someone
could sneak in a rogue diff to expolit apt?

> Knowing the size of the patches you need to download is good for
> progress bars.

http/ftp will tell us about that and it should already work with the
current patch.

> Knowing the date of the resulting Packages file you're going to
> create at each step is useful for debugging -- while you might
> expect daily patches for testing/unstable, they'll come at much more
> irregular intervals for stable or security updates.

That's indeed usefull. 

thanks,
 Michael

-- 
The first rule of holes is: when you find yourself in one, stop digging. - PJ
Linux is not The Answer. Yes is the answer. Linux is The Question. - Neo



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Michael Vogt <mvogt@acm.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #97 received at 128818@bugs.debian.org (full text, mbox):

From: Michael Vogt <mvogt@acm.org>
To: Jeroen van Wolffelaar <jeroen@wolffelaar.nl>
Cc: 128818@bugs.debian.org, Gustavo Niemeyer <niemeyer@conectiva.com>, aba@debian.org, ajt@debian.org
Subject: Re: [patch] packages.gz diff support for apt
Date: Thu, 25 Nov 2004 00:00:05 +0100
On Tue, Nov 23, 2004 at 08:17:43PM +0100, Jeroen van Wolffelaar wrote:
[..]
> This is avoidable:
> Every time a new packages file comes available, calculate diff -e (ed
> script) of the old and today's packages file. As a bonus, this is also
> smaller since deletions are simple ranges, and not included in the
> 'patch'.
> 
> Then, for each existing Packages.<md5sum>.diff.gz, append the thusly
> calculated edscript. It will result in a new edscript that transforms
> the packages file with <md5sum> as md5sum into the most current one.
> 
> Applying the ed script in-place goes like this:
> $ ( zcat $patch ; echo w ) | ed Packages
> 
> md5sum can (should?) be checked afterwards of course, multiple
> possibilities here. You could append the md5sum as ed comment in the ed
> script, for example, which will not make apt do any additional download
> -- this way, a Packages file update requires exactly one download, and
> two md5sum calculations client-side (one before to determine filename,
> one after to verify). 

I very much like the idea of having only one --ed script patch. It
will make the patch easier as the looping in no longer
needed. Embedded the final md5sum as a ed-comment in the first line
sounds like the way to go to me. We definitly want to check the md5sum
afterwards. If it does not match we can fallback to the Packages.bz2/gz.



thanks,
 Michael

-- 
The first rule of holes is: when you find yourself in one, stop digging. - PJ
Linux is not The Answer. Yes is the answer. Linux is The Question. - Neo



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Jeroen van Wolffelaar <jeroen@wolffelaar.nl>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #102 received at 128818@bugs.debian.org (full text, mbox):

From: Jeroen van Wolffelaar <jeroen@wolffelaar.nl>
To: Michael Vogt <mvogt@acm.org>
Cc: Anthony Towns <aj@azure.humbug.org.au>, 128818@bugs.debian.org
Subject: Re: Bug#128818: [patch] packages.gz diff support for apt
Date: Thu, 25 Nov 2004 01:19:49 +0100
On Wed, Nov 24, 2004 at 11:18:44PM +0100, Michael Vogt wrote:
> On Wed, Nov 24, 2004 at 04:49:34AM +1000, Anthony Towns wrote:
> > Knowing the md5sum of the patches is useful just in case diff has a
> > root exploit. 
> 
> I'm not sure if I understand this correctly. You think that someone
> could sneak in a rogue diff to expolit apt?

ed comes also with 'red', which doesn't allow any execution, just buffer
manipulation commands. The subset of ed needed for this application can
also be manually reimplemented, it is extremely simple (indexed linewise
removals and additions).
 
> > Knowing the date of the resulting Packages file you're going to
> > create at each step is useful for debugging -- while you might
> > expect daily patches for testing/unstable, they'll come at much more
> > irregular intervals for stable or security updates.
> 
> That's indeed usefull. 

You could make sure the patch files have the same mtime as the resulting
packages file, and then on client side, you touch the result towards the
date that the http/ftp protocol tells you the patch file is -- just as
with size, also date can be transferred via the protocol.

--Jeroen

-- 
Jeroen van Wolffelaar
jeroen@wolffelaar.nl
http://jeroen.A-Eskwadraat.nl



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Anthony Towns <aj@azure.humbug.org.au>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #107 received at 128818@bugs.debian.org (full text, mbox):

From: Anthony Towns <aj@azure.humbug.org.au>
To: Michael Vogt <mvogt@acm.org>
Cc: 128818@bugs.debian.org
Subject: Re: Bug#128818: [patch] packages.gz diff support for apt
Date: Thu, 25 Nov 2004 14:53:04 +1000
[Message part 1 (text/plain, inline)]
Michael Vogt wrote:
>>So the index file I was imagining looked like:
> While all the information is certainly usefull, I wonder if it's all
> needed. 

It's not /needed/ but it is /useful/. On the server side, it's useful to 
have the source timestamps and ordering available, eg.

> A problem I see that the index-file still needs to download
> a bunch of patches.

Upside is that you can download the Index file, then all the patches 
simultaneously. That's essentially two round trips instead of N (for 
your current version) or one (for Jeroen's idea).

> I wonder if the idea of Jeroen van Wolffelaar to
> use only one ed-style diff is workable. It would indeed have a much
> better performance for the client. 

I don't know that it's such a big deal -- your general use case is a 
daily "apt-get update", anyway, and that'll only become moreso once 
those require downloading kBs instead of MBs. The other issue is that it 
makes server-side space requirements be squared instead of linear 
(you've got N patches, the most recent of which is stored N times, the 
oldest of which is stored 1 time). If we've got enough space for N=10, 
then the choice is between storing 10 days of patches Jeroen-style, or 
55 days of patches (11*10/2) ordinary style. The bandwidth hit might 
also be obnoxious, I'm not sure.

I'd be interested in seeing how that actually ends up looking for 
unstable and testing, though.

I'm half tempted to suggest thinking about an annotated patch file, that 
looks like:

	patch-for abcdef12341231def1123 4123 2004-11-23-131421.1234
	* a 31
	* blahblah
	* .
	patch-for a4234534562bce123423f ...
	* ...

that concatenates all the information for the patches in a single file, 
most recent to least recent with some index stuff at the top, and you 
just stop downloading once you've got enough information, or you find 
out it's not going to work. Might be overly complicated though.

> Below I outline my thoughts on the index file. I would very much
> appreciate your comments. My current feeling is that we may go without
> a explicit index-file. But I may be wrong here of course.

"we", huh?

>>Knowing the md5sum/size of what you're going to end up with is a useful 
>>sanity check, so that you can stop halfway through if you've somehow 
>>managed to get yourself into a loop or similar. 
> If the patch fails for some reason the next calculated md5sum will not
> match any file on the server and the code will fallback to download
> the Packages.gz file. 

What makes you think it won't match a file on the server? It's easy to 
write a CGI script that'll return a patch that adds lines to your file 
no matter what md5sum you ask for. If it returned a script like 
"a\na\n.\n%s/a/aaaaaa/g\n" it should do a good job of breaking your 
system reasonably quickly.

>>Knowing the md5sum of the patches is useful just in case diff has a
>>root exploit. 
> I'm not sure if I understand this correctly. You think that someone
> could sneak in a rogue diff to expolit apt?

It'd be a rogue diff that'd exploit patch, or ed, or whatever you used 
to apply it. Hopefully pretty unlikely, but defense in depth is always good.

>>Knowing the size of the patches you need to download is good for
>>progress bars.
> http/ftp will tell us about that and it should already work with the
> current patch.

It'll tell you how much you're downloading for the current patch; but 
not if you need to download another 100kB of patches after that one's done.

Cheers,
aj
[signature.asc (application/pgp-signature, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Robert Lemmen <robertle@semistable.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #112 received at 128818@bugs.debian.org (full text, mbox):

From: Robert Lemmen <robertle@semistable.com>
To: debian-devel@lists.debian.org
Cc: 128818@bugs.debian.org
Subject: Re: New method for Packages/Sources file updates
Date: Thu, 25 Nov 2004 19:31:58 +0100
[Message part 1 (text/plain, inline)]
On Thu, Nov 25, 2004 at 06:59:15PM +0100, Frank Küster wrote:
> There was also an ITP for a client-side rsync method that puts the load
> on the client side, and which was intended to solve the same problem,
> eventually. I didn't bother to read the details, though.

that's zsync [0], packages are in NEW right now.

short story: you generate a file on the server side (cheap, small) and place
it alongside your Packages file. an client that is capable of understanding
it can retrieve it and then calculate on the client side (=for free) which 
parts of the original file it needs and retrieve thos over http range 
requests.

i thinks it's kinda perfect for debian pacakages files, and already works
pretty well. that said it's a pretty new piece of software and has some 
issues that need to be ironed out and of course needs wider testing. but 
in general it doesn't cause much load on the servers (like rsync) doesn't 
need huge amounts of files on the server sides (like incremental patches) 
and has almost no dependencies, so it would be easy to integrate into apt.
and as a bonus it would be easy to modify it so that you don't need to put 
the delta/checksum files on the same server as the actual file you want to 
download -- cool for testing purposes.

cu  robert

[0] http://zsync.moria.org.uk

-- 
Robert Lemmen                               http://www.semistable.com 
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Jeroen van Wolffelaar <jeroen@wolffelaar.nl>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #117 received at 128818@bugs.debian.org (full text, mbox):

From: Jeroen van Wolffelaar <jeroen@wolffelaar.nl>
To: Anthony Towns <aj@azure.humbug.org.au>
Cc: Michael Vogt <mvogt@acm.org>, 128818@bugs.debian.org
Subject: Re: Bug#128818: [patch] packages.gz diff support for apt
Date: Sat, 27 Nov 2004 17:52:53 +0100
On Thu, Nov 25, 2004 at 02:53:04PM +1000, Anthony Towns wrote:
> Michael Vogt wrote:
> >I wonder if the idea of Jeroen van Wolffelaar to
> >use only one ed-style diff is workable. It would indeed have a much
> >better performance for the client. 
> 
> I don't know that it's such a big deal -- your general use case is a 
> daily "apt-get update", anyway, and that'll only become moreso once 
> those require downloading kBs instead of MBs. The other issue is that it 
> makes server-side space requirements be squared instead of linear 
> (you've got N patches, the most recent of which is stored N times, the 
> oldest of which is stored 1 time). If we've got enough space for N=10, 
> then the choice is between storing 10 days of patches Jeroen-style, or 
> 55 days of patches (11*10/2) ordinary style. The bandwidth hit might 
> also be obnoxious, I'm not sure.

Regarding bandwith, only among mirrors having all packages.gz files etc,
and mirrors are assumed to have plenty of space anyway (or not to mirror
these files if they don't want to).

Regarding space requirements, you're absolutely right. It can be a bit
less (~37% actually) when the diffs have duplicate/useless information
purged, rather than simply concatted.

> I'd be interested in seeing how that actually ends up looking for 
> unstable and testing, though.

I've run some stats over the past 8 weeks of Packages.gz files for sid's
main i386. Full datasheet (badly formatted and a bit raw) are at
http://www.wolffelaar.nl/~jeroen/pdiff.sxc (OO.o calc)

The raw daily ed-diffs are on average 50kB (ranging 30kB - 150kB), the
bzipped2 version of it on average 12kB. I'll now list for the 27
november Packages.gz files (all dates are defined as the day that at
0:00 UTC those files are already available) some numbers:

For
1) number of weeks to keep on server
I will list
2) total size needed for daily ed diffs (Anthony-style) (bz2)
3) total size needed for cumulative ed diffs (Jeroen-style) (bz2)
4) total size needed for optimized cumulative ed diffs (bz2)

Total server requirements will be about 25 times that (11 architectures
times two often-changing suites (testing&sid), plus I added 10% for
sources and contrib/non-free). The figure of 25 is a bit guessed
though... I could run it for all.

weeks   (1)   (2)      (3)  -- x25 --> (1)     (2)     (3)
1       86kB   382kB   333kB          2.1MB   9.3MB   8.1MB
2      182kB  1315kB  1044kB          4.4MB  32.0MB  25.4MB
3      266kB  2948kB  2161kB          6.5MB  71.9MB  52.7MB
4      368kB  5221kB  3635kB          8.9MB 127.4MB  88.7MB
5      460kB  8093kB  5415kB         11.2MB 197.5MB 132.2MB
6      536kB 11591kB  7488kB         13.0MB 282.9MB 182.8MB
7      613kB 15668kB  9823kB         14.9MB 382.5MB 239.8MB
8      675kB 19589kB 12014kB         16.4MB 478.2MB 293.3MB

So, while the space requirements for this don't look too extreme, it
also shows that with less than 17MB mirror space you can keep two months
worth of ed diffs for all architectures and suites (do note that part of
the data is guessed, the number 25 as explained above).

If you're going to support only a week or something however, it doesn't
matter much. As Anthony Town's suggestion scales much better, I do
suggest to go for the index file. Daily updates will be the most common,
and with this index file and http connection reusing, you can quite
efficiently download all patches you need.
 
> I'm half tempted to suggest thinking about an annotated patch file, that 
> looks like:
> 
> 	patch-for abcdef12341231def1123 4123 2004-11-23-131421.1234
> 	* a 31
> 	* blahblah
> 	* .
> 	patch-for a4234534562bce123423f ...
> 	* ...
> 
> that concatenates all the information for the patches in a single file, 
> most recent to least recent with some index stuff at the top, and you 
> just stop downloading once you've got enough information, or you find 
> out it's not going to work. Might be overly complicated though.

This is a nice idea, it combines the only one file to be downloaded with
the moderate space requirements. Implementation is a bit more tricky
indeed, though, but I don't think its prohibitly more difficult. Added
bonus is that it is just one file, where there's being prepended to:
directory listing near the packages.gz files isn't having that enormous
amount of files. On the (small) downside, prepending, but then
recompressing with bz2 makes it non-rsync friendly to transfer this big
patchfile amoungst mirrors.

--Jeroen

-- 
Jeroen van Wolffelaar
jeroen@wolffelaar.nl
http://jeroen.A-Eskwadraat.nl



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Anthony Towns <aj@azure.humbug.org.au>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #122 received at 128818@bugs.debian.org (full text, mbox):

From: Anthony Towns <aj@azure.humbug.org.au>
To: Jeroen van Wolffelaar <jeroen@wolffelaar.nl>
Cc: 128818@bugs.debian.org
Subject: Re: Bug#128818: [patch] packages.gz diff support for apt
Date: Sun, 28 Nov 2004 14:11:19 +1000
[Message part 1 (text/plain, inline)]
Jeroen van Wolffelaar wrote:
>>I don't know that it's such a big deal -- your general use case is a 
>>daily "apt-get update", anyway, and that'll only become moreso once 
>>those require downloading kBs instead of MBs. The other issue is that it 
>>makes server-side space requirements be squared instead of linear 
>>(you've got N patches, the most recent of which is stored N times, the 
>>oldest of which is stored 1 time). If we've got enough space for N=10, 
>>then the choice is between storing 10 days of patches Jeroen-style, or 
>>55 days of patches (11*10/2) ordinary style. The bandwidth hit might 
>>also be obnoxious, I'm not sure.
> Regarding bandwith, only among mirrors having all packages.gz files etc,
> and mirrors are assumed to have plenty of space anyway (or not to mirror
> these files if they don't want to).

Err, complete mirrors aren't assumed to have infinite bandwidth, and 
they're not assumed to have arbitrary amounts of bandwidth we can waste. 
Note that if you've got the "10" days of patches, the single diff per 
day needs downloading two files (index and patch), and removing one (10 
day old patch); the complete-patch-for-each-day needs to download 10 
files (that are in total 55 times the size of the other patch we're 
downloading). 55*30kB is ~1.6MB. I'm still not convinced that counts as 
obnoxious, but it's not clearly unobnoxious either (in the way 30kB is).

> Regarding space requirements, you're absolutely right. It can be a bit
> less (~37% actually) when the diffs have duplicate/useless information
> purged, rather than simply concatted.

That seems difficult to do without keeping all the old Packages files 
around, which would be nice to avoid?

37% less is around 33% less, is around a 1/3rd less, 2/3rd of 55 is 
around 37 times, for 30kB versus 1.1MB, which still isn't real convincing.

> I've run some stats over the past 8 weeks of Packages.gz files for sid's
> main i386. Full datasheet (badly formatted and a bit raw) are at
> http://www.wolffelaar.nl/~jeroen/pdiff.sxc (OO.o calc)

Any chance of dumping to a .csv file?

> The raw daily ed-diffs are on average 50kB (ranging 30kB - 150kB), the
> bzipped2 version of it on average 12kB.

What's the gzipped size? It'd probably be nicer to go with that for 
things small, I think?

> I'll now list for the 27
> november Packages.gz files (all dates are defined as the day that at
> 0:00 UTC those files are already available) some numbers:
> 
> For
> 1) number of weeks to keep on server
> I will list
> 2) total size needed for daily ed diffs (Anthony-style) (bz2)
> 3) total size needed for cumulative ed diffs (Jeroen-style) (bz2)
> 4) total size needed for optimized cumulative ed diffs (bz2)
> 
> Total server requirements will be about 25 times that (11 architectures
> times two often-changing suites (testing&sid), plus I added 10% for
> sources and contrib/non-free). The figure of 25 is a bit guessed
> though... I could run it for all.

25 sounds pretty fair as an estimate, though I'd expect Sources to 
change less than Packages (no descriptions or Depends: lines that get 
tweaked regularly, just Version: fields) rather than more; and not all 
architectures are going to be the same either, though I don't know how 
significant that is. How about running it on everything anyway? Three 
cheers for brute force and ignorance! My guess: factor of 19 or 20. Note 
it'll go up anyway when new architectures start getting added again.

> weeks   (1)   (2)      (3)  -- x25 --> (1)     (2)     (3)
> 1       86kB   382kB   333kB          2.1MB   9.3MB   8.1MB
> 2      182kB  1315kB  1044kB          4.4MB  32.0MB  25.4MB
> 3      266kB  2948kB  2161kB          6.5MB  71.9MB  52.7MB
> 4      368kB  5221kB  3635kB          8.9MB 127.4MB  88.7MB
> 5      460kB  8093kB  5415kB         11.2MB 197.5MB 132.2MB
> 6      536kB 11591kB  7488kB         13.0MB 282.9MB 182.8MB
> 7      613kB 15668kB  9823kB         14.9MB 382.5MB 239.8MB
> 8      675kB 19589kB 12014kB         16.4MB 478.2MB 293.3MB
> 
> So, while the space requirements for this don't look too extreme, it
> also shows that with less than 17MB mirror space you can keep two months
> worth of ed diffs for all architectures and suites (do note that part of
> the data is guessed, the number 25 as explained above).

Err, aren't you also guessing that the 1 week uses 86kB consistently? I 
find it hard to believe that it's /really/ that consistent.

Hrm, 8 weeks of index file isn't even such a big deal -- I use up about 
120 bytes per entry, which is under 7kB for 8 weeks of daily entries.

Also worth investigating: how long does it take to apply (1), (2) and 
(3) after 6 to 8 weeks of changes have accumulated? I'd guess (3) should 
be okay, but I'd be a little worried about (1) and (2).

> This is a nice idea, it combines the only one file to be downloaded with
> the moderate space requirements. Implementation is a bit more tricky
> indeed, though, but I don't think its prohibitly more difficult. Added
> bonus is that it is just one file, where there's being prepended to:
> directory listing near the packages.gz files isn't having that enormous
> amount of files. On the (small) downside, prepending, but then
> recompressing with bz2 makes it non-rsync friendly to transfer this big
> patchfile amoungst mirrors.

Yeah -- but it's only 17MB a day in total; so big deal. And I suspect 
gzip --rsyncable wouldn't make it that much bigger either anyway. It's 
the client side implementation issues that's really tricky.

Hrm. How about two files; an index and a single concatenated patch file, 
where the index tells you where to start and where to finish, and you 
just download those bytes, and apply them? Can apt methods reliably be 
made to support one of "download bytes 1..N of <url>" or "download bytes 
M..EOF of <url>"? I guess we can trust that ./Packages.diffdex.gz and 
./Packages.diff.gz will all be in sync pretty much all the time on 
non-broken mirrors. :-/ One file, or an index and n files would be 
easier to make reliable.

Hrm. 25 weeks at 2 entries a day would be 350 entries, at 160 bytes per 
entry would be 56,000 bytes. Crap. Okay, 10 weeks at 1 entry a day is 70 
entries, at 160 bytes per entry (2 lines), gives a little over 10kB of 
index information. *grump* Okay, my ideal one file format is therefore:

 Patch: 2004-11-28-035413
 Patch-MD5Sum: 12311 764efa883dda1e11db47671c4a3bbd9e
 File-MD5Sum: 8112111 c1e3db8ccea4541a0f3d7e5c75feb3fb
 Ed-Commands:
  1231112c
  Version: 1.00-2
  ...

 Patch: 2004-11-27-032836
 Patch-MD5Sum: ...
   ...

except that sucks too because you don't know if you're too far out of 
date until you get right to the end. Blehblehbleh.

Usage scenarios:

 (a) Download one patch to get from last update to today.
 (b) Download n patches to get from n updates ago to today
 (c) Download entire Packages file because patches aren't available or
     downloading patches would be slower/larger.

(a) should be as quick as possible, since it's the common case.
(c) shouldn't require downloading too much extra information to work out 
it's necessary -- 20-50kB is acceptable, 500kB isn't.

So (c) implies you need to be able to quickly get a list of all the 
"File-MD5Sums" we have patches for. So they have to be together in some 
index file, or an index section of some file.

Meh, I'm putting this down to premature optimisation and going back to 
an index file and n patch files. They can go in some subdircectory, 
Packages.gz, Packages.bz2, Packages.diffs/. Whatever.

Cheers,
aj
[signature.asc (application/pgp-signature, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Adam Heath <doogie@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #127 received at 128818@bugs.debian.org (full text, mbox):

From: Adam Heath <doogie@debian.org>
To: Anthony Towns <aj@azure.humbug.org.au>, 128818@bugs.debian.org
Subject: Re: Bug#128818: [patch] packages.gz diff support for apt
Date: Mon, 29 Nov 2004 01:36:30 -0600 (CST)
On Sun, 28 Nov 2004, Anthony Towns wrote:

> Yeah -- but it's only 17MB a day in total; so big deal. And I suspect
> gzip --rsyncable wouldn't make it that much bigger either anyway. It's
> the client side implementation issues that's really tricky.
>
> Hrm. How about two files; an index and a single concatenated patch file,
> where the index tells you where to start and where to finish, and you
> just download those bytes, and apply them? Can apt methods reliably be
> made to support one of "download bytes 1..N of <url>" or "download bytes
> M..EOF of <url>"? I guess we can trust that ./Packages.diffdex.gz and
> ./Packages.diff.gz will all be in sync pretty much all the time on
> non-broken mirrors. :-/ One file, or an index and n files would be
> easier to make reliable.

Only http can do ranges.  FTP can only start at an offset; the client then has
to abort when it gets what it wants, by closing the tcp connection.

I'm not certain about the rsh/ssh method.



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Dan Jacobson <jidanni@jidanni.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #132 received at 128818@bugs.debian.org (full text, mbox):

From: Dan Jacobson <jidanni@jidanni.org>
To: 128818@bugs.debian.org
Subject: apt-offline (dselect-upgrade via CDROM)
Date: Sat, 11 Dec 2004 05:37:48 +0800
Dudes, for the lowest of the low bandwitdh users:
http://jidanni.org/comp/apt-offline/index_en.html :
   If just "apt-get update", even say with [4]apt-rsync, still takes too
   much modem time, thus ruling out [5]apt-zip, then let's move as much
   of the operation we can off of our low bandwidth computer, doing all
   apt computing instead on a high bandwidth computer where we have an
   account. We implement [6]apt-doc's offline idea.



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Matt Zimmerman <mdz@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #137 received at 128818@bugs.debian.org (full text, mbox):

From: Matt Zimmerman <mdz@debian.org>
To: 128818@bugs.debian.org
Subject: Packages.gz diffs
Date: Mon, 27 Dec 2004 18:53:13 -0800
My general position on this feature is that it should be implemented as an
external program or library, which apt can interface with, rather than as
something specific to apt.  The problem of creating and using these delta
updates is more general than apt or Debian.

-- 
 - mdz



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Andreas Barth <aba@not.so.argh.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #142 received at 128818@bugs.debian.org (full text, mbox):

From: Andreas Barth <aba@not.so.argh.org>
To: 128818@bugs.debian.org
Cc: debian-devel@lists.debian.org
Subject: partial patches - server application
Date: Thu, 6 Jan 2005 10:12:12 +0100
Dear all,

with ideas and code (and a lot more) from Anthony, I was able to put
together the server part for partial patches in a way that it seems to
me that it might be included in dak. The resulting files are available
from
 deb http://merkel.debian.org/~aba/debian sid main contrib non-free
(or any other combination of suites and components you like)

However, there are only the dist files on that place, _no_ downloadable
pool is available there.

The partial files are included in a subdirectory called diff in each
"low-level" directory like unstable/main/source, and have an Index-file
pointing to the other files, and one or more (at maximum 14) patches.
Such an Index-File looks like:

Canonical-Path: dists/sid/main/binary-i386/Packages
SHA1-History:
 f3a0c1972021af11782c661d1bd5214f1d443868 13345332 2005-01-04-1633.27
 9891de37f8f56b15e2dcffe6b02afa94f8bfa472 13346502 2005-01-05-1633.08
SHA1-Patches:
 c3ad4f802238c5becefb1551722fd26d00452db4   33228 2005-01-04-1633.27
 2314857b6ffed5f55c3f667ec14bba860818a7ad   66436 2005-01-05-1633.08

This means: If the local file dists/sid/main/binary-i386/Packages has
the sha1-sum of f3a0c1972021af11782c661d1bd5214f1d443868, take the patch
named 2005-01-04-1633.27 (and this patch has the given size and
sha1-sum). Of course, this patch is a gz'ed file. The Patches are
ed-style, which is better for size.


The script is run once every night at about 23:00 UTC. The script itself
is available at http://merkel.debian.org/~aba/tiffani


If there are any questions, please don't hesitate to speak with me.


Cheers,
Andi



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Andreas Metzler <ametzler@downhill.at.eu.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #147 received at 128818@bugs.debian.org (full text, mbox):

From: Andreas Metzler <ametzler@downhill.at.eu.org>
To: debian-devel@lists.debian.org
Cc: Andreas Barth <aba@not.so.argh.org>, 128818@bugs.debian.org
Subject: Re: partial patches - server application
Date: Thu, 6 Jan 2005 10:35:07 +0100
On 2005-01-06 Andreas Barth <aba@not.so.argh.org> wrote:
[...]
>  deb http://merkel.debian.org/~aba/debian sid main contrib non-free
> (or any other combination of suites and components you like)

> However, there are only the dist files on that place, _no_ downloadable
> pool is available there.

> The partial files are included in a subdirectory called diff in each
> "low-level" directory like unstable/main/source, and have an Index-file
> pointing to the other files, and one or more (at maximum 14) patches.
> Such an Index-File looks like:
[...]

Hello,
This looks extremely promising, thank you.

Is there actually a good[1] reason for keeping the patches both as
plain and gzipped?

I guess we are carrying around Packages _and_ Packages.gz _and_
Packages.bz2 for backwards compatibilty but that does not apply for a
fresh piece of code like the diff-idea.
                  cu andreas
[1] I realize that un-zipped might be a little bit faster for
file-URIs but is this worth it?
-- 
"See, I told you they'd listen to Reason," [SPOILER] Svfurlr fnlf,
fuhggvat qbja gur juveyvat tha.
Neal Stephenson in "Snow Crash"
                                           http://downhill.aus.cc/



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Andreas Barth <aba@not.so.argh.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #152 received at 128818@bugs.debian.org (full text, mbox):

From: Andreas Barth <aba@not.so.argh.org>
To: debian-devel@lists.debian.org, 128818@bugs.debian.org
Subject: Re: partial patches - server application
Date: Thu, 6 Jan 2005 11:22:53 +0100
* Andreas Metzler (ametzler@downhill.at.eu.org) [050106 11:10]:
> On 2005-01-06 Andreas Barth <aba@not.so.argh.org> wrote:
> [...]
> >  deb http://merkel.debian.org/~aba/debian sid main contrib non-free
> > (or any other combination of suites and components you like)
> 
> > However, there are only the dist files on that place, _no_ downloadable
> > pool is available there.
> 
> > The partial files are included in a subdirectory called diff in each
> > "low-level" directory like unstable/main/source, and have an Index-file
> > pointing to the other files, and one or more (at maximum 14) patches.
> > Such an Index-File looks like:
> [...]
> 
> Hello,
> This looks extremely promising, thank you.
> 
> Is there actually a good[1] reason for keeping the patches both as
> plain and gzipped?

The good reason is that it was easier to test with them by hand, and I
was too lazy to remove them before now - they are gone now, and tiffani
is changed to not produce them any more.



Cheers,
Andi
-- 
   http://home.arcor.de/andreas-barth/
   PGP 1024/89FB5CE5  DC F1 85 6D A6 45 9C 0F  3B BE F1 D0 C5 D1 D9 0C



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Florian Weimer <fw@deneb.enyo.de>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #157 received at 128818@bugs.debian.org (full text, mbox):

From: Florian Weimer <fw@deneb.enyo.de>
To: Andreas Barth <aba@not.so.argh.org>
Cc: 128818@bugs.debian.org, debian-devel@lists.debian.org
Subject: Re: partial patches - server application
Date: Thu, 06 Jan 2005 11:41:41 +0100
* Andreas Barth:

> This means: If the local file dists/sid/main/binary-i386/Packages has
> the sha1-sum of f3a0c1972021af11782c661d1bd5214f1d443868, take the patch
> named 2005-01-04-1633.27 (and this patch has the given size and
> sha1-sum). Of course, this patch is a gz'ed file. The Patches are
> ed-style, which is better for size.

Is this really a good idea?  patch invokes ed(1) to process ed
scripts, and this might lead to execution of arbitrary commands.



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Andreas Barth <aba@not.so.argh.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #162 received at 128818@bugs.debian.org (full text, mbox):

From: Andreas Barth <aba@not.so.argh.org>
To: Florian Weimer <fw@deneb.enyo.de>
Cc: 128818@bugs.debian.org, debian-devel@lists.debian.org
Subject: Re: partial patches - server application
Date: Thu, 6 Jan 2005 11:50:29 +0100
* Florian Weimer (fw@deneb.enyo.de) [050106 11:45]:
> * Andreas Barth:

> > This means: If the local file dists/sid/main/binary-i386/Packages has
> > the sha1-sum of f3a0c1972021af11782c661d1bd5214f1d443868, take the patch
> > named 2005-01-04-1633.27 (and this patch has the given size and
> > sha1-sum). Of course, this patch is a gz'ed file. The Patches are
> > ed-style, which is better for size.
 
> Is this really a good idea?  patch invokes ed(1) to process ed
> scripts, and this might lead to execution of arbitrary commands.

It is agreed that the usage of patch and ed is _not_ the recommended
way for production code (but acceptable for prototype code). However, as
already discussed last time, the patches need only a tiny subset of ed
that is not only provided by red, but can even be implemented internally
in apt.


Cheers,
Andi
-- 
   http://home.arcor.de/andreas-barth/
   PGP 1024/89FB5CE5  DC F1 85 6D A6 45 9C 0F  3B BE F1 D0 C5 D1 D9 0C



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Florian Weimer <fw@deneb.enyo.de>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #167 received at 128818@bugs.debian.org (full text, mbox):

From: Florian Weimer <fw@deneb.enyo.de>
To: Andreas Barth <aba@not.so.argh.org>
Cc: 128818@bugs.debian.org, debian-devel@lists.debian.org
Subject: Re: partial patches - server application
Date: Thu, 06 Jan 2005 17:53:55 +0100
* Andreas Barth:

>> Is this really a good idea?  patch invokes ed(1) to process ed
>> scripts, and this might lead to execution of arbitrary commands.
>
> It is agreed that the usage of patch and ed is _not_ the recommended
> way for production code (but acceptable for prototype code). However, as
> already discussed last time, the patches need only a tiny subset of ed
> that is not only provided by red, but can even be implemented internally
> in apt.

Unfortunately, deltas created with "diff -e" are not suitable for
one-pass processing.  The output from "diff -f" is, but it cannot
handle some input files ("." on a line by itself cannot be added and
results in silently corrupted output).  "diff -n" (RCS style) output
seems to be most suitable.  It appears to be completely binary
transparent ("diff -e" isn't), and it is suitable for one-pass
processing.

An example implementation (written in C) of the corresponding patch
operation is available at:

  svn://subversion.enyo.de/rcs-patch/rcs-patch/trunk

There is a drawback, though: "diff -n" output is slightly larger than
"diff -e" output, even after compression (by about 7% to 10%).



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Anthony Towns <aj@azure.humbug.org.au>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #172 received at 128818@bugs.debian.org (full text, mbox):

From: Anthony Towns <aj@azure.humbug.org.au>
To: debian-devel@lists.debian.org
Cc: 128818@bugs.debian.org
Subject: Re: partial patches - server application
Date: Thu, 13 Jan 2005 09:30:43 +1000
Andreas Barth wrote:
> with ideas and code (and a lot more) from Anthony, I was able to put
> together the server part for partial patches in a way that it seems to
> me that it might be included in dak. The resulting files are available
> from
>  deb http://merkel.debian.org/~aba/debian sid main contrib non-free
> (or any other combination of suites and components you like)

So here's a script to make use of them:

---
#!/bin/sh

file="$1"
index="$2"
patchurl="${3%/}"
patch=$(tempfile)

sizeof () {
 wc -c "$1" | sed 's/^ *//;s/ .*//'
}

c_sha1=""
c_size="-1"
cur_sha1 () {
  if [ "$c_sha1" = "" ]; then
    c_sha1=$(sha1sum < "$file")
  fi
  echo $c_sha1
}
cur_size () {
  c2_size=$(sizeof "$file")
  if [ $c2_size != $c_size ]; then
    c_size=$c2_size
    c_sha1=""
  fi
  echo $c_size
}

patch_sha1_size () {
  sed -n '/^SHA1-Patches:/,/^[^ ]/'"s/^ \([^ ]*\)  *\([^ ]*\) $1\$/\1 
\2/p" "$index"
}

sed -n '/^SHA1-History:/,/^[^ ]/s/^ / /p' "$index" |
  while read n_sha1 n_size n_patch; do
    echo "try: $n_patch"
    if [ $(cur_size) = "$n_size" ]; then
      if true || [ $(cur_sha1) = "$n_sha1" ]; then
        curl -s "$patchurl/$n_patch.diff.gz" | zcat > $patch
        p_size=$(sizeof $patch)
        p_sha1=$(sha1sum < $patch)

        if [ "$p_sha1 $p_size" = "$(patch_sha1_size $n_patch)" ]; then
          echo "applying patch $n_patch"
          (cat $patch; echo "wq") | ed "$file" >/dev/null
          c_size=0
        fi
      fi
    fi
  done

rm -f "$patch"
---

If the script's installed in /usr/local/bin/untiffani, usage is 
something like:

---
  url=http://merkel.debian.org/~aba/debian
  tmp=$(tempfile) || exit
  cd /var/lib/apt/lists
  for p in *_Packages; do
    path=${p##*_dists_}
    if [ "$path" = "$p" ]; then continue; fi
    path=$(echo ${path%_Packages} | tr _ /)
    >"$tmp"
    url2="${url}/dists/${path}/diff"
    url2=$(echo "$url2" | sed 's/testing/sarge/;s/unstable/sid/')
    wget -q -o "$tmp" "${url2}/Index"
    if [ -s "$tmp" ]; then
      cp "$p" "${p.bak}"
      /usr/local/bin/untiffani "$p" "$tmp" "${url2}"
    fi
  done
  rm -f "$tmp"

  apt-get update
---

It's uh, rather untested. If you try it and it actually works (or you 
fix it so it works), might be worth posting a followup if nobody else 
has already.

For comparison, today's update of unstable/main/i386/Packages.bz2 goes 
from downloading the full 2.5MB to downloading a 2kB Index file, and a 
15kB patch.

Note that the ~aba url is only updated about now every day, and if it 
and your mirror are out of sync, you'll likely end up unhappy.

Cheers,
aj



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Robert Lemmen <robertle@semistable.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #177 received at 128818@bugs.debian.org (full text, mbox):

From: Robert Lemmen <robertle@semistable.com>
To: 128818@bugs.debian.org
Subject: c implementation
Date: Tue, 1 Mar 2005 16:38:44 +0100
[Message part 1 (text/plain, inline)]
hi folks,

i am working on a c implementation of the code above, as we will need
this for inclusion into apt (and some other stuff). if anybody else is
working on it, please tell me so we don't duplicate work.

the big question is however how to do the ed part. having full ed
compatibility is imho undesirable because it is very complex and brings
security problems with it. can we agree an a limited subset? for example
no lines that start with a dot (should not happen in packages files
anyway) and the hunks must be ordered by line number (good for
single-pass processing, present in the current diffs)? 

cu  robert

-- 
Robert Lemmen                               http://www.semistable.com 
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Robert Lemmen <robertle@semistable.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #182 received at 128818@bugs.debian.org (full text, mbox):

From: Robert Lemmen <robertle@semistable.com>
To: 128818@bugs.debian.org
Subject: alpha c implementation
Date: Thu, 10 Mar 2005 14:48:43 +0100
[Message part 1 (text/plain, inline)]
hi everyone,

i have managed to put some code together that does the client-side stuff
in c, so we can perhaps use it in apt at some point in the future. right
now the code is a bit rough at the edges, but it basically works. i will
continue to clean it up, but i would like to get some feedback too. so
please have a look! code is at
http://www.semistable.com/files/apt-cqupdate-20050310.tar.gz

(the 20050310 will of course change, you'll manage)

cu  robert

-- 
Robert Lemmen                               http://www.semistable.com 
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Michael Vogt <mvogt@acm.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #187 received at 128818@bugs.debian.org (full text, mbox):

From: Michael Vogt <mvogt@acm.org>
To: 128818@bugs.debian.org
Subject: [patch] update for apt-0.6
Date: Fri, 15 Apr 2005 14:07:08 +0200
[Message part 1 (text/plain, inline)]
Hi,

attached is a patch that prototypes a implemention of the new pdiffs
with index file as aba has implemented it on merkel. It works for
me(tm), but it still has some shortcomings. But I won't be able to
work on it in the next two weeks so I wanted to share the code with
you :)

The patch is against the current apt version from:
apt@packages.debian.org/apt--main--0

Here is the stuff that still needs to be done:
- the patch Index file needs to be used for If-Modified-Since
  requests, currently it wasts bandwith by asking for it every time
- the external ed is callled, I hope Roberts ed implementation will
  come handy here :) 
- a lot of checks are not done yet (e.g. the checksum of the
  downloaded patch is not checked, the checksum of the resulting
  package file etc)
- the fetcher output looks a bit strange (missing description and 
  overall  progress calculation)

The code is also available in tla at:
http://people.ubuntu.com/~mvo/arch/ubuntu
as
michael.vogt@ubuntu.com--2005/apt--pdiff--0

Cheers,
 Michael

-- 
Linux is not The Answer. Yes is the answer. Linux is The Question. - Neo
[apt-incr-0.6.diff (text/plain, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Robert Lemmen <robertle@semistable.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #192 received at 128818@bugs.debian.org (full text, mbox):

From: Robert Lemmen <robertle@semistable.com>
To: 128818@bugs.debian.org
Subject: gzip and ed stuff
Date: Mon, 2 May 2005 11:03:17 +0200
[Message part 1 (text/plain, inline)]
hi folks,

i had a look at the patch above and have some suggestions and questions:

- the "zcat" part should be done by the apt-internal method, using the
  library directly in the ed part or somewhere else spoils apt's design.
  as far as i can see this should be easy
- the big question is how to implement the ed part. it would be most
  natural to make it a mathod as well, but i don't think this is
  reasonable as normal methods take one argument and prduce one file,
  where the ed part would need two arguments (the original and the
  patch) and produce one file. of course this could be coded into the
  url, but that's ugly. so should i just create a method that takes the
  two file(name)s and do the patch? 

perhaps some of the people who are closer to the design of apt could
comment on the preferred way to take here...

cu  robert

-- 
Robert Lemmen                               http://www.semistable.com 
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Michael Vogt <michael.vogt@ubuntu.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #197 received at 128818@bugs.debian.org (full text, mbox):

From: Michael Vogt <michael.vogt@ubuntu.com>
To: 128818@bugs.debian.org
Subject: pdiff support
Date: Fri, 19 Aug 2005 18:13:41 +0200
Hi,

just a quick update on this bug. Robert and I converted the code to
use a internal rred method now [1]. No need to call a external ed
anymore. 

The code is also cleaned up and update to apt--main--0, but the
fetcher output is still a bit strange. Testing/feedback is welcome, it
works here against abas test-server.

Cheers,
 Michael

[1] in michael.vogt@ubuntu.com--2005/apt--pdiff--0
-- 
Linux is not The Answer. Yes is the answer. Linux is The Question. - Neo



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Michael Vogt <mvogt@acm.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #202 received at 128818@bugs.debian.org (full text, mbox):

From: Michael Vogt <mvogt@acm.org>
To: 128818@bugs.debian.org
Subject: status update
Date: Tue, 30 Aug 2005 10:32:53 +0200
Dear Friends,

here is another update on the status of this bug. The current version
in michael.vogt@ubuntu.com--2005/apt--pdiff--0 [1] works well for my
little tests so far. I put some packages (build against current sid) at:

deb http://people.debian.org/~mvo/apt/pdiffs /

It's build with a new soname. I don't actually think that this is
needed, I don't think the patch breaks the ABI. But the URL-Remap
stuff will probably break without (so it's not needed in the final
version). 


I use the following line in my sources.list to test the diff apply
code:

deb http://merkel.debian.org/~aba/debian/ sid main contrib

You can even use this url for real package installs/upgrades if you
tell apt to remap that url to the archive url:

apt-get -o APT::URL-Remap::http://merkel.debian.org/~aba/debian/=http://ftp.debian.org/debian/ dist-upgrade

This "feature" is only added for testing the pdiff stuff until it's
implemented somewhere officially and will go away once it's no longer
needed. It also still displays the non-remapped url in the gui but
connects to the remapped version (I haven't bothered yet).

It still gives some funny output when it calculates the transfer
speed, but otherwise the output of the fetcher looks mostly correct
now. 

Cheers,
 Michael


[1] baz archive is at http://people.ubuntu.com/~mvo/arch/ubuntu
-- 
Linux is not The Answer. Yes is the answer. Linux is The Question. - Neo



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Andreas Metzler <ametzler@downhill.at.eu.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #207 received at 128818@bugs.debian.org (full text, mbox):

From: Andreas Metzler <ametzler@downhill.at.eu.org>
To: 128818@bugs.debian.org
Subject: Re: status update
Date: Sat, 7 Jan 2006 16:02:27 +0100
The respective pdiff patch has now lived happily in experimental for
quite some time without any rc-bugs being discovered.

Could we please have it merged into unstable, to find the bugs and
have it as part of etch?
             cu andreas
-- 
The 'Galactic Cleaning' policy undertaken by Emperor Zhark is a personal
vision of the emperor's, and its inclusion in this work does not constitute
tacit approval by the author or the publisher for any such projects,
howsoever undertaken.                                (c) Jasper Ffforde



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to "Maxim Grechkin" <maximsch2@gmail.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #212 received at 128818@bugs.debian.org (full text, mbox):

From: "Maxim Grechkin" <maximsch2@gmail.com>
To: 128818@bugs.debian.org
Subject: About delta updates
Date: Sat, 8 Apr 2006 19:13:10 +0400
When it will be mergeed to sid? Have it chances to hit Ubuntu Dapper?



Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#128818; Package apt. Full text and rfc822 format available.

Acknowledgement sent to Michael Vogt <mvo@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. Full text and rfc822 format available.

Message #217 received at 128818@bugs.debian.org (full text, mbox):

From: Michael Vogt <mvo@debian.org>
To: Maxim Grechkin <maximsch2@gmail.com>, 128818@bugs.debian.org
Subject: Re: Bug#128818: About delta updates
Date: Wed, 26 Apr 2006 13:30:35 +0200
On Sat, Apr 08, 2006 at 07:13:10PM +0400, Maxim Grechkin wrote:
> When it will be mergeed to sid? Have it chances to hit Ubuntu Dapper?

They are already in the debian-sid bzr tree at
http://people.debian.org/~mvo/bzr/apt/apt--debian-sid/ and will be
part of the next upload.

They will probably never hit ubuntu because ubuntu regenerates the
Packages files up to twice a hour and that is not what the pdiff
deltas are designed for.

Cheers,
 Michael 

-- 
Linux is not The Answer. Yes is the answer. Linux is The Question. - Neo



Reply sent to "Eugene V. Lyubimkin" <jackyf.devel@gmail.com>:
You have taken responsibility. (Mon, 12 Jan 2009 14:06:04 GMT) Full text and rfc822 format available.

Notification sent to Radim Kolar <hsn@cybermail.net>:
Bug acknowledged by developer. (Mon, 12 Jan 2009 14:06:04 GMT) Full text and rfc822 format available.

Message #222 received at 128818-done@bugs.debian.org (full text, mbox):

From: "Eugene V. Lyubimkin" <jackyf.devel@gmail.com>
To: 128818-done@bugs.debian.org, 213551-done@bugs.debian.org
Subject: closing #128818, #213551
Date: Mon, 12 Jan 2009 16:10:35 +0200
[Message part 1 (text/plain, inline)]
Version: 0.6.44

PDiffs, largely discussed at this bug thread, was reached apt at version
0.6.44. I'm closing this bug now so long.

-- 
Eugene V. Lyubimkin aka JackYF, JID: jackyf.devel(maildog)gmail.com
Ukrainian C++ Developer, Debian Maintainer, APT contributor

[signature.asc (application/pgp-signature, attachment)]

Reply sent to "Eugene V. Lyubimkin" <jackyf.devel@gmail.com>:
You have taken responsibility. (Mon, 12 Jan 2009 14:06:05 GMT) Full text and rfc822 format available.

Notification sent to Doug Holland <meldroc@frii.com>:
Bug acknowledged by developer. (Mon, 12 Jan 2009 14:06:05 GMT) Full text and rfc822 format available.

Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Tue, 10 Feb 2009 07:27:17 GMT) Full text and rfc822 format available.

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Fri Apr 18 13:49:25 2014; Machine Name: buxtehude.debian.org

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.