Debian Bug report logs - #662080
RFP: hadori -- Hardlinks identical files

Package: wnpp; Maintainer for wnpp is wnpp@debian.org;

Reported by: Timo Weingärtner <timo@tiwe.de>

Date: Sat, 3 Mar 2012 23:33:05 UTC

Severity: wishlist

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, debian-devel@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Sat, 03 Mar 2012 23:33:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to Timo Weingärtner <timo@tiwe.de>:
New Bug report received and forwarded. Copy sent to debian-devel@lists.debian.org, wnpp@debian.org. (Sat, 03 Mar 2012 23:33:08 GMT) Full text and rfc822 format available.

Message #5 received at submit@bugs.debian.org (full text, mbox):

From: Timo Weingärtner <timo@tiwe.de>
To: submit@bugs.debian.org
Subject: ITP: hadori -- Hardlinks identical files
Date: Sun, 4 Mar 2012 00:31:16 +0100
[Message part 1 (text/plain, inline)]
Package: wnpp
Severity: wishlist
X-Debbugs-CC: debian-devel@lists.debian.org

   Package name: hadori
        Version: 0.2
Upstream Author: Timo Weingärtner <timo@tiwe.de>
            URL: https://github.com/tiwe-de/hadori
        License: GPL3+
    Description: Hardlinks identical files
 This might look like yet another hardlinking tool, but it is the only one
 which only memorizes one filename per inode. That results in less merory
 consumption and faster execution compared to its alternatives. Therefore
 (and because all the other names are already taken) it's called
 "HArdlinking DOne RIght".
 .
 Advantages over other hardlinking tools:
  * predictability: arguments are scanned in order, each first version is kept
  * much lower CPU and memory consumption
  * hashing option: speedup on many equal-sized, mostly identical files

The initial comparison was with hardlink, which got OOM killed with a hundred 
backups of my home directory. Last night I compared it to duff and rdfind 
which would have happily linked files with different st_mtime and st_mode.

I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
number.


Greetings
Timo
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Sun, 04 Mar 2012 00:09:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Julian Andres Klode <jak@debian.org>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Sun, 04 Mar 2012 00:09:06 GMT) Full text and rfc822 format available.

Message #10 received at 662080@bugs.debian.org (full text, mbox):

From: Julian Andres Klode <jak@debian.org>
To: Timo Weingärtner <timo@tiwe.de>, 662080@bugs.debian.org
Subject: Re: Bug#662080: ITP: hadori -- Hardlinks identical files
Date: Sun, 4 Mar 2012 01:07:42 +0100
[Message part 1 (text/plain, inline)]
On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote:
> Package: wnpp
> Severity: wishlist
> X-Debbugs-CC: debian-devel@lists.debian.org
> 
>    Package name: hadori
>         Version: 0.2
> Upstream Author: Timo Weingärtner <timo@tiwe.de>
>             URL: https://github.com/tiwe-de/hadori
>         License: GPL3+
>     Description: Hardlinks identical files
>  This might look like yet another hardlinking tool, but it is the only one
>  which only memorizes one filename per inode. That results in less merory
>  consumption and faster execution compared to its alternatives. Therefore
>  (and because all the other names are already taken) it's called
>  "HArdlinking DOne RIght".
>  .
>  Advantages over other hardlinking tools:
>   * predictability: arguments are scanned in order, each first version is kept
>   * much lower CPU and memory consumption
>   * hashing option: speedup on many equal-sized, mostly identical files
> 
> The initial comparison was with hardlink, which got OOM killed with a hundred 
> backups of my home directory. Last night I compared it to duff and rdfind 
> which would have happily linked files with different st_mtime and st_mode.

You might want to try hardlink 0.2~rc1. In any case, I don't think we need
yet another such tool in the archive. If you want that algorithm, we can
implement it in hardlink 0.2 using probably about 10 lines. I had that
locally and it works, so if you want it, we can add it and avoid the
need for one more hack in that space.

hardlink 0.2 is written in C, and uses a binary tree to map
(dev_t, off_t) to a struct file which contains the stat information
plus name for linking. It requires two allocations per file, one for
the struct file with the filename, and one for the node in the tree
(well, actually we only need the node for the first file with a
 specific (dev_t, off_t) tuple). A node has 3 pointers.

-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.
[Message part 2 (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Sun, 04 Mar 2012 02:21:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Timo Weingärtner <timo@tiwe.de>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Sun, 04 Mar 2012 02:21:03 GMT) Full text and rfc822 format available.

Message #15 received at 662080@bugs.debian.org (full text, mbox):

From: Timo Weingärtner <timo@tiwe.de>
To: Julian Andres Klode <jak@debian.org>
Cc: 662080@bugs.debian.org
Subject: Re: Bug#662080: ITP: hadori -- Hardlinks identical files
Date: Sun, 4 Mar 2012 03:18:50 +0100
[Message part 1 (text/plain, inline)]
Hallo Julian Andres,

2012-03-04 um 01:07:42 schriebst Du:
> On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote:

> > The initial comparison was with hardlink, which got OOM killed with a
> > hundred backups of my home directory. Last night I compared it to duff
> > and rdfind which would have happily linked files with different st_mtime
> > and st_mode.
> 
> You might want to try hardlink 0.2~rc1. In any case, I don't think we need
> yet another such tool in the archive. If you want that algorithm, we can
> implement it in hardlink 0.2 using probably about 10 lines. I had that
> locally and it works, so if you want it, we can add it and avoid the
> need for one more hack in that space.

And why is lighttpd in the archive? Apache can do the same ...

> hardlink 0.2 is written in C, and uses a binary tree to map
> (dev_t, off_t) to a struct file which contains the stat information
> plus name for linking. It requires two allocations per file, one for
> the struct file with the filename, and one for the node in the tree
> (well, actually we only need the node for the first file with a
>  specific (dev_t, off_t) tuple). A node has 3 pointers.

The "hardlink" I used at that time was written in python and definitely didn't 
do it the way I want.

hadori is written in C++11 which IMHO makes it look a little more readable. It 
started with tree based map and multimap, now it uses the unordered_ (hash 
based) versions which made it twice as fast in a typical workload.

The main logic is in hadori.C, handle_file and uses:

std::unordered_map<ino_t, inode const> kept;
std::unordered_map<ino_t, ino_t> to_link;
std::unordered_multimap<off_t, ino_t> sizes;

class inode contains a struct stat, a file name and an adler checksum, but I 
plan to drop the last one because I think the hashing option is no great gain.


Grüße
Timo
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Sun, 04 Mar 2012 06:03:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Goswin von Brederlow <goswin-v-b@web.de>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Sun, 04 Mar 2012 06:03:03 GMT) Full text and rfc822 format available.

Message #20 received at submit@bugs.debian.org (full text, mbox):

From: Goswin von Brederlow <goswin-v-b@web.de>
To: Timo Weingrtner <timo@tiwe.de>
Cc: 662080@bugs.debian.org, submit@bugs.debian.org
Subject: Re: Bug#662080: ITP: hadori -- Hardlinks identical files
Date: Sun, 04 Mar 2012 07:00:13 +0100
Timo Weingärtner <timo@tiwe.de> writes:

> Package: wnpp
> Severity: wishlist
> X-Debbugs-CC: debian-devel@lists.debian.org
>
>    Package name: hadori
>         Version: 0.2
> Upstream Author: Timo Weingärtner <timo@tiwe.de>
>             URL: https://github.com/tiwe-de/hadori
>         License: GPL3+
>     Description: Hardlinks identical files
>  This might look like yet another hardlinking tool, but it is the only one
>  which only memorizes one filename per inode. That results in less merory
>  consumption and faster execution compared to its alternatives. Therefore
>  (and because all the other names are already taken) it's called
>  "HArdlinking DOne RIght".
>  .
>  Advantages over other hardlinking tools:
>   * predictability: arguments are scanned in order, each first version is kept
>   * much lower CPU and memory consumption
>   * hashing option: speedup on many equal-sized, mostly identical files
>
> The initial comparison was with hardlink, which got OOM killed with a hundred 
> backups of my home directory. Last night I compared it to duff and rdfind 
> which would have happily linked files with different st_mtime and st_mode.
>
> I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
> number.
>
>
> Greetings
> Timo

I've been thinking about the problem of memory consumption too. But I've
come to a different solution. One that doesn't need memory at all.

Instead of remembering inodes, filenames and checksums create a global
cache (e.g. directory hierachy like .cache/<start of hash>/<hash>)
and hardlink every file to there. If you want/need to include uid, gid,
mtime, mode in there then make that part of the .cache path.

Garbage collection in the cache would be removing all files with a link
count of 1.

Going one step further link files with unique size [uid, gid, mtime,
...] to .cache/<size> and change that into .cache/<size>/<start of
hash>/<hash> when you find a second file with the same size that isn't
identical. That would save on the expensive hashing of clearly unique
files.

You could also use a hash that computes the first byte from the first
4k, second byte from 64k, thrid from 1mb and so on. That way you can
check if the beginning of 2 files match without having to checksum the
whole file or literally comprare the two.

MfG
        Goswin




Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Sun, 04 Mar 2012 06:03:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Goswin von Brederlow <goswin-v-b@web.de>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Sun, 04 Mar 2012 06:03:06 GMT) Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Sun, 04 Mar 2012 11:35:19 GMT) Full text and rfc822 format available.

Acknowledgement sent to Julian Andres Klode <jak@debian.org>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Sun, 04 Mar 2012 11:36:05 GMT) Full text and rfc822 format available.

Message #30 received at 662080@bugs.debian.org (full text, mbox):

From: Julian Andres Klode <jak@debian.org>
To: Timo Weingärtner <timo@tiwe.de>
Cc: 662080@bugs.debian.org
Subject: Re: Bug#662080: ITP: hadori -- Hardlinks identical files
Date: Sun, 4 Mar 2012 12:31:39 +0100
[Message part 1 (text/plain, inline)]
On Sun, Mar 04, 2012 at 03:18:50AM +0100, Timo Weingärtner wrote:
> Hallo Julian Andres,
> 
> 2012-03-04 um 01:07:42 schriebst Du:
> > On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote:
> 
> > > The initial comparison was with hardlink, which got OOM killed with a
> > > hundred backups of my home directory. Last night I compared it to duff
> > > and rdfind which would have happily linked files with different st_mtime
> > > and st_mode.
> > 
> > You might want to try hardlink 0.2~rc1. In any case, I don't think we need
> > yet another such tool in the archive. If you want that algorithm, we can
> > implement it in hardlink 0.2 using probably about 10 lines. I had that
> > locally and it works, so if you want it, we can add it and avoid the
> > need for one more hack in that space.
> 
> And why is lighttpd in the archive? Apache can do the same ...

Can it? Apache uses threads, lighttpd uses an event loop for
processing. That's a different topic though.

But in any case, avoiding yet another tool with the same security
issues (CVE-2011-3632) and bugs (and more bugs) as we currently
have would be a good idea.

hadori bugs:
  - Race, possible data loss: Calls unlink() before link(). If
    link() fails the data might be lost (best solution appears
    to be to link to a temporary file in the target directory
    and then rename to target name, making the replacement
    atomic)
  - Error checking: Errors when opening files or reading
    files are not checked (ifstream uses the failbit and
    stuff).

Common security issue, same as CVE-2011-3632 for Fedora's hardlink:
	[Unsafe operations on changing trees]
  - If a regular file is replaced by a non-regular one before an
    open() for reading, the program reads from a non-regular file
  - A source file is replaced by one file with different owner
    or permissions after the stat() and before the link()
  - A component of the path is replaced by a symbolic link after
    the initial stat()ing and readdir()ing. An attacker may use
    that to write outside of the intented directory.

(Fixed in Fedora's hardlink, and my hardlink by adding a section
 to the manual page stating that it is not safe to run the
 program on changing trees).

Possibly hardlink only bugs:
   - Exaggeration of sizes. hardlink currently counts every
     link replaced -st_size, even is st_nlink > 1. I don't
     know what hadori does there. (and it requires more work
     to fix that in hardlink, as we currently do not combine
     links to the same inode in any way, and likely want
     --dry-run to work correct as well).

You can also drop your race check. The tool is unsafe on
changing trees anyway, so you don't need to check whether
someone else deleted the file, especially if you're then
linking to it anyway.

> 	
> > hardlink 0.2 is written in C, and uses a binary tree to map
> > (dev_t, off_t) to a struct file which contains the stat information
> > plus name for linking. It requires two allocations per file, one for
> > the struct file with the filename, and one for the node in the tree
> > (well, actually we only need the node for the first file with a
> >  specific (dev_t, off_t) tuple). A node has 3 pointers.
> 
> The "hardlink" I used at that time was written in python and definitely didn't 
> do it the way I want.

I know. I knew that there were problems on large trees in 2009, but got
nowhere with a fix in Python. We still have the two passes in hardlink
and thus need to keep all the files currently, as I did not carry the
link-first mode over from my temporary C++ rewrite, as memory usage
was not much different in my test case. But as my test case was just
running on /, the whole thing may not be representative. If there
are lots of duplicates, link-first can definitely help.

The one that works exactly as as you want is most likely Fedora's hardlink.

> 
> hadori is written in C++11 which IMHO makes it look a little more readable. 

Yes. It looks readable, but also has far less features than hardlink (which
were added to hardlink because of user requests). You can get it even shorter
and more efficient if you use POSIX's nftw() to walk the directory tree
instead of your recurse()/recurse_start() pair (you just need to write
one callback that takes a complete file path, a type info, a stat buffer,
and the basename of the file). Does not work on Haiku though, or BSDs
prior to 2002. Alternatively, there's also fts() if you don't care about
the System V line of UNIXes (e.g. Solaris).

You can tell both to not cross mount points and do not follow symbolic
links, which makes the whole thing very easy.

> It 
> started with tree based map and multimap, now it uses the unordered_ (hash 
> based) versions which made it twice as fast in a typical workload.

That's strange. In my (not published) C++ version of hardlink, unordered
(multi) maps were only slightly faster than ordered ones. I then rewrote
the code in C to make it more readable to the common DD who does not
want to work with C++, and more portable.

And it does not seem correct if you spend so much time in the map, at
least not without caching. And normally, you most likely do not have
the tree(s) you're hardlinking on cached.

> 
> The main logic is in hadori.C, handle_file and uses:
> 
> std::unordered_map<ino_t, inode const> kept;
> std::unordered_map<ino_t, ino_t> to_link;
> std::unordered_multimap<off_t, ino_t> sizes;
> 
> class inode contains a struct stat, a file name and an adler checksum, but I 
> plan to drop the last one because I think the hashing option is no great gain.

I basically have the equivalent to a multimap<off_t, inode> in my code. For
hashing, you may want to use crc32 instead of adler32. Adler32 likes to have
collisions (mostly on small sizes, though, but still, far more collisions
than CRC32). Performance is not really different, you'll spend far more
time in read()ing anyway.


-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.

[Message part 2 (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Sun, 04 Mar 2012 11:51:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Julian Andres Klode <jak@debian.org>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Sun, 04 Mar 2012 11:51:08 GMT) Full text and rfc822 format available.

Message #35 received at 662080@bugs.debian.org (full text, mbox):

From: Julian Andres Klode <jak@debian.org>
To: Goswin von Brederlow <goswin-v-b@web.de>, 662080@bugs.debian.org
Cc: Timo Weingrtner <timo@tiwe.de>
Subject: Re: Bug#662080: ITP: hadori -- Hardlinks identical files
Date: Sun, 4 Mar 2012 12:47:27 +0100
[Message part 1 (text/plain, inline)]
On Sun, Mar 04, 2012 at 07:00:13AM +0100, Goswin von Brederlow wrote:
> Timo Weingärtner <timo@tiwe.de> writes:
> 
> > Package: wnpp
> > Severity: wishlist
> > X-Debbugs-CC: debian-devel@lists.debian.org
> >
> >    Package name: hadori
> >         Version: 0.2
> > Upstream Author: Timo Weingärtner <timo@tiwe.de>
> >             URL: https://github.com/tiwe-de/hadori
> >         License: GPL3+
> >     Description: Hardlinks identical files
> >  This might look like yet another hardlinking tool, but it is the only one
> >  which only memorizes one filename per inode. That results in less merory
> >  consumption and faster execution compared to its alternatives. Therefore
> >  (and because all the other names are already taken) it's called
> >  "HArdlinking DOne RIght".
> >  .
> >  Advantages over other hardlinking tools:
> >   * predictability: arguments are scanned in order, each first version is kept
> >   * much lower CPU and memory consumption
> >   * hashing option: speedup on many equal-sized, mostly identical files
> >
> > The initial comparison was with hardlink, which got OOM killed with a hundred 
> > backups of my home directory. Last night I compared it to duff and rdfind 
> > which would have happily linked files with different st_mtime and st_mode.
> >
> > I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
> > number.
> >
> >
> > Greetings
> > Timo
> 
> I've been thinking about the problem of memory consumption too. But I've
> come to a different solution. One that doesn't need memory at all.

I know yet another solution. For each file you visit, you simply visit
the complete tree. Than you have n + 1 visits, but get constant space
usage.

> 
> Instead of remembering inodes, filenames and checksums create a global
> cache (e.g. directory hierachy like .cache/<start of hash>/<hash>)
> and hardlink every file to there. If you want/need to include uid, gid,
> mtime, mode in there then make that part of the .cache path.
> 
> Garbage collection in the cache would be removing all files with a link
> count of 1.
> 
> Going one step further link files with unique size [uid, gid, mtime,
> ...] to .cache/<size> and change that into .cache/<size>/<start of
> hash>/<hash> when you find a second file with the same size that isn't
> identical. That would save on the expensive hashing of clearly unique
> files.

So implement an object store and replace files outside the object
store with hardlinks to the store. Yes, this is guaranteed to work
for some cases, but also has problems. If you create files first, and
then move them to the store, you still need to check every file with
link count != 1 and check whether it is in the cache already. And for
this, you need a lookup by inode if you want to avoid hashing.

And this is basically the same hierarchy as git has:
	.git/objects/first 2 hex digits of sha1sum/remaining sha1sum

> 
> You could also use a hash that computes the first byte from the first
> 4k, second byte from 64k, thrid from 1mb and so on. That way you can
> check if the beginning of 2 files match without having to checksum the
> whole file or literally comprare the two.

If the beginning can match. They're not guaranteed to match just because
the hashes match.

-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.
[Message part 2 (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Sun, 04 Mar 2012 18:30:14 GMT) Full text and rfc822 format available.

Acknowledgement sent to Timo Weingärtner <timo@tiwe.de>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Sun, 04 Mar 2012 18:30:14 GMT) Full text and rfc822 format available.

Message #40 received at 662080@bugs.debian.org (full text, mbox):

From: Timo Weingärtner <timo@tiwe.de>
To: Julian Andres Klode <jak@debian.org>
Cc: 662080@bugs.debian.org
Subject: Re: Bug#662080: ITP: hadori -- Hardlinks identical files
Date: Sun, 4 Mar 2012 19:26:13 +0100
[Message part 1 (text/plain, inline)]
Hallo Julian Andres,

2012-03-04 um 12:31:39 schriebst Du:
> But in any case, avoiding yet another tool with the same security
> issues (CVE-2011-3632) and bugs (and more bugs) as we currently
> have would be a good idea.
> 
> hadori bugs:
>   - Race, possible data loss: Calls unlink() before link(). If
>     link() fails the data might be lost (best solution appears
>     to be to link to a temporary file in the target directory
>     and then rename to target name, making the replacement
>     atomic)

I copied that from ln -f, which has the same bug then.

>   - Error checking: Errors when opening files or reading
>     files are not checked (ifstream uses the failbit and
>     stuff).

If only one of the files fails nothing bad happens. If both fail bad things 
might happen, that's right.

> Common security issue, same as CVE-2011-3632 for Fedora's hardlink:
> 	[Unsafe operations on changing trees]
>   - If a regular file is replaced by a non-regular one before an
>     open() for reading, the program reads from a non-regular file
>   - A source file is replaced by one file with different owner
>     or permissions after the stat() and before the link()
>   - A component of the path is replaced by a symbolic link after
>     the initial stat()ing and readdir()ing. An attacker may use
>     that to write outside of the intented directory.
> 
> (Fixed in Fedora's hardlink, and my hardlink by adding a section
>  to the manual page stating that it is not safe to run the
>  program on changing trees).

I think that kind of bugs will stay until it is possible open/link by inode 
number. Perhaps *at() can help at the file currently examined.

Right now I only used it for my backups which are only accessible by me (and 
root).

> Possibly hardlink only bugs:
>    - Exaggeration of sizes. hardlink currently counts every
>      link replaced -st_size, even is st_nlink > 1. I don't
>      know what hadori does there.

hadori does not have statistics. They should be easy to add, but I had no use 
for them.

> You can also drop your race check. The tool is unsafe on
> changing trees anyway, so you don't need to check whether
> someone else deleted the file, especially if you're then
> linking to it anyway.

I wanted it to exit when something unexpected happens.

> I knew that there were problems on large trees in 2009, but got nowhere with
> a fix in Python. We still have the two passes in hardlink and thus need to
> keep all the files currently, as I did not carry the link-first mode over
> from my temporary C++ rewrite, as memory usage was not much different in my
> test case. But as my test case was just running on /, the whole thing may
> not be representative. If there are lots of duplicates, link-first can
> definitely help.
> 
> The one that works exactly as as you want is most likely Fedora's hardlink.

I've searched for other implementations and all the others do two passes when 
one is obviously enough.

> Yes. It looks readable, but also has far less features than hardlink (which
> were added to hardlink because of user requests).

I still don't get what --maximize (and --minimize) are needed for. In my 
incremental full backup scenario I get best results with keep-first. When 
hardlinking only $last and $dest (see below) even --maximize can disconnect 
files from older backups.

> > It
> > started with tree based map and multimap, now it uses the unordered_
> > (hash based) versions which made it twice as fast in a typical workload.
> 
> That's strange. In my (not published) C++ version of hardlink, unordered
> (multi) maps were only slightly faster than ordered ones. I then rewrote
> the code in C to make it more readable to the common DD who does not
> want to work with C++, and more portable.
> 
> And it does not seem correct if you spend so much time in the map, at
> least not without caching. And normally, you most likely do not have
> the tree(s) you're hardlinking on cached.

I have, because I usually run:
$ rsync -aH $source $dest --link-dest $last
$ hadori $last $dest


Grüße
Timo
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Sun, 04 Mar 2012 19:00:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Julian Andres Klode <jak@debian.org>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Sun, 04 Mar 2012 19:00:05 GMT) Full text and rfc822 format available.

Message #45 received at 662080@bugs.debian.org (full text, mbox):

From: Julian Andres Klode <jak@debian.org>
To: Timo Weingärtner <timo@tiwe.de>
Cc: 662080@bugs.debian.org
Subject: Re: Bug#662080: ITP: hadori -- Hardlinks identical files
Date: Sun, 4 Mar 2012 19:57:36 +0100
[Message part 1 (text/plain, inline)]
On Sun, Mar 04, 2012 at 07:26:13PM +0100, Timo Weingärtner wrote:
> Hallo Julian Andres,
> 
> 2012-03-04 um 12:31:39 schriebst Du:
> > But in any case, avoiding yet another tool with the same security
> > issues (CVE-2011-3632) and bugs (and more bugs) as we currently
> > have would be a good idea.
> > 
> > hadori bugs:
> >   - Race, possible data loss: Calls unlink() before link(). If
> >     link() fails the data might be lost (best solution appears
> >     to be to link to a temporary file in the target directory
> >     and then rename to target name, making the replacement
> >     atomic)
> 
> I copied that from ln -f, which has the same bug then.

Could be. The other way is definitely safer.

> 
> >   - Error checking: Errors when opening files or reading
> >     files are not checked (ifstream uses the failbit and
> >     stuff).
> 
> If only one of the files fails nothing bad happens. If both fail bad things 
> might happen, that's right.

> 
> > Common security issue, same as CVE-2011-3632 for Fedora's hardlink:
> > 	[Unsafe operations on changing trees]
> >   - If a regular file is replaced by a non-regular one before an
> >     open() for reading, the program reads from a non-regular file
> >   - A source file is replaced by one file with different owner
> >     or permissions after the stat() and before the link()
> >   - A component of the path is replaced by a symbolic link after
> >     the initial stat()ing and readdir()ing. An attacker may use
> >     that to write outside of the intented directory.
> > 
> > (Fixed in Fedora's hardlink, and my hardlink by adding a section
> >  to the manual page stating that it is not safe to run the
> >  program on changing trees).
> 
> I think that kind of bugs will stay until it is possible open/link by inode 
> number. Perhaps *at() can help at the file currently examined.

Nobody said they have to be fixed. As I wrote, the "fix" was to mention
it in the manpage.

> 
> Right now I only used it for my backups which are only accessible by me (and 
> root).
> 
> > Possibly hardlink only bugs:
> >    - Exaggeration of sizes. hardlink currently counts every
> >      link replaced -st_size, even is st_nlink > 1. I don't
> >      know what hadori does there.
> 
> hadori does not have statistics. They should be easy to add, but I had no use 
> for them.
> 
> > You can also drop your race check. The tool is unsafe on
> > changing trees anyway, so you don't need to check whether
> > someone else deleted the file, especially if you're then
> > linking to it anyway.
> 
> I wanted it to exit when something unexpected happens.

But then there are just other cases where you don't, like
opening files. And also, users will complain if you exit
just because one file has a problem. They expect the
program to continue with the remaining ones (at least they
expected this in my audio conversion script, so I do this
in hardlink as well).

> 
> > I knew that there were problems on large trees in 2009, but got nowhere with
> > a fix in Python. We still have the two passes in hardlink and thus need to
> > keep all the files currently, as I did not carry the link-first mode over
> > from my temporary C++ rewrite, as memory usage was not much different in my
> > test case. But as my test case was just running on /, the whole thing may
> > not be representative. If there are lots of duplicates, link-first can
> > definitely help.
> > 
> > The one that works exactly as as you want is most likely Fedora's hardlink.
> 
> I've searched for other implementations and all the others do two passes when 
> one is obviously enough.

Fedora's hardlink should do one pass only and link the first file to later
files. It's fairly simple code. But nobody apart from Fedora and some
other RPM distributions use it.

> 
> > Yes. It looks readable, but also has far less features than hardlink (which
> > were added to hardlink because of user requests).
> 
> I still don't get what --maximize (and --minimize) are needed for. In my 
> incremental full backup scenario I get best results with keep-first. When 
> hardlinking only $last and $dest (see below) even --maximize can disconnect 
> files from older backups.

That's to be expected. I don't know the reasons either. I thought
they came from hardlinkpy (of which hardlink was a rewrite with
the goal of increasing object-orientedness, and readability). But
it seems they didn't. It turns out that this was a user request,
or actually a feature of that user's fork of hardlinkpy:

  **** BEGINNE LOGBUCH UM Fri Dec 19 15:48:39 2008

  Dez 19 18:54:56 <chf> Gegebenenfalls sollte man es zusammenführen, 
                        weil die Modifikation, welche ich benötige,
                        auch den Link-Count der Dateien vergleicht
                        und immer diejenige mit dem höheren erhält
                        und die mit dem niedrigeren durch einen
                        Hardlink auf die andere ersetzt.

(If a non-German-speaking person wants to read this, try a translator
 software)

I think he probably ran this on all of his backups at once. And it
makes sense if you run it on a single directory as well. The
option that does not make sense at all is --minimize, though,
I don't know why anyone would want this. Historical garbage in
my opinion.


> 
> > > It
> > > started with tree based map and multimap, now it uses the unordered_
> > > (hash based) versions which made it twice as fast in a typical workload.
> > 
> > That's strange. In my (not published) C++ version of hardlink, unordered
> > (multi) maps were only slightly faster than ordered ones. I then rewrote
> > the code in C to make it more readable to the common DD who does not
> > want to work with C++, and more portable.
> > 
> > And it does not seem correct if you spend so much time in the map, at
> > least not without caching. And normally, you most likely do not have
> > the tree(s) you're hardlinking on cached.
> 
> I have, because I usually run:
> $ rsync -aH $source $dest --link-dest $last
> $ hadori $last $dest

OK, but then you're having either large amounts of RAM or not
much to compare. If I read one million files, my cache is
practically useless afterwards and only takes care of some
of them, but most have to be re-read.

-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.
[Message part 2 (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org:
Bug#662080; Package wnpp. (Mon, 05 Mar 2012 00:03:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Goswin von Brederlow <goswin-v-b@web.de>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org. (Mon, 05 Mar 2012 00:03:06 GMT) Full text and rfc822 format available.

Message #50 received at 662080@bugs.debian.org (full text, mbox):

From: Goswin von Brederlow <goswin-v-b@web.de>
To: Julian Andres Klode <jak@debian.org>
Cc: Goswin von Brederlow <goswin-v-b@web.de>, 662080@bugs.debian.org, Timo Weingrtner <timo@tiwe.de>
Subject: Re: Bug#662080: ITP: hadori -- Hardlinks identical files
Date: Mon, 05 Mar 2012 01:01:43 +0100
Julian Andres Klode <jak@debian.org> writes:

> On Sun, Mar 04, 2012 at 07:00:13AM +0100, Goswin von Brederlow wrote:
>> Timo Weingärtner <timo@tiwe.de> writes:
>> 
>> > Package: wnpp
>> > Severity: wishlist
>> > X-Debbugs-CC: debian-devel@lists.debian.org
>> >
>> >    Package name: hadori
>> >         Version: 0.2
>> > Upstream Author: Timo Weingärtner <timo@tiwe.de>
>> >             URL: https://github.com/tiwe-de/hadori
>> >         License: GPL3+
>> >     Description: Hardlinks identical files
>> >  This might look like yet another hardlinking tool, but it is the only one
>> >  which only memorizes one filename per inode. That results in less merory
>> >  consumption and faster execution compared to its alternatives. Therefore
>> >  (and because all the other names are already taken) it's called
>> >  "HArdlinking DOne RIght".
>> >  .
>> >  Advantages over other hardlinking tools:
>> >   * predictability: arguments are scanned in order, each first version is kept
>> >   * much lower CPU and memory consumption
>> >   * hashing option: speedup on many equal-sized, mostly identical files
>> >
>> > The initial comparison was with hardlink, which got OOM killed with a hundred 
>> > backups of my home directory. Last night I compared it to duff and rdfind 
>> > which would have happily linked files with different st_mtime and st_mode.
>> >
>> > I need a sponsor. I'll upload it to mentors.d.n as soon as I get the bug 
>> > number.
>> >
>> >
>> > Greetings
>> > Timo
>> 
>> I've been thinking about the problem of memory consumption too. But I've
>> come to a different solution. One that doesn't need memory at all.
>
> I know yet another solution. For each file you visit, you simply visit
> the complete tree. Than you have n + 1 visits, but get constant space
> usage.
>
>> 
>> Instead of remembering inodes, filenames and checksums create a global
>> cache (e.g. directory hierachy like .cache/<start of hash>/<hash>)
>> and hardlink every file to there. If you want/need to include uid, gid,
>> mtime, mode in there then make that part of the .cache path.
>> 
>> Garbage collection in the cache would be removing all files with a link
>> count of 1.
>> 
>> Going one step further link files with unique size [uid, gid, mtime,
>> ...] to .cache/<size> and change that into .cache/<size>/<start of
>> hash>/<hash> when you find a second file with the same size that isn't
>> identical. That would save on the expensive hashing of clearly unique
>> files.
>
> So implement an object store and replace files outside the object
> store with hardlinks to the store. Yes, this is guaranteed to work
> for some cases, but also has problems. If you create files first, and
> then move them to the store, you still need to check every file with
> link count != 1 and check whether it is in the cache already. And for
> this, you need a lookup by inode if you want to avoid hashing.
>
> And this is basically the same hierarchy as git has:
> 	.git/objects/first 2 hex digits of sha1sum/remaining sha1sum

In the above every file is in the cache. A link count of 1 would
indicate a new file that hasn't been processed yet. Unfortunately you
can also have files with link count != 1 that aren't processed yet,
e.g. 2 new files that are hardlinked to each other.

Detecting wether a file is already in cache or not actualy needs to
check 2 things:

1) link count == 1
   => new file, add to cache
2) link count != 1 but hash of file not known (e.g. extended attribute
   not set)
   => new set of files that are hardlinks to each other

Actually the link count can be completly ignored if you always add a
flag when you've processed a file.

Note: The above wastes time in the 2nd case since it would checksum all
the files that are hardlinks one by one and replace them with hardlinks
into the cache. But you could remember the inode and name of the first
occurance. This would only use up memory proportionally to the number of
new inodes.

>> You could also use a hash that computes the first byte from the first
>> 4k, second byte from 64k, thrid from 1mb and so on. That way you can
>> check if the beginning of 2 files match without having to checksum the
>> whole file or literally comprare the two.
>
> If the beginning can match. They're not guaranteed to match just because
> the hashes match.

This wouldn't be to proof identity but to quickly proof difference. If
the first 4k differ then the file will not match. Only makes sense if
you have a lot of big files of equal size.

MfG
        Goswin




Owner recorded as Timo Weingärtner <timo@tiwe.de>. Request was from Raphael Geissert <geissert@debian.org> to control@bugs.debian.org. (Mon, 26 Mar 2012 02:51:06 GMT) Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, wnpp@debian.org, Timo Weingärtner <timo@tiwe.de>:
Bug#662080; Package wnpp. (Mon, 27 May 2013 14:20:18 GMT) Full text and rfc822 format available.

Acknowledgement sent to Lucas Nussbaum <lucas@debian.org>:
Extra info received and forwarded to list. Copy sent to wnpp@debian.org, Timo Weingärtner <timo@tiwe.de>. (Mon, 27 May 2013 14:20:19 GMT) Full text and rfc822 format available.

Message #57 received at 662080@bugs.debian.org (full text, mbox):

From: Lucas Nussbaum <lucas@debian.org>
To: 662080@bugs.debian.org
Cc: control@bugs.debian.org
Subject: hadori: changing back from ITP to RFP
Date: Mon, 27 May 2013 15:24:16 +0200
retitle 662080 RFP: hadori -- Hardlinks identical files
noowner 662080
tag 662080 - pending
thanks

Hi,

This is an automatic email to change the status of hadori back from ITP
(Intent to Package) to RFP (Request for Package), because this bug hasn't seen
any activity during the last 12 months.

If you are still interested in adopting hadori, please send a mail to
<control@bugs.debian.org> with:

 retitle 662080 ITP: hadori -- Hardlinks identical files
 owner 662080 !
 thanks

However, it is not recommended to keep ITP for a long time without acting on
the package, as it might cause other prospective maintainers to refrain from
packaging that software. It is also a good idea to document your progress on
this ITP from time to time, by mailing <662080@bugs.debian.org>.

Thank you for your interest in Debian,
-- 
Lucas, for the QA team <debian-qa@lists.debian.org>



Changed Bug title to 'RFP: hadori -- Hardlinks identical files' from 'ITP: hadori -- Hardlinks identical files' Request was from Lucas Nussbaum <lucas@debian.org> to control@bugs.debian.org. (Mon, 27 May 2013 14:26:10 GMT) Full text and rfc822 format available.

Removed annotation that Bug was owned by Timo Weingärtner <timo@tiwe.de>. Request was from Lucas Nussbaum <lucas@debian.org> to control@bugs.debian.org. (Mon, 27 May 2013 14:26:10 GMT) Full text and rfc822 format available.

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Wed Apr 23 15:52:53 2014; Machine Name: beach.debian.org

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.