Debian Bug report logs - #217243
wget: Possibility to really reject files on recursive downloads

version graph

Package: wget; Maintainer for wget is Noël Köthe <noel@debian.org>; Source for wget is src:wget.

Reported by: Konstantin Seiler <list@kseiler.de>

Date: Thu, 23 Oct 2003 13:48:32 UTC

Severity: wishlist

Merged with 331613

Found in versions 1.8.1-6.1, wget/1.10.1-1

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to Noel Koethe <noel@debian.org>:
Bug#217243; Package wget. Full text and rfc822 format available.

Acknowledgement sent to Konstantin Seiler <list@kseiler.de>:
New Bug report received and forwarded. Copy sent to Noel Koethe <noel@debian.org>. Full text and rfc822 format available.

Message #5 received at maintonly@bugs.debian.org (full text, mbox, reply):

From: Konstantin Seiler <list@kseiler.de>
To: Debian Bug Tracking System <maintonly@bugs.debian.org>
Subject: wget: Possibility to really reject files on recursive downloads
Date: Thu, 23 Oct 2003 23:42:28 +1000
Package: wget
Version: 1.8.1-6.1
Severity: wishlist

It should be possible to reject certain filetypes when downloading
recursively. The -R option makes wget download the file to extract new URLs
and delete it afterwards.
There should be an extra option to prevent wget from downloading certain
files at all.

In my situation I'm often downloading directories with an apache generated
index. Since that index can be sorted, I alway get the seven different pages 
index.html      index.html?D=D  index.html?M=D  index.html?N=D index.html?S=D
index.html?D=A  index.html?M=A  index.html?N=A  index.html?S=A

rejecting the suffixes with -R =A,=D does not have the desired effect of speeding up the
download since the files are downloaded anyway.


-- System Information
Debian Release: 3.0
Architecture: i386
Kernel: Linux kaymes 2.4.19kmsb #1 Sam Apr 5 17:15:16 EST 2003 i686
Locale: LANG=de_DE@euro, LC_CTYPE=de_DE@euro

Versions of packages wget depends on:
ii  libc6                         2.2.5-11.5 GNU C Library: Shared libraries an




Merged 217243 331613. Request was from Noèl Köthe <noel@debian.org> to control@bugs.debian.org. Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, Noèl Köthe <noel@debian.org>:
Bug#217243; Package wget. Full text and rfc822 format available.

Acknowledgement sent to Sergey Svishchev <svs@ropnet.ru>:
Extra info received and forwarded to list. Copy sent to Noèl Köthe <noel@debian.org>. Full text and rfc822 format available.

Message #12 received at 217243@bugs.debian.org (full text, mbox, reply):

From: Sergey Svishchev <svs@ropnet.ru>
To: 217243@bugs.debian.org
Subject: rejecting apache generated indexes
Date: Mon, 26 Jun 2006 02:09:17 +0400
Wget maintainer plans to implement more powerful URL filtering mechanism
after 1.11 is released, see: http://www.mail-archive.com/wget%40sunsite.dk/msg08816.html

In meanwhile, the 'apache-directory-indexes' problem is easily fixable:

--- src/recur.c.orig	2003-10-11 17:57:11.000000000 +0400
+++ src/recur.c
@@ -532,6 +532,14 @@ download_child_p (const struct urlpos *u
	  goto out;
	}
    }
+  if (u->file[0] == '\0')
+    {
+      if (!acceptable (url))
+	{
+	  DEBUGP (("%s does not match acc/rej rules.\n", url));
+	  goto out;
+	}
+    }

  /* 7. */
  if (schemes_are_similar_p (u->scheme, parent->scheme))


Then, you can use the following pattern (I put it into $WGETRC):

\??=?* 

-- 
Sergey Svishchev



Information forwarded to debian-bugs-dist@lists.debian.org, Noël Köthe <noel@debian.org>:
Bug#217243; Package wget. (Mon, 30 Dec 2013 22:21:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Gwern Branwen <gwern@gwern.net>:
Extra info received and forwarded to list. Copy sent to Noël Köthe <noel@debian.org>. (Mon, 30 Dec 2013 22:21:04 GMT) Full text and rfc822 format available.

Message #17 received at 217243@bugs.debian.org (full text, mbox, reply):

From: Gwern Branwen <gwern@gwern.net>
To: 217243@bugs.debian.org
Subject: Real problem, not wishlist
Date: Mon, 30 Dec 2013 17:16:21 -0500
This bug was a very serious, almost fatal, bug for me recently, and I
thought I would share my story to emphasize that for me, this was not
a 'wishlist' severity bug.

I research Tor black-markets (see http://www.gwern.net/Silk%20Road )
because I am interested in them from economic, historical, and
statistical perspectives. Black-markets are dangerous risky
enterprises, even when run as Tor hidden-services, and so people like
me or Nicolas Christin often download or spider them so as to have
copies to analyze later.

In October 2013, Silk Road was famously busted (to everyone's complete
surprise). Fortunately, the FBI seizure left the SR forums alone, and
it became a top priority for me to grab a copy of the forums while I
still could since they would be invaluable in the post-mortem of SR
and the wave of arrests everyone expected to follow the bust. The wget
spider of the public forum went fine. But even more importantly, I
needed to get a copy of the members-only subforum, the Vendor
Roundtable, where all the Silk Road drug dealers talked shop, and more
importantly, turned out to have discovered some of the early bits and
pieces of how Silk Road/Ross Ulbricht was busted.

I'm not a drug dealer, but I know a few of the SR ones and was able to
get login credentials. I logged in, checked that I had access to the
Roundtable, exported my cookies, and read the wget man page for
guidance:

       -R rejlist --reject rejlist
           Specify comma-separated lists of file name suffixes or patterns to
           accept or reject. Note that if any of the wildcard characters, *,
           ?, [ or ], appear in an element of acclist or rejlist, it will be
           treated as a pattern, rather than a suffix.

Perfect. Exactly what I needed to avoid being logged out. I threw in a
`--reject '*logout*'` to cover all possible logout links, and kicked
the spider off. I watched for a few minutes, everything looked like it
was going fine with no suspicious 'index.php?logout' files showing up
or anything, and I went off to deal with other aspects of breaking
news.

2 days later, the spider was still running (it's a very big forum and
Tor has high latency), and I needed to check a particular claim about
a Roundtable thread. No problem, I had a copy of the Roundtable - I'd
just check that. NOPE. The thread wasn't there at all. In fact, almost
*nothing* in the Roundtable had been downloaded at all!

I panicked. No one knew why the FBI hadn't shut down the forums, who
was running them, or when they would disappear into the digital ether.
Christin wasn't spidering the Roundtable, and I was it. If I didn't
have a copy, then likely, no one did. It would be gone permanently.
Luckily, the forums were still up... but for how long? Minutes, hours,
or days? What had gone wrong and how could I fix it?

I logged in again, exported cookies, restarted, checked in a few
hours. No Roundtable. WTF?! I logged in, exported, restarted, watched
closely... I spotted in the stream a mention of 'index.php?logout'.
But why? I went back to the `--reject` documentation. Had I called it
wrong? Made a syntax error? Did `--reject` not do what it was supposed
to do? But the documentation is perfectly clear: --reject rejects URLs
from being downloaded. It doesn't do something remotely as absurd as
download a URL and then delete it! There is no usecase for that in
combination with rejecting URLs, it's trivially broken for many
use-cases, and it would *definitely* be documented in the manpage.

I went back, logged in... Repeat 5 or 10 times with various
invocations of `--reject` and regexps and escalating blood pressure,
until I checked the downloaded pages and resigned myself that somehow,
somehow or other, I couldn't begin to explain it, neither the how nor
the why, wget was logging itself out of the forums. As absurd as it
sounded, nothing else fit the evidence.

I started googling 'wget reject'.

To discover this bug report, among others.

Oh how I raged that night. 'principle of least surprise', 'betrayal',
'crime against posterity', 'moronic', 'deliberately malicious', 'what
the hell', and more indelicate phrases were uttered.

I was also not pleased to discover that, `--reject` aside, there was
apparently no way whatsoever to genuinely reject URLs inside wget.

Eventually, I rigged up a hack where I pointed wget to Privoxy, and
wrote Privoxy rules to block certain URLs including the logout links.
It's ugly, it's not easy to modify, I'm not really familiar with
Privoxy syntax, but at least it does, in fact, work. And I was able to
get a good chunk of the Roundtable before the forums went down. (Not
all of it, but that's another story which is not wget's but the forum
software's fault - I think.)

Summary: `--reject` is a problem. It can't be *that* hard to fix,
there are short patches floating around. Please fix it.

-- 
gwern



Information forwarded to debian-bugs-dist@lists.debian.org, Noël Köthe <noel@debian.org>:
Bug#217243; Package wget. (Wed, 08 Jan 2014 00:39:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Marcin Sochacki <wanted@gnu.univ.gda.pl>:
Extra info received and forwarded to list. Copy sent to Noël Köthe <noel@debian.org>. (Wed, 08 Jan 2014 00:39:04 GMT) Full text and rfc822 format available.

Message #22 received at 217243@bugs.debian.org (full text, mbox, reply):

From: Marcin Sochacki <wanted@gnu.univ.gda.pl>
To: 217243@bugs.debian.org
Subject: Re: Real problem, not wishlist
Date: Wed, 8 Jan 2014 00:54:16 +0100
pavuk (http://www.pavuk.org/man.html) seems to be a good replacement for
wget. Check out -skip_url_pattern and/or -skip_url_rpattern

-- 
+---------------------------------------+
|  -o)  http://wanted.eu.org/
|  /\\  Message void if penguin violated
+ _\_V  Don't mess with the penguin



Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Fri Oct 21 02:16:05 2016; Machine Name: beach

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.