Debian Bug report logs -
#874321
backtraces from generic extractor: "Compressed file ended before the end-of-stream marker was reached"
Reported by: Joey Hess <id@joeyh.name>
Date: Tue, 5 Sep 2017 02:15:01 UTC
Severity: normal
Tags: moreinfo
Found in version youtube-dl/2017.05.18.1-1
Reply or subscribe to this bug.
Toggle useless messages
Report forwarded
to debian-bugs-dist@lists.debian.org, Rogério Brito <rbrito@ime.usp.br>:
Bug#874321; Package youtube-dl.
(Tue, 05 Sep 2017 02:15:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Joey Hess <id@joeyh.name>:
New Bug report received and forwarded. Copy sent to Rogério Brito <rbrito@ime.usp.br>.
(Tue, 05 Sep 2017 02:15:04 GMT) (full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Package: youtube-dl
Version: 2017.05.18.1-1
Severity: normal
joey@darkstar:~>youtube-dl http://debian.org/
[generic] debian: Requesting header
[redirect] Following redirect to http://www.debian.org/
[generic] www.debian: Requesting header
WARNING: Falling back on generic information extractor.
[generic] www.debian: Downloading webpage
Traceback (most recent call last):
File "/usr/bin/youtube-dl", line 11, in <module>
load_entry_point('youtube-dl==2017.5.18.1', 'console_scripts', 'youtube-dl')()
File "/usr/lib/python3/dist-packages/youtube_dl/__init__.py", line 465, in main
_real_main(argv)
File "/usr/lib/python3/dist-packages/youtube_dl/__init__.py", line 455, in _real_main
retcode = ydl.download(all_urls)
File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 1896, in download
url, force_generic_extractor=self.params.get('force_generic_extractor', False))
File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 771, in extract_info
return self.process_ie_result(ie_result, download, extra_info)
File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 832, in process_ie_result
extra_info=extra_info)
File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 760, in extract_info
ie_result = ie.extract(url)
File "/usr/lib/python3/dist-packages/youtube_dl/extractor/common.py", line 433, in extract
ie_result = self._real_extract(url)
File "/usr/lib/python3/dist-packages/youtube_dl/extractor/generic.py", line 1942, in _real_extract
full_response = self._request_webpage(request, video_id)
File "/usr/lib/python3/dist-packages/youtube_dl/extractor/common.py", line 502, in _request_webpage
return self._downloader.urlopen(url_or_request)
File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 2106, in urlopen
return self._opener.open(req, timeout=self._socket_timeout)
File "/usr/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/usr/lib/python3/dist-packages/youtube_dl/utils.py", line 981, in http_response
uncompressed = io.BytesIO(gz.read())
File "/usr/lib/python3.5/gzip.py", line 274, in read
return self._buffer.read(size)
File "/usr/lib/python3.5/gzip.py", line 480, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
I'm able to reproduce this over an Excede satelite internet connection,
but not from a VPS. There's some transparent proxying involved,
which is apparently confusing the gzip Content-encoding support in
youtube-dl. (I have not seen the transparent proxying cause any
other problems with other programs.) Only http urls cause the problem,
since https bypasses the transparent proxy.
I edited the code to dump out the gzip compressed content it received
before it trys to decompress it.
joey@darkstar:~>file dump
dump: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)
joey@darkstar:~>ls -l dump
-rw-r--r-- 1 joey joey 4744 Sep 4 21:00 dump
joey@darkstar:~>zcat < dump > data
gzip: stdin: unexpected end of file
joey@darkstar:~>curl --compressed -so raw http://www.debian.org/
joey@darkstar:~>cmp data raw
joey@darkstar:~>
So, it's apparently downloaded a gzip compressed chunk of data
which contains the whole url, but the gzip data is somehow shady,
although not in a way that prevents decompressing the whole page
content. I've attached the `dump` file to this bug report.
I've also attached a `wireshark.pcapng` which has the curl traffic
first followed by youtube-dl.
I suspect that the gzip compressed data has a missing gzip footer.
Normally, the last 8 bytes of `dump` would be the gzip footer. Those are:
93 C6 FF 00 00 00 FF FF
If that were a footer, the size would be 0000FFFF which is not the
actual size. And, changing any of these bytes except for the last one
exposes parts of the compression dictionary, so they must not be
part of the footer, and seem to instead be part of the DEFLATE data.
Similarly, looking at the http response to curl,
the last 8 bytes of that are
9D 7B D7 66 E6 3B 00 00
Which again does not look like a gzip footer.
curl seems to follow Postel's law in handling this, so perhaps youtube-dl
should too?
-- System Information:
Debian Release: buster/sid
APT prefers unstable
APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 4.11.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.utf8, LC_CTYPE=en_US.utf8 (charmap=UTF-8), LANGUAGE=en_US.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
Versions of packages youtube-dl depends on:
ii dpkg 1.18.24
ii python3 3.5.3-3
ii python3-pkg-resources 36.2.7-2
Versions of packages youtube-dl recommends:
ii aria2 1.32.0-1
ii ca-certificates 20170717
ii curl 7.55.1-1
ii ffmpeg 7:3.3.3-3
ii libav-tools 7:3.3.3-3
ii mplayer 2:1.3.0-6+b4
ii mpv 0.26.0-3
ii rtmpdump 2.4+20151223.gitfa8646d.1-1+b1
ii wget 1.19.1-4
youtube-dl suggests no packages.
-- no debconf information
--
see shy jo
[dump (application/octet-stream, attachment)]
[wireshark.pcapng (application/octet-stream, attachment)]
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to debian-bugs-dist@lists.debian.org:
Bug#874321; Package youtube-dl.
(Thu, 19 Oct 2017 20:15:06 GMT) (full text, mbox, link).
Acknowledgement sent
to Rogério Brito <rbrito@ime.usp.br>:
Extra info received and forwarded to list.
(Thu, 19 Oct 2017 20:15:06 GMT) (full text, mbox, link).
Message #10 received at 874321@bugs.debian.org (full text, mbox, reply):
Control: Control: tag -1 moreinfo
Hi, Joey.
On Sep 04 2017, Joey Hess wrote:
> joey@darkstar:~>youtube-dl http://debian.org/
> [generic] debian: Requesting header
> [redirect] Following redirect to http://www.debian.org/
> [generic] www.debian: Requesting header
> WARNING: Falling back on generic information extractor.
> [generic] www.debian: Downloading webpage
(...)
> Traceback (most recent call last):
(...)
> File "/usr/lib/python3.5/urllib/request.py", line 472, in open
> response = meth(req, response)
> File "/usr/lib/python3/dist-packages/youtube_dl/utils.py", line 981, in http_response
> uncompressed = io.BytesIO(gz.read())
> File "/usr/lib/python3.5/gzip.py", line 274, in read
> return self._buffer.read(size)
> File "/usr/lib/python3.5/gzip.py", line 480, in read
> raise EOFError("Compressed file ended before the "
> EOFError: Compressed file ended before the end-of-stream marker was reached
Just for the record, the above is a problem coming from Python, of course.
> I'm able to reproduce this over an Excede satelite internet connection,
> but not from a VPS.
Right. I am not able to reproduce this with whatever I have at my disposal.
> There's some transparent proxying involved, which is apparently confusing
> the gzip Content-encoding support in youtube-dl. (I have not seen the
> transparent proxying cause any other problems with other programs.)
With other python programs?
> Only http urls cause the problem, since https bypasses the transparent
> proxy.
Sure.
> I edited the code to dump out the gzip compressed content it received
> before it trys to decompress it.
(... snip detailed analysis ...)
> curl seems to follow Postel's law in handling this, so perhaps youtube-dl
> should too?
I believe that you meant to file this as a Python bug and I think that the
severity is, quite frankly, lower than normal...
That being said, I'm only tagging this as moreinfo before reassigning this
to python itself, since I want to understand what kind of goal you would
like to achieve here.
From my point of view, if anything here needs to deal more gracefully with
errors, then, that would be Python's gzip module...
Thanks,
--
Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA
http://cynic.cc/blog/ : github.com/rbrito : profiles.google.com/rbrito
DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br
Added tag(s) moreinfo.
Request was from Rogério Brito <rbrito@ime.usp.br>
to control@bugs.debian.org.
(Thu, 19 Oct 2017 20:27:12 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Rogério Brito <rbrito@ime.usp.br>:
Bug#874321; Package youtube-dl.
(Tue, 24 Oct 2017 22:27:07 GMT) (full text, mbox, link).
Acknowledgement sent
to Joey Hess <id@joeyh.name>:
Extra info received and forwarded to list. Copy sent to Rogério Brito <rbrito@ime.usp.br>.
(Tue, 24 Oct 2017 22:27:07 GMT) (full text, mbox, link).
Message #17 received at 874321@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Rogério Brito wrote:
> I believe that you meant to file this as a Python bug and I think that the
> severity is, quite frankly, lower than normal...
I don't think this is a python bug. It's reasonable for pythons's gzip
library to fail when presented with corrupted data. It does not know
it's being used to download an url. Perhaps it should have a mode where
it tries to extract as much data is it can, in case its caller wants to
try to be robust.
I think this is a bug in youtube-dl though, because of this code:
std_headers = {
...
'Accept-Encoding': 'gzip, deflate',
}
if resp.headers.get('Content-encoding', '') == 'gzip':
content = resp.read()
gz = gzip.GzipFile(fileobj=io.BytesIO(content), mode='rb')
try:
uncompressed = io.BytesIO(gz.read())
except IOError as original_ioerror:
# There may be junk add the end of the file
# See http://stackoverflow.com/q/4928560/35070 for details
for i in range(1, 1024):
try:
gz = gzip.GzipFile(fileobj=io.BytesIO(content[:-i]), mode='rb')
uncompressed = io.BytesIO(gz.read())
except IOError:
continue
break
else:
raise original_ioerror
It's encouraging gzip to be used (rather than deflate or no compression),
and it already contains workarounds for similar problems. This code
smells.
There is probably a python library that implements this robustly.
I tried python-urllib3:
joey@darkstar:~>python
Python 2.7.14 (default, Sep 17 2017, 18:50:44)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> headers = {'Accept-Encoding': 'gzip'}
>>> r = http.request('GET', 'http://www.debian.org/', headers=headers)
>>> r.headers.get("Content-Encoding")
'gzip'
>>> len(r.data)
14871
So that seems to work. I think because it uses zlib to decompress the data,
not gzip.
--
see shy jo
[signature.asc (application/pgp-signature, inline)]
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Tue Aug 14 21:33:24 2018;
Machine Name:
buxtehude
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.