Debian Bug report logs - #874321
backtraces from generic extractor: "Compressed file ended before the end-of-stream marker was reached"

version graph

Package: youtube-dl; Maintainer for youtube-dl is Rogério Brito <rbrito@ime.usp.br>; Source for youtube-dl is src:youtube-dl (PTS, buildd, popcon).

Reported by: Joey Hess <id@joeyh.name>

Date: Tue, 5 Sep 2017 02:15:01 UTC

Severity: normal

Tags: moreinfo

Found in version youtube-dl/2017.05.18.1-1

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Rogério Brito <rbrito@ime.usp.br>:
Bug#874321; Package youtube-dl. (Tue, 05 Sep 2017 02:15:04 GMT) (full text, mbox, link).


Acknowledgement sent to Joey Hess <id@joeyh.name>:
New Bug report received and forwarded. Copy sent to Rogério Brito <rbrito@ime.usp.br>. (Tue, 05 Sep 2017 02:15:04 GMT) (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Joey Hess <id@joeyh.name>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: backtraces from generic extractor: "Compressed file ended before the end-of-stream marker was reached"
Date: Mon, 4 Sep 2017 22:12:54 -0400
[Message part 1 (text/plain, inline)]
Package: youtube-dl
Version: 2017.05.18.1-1
Severity: normal

joey@darkstar:~>youtube-dl  http://debian.org/
[generic] debian: Requesting header
[redirect] Following redirect to http://www.debian.org/
[generic] www.debian: Requesting header
WARNING: Falling back on generic information extractor.
[generic] www.debian: Downloading webpage
Traceback (most recent call last):
  File "/usr/bin/youtube-dl", line 11, in <module>
    load_entry_point('youtube-dl==2017.5.18.1', 'console_scripts', 'youtube-dl')()
  File "/usr/lib/python3/dist-packages/youtube_dl/__init__.py", line 465, in main
    _real_main(argv)
  File "/usr/lib/python3/dist-packages/youtube_dl/__init__.py", line 455, in _real_main
    retcode = ydl.download(all_urls)
  File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 1896, in download
    url, force_generic_extractor=self.params.get('force_generic_extractor', False))
  File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 771, in extract_info
    return self.process_ie_result(ie_result, download, extra_info)
  File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 832, in process_ie_result
    extra_info=extra_info)
  File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 760, in extract_info
    ie_result = ie.extract(url)
  File "/usr/lib/python3/dist-packages/youtube_dl/extractor/common.py", line 433, in extract
    ie_result = self._real_extract(url)
  File "/usr/lib/python3/dist-packages/youtube_dl/extractor/generic.py", line 1942, in _real_extract
    full_response = self._request_webpage(request, video_id)
  File "/usr/lib/python3/dist-packages/youtube_dl/extractor/common.py", line 502, in _request_webpage
    return self._downloader.urlopen(url_or_request)
  File "/usr/lib/python3/dist-packages/youtube_dl/YoutubeDL.py", line 2106, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/usr/lib/python3/dist-packages/youtube_dl/utils.py", line 981, in http_response
    uncompressed = io.BytesIO(gz.read())
  File "/usr/lib/python3.5/gzip.py", line 274, in read
    return self._buffer.read(size)
  File "/usr/lib/python3.5/gzip.py", line 480, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

I'm able to reproduce this over an Excede satelite internet connection,
but not from a VPS. There's some transparent proxying involved,
which is apparently confusing the gzip Content-encoding support in
youtube-dl. (I have not seen the transparent proxying cause any
other problems with other programs.) Only http urls cause the problem,
since https bypasses the transparent proxy.

I edited the code to dump out the gzip compressed content it received
before it trys to decompress it.

joey@darkstar:~>file dump
dump: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)
joey@darkstar:~>ls -l dump
-rw-r--r-- 1 joey joey 4744 Sep  4 21:00 dump
joey@darkstar:~>zcat < dump > data
gzip: stdin: unexpected end of file
joey@darkstar:~>curl --compressed -so raw http://www.debian.org/
joey@darkstar:~>cmp data raw
joey@darkstar:~>

So, it's apparently downloaded a gzip compressed chunk of data
which contains the whole url, but the gzip data is somehow shady,
although not in a way that prevents decompressing the whole page
content. I've attached the `dump` file to this bug report.

I've also attached a `wireshark.pcapng` which has the curl traffic
first followed by youtube-dl.

I suspect that the gzip compressed data has a missing gzip footer.
Normally, the last 8 bytes of `dump` would be the gzip footer. Those are:
93 C6 FF 00 00 00 FF FF
If that were a footer, the size would be 0000FFFF which is not the
actual size. And, changing any of these bytes except for the last one
exposes parts of the compression dictionary, so they must not be
part of the footer, and seem to instead be part of the DEFLATE data.

Similarly, looking at the http response to curl, 
the last 8 bytes of that are
9D 7B D7 66 E6 3B 00 00
Which again does not look like a gzip footer.

curl seems to follow Postel's law in handling this, so perhaps youtube-dl
should too?

-- System Information:
Debian Release: buster/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.11.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.utf8, LC_CTYPE=en_US.utf8 (charmap=UTF-8), LANGUAGE=en_US.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages youtube-dl depends on:
ii  dpkg                   1.18.24
ii  python3                3.5.3-3
ii  python3-pkg-resources  36.2.7-2

Versions of packages youtube-dl recommends:
ii  aria2            1.32.0-1
ii  ca-certificates  20170717
ii  curl             7.55.1-1
ii  ffmpeg           7:3.3.3-3
ii  libav-tools      7:3.3.3-3
ii  mplayer          2:1.3.0-6+b4
ii  mpv              0.26.0-3
ii  rtmpdump         2.4+20151223.gitfa8646d.1-1+b1
ii  wget             1.19.1-4

youtube-dl suggests no packages.

-- no debconf information

-- 
see shy jo
[dump (application/octet-stream, attachment)]
[wireshark.pcapng (application/octet-stream, attachment)]
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org:
Bug#874321; Package youtube-dl. (Thu, 19 Oct 2017 20:15:06 GMT) (full text, mbox, link).


Acknowledgement sent to Rogério Brito <rbrito@ime.usp.br>:
Extra info received and forwarded to list. (Thu, 19 Oct 2017 20:15:06 GMT) (full text, mbox, link).


Message #10 received at 874321@bugs.debian.org (full text, mbox, reply):

From: Rogério Brito <rbrito@ime.usp.br>
To: Joey Hess <id@joeyh.name>, 874321@bugs.debian.org
Subject: Re: Bug#874321: backtraces from generic extractor: "Compressed file ended before the end-of-stream marker was reached"
Date: Thu, 19 Oct 2017 18:13:09 -0200
Control: Control: tag -1 moreinfo

Hi, Joey.

On Sep 04 2017, Joey Hess wrote:
> joey@darkstar:~>youtube-dl  http://debian.org/
> [generic] debian: Requesting header
> [redirect] Following redirect to http://www.debian.org/
> [generic] www.debian: Requesting header
> WARNING: Falling back on generic information extractor.
> [generic] www.debian: Downloading webpage
(...)
> Traceback (most recent call last):
(...)
>   File "/usr/lib/python3.5/urllib/request.py", line 472, in open
>     response = meth(req, response)
>   File "/usr/lib/python3/dist-packages/youtube_dl/utils.py", line 981, in http_response
>     uncompressed = io.BytesIO(gz.read())
>   File "/usr/lib/python3.5/gzip.py", line 274, in read
>     return self._buffer.read(size)
>   File "/usr/lib/python3.5/gzip.py", line 480, in read
>     raise EOFError("Compressed file ended before the "
> EOFError: Compressed file ended before the end-of-stream marker was reached

Just for the record, the above is a problem coming from Python, of course.

> I'm able to reproduce this over an Excede satelite internet connection,
> but not from a VPS.

Right. I am not able to reproduce this with whatever I have at my disposal.

> There's some transparent proxying involved, which is apparently confusing
> the gzip Content-encoding support in youtube-dl. (I have not seen the
> transparent proxying cause any other problems with other programs.)

With other python programs?

> Only http urls cause the problem, since https bypasses the transparent
> proxy.

Sure.

> I edited the code to dump out the gzip compressed content it received
> before it trys to decompress it.
(... snip detailed analysis ...)
> curl seems to follow Postel's law in handling this, so perhaps youtube-dl
> should too?

I believe that you meant to file this as a Python bug and I think that the
severity is, quite frankly, lower than normal...

That being said, I'm only tagging this as moreinfo before reassigning this
to python itself, since I want to understand what kind of goal you would
like to achieve here.

From my point of view, if anything here needs to deal more gracefully with
errors, then, that would be Python's gzip module...


Thanks,

-- 
Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA
http://cynic.cc/blog/ : github.com/rbrito : profiles.google.com/rbrito
DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br



Added tag(s) moreinfo. Request was from Rogério Brito <rbrito@ime.usp.br> to control@bugs.debian.org. (Thu, 19 Oct 2017 20:27:12 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Rogério Brito <rbrito@ime.usp.br>:
Bug#874321; Package youtube-dl. (Tue, 24 Oct 2017 22:27:07 GMT) (full text, mbox, link).


Acknowledgement sent to Joey Hess <id@joeyh.name>:
Extra info received and forwarded to list. Copy sent to Rogério Brito <rbrito@ime.usp.br>. (Tue, 24 Oct 2017 22:27:07 GMT) (full text, mbox, link).


Message #17 received at 874321@bugs.debian.org (full text, mbox, reply):

From: Joey Hess <id@joeyh.name>
To: Rogério Brito <rbrito@ime.usp.br>
Cc: 874321@bugs.debian.org
Subject: Re: Bug#874321: backtraces from generic extractor: "Compressed file ended before the end-of-stream marker was reached"
Date: Tue, 24 Oct 2017 18:16:12 -0400
[Message part 1 (text/plain, inline)]
Rogério Brito wrote:
> I believe that you meant to file this as a Python bug and I think that the
> severity is, quite frankly, lower than normal...

I don't think this is a python bug. It's reasonable for pythons's gzip
library to fail when presented with corrupted data. It does not know
it's being used to download an url. Perhaps it should have a mode where
it tries to extract as much data is it can, in case its caller wants to
try to be robust.

I think this is a bug in youtube-dl though, because of this code:

std_headers = {
...
    'Accept-Encoding': 'gzip, deflate',
}

        if resp.headers.get('Content-encoding', '') == 'gzip':
            content = resp.read()
            gz = gzip.GzipFile(fileobj=io.BytesIO(content), mode='rb')
            try:
                uncompressed = io.BytesIO(gz.read())
            except IOError as original_ioerror:
                # There may be junk add the end of the file
                # See http://stackoverflow.com/q/4928560/35070 for details
                for i in range(1, 1024):
                    try:
                        gz = gzip.GzipFile(fileobj=io.BytesIO(content[:-i]), mode='rb')
                        uncompressed = io.BytesIO(gz.read())
                    except IOError:
                        continue
                    break
                else:
                    raise original_ioerror

It's encouraging gzip to be used (rather than deflate or no compression),
and it already contains workarounds for similar problems. This code
smells.

There is probably a python library that implements this robustly.
I tried python-urllib3:

joey@darkstar:~>python
Python 2.7.14 (default, Sep 17 2017, 18:50:44) 
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> headers = {'Accept-Encoding': 'gzip'}
>>> r = http.request('GET', 'http://www.debian.org/', headers=headers)
>>> r.headers.get("Content-Encoding")
'gzip'
>>> len(r.data)
14871

So that seems to work. I think because it uses zlib to decompress the data,
not gzip.

-- 
see shy jo
[signature.asc (application/pgp-signature, inline)]

Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Tue Aug 14 21:33:24 2018; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.