Debian Bug report logs - #1010821
PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element

version graph

Package: pypdf2; Maintainer for pypdf2 is Debian Python Team <team+python@tracker.debian.org>;

Affects: src:weasyprint, src:xml2rfc, src:pypdf2

Reported by: Paul Gevers <elbrus@debian.org>

Date: Tue, 10 May 2022 19:57:02 UTC

Severity: serious

Tags: bookworm, sid

Found in version 2.4.2-1

Fixed in version 2.6.0-1

Done: László Böszörményi (GCS) <gcs@debian.org>

Bug is archived. No further changes may be made.

Forwarded to https://github.com/py-pdf/PyPDF2/issues/1111

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, Laszlo Boszormenyi (GCS) <gcs@debian.org>, Daniel Kahn Gillmor <dkg@fifthhorseman.net>:
Bug#1010821; Package src:pypdf2, src:xml2rfc. (Tue, 10 May 2022 19:57:04 GMT) (full text, mbox, link).


Acknowledgement sent to Paul Gevers <elbrus@debian.org>:
New Bug report received and forwarded. Copy sent to Laszlo Boszormenyi (GCS) <gcs@debian.org>, Daniel Kahn Gillmor <dkg@fifthhorseman.net>. (Tue, 10 May 2022 19:57:04 GMT) (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Paul Gevers <elbrus@debian.org>
To: submit@bugs.debian.org
Subject: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Date: Tue, 10 May 2022 21:53:30 +0200
[Message part 1 (text/plain, inline)]
Source: pypdf2, xml2rfc
Control: found -1 pypdf2/1.27.12-1
Control: found -1 xml2rfc/3.12.4-1
Severity: serious
Tags: sid bookworm
User: debian-ci@lists.debian.org
Usertags: breaks needs-update

Dear maintainer(s),

With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in 
testing when that autopkgtest is run with the binary packages of pypdf2 
from unstable. It passes when run with only packages from testing. In 
tabular form:

                       pass            fail
pypdf2                 from testing    1.27.12-1
xml2rfc                from testing    3.12.4-1
all others             from testing    from testing

I copied some of the output at the bottom of this report.

Currently this regression is blocking the migration of pypdf2 to testing 
[1]. Due to the nature of this issue, I filed this bug report against 
both packages. Can you please investigate the situation and reassign the 
bug to the right package?

More information about this bug and the reason for filing it can be found on
https://wiki.debian.org/ContinuousIntegration/RegressionEmailInformation

Paul

[1] https://qa.debian.org/excuses.php?package=pypdf2

https://ci.debian.net/data/autopkgtest/testing/amd64/x/xml2rfc/21504535/log.gz

======================================================================
ERROR: setUpClass (__main__.PdfWriterTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File 
"/tmp/autopkgtest-lxc.mlxdmdjo/downtmp/build.EDj/src/xxx/test.py", line 
495, in setUpClass
    cls.elements_pdfxml = xmldoc(None, bytes=elements_pdfdoc)
  File "/usr/lib/python3/dist-packages/xml2rfc/walkpdf.py", line 97, in 
xmldoc
    return lxml.etree.fromstring(text)
  File "src/lxml/etree.pyx", line 3252, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1793, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1082, in 
lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 615, in 
lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 11931
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1, line 11931, column 5

----------------------------------------------------------------------
Ran 42 tests in 32.420s

FAILED (errors=1)
autopkgtest [04:57:54]: test run-pytest

[OpenPGP_signature (application/pgp-signature, attachment)]

Marked as found in versions pypdf2/1.27.12-1. Request was from Paul Gevers <elbrus@debian.org> to submit@bugs.debian.org. (Tue, 10 May 2022 19:57:04 GMT) (full text, mbox, link).


Marked as found in versions xml2rfc/3.12.4-1. Request was from Paul Gevers <elbrus@debian.org> to submit@bugs.debian.org. (Tue, 10 May 2022 19:57:05 GMT) (full text, mbox, link).


Bug reassigned from package 'src:pypdf2, src:xml2rfc' to 'src:xml2rfc'. Request was from Paul Gevers <elbrus@debian.org> to control@bugs.debian.org. (Thu, 30 Jun 2022 06:54:03 GMT) (full text, mbox, link).


No longer marked as found in versions pypdf2/1.27.12-1 and xml2rfc/3.12.4-1. Request was from Paul Gevers <elbrus@debian.org> to control@bugs.debian.org. (Thu, 30 Jun 2022 06:54:03 GMT) (full text, mbox, link).


Marked as found in versions xml2rfc/3.12.4-1. Request was from Paul Gevers <elbrus@debian.org> to control@bugs.debian.org. (Thu, 30 Jun 2022 06:54:04 GMT) (full text, mbox, link).


Added indication that 1010821 affects src:pypdf2 Request was from Paul Gevers <elbrus@debian.org> to control@bugs.debian.org. (Thu, 30 Jun 2022 06:54:05 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org:
Bug#1010821; Package src:xml2rfc. (Fri, 15 Jul 2022 15:12:02 GMT) (full text, mbox, link).


Acknowledgement sent to Daniel Kahn Gillmor <dkg@fifthhorseman.net>:
Extra info received and forwarded to list. (Fri, 15 Jul 2022 15:12:02 GMT) (full text, mbox, link).


Message #22 received at 1010821@bugs.debian.org (full text, mbox, reply):

From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
To: Paul Gevers <elbrus@debian.org>, 1010821@bugs.debian.org
Subject: Re: Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Date: Fri, 15 Jul 2022 11:08:32 -0400
[Message part 1 (text/plain, inline)]
Control: reassign 1010821 pypdf2/2.4.2-1
Control: forwarded 1010821 https://github.com/py-pdf/PyPDF2/issues/1111
Control: retitle 1010821 PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element
Control: affects 1010821 + src:xml2rfc src:weasyprint

On Tue 2022-05-10 21:53:30 +0200, Paul Gevers wrote:
> With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in 
> testing when that autopkgtest is run with the binary packages of pypdf2 
> from unstable. It passes when run with only packages from testing. In 
> tabular form:

the problem here is indeed a bug in the latest versions of PyPDF2.  I've
traced it back to a failure in how PyPDF2 deals with an empty second
element in a bfchar list:

   https://github.com/py-pdf/PyPDF2/issues/1111

You can replicate the problem with this file (habibi.html):

--------
<!DOCTYPE html>
<head>
<meta charset="utf-8">
<title>habibi</title>
</head>
<body>
<div>حَبيبي habibi</div>
</body>
</html>
--------

Feed it through weasyprint from the command line:

------
weasyprint habibi.html habibi.pdf
------

and then in python:

-------
from PyPDF2 import PdfReader
r = PdfReader('habibi.pdf')
t = r.pages[0].extract_text()
-------

This causes a crash in PyPDF2.  The crash can be worked around with this
patch:

-----------------
--- a/PyPDF2/_cmap.py
+++ b/PyPDF2/_cmap.py
@@ -245,7 +245,7 @@ def parse_to_unicode(
         elif process_char:
             lst = [x for x in l.split(b" ") if x]
             map_dict[-1] = len(lst[0]) // 2
-            while len(lst) > 0:
+            while len(lst) > 1:
                 map_dict[
                     unhexlify(lst[0]).decode(
                         "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
-----------------

But the patch is insufficient, because then the result of extract_text()
("t" in the python above) is wrong.  The problem has to do with subtly
wrong parsing in _cmap.py's parse_to_unicode().  it does manual
manipulation by removing angle brackets and then splitting and
recombining strings based on whitespace.  When the contents of some of
the angle-brackets are empty, this technique doesn't work.

    --dkg
[signature.asc (application/pgp-signature, inline)]

Set Bug forwarded-to-address to 'https://github.com/py-pdf/PyPDF2/issues/1111'. Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net> to 1010821-submit@bugs.debian.org. (Fri, 15 Jul 2022 15:12:02 GMT) (full text, mbox, link).


Changed Bug title to 'PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element' from 'pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1'. Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net> to 1010821-submit@bugs.debian.org. (Fri, 15 Jul 2022 15:12:03 GMT) (full text, mbox, link).


Added indication that 1010821 affects src:xml2rfc and src:weasyprint Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net> to 1010821-submit@bugs.debian.org. (Fri, 15 Jul 2022 15:12:04 GMT) (full text, mbox, link).


Bug reassigned from package 'src:xml2rfc' to 'pypdf2'. Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net> to submit@bugs.debian.org. (Tue, 19 Jul 2022 04:33:05 GMT) (full text, mbox, link).


No longer marked as found in versions xml2rfc/3.12.4-1. Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net> to submit@bugs.debian.org. (Tue, 19 Jul 2022 04:33:06 GMT) (full text, mbox, link).


Marked as found in versions 2.4.2-1. Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net> to submit@bugs.debian.org. (Tue, 19 Jul 2022 04:33:06 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, Laszlo Boszormenyi (GCS) <gcs@debian.org>:
Bug#1010821; Package pypdf2. (Tue, 19 Jul 2022 14:42:03 GMT) (full text, mbox, link).


Acknowledgement sent to László Böszörményi (GCS) <gcs@debian.org>:
Extra info received and forwarded to list. Copy sent to Laszlo Boszormenyi (GCS) <gcs@debian.org>. (Tue, 19 Jul 2022 14:42:03 GMT) (full text, mbox, link).


Message #39 received at 1010821@bugs.debian.org (full text, mbox, reply):

From: László Böszörményi (GCS) <gcs@debian.org>
To: Paul Gevers <elbrus@debian.org>, 1010821@bugs.debian.org
Subject: Re: Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Date: Tue, 19 Jul 2022 16:38:43 +0200
Version: 2.6.0-1

On Tue, May 10, 2022 at 9:57 PM Paul Gevers <elbrus@debian.org> wrote:
> With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in
> testing when that autopkgtest is run with the binary packages of pypdf2
> from unstable.
 This was fixed in the recent pypdf2 upload, but forgot to close this
bug report.

Laszlo/GCS



Reply sent to László Böszörményi (GCS) <gcs@debian.org>:
You have taken responsibility. (Tue, 19 Jul 2022 14:54:08 GMT) (full text, mbox, link).


Notification sent to Paul Gevers <elbrus@debian.org>:
Bug acknowledged by developer. (Tue, 19 Jul 2022 14:54:08 GMT) (full text, mbox, link).


Message #44 received at 1010821-done@bugs.debian.org (full text, mbox, reply):

From: László Böszörményi (GCS) <gcs@debian.org>
To: 1010821-done@bugs.debian.org
Subject: Re: Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Date: Tue, 19 Jul 2022 16:52:10 +0200
Version: 2.6.0-1

On Tue, May 10, 2022 at 9:57 PM Paul Gevers <elbrus@debian.org> wrote:
> With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in
> testing when that autopkgtest is run with the binary packages of pypdf2
> from unstable.
Really closing this time.



Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Fri, 18 Nov 2022 07:26:14 GMT) (full text, mbox, link).


Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Thu May 9 12:19:52 2024; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.