Debian Bug report logs -
#1010821
PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element
Toggle useless messages
Report forwarded
to debian-bugs-dist@lists.debian.org, Laszlo Boszormenyi (GCS) <gcs@debian.org>, Daniel Kahn Gillmor <dkg@fifthhorseman.net>:
Bug#1010821; Package src:pypdf2, src:xml2rfc.
(Tue, 10 May 2022 19:57:04 GMT) (full text, mbox, link).
Acknowledgement sent
to Paul Gevers <elbrus@debian.org>:
New Bug report received and forwarded. Copy sent to Laszlo Boszormenyi (GCS) <gcs@debian.org>, Daniel Kahn Gillmor <dkg@fifthhorseman.net>.
(Tue, 10 May 2022 19:57:04 GMT) (full text, mbox, link).
Message #5 received at submit@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Source: pypdf2, xml2rfc
Control: found -1 pypdf2/1.27.12-1
Control: found -1 xml2rfc/3.12.4-1
Severity: serious
Tags: sid bookworm
User: debian-ci@lists.debian.org
Usertags: breaks needs-update
Dear maintainer(s),
With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in
testing when that autopkgtest is run with the binary packages of pypdf2
from unstable. It passes when run with only packages from testing. In
tabular form:
pass fail
pypdf2 from testing 1.27.12-1
xml2rfc from testing 3.12.4-1
all others from testing from testing
I copied some of the output at the bottom of this report.
Currently this regression is blocking the migration of pypdf2 to testing
[1]. Due to the nature of this issue, I filed this bug report against
both packages. Can you please investigate the situation and reassign the
bug to the right package?
More information about this bug and the reason for filing it can be found on
https://wiki.debian.org/ContinuousIntegration/RegressionEmailInformation
Paul
[1] https://qa.debian.org/excuses.php?package=pypdf2
https://ci.debian.net/data/autopkgtest/testing/amd64/x/xml2rfc/21504535/log.gz
======================================================================
ERROR: setUpClass (__main__.PdfWriterTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/tmp/autopkgtest-lxc.mlxdmdjo/downtmp/build.EDj/src/xxx/test.py", line
495, in setUpClass
cls.elements_pdfxml = xmldoc(None, bytes=elements_pdfdoc)
File "/usr/lib/python3/dist-packages/xml2rfc/walkpdf.py", line 97, in
xmldoc
return lxml.etree.fromstring(text)
File "src/lxml/etree.pyx", line 3252, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1793, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1082, in
lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 615, in
lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 11931
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1, line 11931, column 5
----------------------------------------------------------------------
Ran 42 tests in 32.420s
FAILED (errors=1)
autopkgtest [04:57:54]: test run-pytest
[OpenPGP_signature (application/pgp-signature, attachment)]
Marked as found in versions pypdf2/1.27.12-1.
Request was from Paul Gevers <elbrus@debian.org>
to submit@bugs.debian.org.
(Tue, 10 May 2022 19:57:04 GMT) (full text, mbox, link).
Marked as found in versions xml2rfc/3.12.4-1.
Request was from Paul Gevers <elbrus@debian.org>
to submit@bugs.debian.org.
(Tue, 10 May 2022 19:57:05 GMT) (full text, mbox, link).
No longer marked as found in versions pypdf2/1.27.12-1 and xml2rfc/3.12.4-1.
Request was from Paul Gevers <elbrus@debian.org>
to control@bugs.debian.org.
(Thu, 30 Jun 2022 06:54:03 GMT) (full text, mbox, link).
Marked as found in versions xml2rfc/3.12.4-1.
Request was from Paul Gevers <elbrus@debian.org>
to control@bugs.debian.org.
(Thu, 30 Jun 2022 06:54:04 GMT) (full text, mbox, link).
Added indication that 1010821 affects src:pypdf2
Request was from Paul Gevers <elbrus@debian.org>
to control@bugs.debian.org.
(Thu, 30 Jun 2022 06:54:05 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org:
Bug#1010821; Package src:xml2rfc.
(Fri, 15 Jul 2022 15:12:02 GMT) (full text, mbox, link).
Acknowledgement sent
to Daniel Kahn Gillmor <dkg@fifthhorseman.net>:
Extra info received and forwarded to list.
(Fri, 15 Jul 2022 15:12:02 GMT) (full text, mbox, link).
Message #22 received at 1010821@bugs.debian.org (full text, mbox, reply):
[Message part 1 (text/plain, inline)]
Control: reassign 1010821 pypdf2/2.4.2-1
Control: forwarded 1010821 https://github.com/py-pdf/PyPDF2/issues/1111
Control: retitle 1010821 PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element
Control: affects 1010821 + src:xml2rfc src:weasyprint
On Tue 2022-05-10 21:53:30 +0200, Paul Gevers wrote:
> With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in
> testing when that autopkgtest is run with the binary packages of pypdf2
> from unstable. It passes when run with only packages from testing. In
> tabular form:
the problem here is indeed a bug in the latest versions of PyPDF2. I've
traced it back to a failure in how PyPDF2 deals with an empty second
element in a bfchar list:
https://github.com/py-pdf/PyPDF2/issues/1111
You can replicate the problem with this file (habibi.html):
--------
<!DOCTYPE html>
<head>
<meta charset="utf-8">
<title>habibi</title>
</head>
<body>
<div>حَبيبي habibi</div>
</body>
</html>
--------
Feed it through weasyprint from the command line:
------
weasyprint habibi.html habibi.pdf
------
and then in python:
-------
from PyPDF2 import PdfReader
r = PdfReader('habibi.pdf')
t = r.pages[0].extract_text()
-------
This causes a crash in PyPDF2. The crash can be worked around with this
patch:
-----------------
--- a/PyPDF2/_cmap.py
+++ b/PyPDF2/_cmap.py
@@ -245,7 +245,7 @@ def parse_to_unicode(
elif process_char:
lst = [x for x in l.split(b" ") if x]
map_dict[-1] = len(lst[0]) // 2
- while len(lst) > 0:
+ while len(lst) > 1:
map_dict[
unhexlify(lst[0]).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
-----------------
But the patch is insufficient, because then the result of extract_text()
("t" in the python above) is wrong. The problem has to do with subtly
wrong parsing in _cmap.py's parse_to_unicode(). it does manual
manipulation by removing angle brackets and then splitting and
recombining strings based on whitespace. When the contents of some of
the angle-brackets are empty, this technique doesn't work.
--dkg
[signature.asc (application/pgp-signature, inline)]
Changed Bug title to 'PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element' from 'pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1'.
Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net>
to 1010821-submit@bugs.debian.org.
(Fri, 15 Jul 2022 15:12:03 GMT) (full text, mbox, link).
Added indication that 1010821 affects src:xml2rfc and src:weasyprint
Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net>
to 1010821-submit@bugs.debian.org.
(Fri, 15 Jul 2022 15:12:04 GMT) (full text, mbox, link).
Bug reassigned from package 'src:xml2rfc' to 'pypdf2'.
Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net>
to submit@bugs.debian.org.
(Tue, 19 Jul 2022 04:33:05 GMT) (full text, mbox, link).
No longer marked as found in versions xml2rfc/3.12.4-1.
Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net>
to submit@bugs.debian.org.
(Tue, 19 Jul 2022 04:33:06 GMT) (full text, mbox, link).
Marked as found in versions 2.4.2-1.
Request was from Daniel Kahn Gillmor <dkg@fifthhorseman.net>
to submit@bugs.debian.org.
(Tue, 19 Jul 2022 04:33:06 GMT) (full text, mbox, link).
Information forwarded
to debian-bugs-dist@lists.debian.org, Laszlo Boszormenyi (GCS) <gcs@debian.org>:
Bug#1010821; Package pypdf2.
(Tue, 19 Jul 2022 14:42:03 GMT) (full text, mbox, link).
Acknowledgement sent
to László Böszörményi (GCS) <gcs@debian.org>:
Extra info received and forwarded to list. Copy sent to Laszlo Boszormenyi (GCS) <gcs@debian.org>.
(Tue, 19 Jul 2022 14:42:03 GMT) (full text, mbox, link).
Message #39 received at 1010821@bugs.debian.org (full text, mbox, reply):
Version: 2.6.0-1
On Tue, May 10, 2022 at 9:57 PM Paul Gevers <elbrus@debian.org> wrote:
> With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in
> testing when that autopkgtest is run with the binary packages of pypdf2
> from unstable.
This was fixed in the recent pypdf2 upload, but forgot to close this
bug report.
Laszlo/GCS
Reply sent
to László Böszörményi (GCS) <gcs@debian.org>:
You have taken responsibility.
(Tue, 19 Jul 2022 14:54:08 GMT) (full text, mbox, link).
Notification sent
to Paul Gevers <elbrus@debian.org>:
Bug acknowledged by developer.
(Tue, 19 Jul 2022 14:54:08 GMT) (full text, mbox, link).
Message #44 received at 1010821-done@bugs.debian.org (full text, mbox, reply):
Version: 2.6.0-1
On Tue, May 10, 2022 at 9:57 PM Paul Gevers <elbrus@debian.org> wrote:
> With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in
> testing when that autopkgtest is run with the binary packages of pypdf2
> from unstable.
Really closing this time.
Bug archived.
Request was from Debbugs Internal Request <owner@bugs.debian.org>
to internal_control@bugs.debian.org.
(Fri, 18 Nov 2022 07:26:14 GMT) (full text, mbox, link).
Send a report that this bug log contains spam.
Debian bug tracking system administrator <owner@bugs.debian.org>.
Last modified:
Thu May 9 12:19:52 2024;
Machine Name:
buxtehude
Debian Bug tracking system
Debbugs is free software and licensed under the terms of the GNU
Public License version 2. The current version can be obtained
from https://bugs.debian.org/debbugs-source/.
Copyright © 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson,
2005-2017 Don Armstrong, and many other contributors.