Debian Bug report logs - #656288
python3-apt: difficulties with non-UTF-8-encoded TagFiles

version graph

Package: python3-apt; Maintainer for python3-apt is APT Development Team <deity@lists.debian.org>; Source for python3-apt is src:python-apt (PTS, buildd, popcon).

Reported by: Colin Watson <cjwatson@debian.org>

Date: Wed, 18 Jan 2012 01:00:02 UTC

Severity: normal

Found in version python-apt/0.8.3

Fixed in version python-apt/0.8.5

Done: Julian Andres Klode <jak@debian.org>

Bug is archived. No further changes may be made.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Wed, 18 Jan 2012 01:00:05 GMT) (full text, mbox, link).


Acknowledgement sent to Colin Watson <cjwatson@debian.org>:
New Bug report received and forwarded. Copy sent to APT Development Team <deity@lists.debian.org>. (Wed, 18 Jan 2012 01:00:05 GMT) (full text, mbox, link).


Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Colin Watson <cjwatson@debian.org>
To: submit@bugs.debian.org
Subject: python3-apt: difficulties with non-UTF-8-encoded TagFiles
Date: Wed, 18 Jan 2012 00:56:03 +0000
Package: python3-apt
Version: 0.8.3
Severity: normal

In Python 3, I can find no way to get apt_pkg.TagFile to read a file
that isn't encoded in UTF-8:

  >>> import sys
  >>> import apt_pkg
  >>> sys.version
  '3.2.2+ (default, Jan  8 2012, 07:26:18) \n[GCC 4.6.2]'
  >>> with open("test", "w", encoding="iso-8859-1") as test:
  ...     print("Package: test", file=test)
  ...     print("Maintainer: M\xe4intainer <test@example.org>", file=test)
  ...     print(file=test)
  ...
  >>> tagfile = apt_pkg.TagFile(open("test", "rb"))
  >>> next(tagfile)["Maintainer"]
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 1: invalid continuation byte
  >>> tagfile = apt_pkg.TagFile(open("test", encoding="iso-8859-1"))
  >>> next(tagfile)["Maintainer"]
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 1: invalid continuation byte

Whereas in Python 2:

  >>> import sys
  >>> import apt_pkg
  >>> sys.version
  '2.7.2+ (default, Jan 13 2012, 23:15:17) \n[GCC 4.6.2]'
  >>> tagfile = apt_pkg.TagFile(open("test", "rb"))
  >>> tagfile.next()["Maintainer"]
  'M\xe4intainer <test@example.org>'

This breaks part of the python-debian test suite (I'm currently trying
to port python-debian to Python 3), which is interested in such things
as making sure that it's possible to parse old Sources files from before
Debian switched to UTF-8.

A fix is tricky.  We can't do anything actually nice using Python 3's
I/O facilities, because python-apt just pokes around to find the file
descriptor and passes that directly to apt.  However, one idea that
comes to mind is that if you open a file with the 'encoding' parameter
then python-apt could spot that in the file object, remember it, and
decode bytes using that encoding any time it wants to return a Unicode
string.

python-debian's test suite also tests that it's possible to parse old
Sources files in *mixed* encodings.  This is going to be harder because
it basically means having apt_pkg.TagSection return bytes, which I don't
think is desirable in general.  Maybe this could be optional somehow?

Thanks,

-- 
Colin Watson                                       [cjwatson@debian.org]




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Wed, 18 Jan 2012 10:06:03 GMT) (full text, mbox, link).


Acknowledgement sent to Colin Watson <cjwatson@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Wed, 18 Jan 2012 10:06:05 GMT) (full text, mbox, link).


Message #10 received at submit@bugs.debian.org (full text, mbox, reply):

From: Colin Watson <cjwatson@debian.org>
To: submit@bugs.debian.org
Subject: Re: Bug#656288: python3-apt: difficulties with non-UTF-8-encoded TagFiles
Date: Wed, 18 Jan 2012 10:02:31 +0000
On Wed, Jan 18, 2012 at 12:56:03AM +0000, Colin Watson wrote:
> python-debian's test suite also tests that it's possible to parse old
> Sources files in *mixed* encodings.  This is going to be harder because
> it basically means having apt_pkg.TagSection return bytes, which I don't
> think is desirable in general.  Maybe this could be optional somehow?

Thinking about it, this seems a reasonable thing to make switchable in
TagFile's constructor.  After all:

  >>> with open("test", encoding="iso-8859-1") as test:
  ...     print(test.read().__class__)
  ...
  <class 'str'>
  >>> with open("test", mode="rb") as test:
  ...     print(test.read().__class__)
  ...
  <class 'bytes'>

So there's clear precedent in the language for the same method returning
str or bytes depending on how the class was constructed.  Maybe a bytes=
keyword argument?

-- 
Colin Watson                                       [cjwatson@debian.org]




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Wed, 18 Jan 2012 11:18:04 GMT) (full text, mbox, link).


Acknowledgement sent to Julian Andres Klode <jak@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Wed, 18 Jan 2012 11:18:08 GMT) (full text, mbox, link).


Message #15 received at submit@bugs.debian.org (full text, mbox, reply):

From: Julian Andres Klode <jak@debian.org>
To: Colin Watson <cjwatson@debian.org>, 656288@bugs.debian.org
Cc: submit@bugs.debian.org
Subject: Re: Bug#656288: python3-apt: difficulties with non-UTF-8-encoded TagFiles
Date: Wed, 18 Jan 2012 12:14:02 +0100
On Wed, Jan 18, 2012 at 10:02:31AM +0000, Colin Watson wrote:
> On Wed, Jan 18, 2012 at 12:56:03AM +0000, Colin Watson wrote:
> > python-debian's test suite also tests that it's possible to parse old
> > Sources files in *mixed* encodings.  This is going to be harder because
> > it basically means having apt_pkg.TagSection return bytes, which I don't
> > think is desirable in general.  Maybe this could be optional somehow?
> 
> Thinking about it, this seems a reasonable thing to make switchable in
> TagFile's constructor.  After all:
> 
>   >>> with open("test", encoding="iso-8859-1") as test:
>   ...     print(test.read().__class__)
>   ...
>   <class 'str'>
>   >>> with open("test", mode="rb") as test:
>   ...     print(test.read().__class__)
>   ...
>   <class 'bytes'>
> 
> So there's clear precedent in the language for the same method returning
> str or bytes depending on how the class was constructed.  Maybe a bytes=
> keyword argument?

You'd also need to take care of TagSection if that is done, which should
then work in bytes mode when passed a bytes string.

Basically you'd need to modify TagSection and TagFile to both store whether
to use bytes or unicode and pass the value of that flag from the TagFile
to the TagSection. Then create a function

	PyObject *TagFile_ToString(char *s, size_t n)

or similar that uses PyString_* functions or PyBytes_ functions depending
on the context (where PyString is mapped to unicode in Python 3, and
str in Python 2). Then use that function everywhere we currently
create strings in the TagFile.


-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Wed, 18 Jan 2012 11:18:12 GMT) (full text, mbox, link).


Acknowledgement sent to Julian Andres Klode <jak@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Wed, 18 Jan 2012 11:18:16 GMT) (full text, mbox, link).


Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Fri, 20 Jan 2012 16:00:03 GMT) (full text, mbox, link).


Acknowledgement sent to Colin Watson <cjwatson@ubuntu.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Fri, 20 Jan 2012 16:00:03 GMT) (full text, mbox, link).


Message #25 received at 656288@bugs.debian.org (full text, mbox, reply):

From: Colin Watson <cjwatson@ubuntu.com>
To: Julian Andres Klode <jak@debian.org>, 656288@bugs.debian.org
Subject: Re: Bug#656288: python3-apt: difficulties with non-UTF-8-encoded TagFiles
Date: Fri, 20 Jan 2012 15:57:45 +0000
On Wed, Jan 18, 2012 at 12:14:02PM +0100, Julian Andres Klode wrote:
> You'd also need to take care of TagSection if that is done, which should
> then work in bytes mode when passed a bytes string.
> 
> Basically you'd need to modify TagSection and TagFile to both store whether
> to use bytes or unicode and pass the value of that flag from the TagFile
> to the TagSection. Then create a function
> 
> 	PyObject *TagFile_ToString(char *s, size_t n)
> 
> or similar that uses PyString_* functions or PyBytes_ functions depending
> on the context (where PyString is mapped to unicode in Python 3, and
> str in Python 2). Then use that function everywhere we currently
> create strings in the TagFile.

OK.  How about something like this?  I added both an explicit bytes=
parameter and a fallback which tries to detect the encoding from the
file object.

=== modified file 'python/tag.cc'
--- python/tag.cc	2011-11-10 16:20:58 +0000
+++ python/tag.cc	2012-01-20 14:47:56 +0000
@@ -38,6 +38,10 @@ using namespace std;
 struct TagSecData : public CppPyObject<pkgTagSection>
 {
    char *Data;
+   bool Bytes;
+#if PY_MAJOR_VERSION >= 3
+   PyObject *Encoding;
+#endif
 };
 
 // The owner of the TagFile is a Python file object.
@@ -45,6 +49,10 @@ struct TagFileData : public CppPyObject<
 {
    TagSecData *Section;
    FileFd Fd;
+   bool Bytes;
+#if PY_MAJOR_VERSION >= 3
+   PyObject *Encoding;
+#endif
 };
 
 // Traversal and Clean for owned objects
@@ -60,6 +68,35 @@ int TagFileClear(PyObject *self) {
     return 0;
 }
 
+// Helpers to return Unicode or bytes as appropriate.
+#if PY_MAJOR_VERSION < 3
+#define TagSecString_FromStringAndSize(self, v, len) \
+    PyString_FromStringAndSize((v), (len))
+#define TagSecString_FromString(self, v) PyString_FromString(v)
+#else
+PyObject *TagSecString_FromStringAndSize(PyObject *self, const char *v,
+	 				 Py_ssize_t len) {
+   TagSecData *Self = (TagSecData *)self;
+   if (Self->Bytes)
+      return PyBytes_FromStringAndSize(v, len);
+   else if (Self->Encoding)
+      return PyUnicode_Decode(v, len, PyUnicode_AsString(Self->Encoding), 0);
+   else
+      return PyUnicode_FromStringAndSize(v, len);
+}
+
+PyObject *TagSecString_FromString(PyObject *self, const char *v) {
+   TagSecData *Self = (TagSecData *)self;
+   if (Self->Bytes)
+      return PyBytes_FromString(v);
+   else if (Self->Encoding)
+      return PyUnicode_Decode(v, strlen(v),
+			      PyUnicode_AsString(Self->Encoding), 0);
+   else
+      return PyUnicode_FromString(v);
+}
+#endif
+
 
 									/*}}}*/
 // TagSecFree - Free a Tag Section					/*{{{*/
@@ -107,9 +144,9 @@ static PyObject *TagSecFind(PyObject *Se
    {
       if (Default == 0)
 	 Py_RETURN_NONE;
-      return PyString_FromString(Default);
+      return TagSecString_FromString(Self,Default);
    }
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 
 static char *doc_FindRaw =
@@ -128,14 +165,14 @@ static PyObject *TagSecFindRaw(PyObject
    {
       if (Default == 0)
 	 Py_RETURN_NONE;
-      return PyString_FromString(Default);
+      return TagSecString_FromString(Self,Default);
    }
 
    const char *Start;
    const char *Stop;
    GetCpp<pkgTagSection>(Self).Get(Start,Stop,Pos);
 
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 
 static char *doc_FindFlag =
@@ -161,21 +198,18 @@ static PyObject *TagSecFindFlag(PyObject
 // Map access, operator []
 static PyObject *TagSecMap(PyObject *Self,PyObject *Arg)
 {
-   if (PyString_Check(Arg) == 0)
-   {
-      PyErr_SetNone(PyExc_TypeError);
+   const char *Name = PyObject_AsString(Arg);
+   if (Name == 0)
       return 0;
-   }
-
    const char *Start;
    const char *Stop;
-   if (GetCpp<pkgTagSection>(Self).Find(PyString_AsString(Arg),Start,Stop) == false)
+   if (GetCpp<pkgTagSection>(Self).Find(Name,Start,Stop) == false)
    {
-      PyErr_SetString(PyExc_KeyError,PyString_AsString(Arg));
+      PyErr_SetString(PyExc_KeyError,Name);
       return 0;
    }
 
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 
 // len() operation
@@ -230,9 +264,9 @@ static PyObject *TagSecExists(PyObject *
 
 static int TagSecContains(PyObject *Self,PyObject *Arg)
 {
-   if (PyString_Check(Arg) == 0)
-       return 0;
-   const char *Name = PyString_AsString(Arg);
+   const char *Name = PyObject_AsString(Arg);
+   if (Name == 0)
+      return 0;
    const char *Start;
    const char *Stop;
    if (GetCpp<pkgTagSection>(Self).Find(Name,Start,Stop) == false)
@@ -256,7 +290,7 @@ static PyObject *TagSecStr(PyObject *Sel
    const char *Start;
    const char *Stop;
    GetCpp<pkgTagSection>(Self).GetSection(Start,Stop);
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 									/*}}}*/
 // TagFile Wrappers							/*{{{*/
@@ -286,6 +320,12 @@ static PyObject *TagFileNext(PyObject *S
    Obj.Section->Owner = Self;
    Py_INCREF(Obj.Section->Owner);
    Obj.Section->Data = 0;
+   Obj.Section->Bytes = Obj.Bytes;
+#if PY_MAJOR_VERSION >= 3
+   // We don't need to incref Encoding as the previous Section object already
+   // held a reference to it.
+   Obj.Section->Encoding = Obj.Encoding;
+#endif
    if (Obj.Object.Step(Obj.Section->Object) == false)
       return HandleErrors(NULL);
 
@@ -347,11 +387,12 @@ static PyObject *TagFileJump(PyObject *S
 static PyObject *TagSecNew(PyTypeObject *type,PyObject *Args,PyObject *kwds) {
    char *Data;
    int Len;
-   char *kwlist[] = {"text", 0};
+   char Bytes = 0;
+   char *kwlist[] = {"text", "bytes", 0};
 
    // this allows reading "byte" types from python3 - but we don't
    // make (much) use of it yet
-   if (PyArg_ParseTupleAndKeywords(Args,kwds,"s#",kwlist,&Data,&Len) == 0)
+   if (PyArg_ParseTupleAndKeywords(Args,kwds,"s#|b",kwlist,&Data,&Len,&Bytes) == 0)
       return 0;
 
    // Create the object..
@@ -359,6 +400,10 @@ static PyObject *TagSecNew(PyTypeObject
    new (&New->Object) pkgTagSection();
    New->Data = new char[strlen(Data)+2];
    snprintf(New->Data,strlen(Data)+2,"%s\n",Data);
+   New->Bytes = Bytes;
+#if PY_MAJOR_VERSION >= 3
+   New->Encoding = 0;
+#endif
 
    if (New->Object.Scan(New->Data,strlen(New->Data)) == false)
    {
@@ -390,9 +435,10 @@ PyObject *ParseSection(PyObject *self,Py
 
 static PyObject *TagFileNew(PyTypeObject *type,PyObject *Args,PyObject *kwds)
 {
-   PyObject *File;
-   char *kwlist[] = {"file", 0};
-   if (PyArg_ParseTupleAndKeywords(Args,kwds,"O",kwlist,&File) == 0)
+   PyObject *File = 0;
+   char Bytes = 0;
+   char *kwlist[] = {"file", "bytes", 0};
+   if (PyArg_ParseTupleAndKeywords(Args,kwds,"O|b",kwlist,&File,&Bytes) == 0)
       return 0;
    int fileno = PyObject_AsFileDescriptor(File);
    if (fileno == -1)
@@ -405,8 +451,15 @@ static PyObject *TagFileNew(PyTypeObject
 #else
    new (&New->Fd) FileFd(fileno,false);
 #endif
+   New->Bytes = Bytes;
    New->Owner = File;
    Py_INCREF(New->Owner);
+#if PY_MAJOR_VERSION >= 3
+   New->Encoding = PyObject_GetAttr(File, PyUnicode_FromString("encoding"));
+   if (!PyUnicode_Check(New->Encoding))
+      New->Encoding = 0;
+   Py_XINCREF(New->Encoding);
+#endif
    new (&New->Object) pkgTagFile(&New->Fd);
 
    // Create the section
@@ -415,6 +468,11 @@ static PyObject *TagFileNew(PyTypeObject
    New->Section->Owner = New;
    Py_INCREF(New->Section->Owner);
    New->Section->Data = 0;
+   New->Section->Bytes = Bytes;
+#if PY_MAJOR_VERSION >= 3
+   New->Section->Encoding = New->Encoding;
+   Py_XINCREF(New->Section->Encoding);
+#endif
 
    return HandleErrors(New);
 }
@@ -492,7 +550,7 @@ PyObject *RewriteSection(PyObject *self,
    }
 
    // Return the string
-   PyObject *ResObj = PyString_FromStringAndSize(bp,size);
+   PyObject *ResObj = TagSecString_FromStringAndSize(Section,bp,size);
    free(bp);
    return HandleErrors(ResObj);
 }
@@ -521,11 +579,15 @@ PySequenceMethods TagSecSeqMeth = {0,0,0
 PyMappingMethods TagSecMapMeth = {TagSecLength,TagSecMap,0};
 
 
-static char *doc_TagSec = "TagSection(text: str)\n\n"
+static char *doc_TagSec = "TagSection(text: str, [bytes: bool = False])\n\n"
    "Provide methods to access RFC822-style header sections, like those\n"
    "found in debian/control or Packages files.\n\n"
    "TagSection() behave like read-only dictionaries and also provide access\n"
-   "to the functions provided by the C++ class (e.g. find)";
+   "to the functions provided by the C++ class (e.g. find).\n\n"
+   "By default, text read from files is treated as strings (binary data in\n"
+   "Python 2, Unicode strings in Python 3). Use bytes=True to cause all\n"
+   "header values read from this TagSection to be bytes even in Python 3.\n"
+   "Header names are always treated as Unicode.";
 PyTypeObject PyTagSection_Type =
 {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
@@ -596,7 +658,7 @@ static PyGetSetDef TagFileGetSet[] = {
 };
 
 
-static char *doc_TagFile = "TagFile(file)\n\n"
+static char *doc_TagFile = "TagFile(file, [bytes: bool = False])\n\n"
    "TagFile() objects provide access to debian control files, which consist\n"
    "of multiple RFC822-style sections.\n\n"
    "To provide access to those sections, TagFile objects provide an iterator\n"
@@ -608,7 +670,11 @@ static char *doc_TagFile = "TagFile(file
    "It is important to not mix the use of both APIs, because this can have\n"
    "unwanted effects.\n\n"
    "The parameter 'file' refers to an object providing a fileno() method or\n"
-   "a file descriptor (an integer)";
+   "a file descriptor (an integer).\n\n"
+   "By default, text read from files is treated as strings (binary data in\n"
+   "Python 2, Unicode strings in Python 3). Use bytes=True to cause all\n"
+   "header values read from this TagFile to be bytes even in Python 3.\n"
+   "Header names are always treated as Unicode.";
 
 // Type for a Tag File
 PyTypeObject PyTagFile_Type =

=== added file 'tests/test_tagfile.py'
--- tests/test_tagfile.py	1970-01-01 00:00:00 +0000
+++ tests/test_tagfile.py	2012-01-20 14:52:28 +0000
@@ -0,0 +1,106 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+#
+# Copyright (C) 2012 Canonical Ltd.
+# Author: Colin Watson <cjwatson@ubuntu.com>
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+"""Unit tests for verifying the correctness of apt_pkg.TagFile."""
+from __future__ import print_function, unicode_literals
+import io
+import os
+import shutil
+import sys
+import tempfile
+import unittest
+
+import apt_pkg
+
+
+class TestTagFile(unittest.TestCase):
+    """Test apt_pkg.TagFile."""
+
+    def setUp(self):
+        apt_pkg.init()
+        self.temp_dir = tempfile.mkdtemp()
+
+    def tearDown(self):
+        shutil.rmtree(self.temp_dir)
+
+    def test_utf8(self):
+        value = "Tést Persön <test@example.org>"
+        packages = os.path.join(self.temp_dir, "Packages")
+        with io.open(packages, "w", encoding="UTF-8") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        if sys.version < '3':
+            # In Python 2, test the traditional file interface.
+            with open(packages) as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("UTF-8"), tagfile.section["Maintainer"])
+        with io.open(packages, encoding="UTF-8") as packages_file:
+            tagfile = apt_pkg.TagFile(packages_file)
+            tagfile.step()
+            if sys.version < '3':
+                self.assertEqual(
+                    value.encode("UTF-8"), tagfile.section["Maintainer"])
+            else:
+                self.assertEqual(value, tagfile.section["Maintainer"])
+
+    def test_latin1(self):
+        value = "Tést Persön <test@example.org>"
+        packages = os.path.join(self.temp_dir, "Packages")
+        with io.open(packages, "w", encoding="ISO-8859-1") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        if sys.version < '3':
+            # In Python 2, test the traditional file interface.
+            with open(packages) as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+        with io.open(packages) as packages_file:
+            tagfile = apt_pkg.TagFile(packages_file, bytes=True)
+            tagfile.step()
+            self.assertEqual(
+                value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+        if sys.version >= '3':
+            # In Python 3, TagFile can pick up the encoding of the file
+            # object.
+            with io.open(packages, encoding="ISO-8859-1") as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(value, tagfile.section["Maintainer"])
+
+    def test_mixed(self):
+        value = "Tést Persön <test@example.org>"
+        packages = os.path.join(self.temp_dir, "Packages")
+        with io.open(packages, "w", encoding="UTF-8") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        with io.open(packages, "a", encoding="ISO-8859-1") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        if sys.version < '3':
+            # In Python 2, test the traditional file interface.
+            with open(packages) as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("UTF-8"), tagfile.section["Maintainer"])
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+        with io.open(packages) as packages_file:
+            tagfile = apt_pkg.TagFile(packages_file, bytes=True)
+            tagfile.step()
+            self.assertEqual(
+                value.encode("UTF-8"), tagfile.section["Maintainer"])
+            tagfile.step()
+            self.assertEqual(
+                value.encode("ISO-8859-1"), tagfile.section["Maintainer"])

-- 
Colin Watson                                       [cjwatson@ubuntu.com]




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Sat, 21 Jan 2012 03:15:03 GMT) (full text, mbox, link).


Acknowledgement sent to Colin Watson <cjwatson@ubuntu.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Sat, 21 Jan 2012 03:15:03 GMT) (full text, mbox, link).


Message #30 received at 656288@bugs.debian.org (full text, mbox, reply):

From: Colin Watson <cjwatson@ubuntu.com>
To: Julian Andres Klode <jak@debian.org>, 656288@bugs.debian.org
Subject: Re: Bug#656288: python3-apt: difficulties with non-UTF-8-encoded TagFiles
Date: Sat, 21 Jan 2012 03:10:54 +0000
On Fri, Jan 20, 2012 at 03:57:45PM +0000, Colin Watson wrote:
> OK.  How about something like this?  I added both an explicit bytes=
> parameter and a fallback which tries to detect the encoding from the
> file object.

This crashed in the python-debian test suite due to a silly mistake.
Here's a fixed version.

(That said, I haven't entirely got the python-debian test suite to pass
yet.  It's parsing mixed-encoding files in terrifying ways which are
going to require some rearrangement to work well with Python 3.  So,
strictly, I can't claim to be 100% confident in this change yet; but I'd
appreciate knowing whether the general approach is OK with you.)

=== modified file 'python/tag.cc'
--- python/tag.cc	2011-11-10 16:20:58 +0000
+++ python/tag.cc	2012-01-20 17:12:36 +0000
@@ -38,6 +38,10 @@ using namespace std;
 struct TagSecData : public CppPyObject<pkgTagSection>
 {
    char *Data;
+   bool Bytes;
+#if PY_MAJOR_VERSION >= 3
+   PyObject *Encoding;
+#endif
 };
 
 // The owner of the TagFile is a Python file object.
@@ -45,6 +49,10 @@ struct TagFileData : public CppPyObject<
 {
    TagSecData *Section;
    FileFd Fd;
+   bool Bytes;
+#if PY_MAJOR_VERSION >= 3
+   PyObject *Encoding;
+#endif
 };
 
 // Traversal and Clean for owned objects
@@ -60,6 +68,35 @@ int TagFileClear(PyObject *self) {
     return 0;
 }
 
+// Helpers to return Unicode or bytes as appropriate.
+#if PY_MAJOR_VERSION < 3
+#define TagSecString_FromStringAndSize(self, v, len) \
+    PyString_FromStringAndSize((v), (len))
+#define TagSecString_FromString(self, v) PyString_FromString(v)
+#else
+PyObject *TagSecString_FromStringAndSize(PyObject *self, const char *v,
+	 				 Py_ssize_t len) {
+   TagSecData *Self = (TagSecData *)self;
+   if (Self->Bytes)
+      return PyBytes_FromStringAndSize(v, len);
+   else if (Self->Encoding)
+      return PyUnicode_Decode(v, len, PyUnicode_AsString(Self->Encoding), 0);
+   else
+      return PyUnicode_FromStringAndSize(v, len);
+}
+
+PyObject *TagSecString_FromString(PyObject *self, const char *v) {
+   TagSecData *Self = (TagSecData *)self;
+   if (Self->Bytes)
+      return PyBytes_FromString(v);
+   else if (Self->Encoding)
+      return PyUnicode_Decode(v, strlen(v),
+			      PyUnicode_AsString(Self->Encoding), 0);
+   else
+      return PyUnicode_FromString(v);
+}
+#endif
+
 
 									/*}}}*/
 // TagSecFree - Free a Tag Section					/*{{{*/
@@ -107,9 +144,9 @@ static PyObject *TagSecFind(PyObject *Se
    {
       if (Default == 0)
 	 Py_RETURN_NONE;
-      return PyString_FromString(Default);
+      return TagSecString_FromString(Self,Default);
    }
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 
 static char *doc_FindRaw =
@@ -128,14 +165,14 @@ static PyObject *TagSecFindRaw(PyObject
    {
       if (Default == 0)
 	 Py_RETURN_NONE;
-      return PyString_FromString(Default);
+      return TagSecString_FromString(Self,Default);
    }
 
    const char *Start;
    const char *Stop;
    GetCpp<pkgTagSection>(Self).Get(Start,Stop,Pos);
 
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 
 static char *doc_FindFlag =
@@ -161,21 +198,18 @@ static PyObject *TagSecFindFlag(PyObject
 // Map access, operator []
 static PyObject *TagSecMap(PyObject *Self,PyObject *Arg)
 {
-   if (PyString_Check(Arg) == 0)
-   {
-      PyErr_SetNone(PyExc_TypeError);
+   const char *Name = PyObject_AsString(Arg);
+   if (Name == 0)
       return 0;
-   }
-
    const char *Start;
    const char *Stop;
-   if (GetCpp<pkgTagSection>(Self).Find(PyString_AsString(Arg),Start,Stop) == false)
+   if (GetCpp<pkgTagSection>(Self).Find(Name,Start,Stop) == false)
    {
-      PyErr_SetString(PyExc_KeyError,PyString_AsString(Arg));
+      PyErr_SetString(PyExc_KeyError,Name);
       return 0;
    }
 
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 
 // len() operation
@@ -230,9 +264,9 @@ static PyObject *TagSecExists(PyObject *
 
 static int TagSecContains(PyObject *Self,PyObject *Arg)
 {
-   if (PyString_Check(Arg) == 0)
-       return 0;
-   const char *Name = PyString_AsString(Arg);
+   const char *Name = PyObject_AsString(Arg);
+   if (Name == 0)
+      return 0;
    const char *Start;
    const char *Stop;
    if (GetCpp<pkgTagSection>(Self).Find(Name,Start,Stop) == false)
@@ -256,7 +290,7 @@ static PyObject *TagSecStr(PyObject *Sel
    const char *Start;
    const char *Stop;
    GetCpp<pkgTagSection>(Self).GetSection(Start,Stop);
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 									/*}}}*/
 // TagFile Wrappers							/*{{{*/
@@ -286,6 +320,12 @@ static PyObject *TagFileNext(PyObject *S
    Obj.Section->Owner = Self;
    Py_INCREF(Obj.Section->Owner);
    Obj.Section->Data = 0;
+   Obj.Section->Bytes = Obj.Bytes;
+#if PY_MAJOR_VERSION >= 3
+   // We don't need to incref Encoding as the previous Section object already
+   // held a reference to it.
+   Obj.Section->Encoding = Obj.Encoding;
+#endif
    if (Obj.Object.Step(Obj.Section->Object) == false)
       return HandleErrors(NULL);
 
@@ -347,11 +387,12 @@ static PyObject *TagFileJump(PyObject *S
 static PyObject *TagSecNew(PyTypeObject *type,PyObject *Args,PyObject *kwds) {
    char *Data;
    int Len;
-   char *kwlist[] = {"text", 0};
+   char Bytes = 0;
+   char *kwlist[] = {"text", "bytes", 0};
 
    // this allows reading "byte" types from python3 - but we don't
    // make (much) use of it yet
-   if (PyArg_ParseTupleAndKeywords(Args,kwds,"s#",kwlist,&Data,&Len) == 0)
+   if (PyArg_ParseTupleAndKeywords(Args,kwds,"s#|b",kwlist,&Data,&Len,&Bytes) == 0)
       return 0;
 
    // Create the object..
@@ -359,6 +400,10 @@ static PyObject *TagSecNew(PyTypeObject
    new (&New->Object) pkgTagSection();
    New->Data = new char[strlen(Data)+2];
    snprintf(New->Data,strlen(Data)+2,"%s\n",Data);
+   New->Bytes = Bytes;
+#if PY_MAJOR_VERSION >= 3
+   New->Encoding = 0;
+#endif
 
    if (New->Object.Scan(New->Data,strlen(New->Data)) == false)
    {
@@ -390,9 +435,10 @@ PyObject *ParseSection(PyObject *self,Py
 
 static PyObject *TagFileNew(PyTypeObject *type,PyObject *Args,PyObject *kwds)
 {
-   PyObject *File;
-   char *kwlist[] = {"file", 0};
-   if (PyArg_ParseTupleAndKeywords(Args,kwds,"O",kwlist,&File) == 0)
+   PyObject *File = 0;
+   char Bytes = 0;
+   char *kwlist[] = {"file", "bytes", 0};
+   if (PyArg_ParseTupleAndKeywords(Args,kwds,"O|b",kwlist,&File,&Bytes) == 0)
       return 0;
    int fileno = PyObject_AsFileDescriptor(File);
    if (fileno == -1)
@@ -405,8 +451,15 @@ static PyObject *TagFileNew(PyTypeObject
 #else
    new (&New->Fd) FileFd(fileno,false);
 #endif
+   New->Bytes = Bytes;
    New->Owner = File;
    Py_INCREF(New->Owner);
+#if PY_MAJOR_VERSION >= 3
+   New->Encoding = PyObject_GetAttr(File, PyUnicode_FromString("encoding"));
+   if (New->Encoding && !PyUnicode_Check(New->Encoding))
+      New->Encoding = 0;
+   Py_XINCREF(New->Encoding);
+#endif
    new (&New->Object) pkgTagFile(&New->Fd);
 
    // Create the section
@@ -415,6 +468,11 @@ static PyObject *TagFileNew(PyTypeObject
    New->Section->Owner = New;
    Py_INCREF(New->Section->Owner);
    New->Section->Data = 0;
+   New->Section->Bytes = Bytes;
+#if PY_MAJOR_VERSION >= 3
+   New->Section->Encoding = New->Encoding;
+   Py_XINCREF(New->Section->Encoding);
+#endif
 
    return HandleErrors(New);
 }
@@ -492,7 +550,7 @@ PyObject *RewriteSection(PyObject *self,
    }
 
    // Return the string
-   PyObject *ResObj = PyString_FromStringAndSize(bp,size);
+   PyObject *ResObj = TagSecString_FromStringAndSize(Section,bp,size);
    free(bp);
    return HandleErrors(ResObj);
 }
@@ -521,11 +579,15 @@ PySequenceMethods TagSecSeqMeth = {0,0,0
 PyMappingMethods TagSecMapMeth = {TagSecLength,TagSecMap,0};
 
 
-static char *doc_TagSec = "TagSection(text: str)\n\n"
+static char *doc_TagSec = "TagSection(text: str, [bytes: bool = False])\n\n"
    "Provide methods to access RFC822-style header sections, like those\n"
    "found in debian/control or Packages files.\n\n"
    "TagSection() behave like read-only dictionaries and also provide access\n"
-   "to the functions provided by the C++ class (e.g. find)";
+   "to the functions provided by the C++ class (e.g. find).\n\n"
+   "By default, text read from files is treated as strings (binary data in\n"
+   "Python 2, Unicode strings in Python 3). Use bytes=True to cause all\n"
+   "header values read from this TagSection to be bytes even in Python 3.\n"
+   "Header names are always treated as Unicode.";
 PyTypeObject PyTagSection_Type =
 {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
@@ -596,7 +658,7 @@ static PyGetSetDef TagFileGetSet[] = {
 };
 
 
-static char *doc_TagFile = "TagFile(file)\n\n"
+static char *doc_TagFile = "TagFile(file, [bytes: bool = False])\n\n"
    "TagFile() objects provide access to debian control files, which consist\n"
    "of multiple RFC822-style sections.\n\n"
    "To provide access to those sections, TagFile objects provide an iterator\n"
@@ -608,7 +670,11 @@ static char *doc_TagFile = "TagFile(file
    "It is important to not mix the use of both APIs, because this can have\n"
    "unwanted effects.\n\n"
    "The parameter 'file' refers to an object providing a fileno() method or\n"
-   "a file descriptor (an integer)";
+   "a file descriptor (an integer).\n\n"
+   "By default, text read from files is treated as strings (binary data in\n"
+   "Python 2, Unicode strings in Python 3). Use bytes=True to cause all\n"
+   "header values read from this TagFile to be bytes even in Python 3.\n"
+   "Header names are always treated as Unicode.";
 
 // Type for a Tag File
 PyTypeObject PyTagFile_Type =

=== added file 'tests/test_tagfile.py'
--- tests/test_tagfile.py	1970-01-01 00:00:00 +0000
+++ tests/test_tagfile.py	2012-01-20 14:52:28 +0000
@@ -0,0 +1,106 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+#
+# Copyright (C) 2012 Canonical Ltd.
+# Author: Colin Watson <cjwatson@ubuntu.com>
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+"""Unit tests for verifying the correctness of apt_pkg.TagFile."""
+from __future__ import print_function, unicode_literals
+import io
+import os
+import shutil
+import sys
+import tempfile
+import unittest
+
+import apt_pkg
+
+
+class TestTagFile(unittest.TestCase):
+    """Test apt_pkg.TagFile."""
+
+    def setUp(self):
+        apt_pkg.init()
+        self.temp_dir = tempfile.mkdtemp()
+
+    def tearDown(self):
+        shutil.rmtree(self.temp_dir)
+
+    def test_utf8(self):
+        value = "Tést Persön <test@example.org>"
+        packages = os.path.join(self.temp_dir, "Packages")
+        with io.open(packages, "w", encoding="UTF-8") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        if sys.version < '3':
+            # In Python 2, test the traditional file interface.
+            with open(packages) as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("UTF-8"), tagfile.section["Maintainer"])
+        with io.open(packages, encoding="UTF-8") as packages_file:
+            tagfile = apt_pkg.TagFile(packages_file)
+            tagfile.step()
+            if sys.version < '3':
+                self.assertEqual(
+                    value.encode("UTF-8"), tagfile.section["Maintainer"])
+            else:
+                self.assertEqual(value, tagfile.section["Maintainer"])
+
+    def test_latin1(self):
+        value = "Tést Persön <test@example.org>"
+        packages = os.path.join(self.temp_dir, "Packages")
+        with io.open(packages, "w", encoding="ISO-8859-1") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        if sys.version < '3':
+            # In Python 2, test the traditional file interface.
+            with open(packages) as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+        with io.open(packages) as packages_file:
+            tagfile = apt_pkg.TagFile(packages_file, bytes=True)
+            tagfile.step()
+            self.assertEqual(
+                value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+        if sys.version >= '3':
+            # In Python 3, TagFile can pick up the encoding of the file
+            # object.
+            with io.open(packages, encoding="ISO-8859-1") as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(value, tagfile.section["Maintainer"])
+
+    def test_mixed(self):
+        value = "Tést Persön <test@example.org>"
+        packages = os.path.join(self.temp_dir, "Packages")
+        with io.open(packages, "w", encoding="UTF-8") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        with io.open(packages, "a", encoding="ISO-8859-1") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        if sys.version < '3':
+            # In Python 2, test the traditional file interface.
+            with open(packages) as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("UTF-8"), tagfile.section["Maintainer"])
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+        with io.open(packages) as packages_file:
+            tagfile = apt_pkg.TagFile(packages_file, bytes=True)
+            tagfile.step()
+            self.assertEqual(
+                value.encode("UTF-8"), tagfile.section["Maintainer"])
+            tagfile.step()
+            self.assertEqual(
+                value.encode("ISO-8859-1"), tagfile.section["Maintainer"])

Thanks,

-- 
Colin Watson                                       [cjwatson@ubuntu.com]




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Sun, 22 Jan 2012 14:42:08 GMT) (full text, mbox, link).


Acknowledgement sent to Colin Watson <cjwatson@ubuntu.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Sun, 22 Jan 2012 14:42:08 GMT) (full text, mbox, link).


Message #35 received at 656288@bugs.debian.org (full text, mbox, reply):

From: Colin Watson <cjwatson@ubuntu.com>
To: Tshepang Lekhonkhobe <tshepang@gmail.com>, 625509@bugs.debian.org
Cc: Julian Andres Klode <jak@debian.org>, 656288@bugs.debian.org
Subject: Re: Bug#625509: python-debian: please port to Py3k
Date: Sun, 22 Jan 2012 14:37:55 +0000
[Message part 1 (text/plain, inline)]
On Wed, Jan 18, 2012 at 10:54:28AM +0000, Colin Watson wrote:
> On Wed, May 04, 2011 at 03:10:29AM +0200, Tshepang Lekhonkhobe wrote:
> > Can you either make this package capable of running for Python 2 and 3,
> > or make separate packages for it, as python-apt does.
> 
> I'm working on this here:
> 
>   http://anonscm.debian.org/gitweb/?p=users/cjwatson/python-debian.git;a=shortlog;h=refs/heads/python3
> 
> I will probably end up depending on the six module, which I uploaded to
> unstable yesterday.  It's tiny, so I shouldn't expect this to cause much
> of a problem.
> 
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=656288 in python3-apt
> is getting in the way a bit, but I suppose worst case I can just skip
> those tests when running under Python 3 for now.

I believe this port is now complete, in the git branch above.  It passes
all tests provided that a version of python3-apt with the most recent
patch in #656288 is available.

I would very much appreciate review of this branch.  In case it eases
review, I've attached the 31-patch series (!) to this mail.  I've tried
to arrange it roughly in ascending order of complexity.

Cheers,

-- 
Colin Watson                                       [cjwatson@ubuntu.com]
[0001-Fix-test-warnings-with-python2.7-3.patch (text/x-diff, attachment)]
[0002-Avoid-various-old-syntactic-forms-which-are-no-longe.patch (text/x-diff, attachment)]
[0003-Use-Python-3-style-print-function.patch (text/x-diff, attachment)]
[0004-Use-a-list-comprehension-instead-of-map-which-return.patch (text/x-diff, attachment)]
[0005-Use-iterkeys-iteritems-when-an-iterator-is-all-we-ne.patch (text/x-diff, attachment)]
[0006-Use-absolute-imports.patch (text/x-diff, attachment)]
[0007-Use-Python-3-style-print-function-in-examples.patch (text/x-diff, attachment)]
[0008-Use-key-in-dict-rather-than-obsolete-dict.has_key-ke.patch (text/x-diff, attachment)]
[0009-Use-open-rather-than-file-file-does-not-exist-in-Pyt.patch (text/x-diff, attachment)]
[0010-Use-sep.join-list-rather-than-string.join-list-sep.patch (text/x-diff, attachment)]
[0011-Implement-rich-comparison-methods-the-only-kind-avai.patch (text/x-diff, attachment)]
[0012-Use-assertTrue-and-assertEquals-rather-than-deprecat.patch (text/x-diff, attachment)]
[0013-Try-to-import-pickle-if-importing-cPickle-fails.-Pyt.patch (text/x-diff, attachment)]
[0014-Use-io.StringIO-if-StringIO.StringIO-is-absent-as-in.patch (text/x-diff, attachment)]
[0015-Use-collections.Mapping-collections.MutableMapping-i.patch (text/x-diff, attachment)]
[0016-Use-list-comprehensions-instead-of-map-where-a-list-.patch (text/x-diff, attachment)]
[0017-If-StandardError-does-not-exist-as-in-Python-3-inher.patch (text/x-diff, attachment)]
[0018-Use-six-to-paper-over-dict-iteration-differences-bet.patch (text/x-diff, attachment)]
[0019-Use-six-to-paper-over-int-long-differences-between-P.patch (text/x-diff, attachment)]
[0020-Cope-with-the-absence-of-a-file-class-in-Python-3.patch (text/x-diff, attachment)]
[0021-Python-3-renamed-raw_input-to-input.patch (text/x-diff, attachment)]
[0022-Be-much-more-careful-about-closing-files-in-a-timely.patch (text/x-diff, attachment)]
[0023-Use-six-to-paper-over-iterator.next-vs.-next-iterato.patch (text/x-diff, attachment)]
[0024-Use-string.ascii_letters-rather-than-the-deprecated-.patch (text/x-diff, attachment)]
[0025-In-Python-3-encode-Unicode-strings-before-passing-th.patch (text/x-diff, attachment)]
[0026-Fix-up-debian.changelog-for-string-handling-changes-.patch (text/x-diff, attachment)]
[0027-Only-define-DebPart.has_key-method-for-Python-2.patch (text/x-diff, attachment)]
[0028-Fix-up-debian.arfile-and-debian.debfile-for-string-h.patch (text/x-diff, attachment)]
[0029-Fix-up-most-of-debian.deb822-for-string-handling-cha.patch (text/x-diff, attachment)]
[0030-Fix-up-the-rest-of-debian.deb822-for-Python-3-string.patch (text/x-diff, attachment)]
[0031-Add-a-python3-debian-package.patch (text/x-diff, attachment)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Mon, 23 Jan 2012 00:30:08 GMT) (full text, mbox, link).


Acknowledgement sent to John Wright <jsw@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Mon, 23 Jan 2012 00:30:08 GMT) (full text, mbox, link).


Message #40 received at 656288@bugs.debian.org (full text, mbox, reply):

From: John Wright <jsw@debian.org>
To: Colin Watson <cjwatson@ubuntu.com>, 625509@bugs.debian.org
Cc: Tshepang Lekhonkhobe <tshepang@gmail.com>, Julian Andres Klode <jak@debian.org>, 656288@bugs.debian.org
Subject: Re: Bug#625509: python-debian: please port to Py3k
Date: Sun, 22 Jan 2012 16:21:41 -0800
On Sun, Jan 22, 2012 at 02:37:55PM +0000, Colin Watson wrote:
> On Wed, Jan 18, 2012 at 10:54:28AM +0000, Colin Watson wrote:
> > On Wed, May 04, 2011 at 03:10:29AM +0200, Tshepang Lekhonkhobe wrote:
> > > Can you either make this package capable of running for Python 2 and 3,
> > > or make separate packages for it, as python-apt does.
> > 
> > I'm working on this here:
> > 
> >   http://anonscm.debian.org/gitweb/?p=users/cjwatson/python-debian.git;a=shortlog;h=refs/heads/python3
> > 
> > I will probably end up depending on the six module, which I uploaded to
> > unstable yesterday.  It's tiny, so I shouldn't expect this to cause much
> > of a problem.
> > 
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=656288 in python3-apt
> > is getting in the way a bit, but I suppose worst case I can just skip
> > those tests when running under Python 3 for now.
> 
> I believe this port is now complete, in the git branch above.  It passes
> all tests provided that a version of python3-apt with the most recent
> patch in #656288 is available.
> 
> I would very much appreciate review of this branch.  In case it eases
> review, I've attached the 31-patch series (!) to this mail.  I've tried
> to arrange it roughly in ascending order of complexity.

Wow.  I'll be glad to review them, but I'm not sure when I'll have the
opportunity.  I'll try to make time later this week.

Thanks for the effort!

-- 
John Wright <jsw@debian.org>




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Wed, 14 Mar 2012 16:09:06 GMT) (full text, mbox, link).


Acknowledgement sent to Stefano Zacchiroli <zack@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Wed, 14 Mar 2012 16:09:06 GMT) (full text, mbox, link).


Message #45 received at 656288@bugs.debian.org (full text, mbox, reply):

From: Stefano Zacchiroli <zack@debian.org>
To: John Wright <jsw@debian.org>, 625509@bugs.debian.org
Cc: Colin Watson <cjwatson@ubuntu.com>, Tshepang Lekhonkhobe <tshepang@gmail.com>, 656288@bugs.debian.org, Julian Andres Klode <jak@debian.org>
Subject: Re: Bug#625509: python-debian: please port to Py3k
Date: Wed, 14 Mar 2012 17:06:51 +0100
[Message part 1 (text/plain, inline)]
On Sun, Jan 22, 2012 at 04:21:41PM -0800, John Wright wrote:
> On Sun, Jan 22, 2012 at 02:37:55PM +0000, Colin Watson wrote:
> > I would very much appreciate review of this branch.  In case it eases
> > review, I've attached the 31-patch series (!) to this mail.  I've tried
> > to arrange it roughly in ascending order of complexity.
> 
> Wow.  I'll be glad to review them, but I'm not sure when I'll have the
> opportunity.  I'll try to make time later this week.

Heya John,
  do you think you'll have time to do the review in the near future?
Just a friendly ping because, unfortunately, I haven't yet look in
enough details to Python 3 to be able to do a review myself.

Still, I'd love to see python-debian porting to python 3 in the archive
... and I'll be happy to test early versions!

TIA,
Cheers.
-- 
Stefano Zacchiroli     zack@{upsilon.cc,pps.jussieu.fr,debian.org} . o .
Maître de conférences   ......   http://upsilon.cc/zack   ......   . . o
Debian Project Leader    .......   @zack on identi.ca   .......    o o o
« the first rule of tautology club is the first rule of tautology club »
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Mon, 19 Mar 2012 04:09:05 GMT) (full text, mbox, link).


Acknowledgement sent to John Wright <jsw@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Mon, 19 Mar 2012 04:09:05 GMT) (full text, mbox, link).


Message #50 received at 656288@bugs.debian.org (full text, mbox, reply):

From: John Wright <jsw@debian.org>
To: Stefano Zacchiroli <zack@debian.org>, 625509@bugs.debian.org
Cc: Tshepang Lekhonkhobe <tshepang@gmail.com>, 656288@bugs.debian.org, Colin Watson <cjwatson@ubuntu.com>, Julian Andres Klode <jak@debian.org>
Subject: Re: Bug#625509: python-debian: please port to Py3k
Date: Sun, 18 Mar 2012 21:02:29 -0700
On Wed, Mar 14, 2012 at 05:06:51PM +0100, Stefano Zacchiroli wrote:
> On Sun, Jan 22, 2012 at 04:21:41PM -0800, John Wright wrote:
> > On Sun, Jan 22, 2012 at 02:37:55PM +0000, Colin Watson wrote:
> > > I would very much appreciate review of this branch.  In case it eases
> > > review, I've attached the 31-patch series (!) to this mail.  I've tried
> > > to arrange it roughly in ascending order of complexity.
> > 
> > Wow.  I'll be glad to review them, but I'm not sure when I'll have the
> > opportunity.  I'll try to make time later this week.
> 
> Heya John,
>   do you think you'll have time to do the review in the near future?
> Just a friendly ping because, unfortunately, I haven't yet look in
> enough details to Python 3 to be able to do a review myself.

I also don't know when I'll have time...  I thought I would a couple of
months ago, but things aren't getting any less busy.  :-(  I also need
to take some time to familiarize myself with Python 3.

> Still, I'd love to see python-debian porting to python 3 in the archive
> ... and I'll be happy to test early versions!

I'll see how many patches I can review next weekend.  Maybe it'll be
worth making an upload to experimental for testing beyond our unit
tests.

-- 
John Wright <jsw@debian.org>




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Mon, 19 Mar 2012 14:36:09 GMT) (full text, mbox, link).


Acknowledgement sent to Stefano Zacchiroli <zack@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Mon, 19 Mar 2012 14:36:09 GMT) (full text, mbox, link).


Message #55 received at 656288@bugs.debian.org (full text, mbox, reply):

From: Stefano Zacchiroli <zack@debian.org>
To: John Wright <jsw@debian.org>, 625509@bugs.debian.org
Cc: Tshepang Lekhonkhobe <tshepang@gmail.com>, 656288@bugs.debian.org, Colin Watson <cjwatson@ubuntu.com>, Julian Andres Klode <jak@debian.org>
Subject: Re: Bug#625509: python-debian: please port to Py3k
Date: Mon, 19 Mar 2012 15:33:11 +0100
[Message part 1 (text/plain, inline)]
On Sun, Mar 18, 2012 at 09:02:29PM -0700, John Wright wrote:
> > Still, I'd love to see python-debian porting to python 3 in the archive
> > ... and I'll be happy to test early versions!
> 
> I'll see how many patches I can review next weekend.  Maybe it'll be
> worth making an upload to experimental for testing beyond our unit
> tests.

That would be a good idea indeed.  Having the package there would allow
to call for testing more easily, and will also expedite the final upload
since the package will have to go through a round of (binary) NEW.

TIA,
Cheers.
-- 
Stefano Zacchiroli     zack@{upsilon.cc,pps.jussieu.fr,debian.org} . o .
Maître de conférences   ......   http://upsilon.cc/zack   ......   . . o
Debian Project Leader    .......   @zack on identi.ca   .......    o o o
« the first rule of tautology club is the first rule of tautology club »
[signature.asc (application/pgp-signature, inline)]

Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Wed, 13 Jun 2012 17:39:03 GMT) (full text, mbox, link).


Acknowledgement sent to Colin Watson <cjwatson@ubuntu.com>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Wed, 13 Jun 2012 17:39:03 GMT) (full text, mbox, link).


Message #60 received at 656288@bugs.debian.org (full text, mbox, reply):

From: Colin Watson <cjwatson@ubuntu.com>
To: Julian Andres Klode <jak@debian.org>, 656288@bugs.debian.org
Subject: Re: Bug#656288: python3-apt: difficulties with non-UTF-8-encoded TagFiles
Date: Wed, 13 Jun 2012 18:35:19 +0100
On Sat, Jan 21, 2012 at 03:10:54AM +0000, Colin Watson wrote:
> (That said, I haven't entirely got the python-debian test suite to pass
> yet.  It's parsing mixed-encoding files in terrifying ways which are
> going to require some rearrangement to work well with Python 3.  So,
> strictly, I can't claim to be 100% confident in this change yet; but I'd
> appreciate knowing whether the general approach is OK with you.)

This is no longer a concern (see #625509), and I would really appreciate
feedback from python-apt maintainers on this; there's a substantial
stack of other Python 3 porting work that depends on getting a
python3-debian into the archive, and that depends on this patch ...

Here's an updated version which applies to HEAD.  Please note that this
also fixes what I consider a bug whereby apt_pkg.TagFile() accepts byte
strings but not Unicode strings in Python 2.

=== modified file 'python/tag.cc'
--- python/tag.cc	2012-02-06 13:55:25 +0000
+++ python/tag.cc	2012-06-13 15:29:08 +0000
@@ -38,6 +38,10 @@ using namespace std;
 struct TagSecData : public CppPyObject<pkgTagSection>
 {
    char *Data;
+   bool Bytes;
+#if PY_MAJOR_VERSION >= 3
+   PyObject *Encoding;
+#endif
 };
 
 // The owner of the TagFile is a Python file object.
@@ -45,6 +49,10 @@ struct TagFileData : public CppPyObject<
 {
    TagSecData *Section;
    FileFd Fd;
+   bool Bytes;
+#if PY_MAJOR_VERSION >= 3
+   PyObject *Encoding;
+#endif
 };
 
 // Traversal and Clean for owned objects
@@ -60,6 +68,35 @@ int TagFileClear(PyObject *self) {
     return 0;
 }
 
+// Helpers to return Unicode or bytes as appropriate.
+#if PY_MAJOR_VERSION < 3
+#define TagSecString_FromStringAndSize(self, v, len) \
+    PyString_FromStringAndSize((v), (len))
+#define TagSecString_FromString(self, v) PyString_FromString(v)
+#else
+PyObject *TagSecString_FromStringAndSize(PyObject *self, const char *v,
+	 				 Py_ssize_t len) {
+   TagSecData *Self = (TagSecData *)self;
+   if (Self->Bytes)
+      return PyBytes_FromStringAndSize(v, len);
+   else if (Self->Encoding)
+      return PyUnicode_Decode(v, len, PyUnicode_AsString(Self->Encoding), 0);
+   else
+      return PyUnicode_FromStringAndSize(v, len);
+}
+
+PyObject *TagSecString_FromString(PyObject *self, const char *v) {
+   TagSecData *Self = (TagSecData *)self;
+   if (Self->Bytes)
+      return PyBytes_FromString(v);
+   else if (Self->Encoding)
+      return PyUnicode_Decode(v, strlen(v),
+			      PyUnicode_AsString(Self->Encoding), 0);
+   else
+      return PyUnicode_FromString(v);
+}
+#endif
+
 
 									/*}}}*/
 // TagSecFree - Free a Tag Section					/*{{{*/
@@ -107,9 +144,9 @@ static PyObject *TagSecFind(PyObject *Se
    {
       if (Default == 0)
 	 Py_RETURN_NONE;
-      return PyString_FromString(Default);
+      return TagSecString_FromString(Self,Default);
    }
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 
 static char *doc_FindRaw =
@@ -128,14 +165,14 @@ static PyObject *TagSecFindRaw(PyObject
    {
       if (Default == 0)
 	 Py_RETURN_NONE;
-      return PyString_FromString(Default);
+      return TagSecString_FromString(Self,Default);
    }
 
    const char *Start;
    const char *Stop;
    GetCpp<pkgTagSection>(Self).Get(Start,Stop,Pos);
 
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 
 static char *doc_FindFlag =
@@ -161,21 +198,18 @@ static PyObject *TagSecFindFlag(PyObject
 // Map access, operator []
 static PyObject *TagSecMap(PyObject *Self,PyObject *Arg)
 {
-   if (PyString_Check(Arg) == 0)
-   {
-      PyErr_SetNone(PyExc_TypeError);
+   const char *Name = PyObject_AsString(Arg);
+   if (Name == 0)
       return 0;
-   }
-
    const char *Start;
    const char *Stop;
-   if (GetCpp<pkgTagSection>(Self).Find(PyString_AsString(Arg),Start,Stop) == false)
+   if (GetCpp<pkgTagSection>(Self).Find(Name,Start,Stop) == false)
    {
-      PyErr_SetString(PyExc_KeyError,PyString_AsString(Arg));
+      PyErr_SetString(PyExc_KeyError,Name);
       return 0;
    }
 
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 
 // len() operation
@@ -230,9 +264,9 @@ static PyObject *TagSecExists(PyObject *
 
 static int TagSecContains(PyObject *Self,PyObject *Arg)
 {
-   if (PyString_Check(Arg) == 0)
-       return 0;
-   const char *Name = PyString_AsString(Arg);
+   const char *Name = PyObject_AsString(Arg);
+   if (Name == 0)
+      return 0;
    const char *Start;
    const char *Stop;
    if (GetCpp<pkgTagSection>(Self).Find(Name,Start,Stop) == false)
@@ -256,7 +290,7 @@ static PyObject *TagSecStr(PyObject *Sel
    const char *Start;
    const char *Stop;
    GetCpp<pkgTagSection>(Self).GetSection(Start,Stop);
-   return PyString_FromStringAndSize(Start,Stop-Start);
+   return TagSecString_FromStringAndSize(Self,Start,Stop-Start);
 }
 									/*}}}*/
 // TagFile Wrappers							/*{{{*/
@@ -286,6 +320,12 @@ static PyObject *TagFileNext(PyObject *S
    Obj.Section->Owner = Self;
    Py_INCREF(Obj.Section->Owner);
    Obj.Section->Data = 0;
+   Obj.Section->Bytes = Obj.Bytes;
+#if PY_MAJOR_VERSION >= 3
+   // We don't need to incref Encoding as the previous Section object already
+   // held a reference to it.
+   Obj.Section->Encoding = Obj.Encoding;
+#endif
    if (Obj.Object.Step(Obj.Section->Object) == false)
       return HandleErrors(NULL);
 
@@ -347,11 +387,12 @@ static PyObject *TagFileJump(PyObject *S
 static PyObject *TagSecNew(PyTypeObject *type,PyObject *Args,PyObject *kwds) {
    char *Data;
    int Len;
-   char *kwlist[] = {"text", 0};
+   char Bytes = 0;
+   char *kwlist[] = {"text", "bytes", 0};
 
    // this allows reading "byte" types from python3 - but we don't
    // make (much) use of it yet
-   if (PyArg_ParseTupleAndKeywords(Args,kwds,"s#",kwlist,&Data,&Len) == 0)
+   if (PyArg_ParseTupleAndKeywords(Args,kwds,"s#|b",kwlist,&Data,&Len,&Bytes) == 0)
       return 0;
 
    // Create the object..
@@ -359,6 +400,10 @@ static PyObject *TagSecNew(PyTypeObject
    new (&New->Object) pkgTagSection();
    New->Data = new char[strlen(Data)+2];
    snprintf(New->Data,strlen(Data)+2,"%s\n",Data);
+   New->Bytes = Bytes;
+#if PY_MAJOR_VERSION >= 3
+   New->Encoding = 0;
+#endif
 
    if (New->Object.Scan(New->Data,strlen(New->Data)) == false)
    {
@@ -391,19 +436,21 @@ PyObject *ParseSection(PyObject *self,Py
 static PyObject *TagFileNew(PyTypeObject *type,PyObject *Args,PyObject *kwds)
 {
    TagFileData *New;
-   PyObject *File;
+   PyObject *File = 0;
+   char Bytes = 0;
 
-   char *kwlist[] = {"file", 0};
-   if (PyArg_ParseTupleAndKeywords(Args,kwds,"O",kwlist,&File) == 0)
+   char *kwlist[] = {"file", "bytes", 0};
+   if (PyArg_ParseTupleAndKeywords(Args,kwds,"O|b",kwlist,&File,&Bytes) == 0)
       return 0;
 
    // check if we got a filename or a file object
    int fileno = -1;
    const char *filename = NULL;
-   if (PyString_Check(File))
-      filename = PyObject_AsString(File);
-   else
+   filename = PyObject_AsString(File);
+   if (filename == NULL) {
+      PyErr_Clear();
       fileno = PyObject_AsFileDescriptor(File);
+   }
 
    // handle invalid arguments
    if (fileno == -1 && filename == NULL)
@@ -432,8 +479,18 @@ static PyObject *TagFileNew(PyTypeObject
       new (&New->Fd) FileFd(filename, FileFd::ReadOnly, false);
 #endif
    } 
+   New->Bytes = Bytes;
    New->Owner = File;
    Py_INCREF(New->Owner);
+#if PY_MAJOR_VERSION >= 3
+   if (fileno > 0) {
+      New->Encoding = PyObject_GetAttr(File, PyUnicode_FromString("encoding"));
+      if (New->Encoding && !PyUnicode_Check(New->Encoding))
+         New->Encoding = 0;
+   } else
+      New->Encoding = 0;
+   Py_XINCREF(New->Encoding);
+#endif
    new (&New->Object) pkgTagFile(&New->Fd);
 
    // Create the section
@@ -442,6 +499,11 @@ static PyObject *TagFileNew(PyTypeObject
    New->Section->Owner = New;
    Py_INCREF(New->Section->Owner);
    New->Section->Data = 0;
+   New->Section->Bytes = Bytes;
+#if PY_MAJOR_VERSION >= 3
+   New->Section->Encoding = New->Encoding;
+   Py_XINCREF(New->Section->Encoding);
+#endif
 
    return HandleErrors(New);
 }
@@ -519,7 +581,7 @@ PyObject *RewriteSection(PyObject *self,
    }
 
    // Return the string
-   PyObject *ResObj = PyString_FromStringAndSize(bp,size);
+   PyObject *ResObj = TagSecString_FromStringAndSize(Section,bp,size);
    free(bp);
    return HandleErrors(ResObj);
 }
@@ -548,11 +610,15 @@ PySequenceMethods TagSecSeqMeth = {0,0,0
 PyMappingMethods TagSecMapMeth = {TagSecLength,TagSecMap,0};
 
 
-static char *doc_TagSec = "TagSection(text: str)\n\n"
+static char *doc_TagSec = "TagSection(text: str, [bytes: bool = False])\n\n"
    "Provide methods to access RFC822-style header sections, like those\n"
    "found in debian/control or Packages files.\n\n"
    "TagSection() behave like read-only dictionaries and also provide access\n"
-   "to the functions provided by the C++ class (e.g. find)";
+   "to the functions provided by the C++ class (e.g. find).\n\n"
+   "By default, text read from files is treated as strings (binary data in\n"
+   "Python 2, Unicode strings in Python 3). Use bytes=True to cause all\n"
+   "header values read from this TagSection to be bytes even in Python 3.\n"
+   "Header names are always treated as Unicode.";
 PyTypeObject PyTagSection_Type =
 {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
@@ -623,7 +689,7 @@ static PyGetSetDef TagFileGetSet[] = {
 };
 
 
-static char *doc_TagFile = "TagFile(file)\n\n"
+static char *doc_TagFile = "TagFile(file, [bytes: bool = False])\n\n"
    "TagFile() objects provide access to debian control files, which consist\n"
    "of multiple RFC822-style sections.\n\n"
    "To provide access to those sections, TagFile objects provide an iterator\n"
@@ -635,7 +701,11 @@ static char *doc_TagFile = "TagFile(file
    "It is important to not mix the use of both APIs, because this can have\n"
    "unwanted effects.\n\n"
    "The parameter 'file' refers to an object providing a fileno() method or\n"
-   "a file descriptor (an integer)";
+   "a file descriptor (an integer).\n\n"
+   "By default, text read from files is treated as strings (binary data in\n"
+   "Python 2, Unicode strings in Python 3). Use bytes=True to cause all\n"
+   "header values read from this TagFile to be bytes even in Python 3.\n"
+   "Header names are always treated as Unicode.";
 
 // Type for a Tag File
 PyTypeObject PyTagFile_Type =

=== modified file 'tests/test_tagfile.py'
--- tests/test_tagfile.py	2012-02-03 12:46:08 +0000
+++ tests/test_tagfile.py	2012-06-13 15:21:06 +0000
@@ -1,18 +1,26 @@
 #!/usr/bin/python
+# -*- coding: utf-8 -*-
 #
 # Copyright (C) 2010 Michael Vogt <mvo@ubuntu.com>
+# Copyright (C) 2012 Canonical Ltd.
+# Author: Colin Watson <cjwatson@ubuntu.com>
 #
 # Copying and distribution of this file, with or without modification,
 # are permitted in any medium without royalty provided the copyright
 # notice and this notice are preserved.
 """Unit tests for verifying the correctness of apt_pkg.TagFile"""
 
+from __future__ import print_function, unicode_literals
+
+import io
 import glob
 import os
+import shutil
+import sys
+import tempfile
 import unittest
 
 from test_all import get_library_dir
-import sys
 sys.path.insert(0, get_library_dir())
 
 import apt_pkg
@@ -20,6 +28,13 @@ import apt_pkg
 class TestTagFile(unittest.TestCase):
     """ test the apt_pkg.TagFile """
 
+    def setUp(self):
+        apt_pkg.init()
+        self.temp_dir = tempfile.mkdtemp()
+
+    def tearDown(self):
+        shutil.rmtree(self.temp_dir)
+
     def test_tag_file(self):
         basepath = os.path.dirname(__file__)
         tagfilepath = os.path.join(basepath, "./data/tagfile/*")
@@ -38,5 +53,81 @@ class TestTagFile(unittest.TestCase):
         # Raises Type error
         self.assertRaises(TypeError, apt_pkg.TagFile, object())
 
+    def test_utf8(self):
+        value = "Tést Persön <test@example.org>"
+        packages = os.path.join(self.temp_dir, "Packages")
+        with io.open(packages, "w", encoding="UTF-8") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        if sys.version < '3':
+            # In Python 2, test the traditional file interface.
+            with open(packages) as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("UTF-8"), tagfile.section["Maintainer"])
+        with io.open(packages, encoding="UTF-8") as packages_file:
+            tagfile = apt_pkg.TagFile(packages_file)
+            tagfile.step()
+            if sys.version < '3':
+                self.assertEqual(
+                    value.encode("UTF-8"), tagfile.section["Maintainer"])
+            else:
+                self.assertEqual(value, tagfile.section["Maintainer"])
+
+    def test_latin1(self):
+        value = "Tést Persön <test@example.org>"
+        packages = os.path.join(self.temp_dir, "Packages")
+        with io.open(packages, "w", encoding="ISO-8859-1") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        if sys.version < '3':
+            # In Python 2, test the traditional file interface.
+            with open(packages) as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+        with io.open(packages) as packages_file:
+            tagfile = apt_pkg.TagFile(packages_file, bytes=True)
+            tagfile.step()
+            self.assertEqual(
+                value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+        if sys.version >= '3':
+            # In Python 3, TagFile can pick up the encoding of the file
+            # object.
+            with io.open(packages, encoding="ISO-8859-1") as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(value, tagfile.section["Maintainer"])
+
+    def test_mixed(self):
+        value = "Tést Persön <test@example.org>"
+        packages = os.path.join(self.temp_dir, "Packages")
+        with io.open(packages, "w", encoding="UTF-8") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        with io.open(packages, "a", encoding="ISO-8859-1") as packages_file:
+            print("Maintainer: %s" % value, file=packages_file)
+            print("", file=packages_file)
+        if sys.version < '3':
+            # In Python 2, test the traditional file interface.
+            with open(packages) as packages_file:
+                tagfile = apt_pkg.TagFile(packages_file)
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("UTF-8"), tagfile.section["Maintainer"])
+                tagfile.step()
+                self.assertEqual(
+                    value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+        with io.open(packages) as packages_file:
+            tagfile = apt_pkg.TagFile(packages_file, bytes=True)
+            tagfile.step()
+            self.assertEqual(
+                value.encode("UTF-8"), tagfile.section["Maintainer"])
+            tagfile.step()
+            self.assertEqual(
+                value.encode("ISO-8859-1"), tagfile.section["Maintainer"])
+
 if __name__ == "__main__":
     unittest.main()

Thanks,

-- 
Colin Watson                                       [cjwatson@ubuntu.com]




Information forwarded to debian-bugs-dist@lists.debian.org, APT Development Team <deity@lists.debian.org>:
Bug#656288; Package python3-apt. (Fri, 15 Jun 2012 21:24:06 GMT) (full text, mbox, link).


Acknowledgement sent to Julian Andres Klode <jak@debian.org>:
Extra info received and forwarded to list. Copy sent to APT Development Team <deity@lists.debian.org>. (Fri, 15 Jun 2012 21:24:06 GMT) (full text, mbox, link).


Message #65 received at 656288@bugs.debian.org (full text, mbox, reply):

From: Julian Andres Klode <jak@debian.org>
To: Colin Watson <cjwatson@ubuntu.com>
Cc: 656288@bugs.debian.org
Subject: Re: Bug#656288: python3-apt: difficulties with non-UTF-8-encoded TagFiles
Date: Fri, 15 Jun 2012 23:20:25 +0200
On Wed, Jun 13, 2012 at 06:35:19PM +0100, Colin Watson wrote:
> On Sat, Jan 21, 2012 at 03:10:54AM +0000, Colin Watson wrote:
> > (That said, I haven't entirely got the python-debian test suite to pass
> > yet.  It's parsing mixed-encoding files in terrifying ways which are
> > going to require some rearrangement to work well with Python 3.  So,
> > strictly, I can't claim to be 100% confident in this change yet; but I'd
> > appreciate knowing whether the general approach is OK with you.)
> 
> This is no longer a concern (see #625509), and I would really appreciate
> feedback from python-apt maintainers on this; there's a substantial
> stack of other Python 3 porting work that depends on getting a
> python3-debian into the archive, and that depends on this patch ...
> 
> Here's an updated version which applies to HEAD.  Please note that this
> also fixes what I consider a bug whereby apt_pkg.TagFile() accepts byte
> strings but not Unicode strings in Python 2.

Committed, with documentation in doc/source/library/apt_pkg.rst added.

-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.




Added tag(s) pending. Request was from Julian Andres Klode <jak@debian.org> to control@bugs.debian.org. (Sun, 17 Jun 2012 15:57:05 GMT) (full text, mbox, link).


Reply sent to Julian Andres Klode <jak@debian.org>:
You have taken responsibility. (Fri, 22 Jun 2012 09:23:42 GMT) (full text, mbox, link).


Notification sent to Colin Watson <cjwatson@debian.org>:
Bug acknowledged by developer. (Fri, 22 Jun 2012 09:24:13 GMT) (full text, mbox, link).


Message #72 received at 656288-close@bugs.debian.org (full text, mbox, reply):

From: Julian Andres Klode <jak@debian.org>
To: 656288-close@bugs.debian.org
Subject: Bug#656288: fixed in python-apt 0.8.5
Date: Fri, 22 Jun 2012 09:19:51 +0000
Source: python-apt
Source-Version: 0.8.5

We believe that the bug you reported is fixed in the latest version of
python-apt, which is due to be installed in the Debian FTP archive:

python-apt-common_0.8.5_all.deb
  to main/p/python-apt/python-apt-common_0.8.5_all.deb
python-apt-dbg_0.8.5_amd64.deb
  to main/p/python-apt/python-apt-dbg_0.8.5_amd64.deb
python-apt-dev_0.8.5_all.deb
  to main/p/python-apt/python-apt-dev_0.8.5_all.deb
python-apt-doc_0.8.5_all.deb
  to main/p/python-apt/python-apt-doc_0.8.5_all.deb
python-apt_0.8.5.dsc
  to main/p/python-apt/python-apt_0.8.5.dsc
python-apt_0.8.5.tar.gz
  to main/p/python-apt/python-apt_0.8.5.tar.gz
python-apt_0.8.5_amd64.deb
  to main/p/python-apt/python-apt_0.8.5_amd64.deb
python3-apt-dbg_0.8.5_amd64.deb
  to main/p/python-apt/python3-apt-dbg_0.8.5_amd64.deb
python3-apt_0.8.5_amd64.deb
  to main/p/python-apt/python3-apt_0.8.5_amd64.deb



A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to 656288@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Julian Andres Klode <jak@debian.org> (supplier of updated python-apt package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@debian.org)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.8
Date: Fri, 22 Jun 2012 10:37:23 +0200
Source: python-apt
Binary: python-apt python-apt-doc python-apt-dbg python-apt-dev python-apt-common python3-apt python3-apt-dbg
Architecture: source amd64 all
Version: 0.8.5
Distribution: unstable
Urgency: low
Maintainer: APT Development Team <deity@lists.debian.org>
Changed-By: Julian Andres Klode <jak@debian.org>
Description: 
 python-apt - Python interface to libapt-pkg
 python-apt-common - Python interface to libapt-pkg (locales)
 python-apt-dbg - Python interface to libapt-pkg (debug extension)
 python-apt-dev - Python interface to libapt-pkg (development files)
 python-apt-doc - Python interface to libapt-pkg (API documentation)
 python3-apt - Python 3 interface to libapt-pkg
 python3-apt-dbg - Python 3 interface to libapt-pkg (debug extension)
Closes: 567765 629624 645970 652335 656288 661062 669458 676960 676973 677331 677916 677934 678286
Changes: 
 python-apt (0.8.5) unstable; urgency=low
 .
   [ Michael Vogt ]
   * python/cache.cc:
     - ensure that pkgApplyStatus is called when the cache is opened
       (thanks to Sebastian Heinlein for finding this bug), LP: #659438
 .
   [ Stéphane Graber ]
   * data/templates/Ubuntu.info.in:
     - add quantal
 .
   [ Steve Langasek ]
   * utils/get_ubuntu_mirrors_from_lp.py: move this script to python3
   * pre-build.sh: call dpkg-checkbuilddeps with the list of our
     source-build-dependencies; this may save someone else an hour down the
     line scratching their head over gratuitous test-suite failures...
 .
   [ Sebastian Heinlein ]
   * lp:~glatzor/python-apt/auth:
     - this is a port of the software-properties AptAuth module to python-apt
       with some cleanups. It provides a wrapper API for the apt-key command
 .
   [ David Prévot ]
   * po/*.po: update PO files against current POT file
   * po/be.po: Belarusian translation by Viktar Siarheichyk (closes: #678286)
   * po/de.po: German translation updated by Holger Wansing (closes: #677916)
   * po/el.po: Greek translation updated by Thomas Vasileiou (closes: #677331)
   * po/en_GB.po: Remove useless file <20120610190618.GA1387@burratino>
   * po/eo.po: Esperanto translation by Kristjan Schmidt and Michael Moroni
   * po/fi.po: Finnish translation updated by Timo Jyrinki
   * po/fr.po: French translation updated (closes: #567765)
   * po/hu.po: Hungarian translation updated by Gabor Kelemen
   * po/id.po: Indonesian translation by Andika Triwidada (closes: #676960)
   * po/nl.po: Dutch translation updated by Jeroen Schot (closes: #652335)
   * po/pt_BR.po: Brazilian translation updated by Sérgio Cipolla
   * po/ru.po: incomplete Russian translation updated by Andrey
   * po/sk.po: Slovak translation updated by Ivan Masár (closes: #676973)
   * po/sl.po: Slovenian translation updated by Matej Urbančič
   * po/sr.po: incomplete Serbian translation updated by Nikola Nenadic
   * po/tl.po: Tagalog translation updated by Ariel S. Betan
   * po/am.po po/br.po po/et.po po/eu.po po/fa.po po/fur.po po/hi.po
     po/mr.po po/ms.po po/nn.po po/pa.po po/ps.po po/qu.po po/rw.po po/ta.po
     po/ur.po po/xh.po: remove useless (empty) translations
 .
   [ Julian Andres Klode ]
   * Merge patch from Colin Watson to handle non-UTF8 tag files in
     Python 3, by using bytes instead of str when requested; and
     document this in the RST documentation (Closes: #656288)
   * debian/control:
     - Drop Recommends on python2.6 (Closes: #645970)
     - Replace xz-lzma Recommends by xz-utils (Closes: #677934)
   * python/configuration.cc:
     - Handle the use of "del" on configuration values. Those are represented
       by calling the setter with NULL, which we did not handle before, causing
       a segmentation fault (Closes: #661062)
   * python/tag.cc:
     - Correctly handle file descriptor 0 aka stdin (Closes: #669458)
   * python/acquire.cc:
     - Use pkgAcquire::Setup() to setup the acquire class and handle errors
       from this (Closes: #629624)
   * debian/control:
     - Set Standards-Version to 3.9.3
   * utils/get_ubuntu_mirrors_from_lp.py:
     - Revert move to Python 3, python3-feedparser is not in the archive yet
   * tests:
     - Fix new tests from Sebastian to work with Python 2.6
Checksums-Sha1: 
 ca7cbe75c3b05db803e765d89bc93a105a0c1d9d 1549 python-apt_0.8.5.dsc
 fe546221e8f5fdfa5fcb3d6092cf0b39c359ea96 395569 python-apt_0.8.5.tar.gz
 7077b9c279e0665dc321356e687c10cee72da845 317216 python-apt_0.8.5_amd64.deb
 6aeaacba22c2595adea123306bd77504ef7e8991 242552 python-apt-doc_0.8.5_all.deb
 2502104b0d84852b325ce4c6c4ab16156cdfe047 4359670 python-apt-dbg_0.8.5_amd64.deb
 4ddf725986375b228963119245cba52d79526a3d 7628 python-apt-dev_0.8.5_all.deb
 c3159c3dbe235a56b9ebf43db72a1d88854b2ca3 110736 python-apt-common_0.8.5_all.deb
 c112edb4ec6f380208f963b13230a516a706194d 194692 python3-apt_0.8.5_amd64.deb
 1ce048eef89926a9e93cc06afa0072f83f52f9ea 2173584 python3-apt-dbg_0.8.5_amd64.deb
Checksums-Sha256: 
 355e4113ebb1a668072c89e732ee40acb444db3ebdd4187a0fbc3fed2b5508ae 1549 python-apt_0.8.5.dsc
 40b5cfe63b3b7481cc48f44c0e7cc6fec99a2adc53bf7443f9f032ce389ee4c0 395569 python-apt_0.8.5.tar.gz
 0e3493f96c2d0dbd6e81a633e61c746aed8b02dcd8708ea778bc3fd8912b2825 317216 python-apt_0.8.5_amd64.deb
 f2ee162159757dc6ed543eeffc3edf7f80d178ef87393ac5055d954161ca3c30 242552 python-apt-doc_0.8.5_all.deb
 ccedd2a4e264eae96f5d5db0a58b15508ed4a019e8bd44f843ffcf6d8e373e55 4359670 python-apt-dbg_0.8.5_amd64.deb
 8e8064455a940285148b567bafd624466c481d2318f03978636bd6d2d238a368 7628 python-apt-dev_0.8.5_all.deb
 579b360e5c4b65b70996961ec2da2088e866a41836615f5c21ece2828be64c41 110736 python-apt-common_0.8.5_all.deb
 7c828d99f67da7c057fb604576281cf82ffefe6b25dce474b15315a3b1d56351 194692 python3-apt_0.8.5_amd64.deb
 94bb706e3fb3c3c8cfaec226017eba5936e7113b6d1746d25f73eecd8dae4ece 2173584 python3-apt-dbg_0.8.5_amd64.deb
Files: 
 c4ce306c90e29db559fe16e1a44d3709 1549 python standard python-apt_0.8.5.dsc
 f19ff896847f1234de13692206f4ef32 395569 python standard python-apt_0.8.5.tar.gz
 58b430548c9e6493c4a19f256c55dffb 317216 python standard python-apt_0.8.5_amd64.deb
 f2e43816c20d0442b5714361e23cf410 242552 doc optional python-apt-doc_0.8.5_all.deb
 2109e04afb4c206da152e59215e01bba 4359670 debug extra python-apt-dbg_0.8.5_amd64.deb
 021e4a506df571cf885da1cb65e6385e 7628 python optional python-apt-dev_0.8.5_all.deb
 16d7350ff857169abc5bea15f1613cfc 110736 python optional python-apt-common_0.8.5_all.deb
 f0d987bd178d6d5d4b26154587305bb7 194692 python optional python3-apt_0.8.5_amd64.deb
 e85bb10c2605f143f0f191dca3fd47d1 2173584 debug extra python3-apt-dbg_0.8.5_amd64.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAk/kMPEACgkQrCpf/gCCPsJuZwCffwdudjz4RayJu3OUnpKDLmOD
nAMAn194KTKhMfV7AYyCJ5CXD6BeNuwh
=19NA
-----END PGP SIGNATURE-----





Bug archived. Request was from Debbugs Internal Request <owner@bugs.debian.org> to internal_control@bugs.debian.org. (Mon, 26 Nov 2012 07:27:45 GMT) (full text, mbox, link).


Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Sat Jan 13 01:59:35 2018; Machine Name: beach

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.