ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
出现这个错,是因为编码的问题。
Traceback (most recent call last): File "/tmp/a.py", line 4, in <module> html5lib.parse('<p>', treebuilder='lxml') File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse return p.parse(doc, encoding=encoding) File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse parseMeta=parseMeta, useChardet=useChardet) File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse self.mainLoop() File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop new_token = phase.processCharacters(new_token) File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters self.tree.insertText(token["data"]) File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText parent.insertText(data) File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText builder.Element.insertText(self, data, insertBefore) File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText self._element.text += data File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467) File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888) File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701) ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
再生成文档过程中,突然间发现出现此错误。本来想着是通过改变编码的方式,来解决这类问题,如下所示:
p = document.add_paragraph(u"哈哈 ") 或者是: p = document.add_paragraph(p.encode('utf-8').decode("utf-8"))
但是我使用了上述的两种方法,错误仍然存在,后面就用了替换的方法,解决了眼前的错误(虽然目前妥协了,但是后面如果发现又更好的解决方式,会再来更新的):
s = re.sub(u"[\\x00-\\x08\\x0b\\x0e-\\x1f\\x7f]", "", s) p = self.doc.add_paragraph(s)