Python 爬虫遇到形如 小说 的编码如何转换为中文？

遇到问题

a target="_blank" title="&#x97F3;&#x4E50;&#x63A5;&#x529B;/&#x5168;&#x80FD;&#x6295;&#x5C4F;/&#x89E6;&#x78B0;&#x8054;&#x7F51;/&#x7C73;&#x5BB6;&#x667A;&#x80FD;&#x573A;&#x666F;">

解决办法 python3 中


# tested under python3.4

def convert(s):
    s = s.strip('&#x;') # 把'&#x957f;'变成'957f'
    s = bytes(r'\u' + s, 'ascii') # 把'957f'转换成b'\\u957f'
    return s.decode('unicode_escape') # 调用bytes对象的decode，encoding用unicode_escape，把b'\\u957f'从unicode转义编码解码成unicode的'长'。具体参见codecs的文档

print(convert('&#x957f;')) # => '长'

我的执行效果

title = "&#x97F3;&#x4E50;&#x63A5;&#x529B;/&#x5168;&#x80FD;&#x6295;&#x5C4F;/&#x89E6;&#x78B0;&#x8054;&#x7F51;/&#x7C73;&#x5BB6;&#x667A;&#x80FD;&#x573A;&#x666F;"

title = title.replace("&#x", "").replace(";", ",")

tirle = title.replace('/', '').split(',')

for i in title:

    if len(i) == 4:
        s = bytes(r'\u' + i, 'ascii')
        print(s.decode(
            'unicode_escape'))

音乐接力全能投屏触碰联网米家智能场景

python 2 中

# for python2.7

def convert(s):
    return ''.join([r'\u', s.strip('&#x;')]).decode('unicode_escape')

ss = unicode(ss, 'gbk') # convert gbk-encoded byte-string ss to unicode string

import re
print re.sub(r'&#x....;', lambda match: convert(match.group()), ss)

posted @ 2021-04-21 16:20 wzqwer 阅读(696) 评论(0) 编辑收藏举报

刷新页面返回顶部

wzqwer

Python 爬虫遇到形如 &#x5c0f;&#x8bf4; 的编码如何转换为中文？

公告

Python 爬虫遇到形如小说的编码如何转换为中文？