中文维基百科语料获取与处理

中文维基百科：高质量、易获取的语料，相当厚道，每月把所有条目都打包一次，爱我大维基。百度百科、互动百科----差评！

源数据下载地址：https://dumps.wikimedia.org/zhwiki/

数据抽取脚本：

from gensim.corpora.wikicorpus import extract_pages,filter_wiki
import bz2file
import re
import opencc
from tqdm import tqdm
import codecs

wiki = extract_pages(bz2file.open('zhwiki-latest-pages-articles.xml.bz2'))

def wiki_replace(d):
    s = d[1]
    s = re.sub(':*{\|[\s\S]*?\|}', '', s)
    s = re.sub('<gallery>[\s\S]*?</gallery>', '', s)
    s = re.sub('(.){{([^{}\n]*?\|[^{}\n]*?)}}', '\\1[[\\2]]', s)
    s = filter_wiki(s)
    s = re.sub('\* *\n|\'{2,}', '', s)
    s = re.sub('\n+', '\n', s)
    s = re.sub('\n[:;]|\n +', '\n', s)
    s = re.sub('\n==', '\n\n==', s)
    s = u'【' + d[0] + u'】\n' + s
    return opencc.convert(s).strip()

i = 0
f = codecs.open('wiki.txt', 'w', encoding='utf-8')
w = tqdm(wiki, desc=u'已获取0篇文章')
for d in w:
    if not re.findall('^[a-zA-Z]+:', d[0]) and d[0] and not re.findall(u'^#', d[1]):
        s = wiki_replace(d)
        f.write(s+'\n\n\n')
        i += 1
        if i % 100 == 0:
            w.set_description(u'已获取%s篇文章'%i)

f.close()

参考：

https://spaces.ac.cn/archives/4176

https://cloud.tencent.com/developer/article/1435977

https://blog.csdn.net/weixin_40547993/article/details/97781179

posted @ 2021-08-03 16:56 今夜无风阅读(536) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

中文维基百科语料获取与处理

公告