Wikipedia zh 数据处理流程

Wikipedia zh 数据处理流程 and 获取词向量

一、下载数据集

中文wikipeda 链接：https://dumps.wikimedia.org/zhwiki/
英文维基百科 https://dumps.wikimedia.org/enwiki/

点击链接后，进入最新月份，eg，20201020/，下载文件 zhwiki-20201020-pages-articles-multistream.xml.bz2，大约2G左右。

二、清洗数据（方式1）

1. 使用github上的 Wikipedia Extractor 进行处理

百科链接1
github链接

下载文件，输入命令 python WikiExtractor.py --infn zhwiki-20201020-pages-articles-multistream.xml.bz2

耐心等待后，会在当前目录下出现一个txt文件，wiki.txt

2. 继续处理

（1）将繁体转换为简体，这里使用了包opencc-python
安装指令 pip install opencc-python-reimplemented

def traditional_2_easy(inputname, outputname):
    cc = OpenCC('t2s')
    f = open(inputname, encoding='utf-8')
    to_convert = f.read()
    converted = cc.convert(to_convert)
    f2 = open(outputname, "w", encoding='utf-8')
    f2.write(converted)
    f2.close()

（2）Wikipedia Extractor提取出来的结果，会去掉{{}}标记的内容，这里对一些特殊情况进行了处理，比如空括号等。

def handle(inputname, outputname):
    f = open(inputname, encoding='utf-8')
    to_convert = f.read()
    converted = re.sub(r'\（\）', '', to_convert)  # 删除中文空括号
    converted = re.sub(r"\「|\」|\｢|\｣|\『|\』", '\"',
                       converted)  # 将「」｢｣『』这些符号替换成引号
    converted = re.sub(r"\，\）|\；\）", '）', converted)  # 罗德·法尼(Rod Dodji Fanni，）
    converted = re.sub(r"\（\，|\(\，", '（', converted)  # 阿魯拉·基馬(Alula Girma (，
    converted = re.sub(r"\(", '（', converted)  # 统一换成中文括号
    converted = re.sub(r"\)", '）', converted)
    f2 = open(outputname, "w", encoding='utf-8')
    f2.write(converted)
    f2.close()

# 主函数
def main():
    traditional_2_easy('wiki.txt', 'out.txt')
    handle('out.txt', 'out2.txt')


if __name__ == '__main__':
    main()

二、清洗数据（方式二）

参考链接：获取并处理中文维基百科语料

1. 使用gensim的wikicorpus库提取文本数据。

from gensim.corpora.wikicorpus import extract_pages, filter_wiki
import bz2file
import re
from opencc import OpenCC
from tqdm import tqdm
import codecs
import jieba

cc = OpenCC('t2s')


def wiki_replace(d):
    global cc
    s = d[1]
    s = re.sub(r':*{\|[\s\S]*?\|}', '', s)
    s = re.sub(r'<gallery>[\s\S]*?</gallery>', '', s)
    s = re.sub(r'(.){{([^{}\n]*?\|[^{}\n]*?)}}', '\\1[[\\2]]', s)
    s = filter_wiki(s)
    s = re.sub(r'\* *\n|\'{2,}', '', s)
    s = re.sub(r'\n+', '\n', s)
    s = re.sub(r'\n[:;]|\n +', '\n', s)
    s = re.sub(r'\n==', '\n\n==', s)
    s = '【' + d[0] + u'】\n' + s
    return cc.convert(s).strip()


def step_1_extract():
    """提取文章，title使用【】括住."""
    wiki = extract_pages(bz2file.open(
        'zhwiki-20201020-pages-articles-multistream.xml.bz2'))
    i = 0
    f = codecs.open('wiki.cn.txt', 'w', encoding='utf-8')
    w = tqdm(wiki, desc='已获取0篇文章')

    for d in w:  # Title, text and page id
        # re.findall('^[a-zA-Z]+:', d[0]) 是去掉那些帮助页面
        # re.findall(u'^#', d[1])这个条件是去掉重定向的页面，最后得到大概就是91.9万个页面
        if not re.findall('^[a-zA-Z]+:', d[0]) and d[0] and not re.findall('^#', d[1]):
            s = wiki_replace(d)
            f.write(s+'\n\n\n')
            i += 1
            if i % 100 == 0:
                w.set_description('已获取%s篇文章' % i)

    f.close()

2. 使用jieba 进行分词

def step_2_tokenize():
    fr = open('wiki.cn.txt', 'r')
    fw = open('wiki.sen.txt', 'w')
    for line in tqdm(fr):
        if line == "" or line == '\n':
            continue
        sentence = list(jieba.cut(line.strip()))
        fw.write(' '.join(sentence)+'\n')
    fr.close()
    fw.close()


step_2_tokenize()

posted @ 2020-10-22 20:11 戴墨镜的长颈鹿阅读(1853) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

戴墨镜的长颈鹿