Python实现文件中的所有词汇分割为单独的字母

基于Character-Based Language Model在制作之前需要对语料库中的词汇进行分割，将每个字母单拎出来存在另一个文件里使用；
下方是干分割工序的Python脚本：

# -*- coding: UTF-8 -*-
import string
import sys

def SplitIntoCharacters(sourceFilePath, outputFileName):
    sourceFile = open(sourceFilePath)
    newFile = open(outputFileName, 'a')
    chn_punctuations = "！？｡＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
    for word in sourceFile.read().split():
        for character in word:
            isPunct = character in string.punctuation or character in chn_punctuations
            if not isPunct:
                newCharacter = character.lower() + "\n"
                newFile.writelines(newCharacter)
    sourceFile.close()
    newFile.close()
    print("done!")


if __name__ == "__main__":
    # print('args list:', str(sys.argv))
    sourceFilePath = sys.argv[1]
    outputFileName = sys.argv[2]
    if sourceFilePath == ' ' or outputFileName == ' ':
        print("Error: Source file path or the output file name is empty")
    else:
        SplitIntoCharacters(sourceFilePath, outputFileName)

# by Alexander Enharjan

用法是：

python3 wordSpliter (INPUT_FILE_PATH) (OUTPUT_FILE_PATH)

作者：艾孜尔江·艾尔斯兰

转载请务必标明出处！

posted @ 2022-09-26 15:16 艾孜尔江阅读(492) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

艾孜尔江

Python实现文件中的所有词汇分割为单独的字母

公告