Python实现文件中的所有词汇分割为单独的字母

  1. 基于Character-Based Language Model在制作之前需要对语料库中的词汇进行分割,将每个字母单拎出来存在另一个文件里使用;
  2. 下方是干分割工序的Python脚本:
# -*- coding: UTF-8 -*-
import string
import sys

def SplitIntoCharacters(sourceFilePath, outputFileName):
    sourceFile = open(sourceFilePath)
    newFile = open(outputFileName, 'a')
    chn_punctuations = "!?。"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
    for word in sourceFile.read().split():
        for character in word:
            isPunct = character in string.punctuation or character in chn_punctuations
            if not isPunct:
                newCharacter = character.lower() + "\n"
                newFile.writelines(newCharacter)
    sourceFile.close()
    newFile.close()
    print("done!")


if __name__ == "__main__":
    # print('args list:', str(sys.argv))
    sourceFilePath = sys.argv[1]
    outputFileName = sys.argv[2]
    if sourceFilePath == ' ' or outputFileName == ' ':
        print("Error: Source file path or the output file name is empty")
    else:
        SplitIntoCharacters(sourceFilePath, outputFileName)

# by Alexander Enharjan
  1. 用法是:
python3 wordSpliter (INPUT_FILE_PATH) (OUTPUT_FILE_PATH)




作者:艾孜尔江·艾尔斯兰

转载请务必标明出处!

posted @ 2022-09-26 15:16  艾孜尔江  阅读(492)  评论(0编辑  收藏  举报