首先在网上下载一个汉语词典的txt文件, 汉语词典
1.用正则去掉词语的解释,即提取出所有汉语词语;
import re def getHanYuCi(st): p = re.compile(r'【.*?】') # 挑选出: [汉字] rt = p.findall(st) p = re.compile(r'[\u4E00-\u9FA5]+') # 去掉【】:只保留汉字; *:前一个字符0次或无限次; +:表示1次或无限制 rt = p.findall(str(rt)) #print(str[0:1000]) return rt def test_1(): path = r'C:\Users\sss\Desktop\hanyucidian.txt' with open(path, 'rb') as f: st = f.read().decode('gb18030') rt = getHanYuCi(st) dict = {} for x in rt: dict[x] = 0 #print('+++++++++++++') #print(len(rt)) #print(rt) # 由于rt特别大,直接print不会读出任何东西,但只读某一段时可以读出来 #print(rt[1:10]) path = r'C:\Users\sss\Desktop\hanyu_ci.txt' with open(path, 'w') as f: f.write(str(dict)) test_1()