jieba库的使用
jieba库用来对中文进行分词
import jieba #精确模式,返回一个列表类型的分词结果 # jieba.lcut(s) print(jieba.lcut("python是一门编程语言")) #全模式,返回一个列表的分词结果,存在冗余 # jieba.lcut(s,cut_all=True) print(jieba.lcut("python是一门编程语言",cut_all=True)) #搜索引擎模式,返回一个列表类型的分词结果,存在冗余 print(jieba.lcut_for_search("python是一门编程语言")) #想分词词典增加新词w # jieba.add_word(w) jieba.add_word("啦啦啦") print(jieba.lcut("python是一门编程语言啦啦啦啦"))
判断hamlet文件中出现次数多的单词及次数
def getText(): txt = open("D:\\Python\\hamlet.txt","r").read() txt = txt.lower() for ch in '|"!#$%()*+,-./;:<=>?@[\\]^_{|}~': txt = txt.replace(ch," ") return txt hamletTxt = getText() words = hamletTxt.split() counts = {} for word in words: counts[word] = counts.get(word,0)+1 items = list(counts.items()) items.sort( key = lambda x: x[1],reverse = True ) for i in range(10): word,count = items[i] print("{0:<10}{1:>5}".format(word,count))
判断"沉默的羔羊"文件中出现次数多的词及次数,比英文多一个jieba分词
import jieba txt = open("沉默的羔羊.txt","r",encoding="utf-8").read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key = lambda x:x[1],reverse=True) for i in range(15): word,count = items[i] print("{0:<10}{1:>5}".format(word,count))