jieba库的使用

jieba库用来对中文进行分词

import jieba
#精确模式，返回一个列表类型的分词结果
# jieba.lcut(s)
print(jieba.lcut("python是一门编程语言"))
#全模式，返回一个列表的分词结果，存在冗余
# jieba.lcut(s,cut_all=True)
print(jieba.lcut("python是一门编程语言",cut_all=True))
#搜索引擎模式，返回一个列表类型的分词结果，存在冗余
print(jieba.lcut_for_search("python是一门编程语言"))
#想分词词典增加新词w
# jieba.add_word(w)
jieba.add_word("啦啦啦")
print(jieba.lcut("python是一门编程语言啦啦啦啦"))

判断hamlet文件中出现次数多的单词及次数

def getText():
    txt = open("D:\\Python\\hamlet.txt","r").read()
    txt = txt.lower()
    for ch in '|"!#$%()*+,-./;:<=>?@[\\]^_{|}~':
        txt = txt.replace(ch," ")
    return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
    counts[word] = counts.get(word,0)+1
items = list(counts.items())
items.sort( key = lambda x: x[1],reverse = True )
for i in range(10):
    word,count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

判断"沉默的羔羊"文件中出现次数多的词及次数，比英文多一个jieba分词

import jieba
txt = open("沉默的羔羊.txt","r",encoding="utf-8").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1],reverse=True)
for i in range(15):
    word,count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

posted @ 2021-01-30 14:45 MMMMinoz 阅读(151) 评论(0) 编辑收藏举报

刷新页面返回顶部

Minoz

jieba库的使用

jieba库用来对中文进行分词

公告