python统计英文单词出现次数【小例子】

#你有一个目录，放了你一个月的日记，都是 txt，为了避免分词的问题，假设内容都是英文，请统计出你认为每篇日记最重要的词
1.txt：i love you beijing
2.txt：i love you beijing hello world
3.txt：today is a good day
源码：

import os,re

def find_word(file_path):
    file_list=os.listdir(file_path)#文件列表
    word_dict={}
    word_re=re.compile(r'[\w]+')#字符串前面加上r表示原生字符串，\w 匹配任何字类字符，包括下划线。与“[A-Za-z0-9_]”等效
    for file_name in file_list:
        if os.path.isfile(file_name) and os.path.splitext(file_name)[1]=='.txt':#os.path.splitext('c:\\csv\\test.csv') 结果('c:\\csv\\test', '.csv')
            try:
                f=open(file_name,'r')
                data=f.read()
                f.close()
                words=word_re.findall(data)#findall()返回的是括号所匹配到的结果（如regex1），多个括号就会返回多个括号分别匹配到的结果（如regex），如果没有括号就返回就返回整条语句所匹配到的结果(如regex2)
                for word in words:
                    if word not in word_dict:
                        word_dict[word]=1 #从1为索引保存单词
                    else:
                        word_dict[word] +=1
            except:
                print('open %s Error' % file_name)
    result_list=sorted(word_dict.items(),key=lambda t :t[1],reverse=True) #t[0]按key排序，t[1]按value排序？ 取前面系列中的第二个参数做排序
    for key,value in result_list:
        print('word',key,'appears %d times' % value)
if __name__=='__main__':
    find_word('.')

结果：

('word', 'beijing', 'appears 2 times')
('word', 'love', 'appears 2 times')
('word', 'i', 'appears 2 times')
('word', 'you', 'appears 2 times')
('word', 'a', 'appears 1 times')
('word', 'good', 'appears 1 times')
('word', 'is', 'appears 1 times')
('word', 'day', 'appears 1 times')
('word', 'world', 'appears 1 times')
('word', 'hello', 'appears 1 times')
('word', 'today', 'appears 1 times')

sort 与 sorted 区别：

sort 是应用在 list 上的方法，sorted 可以对所有可迭代的对象进行排序操作。

list 的 sort 方法返回的是对已经存在的列表进行操作，而内建函数 sorted 方法返回的是一个新的 list，而不是在原来的基础上进行的操作。

sorted 语法：

sorted(iterable, cmp=None, key=None, reverse=False)

参数说明：

iterable -- 可迭代对象。
cmp -- 比较的函数，这个具有两个参数，参数的值都是从可迭代对象中取出，此函数必须遵守的规则为，大于则返回1，小于则返回-1，等于则返回0。
key -- 主要是用来进行比较的元素，只有一个参数，具体的函数的参数就是取自于可迭代对象中，指定可迭代对象中的一个元素来进行排序。
reverse -- 排序规则，reverse = True 降序， reverse = False 升序（默认）。

posted @ 2018-03-14 15:00 酱测阅读(7063) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

纳尼2017

python统计英文单词出现次数【小例子】

公告