Google python course basic exercise——wordcount

在学习了python的基本数据类型及相关语法后,现综合所学list,tuples,str,dict,sort,file等知识,解决以下问题。

题目及代码如下:

 1 # -*- coding: cp936 -*-
 2 """
 3 1. For the --count flag, implement a print_words(filename) function that counts
 4 how often each word appears in the text and prints:
 5 word1 count1
 6 word2 count2
 7 ...
 8 
 9 Print the above list in order sorted by word.Store all the words as lowercase,
10 so 'The' and 'the' count as the same word.
11 
12 2. For the --topcount flag, implement a print_top(filename) which is similar
13 to print_words() but which prints just the top 20 most common words sorted
14 so the most common word is first, then the next most common, and so on.
15 """
16 import sys
17 def word_count(filename):   #first, a helper utility function to get the word/count
18     dict_1 = {}
19     f=open(filename,'r')    #
20     text = ((f.read()).lower()).split() # f.read() return a single string including all words in file,
21                                         #lower() to get the upper the same to lower
22                                         #split() to split on all whitespace
23                                         # text is a list with all words in file seperated 
24     for word in text:                   #get word in text matched into dict_1
25         if word in dict_1.keys():
26             dict_1[word] += 1
27         else:
28             dict_1[word] = 1
29     return dict_1                       #return a dict of {word:count}
30     
31 def print_words(filename):
32     dict_2 = word_count(filename)
33     for k in sorted(dict_2.keys()):   #list in order sorted by keys and common error:sorted() return a 'list' not a 'dict
34         print k,dict_2[k]
35 
36 
37 def print_top(filename):
38     dict_2 = word_count(filename)
39     def get_value(tup):
40         return tup[-1]
41     dict_3 = sorted(dict_2.items(),key=get_value,reverse=True)  #to get the top common words
42                                                                 #items() to get the dict to list of tuples
43     for k,v in dict_3[:20]:
44         print k,v
45 
46 def main():
47                                     #sys.argv[0] is the selfcommand,
48     if len(sys.argv) != 3:          #sys.argv[1] is the optional operation and sys.argv[2] is filename
49         print 'usage: ./wordcount.py {--count | --topcount} file'
50         sys.exit(1)
51 
52     option = sys.argv[1]
53     filename = sys.argv[2]
54     if option == '--count':
55         print_words(filename)
56     elif option == '--topcount':
57         print_top(filename)
58     else:
59         print 'unknown option: ' + option
60         sys.exit(1) 
61 
62 if __name__ == '__main__':
63     main()

结果截图:

(续上图)

word.txt原文粘贴:

Campaigners say the increasing sexualisation of society, 

fuelled by easy access to internet pornography, is behind 

the disturbing figures.

Only yesterday the National Union of Teachers warned that 

sexual equality has been ‘rebranded by big business’ into 

a ‘raunch culture’ which is damaging the way girls view 

themselves.

Playboy bunnies adorn children’s pencil cases, pole dancing 

is portrayed as an ‘empowering’ form of exercise and 

beauty pageants have become a staple of student life, 

delegates said.

Statistics from the Department for Education show that in 

2009/10, there were 3,330 exclusions for sexual misconduct. 

In 2010/11, a further 3,030 children were excluded for the 

same reason.

The 6,000-plus cases cover lewd behaviour, sexual abuse, 

assault, bullying, daubing sexual graffiti, and sexual 

harassment.

The 2010/11 total includes 200 exclusions from primary 

schools: 190 suspensions and ten expulsions.

There have been warnings that the number of expulsions may 

only hint at the true scale of the problem.

England’s deputy children’s commissioner has told MPs that 

head teachers are reluctant to tackle sexual exploitation of 

children for fear of the message it will send out about 

their schools. 

Sue Berelowitz said schools were not facing up to the fact 

that some bullying amounts to sexual violence.

PS:

  1. 本问题实质在于针对所给文件,完成文件中单词的统计信息,并给出top 20,这在网络文本检索中是最基本的功能。
  2. 分析问题1和问题2,发现两者中有重复功能地方,即从file中得到word/count,故考虑写这个辅助函数word_count()。在这个函数中涉及读文本,文本分割,对word计数等知识。
  3. 细节:sorted(dict.keys())返回的是'list'而不是'dict';   dict.items()将dict转化为由(key:value)这样的tuples组成的list
  4. 待改进的地方:对比word.txt与'--count'结果来看,基本实现了文本统计功能,有趣的是文中出现的'rebranded by big business','empowering','raunch culture',children's ,python在split时会将与word连在一起的符号作为一个整体,这样无疑会影响这些word的统计信息及排名,所以函数中还应该加入对标点符号及非字母数字等的处理。
    1 if (word[0].isalnum()) == False:
    2             word = word[1:]
    3 if (word[-1].isalnum()) == False:
    4             word = word[:-1]

    在word_count()函数中加入上面的处理语句即可,注意她保护了中间带符号的如children's,2011/10等类型,处理了类似'empowering 的单词。

posted @ 2013-04-02 17:14  Emma437211  阅读(553)  评论(0编辑  收藏  举报