使用Python对文档单词进行计数
做hacker.org上面的题目时,遇到了一个题目需要对RFC3280种长度为9的单词进行计数,并找出这些单词中出现次数最多的那个:Didactic Byte
RFC3280文档有7000多行,靠人工是数不出来的,解决这种是就得编程了,而且很明显,在处理此类问题时脚本式比较方便的,果断选择python
先是将题意理解错了,理解为出现次数最多的九个字母, 然后等到程序运行好提交答案时才发现不对,唉,真是汗颜
既然程序写出来了,还是将其贴出来吧
文档中字母频率计数:
1 # find the occurence of single letter from a document 2 #import fileinput 3 4 test = "" 5 dicts = {} 6 7 # read content from a document 8 # we could use readline,readlines etc. 9 # also we could use fileinput.input(filename) 10 11 text = open('RFC3280.txt') 12 for line in text.readlines(): 13 test += line.lower().replace(" ","") 14 15 # use dictionary to store the occurence and it's letter 16 for i in xrange(97,123): 17 letter = chr(i) 18 count = test.count(chr(i)) 19 #print "the accounts of %c is %d " % (letter,count) 20 dicts.setdefault(letter,count) 21 22 # sort the dict by values,here I used a lambda expression 23 # to sort by keys: 24 # sorted_dict = sorted(dicts.iteritems(),key = lambda d:d[0]) 25 26 sorted_dict = sorted(dicts.iteritems(),key = lambda d:d[1],reverse = True) 27 print sorted_dict
代码挺短的,注释也很详细,可以看出来脚本语言在处理这种事情上确实是比较方便
下面再看看,从文档中获取指定长度的单词的出现频率,并对其进行排序
1 # this script is used to find all 9-letter word from specific document 2 # after find them, pick the most commom one from them 3 # as I used dictionary to store them, I can not output the one I want seperately 4 # the dictionary object is difference from list or an array,it store it's key:value rondomly 5 6 test = "" 7 storage = [] 8 dicts = {} 9 word_length = 9 10 11 # read content from a document 12 # as python script read lines from the document,it may contains '\n' 13 # at the end of a sentence,there followed a '.',we also need a handle with tring.replace function 14 15 text = open('RFC3280.txt') 16 for line in text.readlines(): 17 test += line.lower().replace('\n','').replace('.','') 18 19 # convert a string to a list 20 lists = test.split(' ') 21 22 # choose theses words which it's length is word_length,here is 9 23 for i in xrange(0,len(lists)): 24 if len(lists[i])==word_length: 25 storage.append(lists[i]) 26 27 #print storage 28 #print len(storage) 29 30 # now use dictionary.setdefault to add elements to dictionary 31 for n in xrange(0,len(storage)): 32 word = storage[n] 33 count = storage.count(word) 34 dicts.setdefault(word,count) 35 36 # sort the dictionary order by vaules 37 sorted_dict = sorted(dicts.iteritems(),key = lambda d:d[1],reverse = True) 38 print sorted_dict 39
两段代码,实现的思路上差不多,都是从文件读取内容存入一个字符串中,这里是按行读取,所以要对字符串拼接。存入字符串后还需要进行一下处理,因为原文文本中有大量
的空格、换行符,其次句末的'.'也会被当作单词的一部分,还好这里要处理的符号不多,不然恐怕就要劳驾正则表达式了。然后将其转换为list(列表)对象,list有count属性可
直接对元素计数。由于需要知道每个元素出现的次数,我是用dictionary(字典)对象存储“键=>值”对,这里使用setdefault属性进行元素添加。注意到代码中使用了lambda
表达式,如果对lambda表达式不太熟悉建议还是查阅一下资料,遗憾的是一段时间没用python,我对这些知识点也没太大映像了,手头也没用相应资料(manual很好,有时候想查到某个比较细小的东西时会不太方便, 如果有安装bpython可能会方便很多),可参考:
上面两段代码可以得出想要的结果,但在阅读体验上一大堆字典元素打印在一起还是需要改进的,今后会对其进行整理。
最后在贴上部分测试文件的内容,这里直接读取文件是因为python脚本和该文本文件处于同一目录下
This memo profiles the X.509 v3 certificate and X.509 v2 Certificate Revocation List (CRL) for use in the Internet. An overview of this approach and model are provided as an introduction. The X.509 v3 certificate format is described in detail, with additional information regarding the format and semantics of Internet name forms. Standard certificate extensions are described and two Internet-specific extensions are defined. A set of required certificate extensions is specified. The X.509 v2 CRL format is described in detail, and required extensions are defined. An algorithm for X.509 certification path validation is described. An ASN.1 module and examples are provided in the appendices.