现代软件工程HW1：词频统计

　　作业详细要求：http://www.cnblogs.com/denghp83/p/8627840.html

　　基本功能

　　1. 统计文件的字符数（只需要统计Ascii码，汉字不用考虑，换行符不用考虑,'\0'不用考虑）（ascii码大小在[32,126]之间）

　　2. 统计文件的单词总数

　　3. 统计文件的总行数（任何字符构成的行，都需要统计）（不要只看换行符的数量，要小心最后一行没有换行符的情形）（空行算一行）

　　4. 统计文件中各单词的出现次数，输出频率最高的10个。

　　5. 对给定文件夹及其递归子文件夹下的所有文件进行统计

　　6. 统计两个单词（词组）在一起的频率，输出频率最高的前10个。

　　7. 在Linux系统下，进行性能分析，过程写到blog中（附加题）

　　注意：

　　a) 空格，水平制表符，换行符，均算字符

　　b) 单词的定义：至少以4个英文字母开头，跟上字母数字符号，单词以分隔符分割，不区分大小写。

　　英文字母：A-Z，a-z

　　字母数字符号：A-Z，a-z，0-9

　　分割符：空格，非字母数字符号

　　例如：”file123”是一个单词，”123file”不是一个单词。file，File和FILE是同一个单词。

　　如果两个单词只有最后的数字结尾不同，则认为是同一个单词，例如，windows，windows95和windows7是同一个单词，iPhone4和IPhone5是同一个单词，但是，windows和windows32a是不同的单词，因为他们不是仅有数字结尾不同

输出按字典顺序，例如，windows95，windows98和windows2000同时出现时，输出windows2000

单词长度只需要考虑[4, 1024]，超出此范围的不用统计。

　　c)词组的定义：windows95 good， windows2000 good123，可以算是同一种词组。按照词典顺序输出。三词相同的情形，比如good123 good456 good789，根据定义，则是 good123 　　　good123 这个词组出现了两次。

　　两个合法单词之间，出现一个非法字符串，比如：windows2000 abc good123，因为abc按照定义不是单词，因此这个词组其实是windows2000 good123，中间的abc当做分隔符看待。

　　good123 good456 good789这种情况，由于这三个单词与good123都是同一个词，最终统计结果是good123 good123这个词组出现了2次。

　　两个单词分属两行，也可以直接组成一个词组。统计词组，只看顺序上，是否相邻。

　　d) 输入文件名以命令行参数传入。需要遍历整个文件夹时，则要输入文件夹的路径。

　　e) 输出文件result.txt

　　characters: number

　　words: number

　　lines: number

　　<word>: number

　　<word>为文件中真实出现的单词大小写格式，例如，如果文件中只出现了File和file，程序不应当输出FILE，且<word>按字典顺序（基于ASCII）排列，上例中程序应该输出File: 2

　　f) 根据命令行参数判断是否为目录

　　g) 将所有文件中的词汇，进行统计，最终只输出一个整体的词频统计结果。

　　PSP表格：

一、需求分析

　　①从宏观到微观，先忽略对一个文件的具体操作，发现作业要求中的字符统计等功能均需要对文件夹中的每个文件进行遍历，故优先实现文件夹遍历读取文件功能。采用深度优先方式遍历文件，具体方法参考 https://blog.csdn.net/qq289665044/article/details/48623325 。

　　②考虑细节问题，对文件调用函数进行读操作，读入字符进行处理。考虑到C++中getline较方便，初步决定使用getline获取文件中一行行的内容，再进行相应处理。

　　③考虑遍历需求，应当设定结构体并进行相关遍历。考虑到需求中“既要保留字典序最小的原型，又要忽略大小写和无关后缀进行统计”，结构体中应该至少包含原型、重复次数、标准型三个变量。进一步考虑发现，对一组n个标准型相同的单词，有n-1个其实没必要保存，故考虑通过unordered_map实现，将标准型作为KEY，可以节省空间。

　　④考虑测试需求，应学习单元测试、Ubuntu下测试等知识。

　　⑤综合上述需求，需要学习C++相关内容，包括文件操作、输出、自带命令行编译等，需要学习vs下性能测试，还需要熟悉linux下的各种操作。

二、具体设计

　　①设计两个类：Word和Phrase，Word中包含重复次数、原型，Phrase中包含重复次数、词1、词2，用作数据存储。词和字符串的处理一律采用string类。

　　②配置两个全局Map，其KEY为toupper标准化的单词原型，例如ABcde234的标准型是ABCED，abdfe./.qwert 的标准型是 ABDFE QWERT，用作全局存储。Phrase的KEY设计为“词1”+“ ”+“词2”

　　③思路：遍历文件，逐行读入，调用Operate函数处理这一行字符串，按规矩存入和统计，再由排序函数按KEY值（标准型）排出存于Map中的TOP10。

　　④测试方法：设计空文件夹、一个文件、多个文件、文件夹中包含文件夹等等样例测试，并使用grof和vs中的性能测试进行调试。

三、具体编码

　　此处直接放上Windows版和Linux版的源码，其中Linux版参考了某赵姓大佬的建议和 https://blog.csdn.net/angle_birds/article/details/8503039.

　　Windows:

  1 //#include "WordsFrequency.h"
  2 #include<iostream>
  3 #include<unordered_map>
  4 #include<string>
  5 #include<fstream>
  6 #include<algorithm>
  7 #include<time.h>
  8 #include<iomanip>
  9 #include<io.h> //if Windows, use this and dfsFolder()
 10 //#include<dirent.h>//if linux, use this and traverseFile() and 
 11 
 12 using namespace std;
 13 class Word {
 14 public:    
 15     string value;
 16     int repeatTimes;
 17 public:
 18     Word();
 19     ~Word();
 20 };
 21 Word::Word() {
 22     this->value = "";
 23     this->repeatTimes = 0;
 24 }
 25 Word::~Word(){}
 26 
 27 class Phrase {
 28 public:
 29     string firstWord;
 30     string secondWord;
 31     int repeatTimes;
 32 public:
 33     Phrase();
 34     ~Phrase();
 35 };
 36 Phrase::Phrase() {
 37     this->firstWord = "";
 38     this->secondWord = "";
 39     this->repeatTimes = 0;
 40 }
 41 Phrase::~Phrase(){}
 42 
 43 void dfsFolder(string folderPath, int depth);
 44 int wordType(char n);
 45 bool IsChar(char n);
 46 void wordOperate(string &str);
 47 void wordMapInsert(string word);
 48 vector<Word> wordTopTen();
 49 void phraseMapInsert(Phrase phrase);
 50 vector<Phrase> phraseTopTen();
 51 //above: something move in from WordsFrequency.h 
 52 
 53 string strNew, strOld;
 54 int charSum=0,lineSum=0,wordSum=0;
 55 unordered_map<string, Word> wordMap;
 56 unordered_map<string, Phrase> phraseMap;
 57 
 58 //dfs traverse files in Windows, consult https://blog.csdn.net/qq289665044/article/details/48623325 
 59 void dfsFolder(string folderPath, int depth)
 60 {
 61     ifstream infile;
 62     string strReg;
 63     _finddata_t fileInfo;
 64     string strFind = folderPath + "/*.*";     
 65     long handle = _findfirst(strFind.c_str(), &fileInfo);
 66 
 67     //if fail
 68     if (handle == -1)
 69     {
 70         cout << "cannot match the folder path" << endl;
 71         return ;
 72     }
 73 
 74     do
 75     { 
 76         //prevent to dfs files on its same depth(".") or from its root("..")  
 77         if (fileInfo.attrib == _A_SUBDIR){
 78             int depthTemp = depth;
 79             if (strcmp(fileInfo.name, ".") != 0 && strcmp(fileInfo.name, "..") != 0)    
 80                 dfsFolder(folderPath + '/' + fileInfo.name, depthTemp + 1);
 81         }
 82         else{
 83             infile.open(folderPath + '/' + fileInfo.name, ios::in);
 84             
 85             while (getline(infile, strReg)) {  
 86                 lineSum++;
 87                 wordOperate(strReg);
 88             } 
 89             infile.close();
 90         }
 91     } while (!_findnext(handle, &fileInfo));    
 92                                                   
 93     _findclose(handle);
 94 }
 95 
 96 //traverseFile in Linux
 97 /*
 98 void listDir(char *path){
 99     DIR *pDir; 
100     struct dirent *ent; 
101     int i = 0;
102     char childpath[512]; 
103     pDir = opendir(path); 
104     memset(childpath, 0, sizeof(childpath)); 
105     while ((ent = readdir(pDir)) != NULL)
106         
107     {
108         if (ent->d_type & DT_DIR)
109             
110         {
111             if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
112                 continue;
113             sprintf(childpath, "%s/%s", path, ent->d_name);
114             listDir(childpath);
115         }
116         else
117         {
118             sprintf(childpath, "%s/%s", path, ent->d_name);
119             traverseFile(childpath);
120         }
121     }
122 }
123 */
124 
125 //judge wordType,1:alpha,2:symbol,0:others
126 int wordType(char n) {
127     if ((n >= 65 && n <= 90 )|| (n >= 97 && n <= 122) || (n >= 48 && n <= 57)) {
128         if ((n >= 65 && n <= 90) || (n >= 97 && n <= 122)) return 1;
129         else return 0;
130     }
131     else return 2;
132 }
133 
134 //judge if n is a char.(ascii 32~126)
135 bool IsChar(char n) {
136     if (n >= 32 && n <= 126) return true;
137     return false;
138 }
139 
140 //when getline, operate the string and realize function:find Words' and Phrases' frequency.
141 void wordOperate(string &strx) {
142     Phrase phTemp;
143     size_t length = 1+strx.length();
144     int wordBegin, wordEnd;
145     bool flag = true;
146 
147     strx += " ";//avoid overflow or miss
148 
149     for(int i=0;(size_t)i<length-1;i++)
150         if (IsChar(strx[i])) charSum++;//------charSum----------
151 
152     for (wordBegin = 0; wordType(strx[wordBegin]) == 2 && (size_t)wordBegin < length; wordBegin++) {
153         //if (IsChar(strx[wordBegin])) charSum++;
154     }//find the first alpha
155 
156     for (int it = wordBegin; strx[it] != '\0' && (size_t)it < length; it++) {
157         if (wordType(strx[it])==2) {
158             for (wordEnd = wordBegin; wordEnd - wordBegin < 4 && (size_t)wordEnd < length; wordEnd++) {
159                 if (wordType(strx[wordEnd])!=1) {
160                     flag = false;
161                     break;
162                 }
163             }//when flag==false,it means the thing we find is NOT a word
164             if (wordEnd == length && wordEnd-wordBegin<4) flag = false;//boundary condition
165 
166             if (flag == true) {
167                 wordSum++;
168                 strNew = strx.substr(wordBegin, it - wordBegin);
169                 wordMapInsert(strNew);
170 
171                 if (strOld != "") {
172                     phTemp.firstWord = strOld;
173                     phTemp.secondWord = strNew;
174                     phraseMapInsert(phTemp);
175                 }
176                 strOld = strNew;
177             }//if it's a word, insert into wordMap; if it can be used to build a phrase,
178             //insert into phraseMap, update strNew
179             flag = true;
180             wordBegin = it + 1;
181         }
182     }
183 }
184 
185 //deal with word and insert into wordMap
186 void wordMapInsert(string word) {
187     if (wordType(word[0])!=1)    return;
188     
189     string wordTemp = word;
190     string::iterator it=word.end();
191 
192     while (wordType(*it) != 1) it--; //find alpha
193 
194     word.erase(it + 1, word.end());//throw tails                     
195         
196     transform(word.begin(), word.end(), word.begin(), ::toupper); 
197     wordMap[word].repeatTimes++;//transform to upper and repeatTimes++               
198 
199     if (wordMap[word].value == "" || wordMap[word].value.compare(wordTemp)>0) 
200             wordMap[word].value = wordTemp;
201             //if the origin shape of word < value, update value=word.originShape
202 }
203 
204 //rank words by repeatTimes ,rank 10
205 vector<Word>wordTopTen() {
206     vector<Word> rankTen(10);//at first, rankTen[i].repeatTimes=0
207     unordered_map<string,Word>::iterator it = wordMap.begin();
208 
209     //insert rank to select top10
210     for (; it != wordMap.end(); it++){
211         if (it->second.repeatTimes > rankTen[9].repeatTimes) {
212             if (it->second.repeatTimes > rankTen[0].repeatTimes)
213                 rankTen.insert(rankTen.begin(), it->second);
214             else {
215                 for (int i = 1; i <= 9; i++) {
216                     if ((it->second.repeatTimes >= rankTen[i].repeatTimes) && (it->second.repeatTimes <= rankTen[i - 1].repeatTimes)) {
217                         rankTen.insert(rankTen.begin() + i, it->second);
218                         break;
219                     }
220                 }
221             }
222         }
223     }
224     rankTen.erase(rankTen.begin() + 10, rankTen.end());
225     return rankTen;
226 }
227 
228 //deal with phrase and insert into phraseMap
229 void phraseMapInsert(Phrase phrase) {
230     if (wordType(phrase.firstWord[0])!=1 || wordType(phrase.secondWord[0])!=1)    return; 
231 
232     Phrase phTemp;
233     phTemp.firstWord = phrase.firstWord;
234     phTemp.secondWord = phrase.secondWord;
235 
236     string::iterator it1 = phrase.firstWord.end();
237     string::iterator it2 = phrase.secondWord.end();
238     it1--; it2--;
239     while (wordType(*it1)!=1) {
240         it1--;
241     };                                                               
242     while (wordType(*it2)!=1) {
243         it2--;
244     };
245     phrase.firstWord.erase(it1 + 1, phrase.firstWord.end());                                           
246     phrase.secondWord.erase(it2 + 1, phrase.secondWord.end());//同wordMapInsert中的操作,对两个单词分别去除尾部无关字符
247 
248     transform(phrase.firstWord.begin(), phrase.firstWord.end(), phrase.firstWord.begin(), ::toupper);         
249     transform(phrase.secondWord.begin(), phrase.secondWord.end(), phrase.secondWord.begin(), ::toupper);
250     string upperStr = phrase.firstWord +" "+ phrase.secondWord;
251     phraseMap[upperStr].repeatTimes++;//转化为大写，统计词频
252 
253     if (phraseMap[upperStr].firstWord == "" || (phraseMap[upperStr].firstWord + phraseMap[upperStr].secondWord).compare(phTemp.firstWord + phTemp.secondWord)>0) {
254         phraseMap[upperStr].firstWord = phTemp.firstWord;
255         phraseMap[upperStr].secondWord = phTemp.secondWord;
256     }//词组原型与map中存的value比较，留下字典序较小的                                                          
257 }
258 
259 //rank phrases by repeatTimes ,rank 10
260 vector<Phrase> phraseTopTen() {
261     vector<Phrase> rankTen(10);//at first, rankTen[i].repeatTimes=0
262     unordered_map<string, Phrase>::iterator it = phraseMap.begin();
263 
264     //insert rank to select top10
265     for (; it != phraseMap.end();it++) {
266         if (it->second.repeatTimes > rankTen[9].repeatTimes) {
267             if (it->second.repeatTimes > rankTen[0].repeatTimes)
268                 rankTen.insert(rankTen.begin(), it->second);
269             else {
270                 for (int i = 1; i <= 9; i++) {
271                     if ((it->second.repeatTimes >= rankTen[i].repeatTimes) && (it->second.repeatTimes <= rankTen[i - 1].repeatTimes)) {
272                         rankTen.insert(rankTen.begin() + i, it->second);
273                         break;
274                     }
275                 }
276             }
277         }
278     }
279 
280     rankTen.erase(rankTen.begin() + 10, rankTen.end());
281     return rankTen;
282 }
283 
284 int main(int argc, char * argv[]) {
285     string fileName=argv[1];
286     ofstream fout;
287     fout.open("result.out", ios::out);
288     //printf("please enter the filePath:\n");
289     //getline(cin, fileName);
290     cout << "wait a moment plz....." << endl;
291 
292     double timeSum;
293     clock_t tStart = clock();
294     dfsFolder(fileName, 0);
295     vector<Word> word= wordTopTen();
296     vector<Phrase> phrase = phraseTopTen();
297     timeSum =(double)(clock() - tStart) / CLOCKS_PER_SEC;
298 
299     //printf("NumOfChar：%d\n", charSum);
300     fout << "NumOfChar:" << charSum << endl;
301     //printf("NumOfWord：%d\n", wordSum);
302     fout << "NumOfWord:" << wordSum << endl;
303     //printf("NumOfLine：%d\n\n", lineSum);
304     fout << "NumOfLine:" << lineSum << endl;
305     //printf("Top10 Words:\n");
306     fout << "Top10 Words:" << endl;
307 
308     for (int i = 0; i < 10; i++) {
309         fout << setw(12) << word[i].value << " " << word[i].repeatTimes << endl;
310     }
311     fout << endl;
312     fout << "Top10 Phrases:" << endl;
313     for (int i = 0; i < 10; i++) {
314         fout << setw(12) << phrase[i].firstWord << " " << setw(12) << phrase[i].secondWord << " " << phrase[i].repeatTimes << endl;
315     }
316     fout << endl;
317     fout << "TimeSum:" << setprecision(4)<< timeSum <<"S"<< endl;
318     fout.close();
319     /*for (int i = 0; i < 10; i++) {
320         cout << setw(12)<< word[i].value <<" "<< word[i].repeatTimes << endl;
321     }
322     printf("\nTop10 Phrases:\n");
323     for (int i = 0; i < 10; i++) {
324         cout << setw(12)<< phrase[i].firstWord << " " <<setw(12)<< phrase[i].secondWord <<" "<<phrase[i].repeatTimes<< endl;
325     }
326     printf("\nTimeSum：%.2fs\n", timeSum);*/
327 
328 }

View Code

　　Linux:

  1 //#include "WordsFrequency.h"
  2 #include<iostream>
  3 #include<unordered_map>
  4 #include<string>
  5 #include<fstream>
  6 #include<algorithm>
  7 #include<time.h>
  8 #include<iomanip>
  9 #include<string.h>
 10 //#include<io.h> //if Windows, use this and dfsFolder()
 11 #include<dirent.h>//if linux, use this and traverseFile() and 
 12 #include<sys/stat.h>
 13 
 14 using namespace std;
 15 class Word {
 16 public:    
 17     string value;
 18     int repeatTimes;
 19 public:
 20     Word();
 21     ~Word();
 22 };
 23 Word::Word() {
 24     this->value = "";
 25     this->repeatTimes = 0;
 26 }
 27 Word::~Word(){}
 28 
 29 class Phrase {
 30 public:
 31     string firstWord;
 32     string secondWord;
 33     int repeatTimes;
 34 public:
 35     Phrase();
 36     ~Phrase();
 37 };
 38 Phrase::Phrase() {
 39     this->firstWord = "";
 40     this->secondWord = "";
 41     this->repeatTimes = 0;
 42 }
 43 Phrase::~Phrase(){}
 44 
 45 void dfsFolder(string folderPath, int depth);
 46 void dfsFolderLinux(string folderPath);
 47 int wordType(char n);
 48 bool IsChar(char n);
 49 void wordOperate(string &str);
 50 void wordMapInsert(string word);
 51 vector<Word> wordTopTen();
 52 void phraseMapInsert(Phrase phrase);
 53 vector<Phrase> phraseTopTen();
 54 //above: something move in from WordsFrequency.h 
 55 
 56 string strNew, strOld;
 57 int charSum=0,lineSum=0,wordSum=0;
 58 unordered_map<string, Word> wordMap;
 59 unordered_map<string, Phrase> phraseMap;
 60 
 61 //dfs traverse files in Windows, consult https://blog.csdn.net/qq289665044/article/details/48623325 
 62 /*
 63 void dfsFolder(string folderPath, int depth)
 64 {
 65     ifstream infile;
 66     string strReg;
 67     _finddata_t fileInfo;
 68     string strFind = folderPath + "/*.*";     
 69     long handle = _findfirst(strFind.c_str(), &fileInfo);
 70 
 71     //if fail
 72     if (handle == -1)
 73     {
 74         cout << "cannot match the folder path" << endl;
 75         return ;
 76     }
 77 
 78     do
 79     { 
 80         //prevent to dfs files on its same depth(".") or from its root("..")  
 81         if (fileInfo.attrib == _A_SUBDIR){
 82             int depthTemp = depth;
 83             if (strcmp(fileInfo.name, ".") != 0 && strcmp(fileInfo.name, "..") != 0)    
 84                 dfsFolder(folderPath + '/' + fileInfo.name, depthTemp + 1);
 85         }
 86         else{
 87             infile.open(folderPath + '/' + fileInfo.name, ios::in);
 88             
 89             while (getline(infile, strReg)) {  
 90                 lineSum++;
 91                 wordOperate(strReg);
 92             } 
 93             infile.close();
 94         }
 95     } while (!_findnext(handle, &fileInfo));    
 96                                                   
 97     _findclose(handle);
 98 }
 99 */
100 
101 //traverseFile in Linux
102 void dfsFolderLinux(string folderPath)
103 {
104     DIR *dir_ptr;
105     struct stat infobuf;
106     struct dirent *direntp;
107     string name, temp;
108     ifstream infile;
109     string  strReg;
110     if ((dir_ptr = opendir(folderPath.c_str())) == NULL)
111         perror("can not open");
112     else
113     {
114         while ((direntp = readdir(dir_ptr)) != NULL)
115         {
116             temp = "";
117             name = direntp->d_name;
118             if (name == "." || name == "..")
119             {
120                 ;
121             }
122             else
123             {
124                 temp += folderPath;
125                 temp += "/";
126                 temp += name;
127  
128                 if ((stat(temp.c_str(), &infobuf)) == -1)
129                     printf("#########\n");
130                 if ((infobuf.st_mode & 0170000) == 0040000)
131                 {
132 
133                     dfsFolderLinux(temp);
134                 }
135                 else
136                 {  
137                     infile.open(temp, ios::in);
138                     while (getline(infile, strReg)) {
139                         lineSum++;
140                         wordOperate(strReg);
141                     }
142                     infile.close();
143                 }
144             }
145         }
146     }
147     closedir(dir_ptr);
148 }
149 
150 
151 //judge wordType,1:alpha,2:symbol,0:others
152 int wordType(char n) {
153     if ((n >= 65 && n <= 90 )|| (n >= 97 && n <= 122) || (n >= 48 && n <= 57)) {
154         if ((n >= 65 && n <= 90) || (n >= 97 && n <= 122)) return 1;
155         else return 0;
156     }
157     else return 2;
158 }
159 
160 //judge if n is a char.(ascii 32~126)
161 bool IsChar(char n) {
162     if (n >= 32 && n <= 126) return true;
163     return false;
164 }
165 
166 //when getline, operate the string and realize function:find Words' and Phrases' frequency.
167 void wordOperate(string &strx) {
168     Phrase phTemp;
169     size_t length = 1+strx.length();
170     int wordBegin, wordEnd;
171     bool flag = true;
172 
173     strx += " ";//avoid overflow or miss
174 
175     for(int i=0;(size_t)i<length-1;i++)
176         if (IsChar(strx[i])) charSum++;//------charSum----------
177 
178     for (wordBegin = 0; wordType(strx[wordBegin]) == 2 && (size_t)wordBegin < length; wordBegin++) {
179         //if (IsChar(strx[wordBegin])) charSum++;
180     }//find the first alpha
181 
182     for (int it = wordBegin; strx[it] != '\0' && (size_t)it < length; it++) {
183         if (wordType(strx[it])==2) {
184             for (wordEnd = wordBegin; wordEnd - wordBegin < 4 && (size_t)wordEnd < length; wordEnd++) {
185                 if (wordType(strx[wordEnd])!=1) {
186                     flag = false;
187                     break;
188                 }
189             }//when flag==false,it means the thing we find is NOT a word
190             if (wordEnd == length && wordEnd-wordBegin<4) flag = false;//boundary condition
191 
192             if (flag == true) {
193                 wordSum++;
194                 strNew = strx.substr(wordBegin, it - wordBegin);
195                 wordMapInsert(strNew);
196 
197                 if (strOld != "") {
198                     phTemp.firstWord = strOld;
199                     phTemp.secondWord = strNew;
200                     phraseMapInsert(phTemp);
201                 }
202                 strOld = strNew;
203             }//if it's a word, insert into wordMap; if it can be used to build a phrase,
204             //insert into phraseMap, update strNew
205             flag = true;
206             wordBegin = it + 1;
207         }
208     }
209 }
210 
211 //deal with word and insert into wordMap
212 void wordMapInsert(string word) {
213     if (wordType(word[0])!=1)    return;
214     
215     string wordTemp = word;
216     string::iterator it=word.end();
217 
218     while (wordType(*it) != 1) it--; //find alpha
219 
220     word.erase(it + 1, word.end());//throw tails                     
221         
222     transform(word.begin(), word.end(), word.begin(), ::toupper); 
223     wordMap[word].repeatTimes++;//transform to upper and repeatTimes++               
224 
225     if (wordMap[word].value == "" || wordMap[word].value.compare(wordTemp)>0) 
226             wordMap[word].value = wordTemp;
227             //if the origin shape of word < value, update value=word.originShape
228 }
229 
230 //rank words by repeatTimes ,rank 10
231 vector<Word>wordTopTen() {
232     vector<Word> rankTen(10);//at first, rankTen[i].repeatTimes=0
233     unordered_map<string,Word>::iterator it = wordMap.begin();
234 
235     //insert rank to select top10
236     for (; it != wordMap.end(); it++){
237         if (it->second.repeatTimes > rankTen[9].repeatTimes) {
238             if (it->second.repeatTimes > rankTen[0].repeatTimes)
239                 rankTen.insert(rankTen.begin(), it->second);
240             else {
241                 for (int i = 1; i <= 9; i++) {
242                     if ((it->second.repeatTimes >= rankTen[i].repeatTimes) && (it->second.repeatTimes <= rankTen[i - 1].repeatTimes)) {
243                         rankTen.insert(rankTen.begin() + i, it->second);
244                         break;
245                     }
246                 }
247             }
248         }
249     }
250     rankTen.erase(rankTen.begin() + 10, rankTen.end());
251     return rankTen;
252 }
253 
254 //deal with phrase and insert into phraseMap
255 void phraseMapInsert(Phrase phrase) {
256     if (wordType(phrase.firstWord[0])!=1 || wordType(phrase.secondWord[0])!=1)    return; 
257 
258     Phrase phTemp;
259     phTemp.firstWord = phrase.firstWord;
260     phTemp.secondWord = phrase.secondWord;
261 
262     string::iterator it1 = phrase.firstWord.end();
263     string::iterator it2 = phrase.secondWord.end();
264     it1--; it2--;
265     while (wordType(*it1)!=1) {
266         it1--;
267     };                                                               
268     while (wordType(*it2)!=1) {
269         it2--;
270     };
271     phrase.firstWord.erase(it1 + 1, phrase.firstWord.end());                                           
272     phrase.secondWord.erase(it2 + 1, phrase.secondWord.end());//同wordMapInsert中的操作,对两个单词分别去除尾部无关字符
273 
274     transform(phrase.firstWord.begin(), phrase.firstWord.end(), phrase.firstWord.begin(), ::toupper);         
275     transform(phrase.secondWord.begin(), phrase.secondWord.end(), phrase.secondWord.begin(), ::toupper);
276     string upperStr = phrase.firstWord +" "+ phrase.secondWord;
277     phraseMap[upperStr].repeatTimes++;//转化为大写，统计词频
278 
279     if (phraseMap[upperStr].firstWord == "" || (phraseMap[upperStr].firstWord + phraseMap[upperStr].secondWord).compare(phTemp.firstWord + phTemp.secondWord)>0) {
280         phraseMap[upperStr].firstWord = phTemp.firstWord;
281         phraseMap[upperStr].secondWord = phTemp.secondWord;
282     }//词组原型与map中存的value比较，留下字典序较小的                                                          
283 }
284 
285 //rank phrases by repeatTimes ,rank 10
286 vector<Phrase> phraseTopTen() {
287     vector<Phrase> rankTen(10);//at first, rankTen[i].repeatTimes=0
288     unordered_map<string, Phrase>::iterator it = phraseMap.begin();
289 
290     //insert rank to select top10
291     for (; it != phraseMap.end();it++) {
292         if (it->second.repeatTimes > rankTen[9].repeatTimes) {
293             if (it->second.repeatTimes > rankTen[0].repeatTimes)
294                 rankTen.insert(rankTen.begin(), it->second);
295             else {
296                 for (int i = 1; i <= 9; i++) {
297                     if ((it->second.repeatTimes >= rankTen[i].repeatTimes) && (it->second.repeatTimes <= rankTen[i - 1].repeatTimes)) {
298                         rankTen.insert(rankTen.begin() + i, it->second);
299                         break;
300                     }
301                 }
302             }
303         }
304     }
305 
306     rankTen.erase(rankTen.begin() + 10, rankTen.end());
307     return rankTen;
308 }
309 
310 int main(int argc, char * argv[]) {
311     string fileName=argv[1];
312     ofstream fout;
313     fout.open("result.out", ios::out);
314     //printf("please enter the filePath:\n");
315     //getline(cin, fileName);
316     cout << "wait a moment plz....." << endl;
317 
318     double timeSum;
319     clock_t tStart = clock();
320     //dfsFolder(fileName, 0);
321     dfsFolderLinux(fileName);
322     vector<Word> word= wordTopTen();
323     vector<Phrase> phrase = phraseTopTen();
324     timeSum =(double)(clock() - tStart) / CLOCKS_PER_SEC;
325 
326     //printf("NumOfChar：%d\n", charSum);
327     fout << "NumOfChar:" << charSum << endl;
328     //printf("NumOfWord：%d\n", wordSum);
329     fout << "NumOfWord:" << wordSum << endl;
330     //printf("NumOfLine：%d\n\n", lineSum);
331     fout << "NumOfLine:" << lineSum << endl;
332     //printf("Top10 Words:\n");
333     fout << "Top10 Words:" << endl;
334     for (int i = 0; i < 10; i++) {
335         fout << setw(12) << word[i].value << " " << word[i].repeatTimes << endl;
336     }
337     fout << endl;
338     fout << "Top10 Phrases:" << endl;
339     for (int i = 0; i < 10; i++) {
340         fout << setw(12) << phrase[i].firstWord << " " << setw(12) << phrase[i].secondWord << " " << phrase[i].repeatTimes << endl;
341     }
342     fout << endl;
343     fout << "TimeSum:" << setprecision(4)<< timeSum <<"S"<< endl;
344     fout.close();
345 }

View Code

　　相关注释应该还算足够，其实原版中文注释更适当一些，考虑到需求里要求代码内不得含有中文字符，只好一一翻成英文。

四、测试

I 单元测试

　　采用VS中自带的单元测试框架进行测试，结合断点调试及变量观察，保证每个函数功能符合预期。

II 白盒测试

　　①读入空文件夹、只含一个空文件。

　　②读入含有单字符的文件、含有一个单词的文件、含有词组的文件。（右侧为用于测试的文件，带t前缀）

　　③读入含有多行的单文件。

　　④读入多文件

　　⑤读入带子文件夹的多文件

III 黑盒测试

　　读入助教提供的文件newsample，结果中TOP10的单词和词组与预期结果一致，字符数、单词数和行数有较小的不同。

四、性能分析：

Windows下：采用vs2015自带性能分析工具进行热行分析，图如下：

　　可以看到，递归调用的文件遍历行数理所当然地占了几乎所有时间，毕竟它是程序基石。同时，wordOperate函数作为总操作的顶层函数，也占用了相当多时间。继续。

　　可见phraseMapInsert和wordMapInsert占去多数资源，再进一步深入。

　　可见这两个函数中，耗时最多的热行是unordered_map的生成和string的比较，经考虑，string的遍历比较和map结点的生成，暂时无法缺省，这两部分耗时可以看做设计时选择数据结构unordered_map和string的代价。

　　这份代价正是程序的主要时间开销，由此，我短期内很难在不改变数据结构选择的基础上做进一步优化，分析完毕。

Linux下：gprof 导出的函数耗时表如下：

　　后边基本CPU耗时占比都比较小。经分析可以看出，Hashtable的生成占去主要时间，与Window下性能分析的结果基本一致。

五、总结与反思

　　由于课业繁忙，本次作业虽限时一周，我却从本周一才开始需求分析，周一晚上才开始正式敲代码，实际开发时间差不多是两天半里见缝插针的三十几小时，周三晚上还熬了一个通宵，期间遇到的各种困难包括数据结构不熟悉、算法设计考虑不周、对单元测试和性能调试不了解、Linux下操作不熟甚至Ubuntu版本落后需要重装等等，靠着组里各路大佬的鼎力相助，总算是在ddl之前交上了，感慨万千。

　　收获方面，我学到了Windows和Linux下文件遍历的方法，学到了C++中相关数据结构的操作和算法设计，学到了平台移植以及测试、性能分析的方法和技巧，短短三天可以说收获颇丰了。

　　反思这一次作业，我发现自己的编程基础相当不扎实，写排序算法时还特意去查了一下插入排序怎么写，对类自带的各种函数也基本靠着查手册来做，菜的真实，需要多做多练多学。编程过程中，由于对github不熟悉，我在做版本更新改进、性能调整时，都没做commit，先前较失败的版本丢失了，只留下了完成品，今后编程中应该养成及时commit的习惯，以供来日借鉴、预防重蹈覆辙。

　　本程序在Windows10下跑助教的样例，耗时28S，而在Ubuntu16下需要50+S，其中的原因我还没弄清楚；单元测试时，我对那些改变了全局变量的函数进行测试，总是很难达到预期，最后不得不修改函数，因此我准备阅读单元测试相关书籍，系统地学习单元测试方法，向TDD靠拢。

　　值得称道的地方：效率还算高，没有出大差错，在编程中应用了之前读书笔记中提到的简洁代码编程方法，积极地查找各种方法和学习。

　　本次作业总结与反思如上，至此告一段落，博客提交的ddl也快要到了。来日方长，这一次的教训牢记于心，继续前行吧。

posted @ 2018-03-30 23:54 CGYR 阅读(223) 评论(2) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

CGYR

现代软件工程HW1：词频统计

公告