词频统计——软工第一次个人作业

词频统计

1.项目要求和基本功能

项目要求

对源文件（*.txt,*.cpp,*.h,*.cs,*.html,*.js,*.java,*.py,*.php等，文件夹内的所有文件）统计字符数、单词数、行数、词频，统计结果以指定格式输出到默认文件中，以及其他扩展功能，并能够快速地处理多个文件。
使用性能测试工具进行分析，找到性能的瓶颈并改进
对代码进行质量分析，消除所有警告，http://msdn.microsoft.com/en-us/library/dd264897.aspx
设计10个测试样例用于测试，确保程序正常运行（例如：空文件，只包含一个词的文件，只有一行的文件，典型文件等等）
使用Github进行代码管理
撰写博客

基本功能

统计文件的字符数（只需要统计Ascii码，汉字不用考虑）
统计文件的单词总数
统计文件的总行数（任何字符构成的行，都需要统计）
统计文件中各单词的出现次数，输出频率最高的10个。
对给定文件夹及其递归子文件夹下的所有文件进行统计
统计两个单词（词组）在一起的频率，输出频率最高的前10个。
在Linux系统下，进行性能分析，过程写到blog中（附加题）

2.PSP表格

Statu	Stages	预估耗时/min	实际耗时/min
Accept	【计划】	30	20
Accept	估计时间	30	20
Accept	【开发】	1330	1910
Accept	需求分析	20	30
Accept	设计文档	30	30
Accept	设计复审	10	5
Accept	代码规范	10	5
Accept	具体设计	60	60
Accept	具体编码	600	1000
Accept	代码复审	300	300
Accept	测试	300	480
Accept	【记录用时】	10	10
Accept	【测试报告】	30	60
Accept	【算工作量】	10	10
Accept	【总结改进】	60	60
Accept	【合计】	1470	2090

3.解题思路

数据结构：

全局变量

unsigned long characterNum;//存放字符数
unsigned long wordNum;      //存放单词数
unsigned long lineNum;        //存放行数

采用结构体数组（动态内存）存储单词及其出现次数

struct wordInfo {
    char* wordStr;
    char**  nextWordPoint;
    int*  nextWordFrequency;
    int      presentNextWordNum;
    int   frequency;
    int   strlength;
    int   wordLength;//不包含最后的数字部分
};
struct alphaArray {
    wordInfo* wordArray;
    int       presentWordArrayLength;
};

struct wordStatisticsResult {
    char* wordStr;
    int   wordFrequency;
};
struct phaseStatisticsResult {
    char* firstStr;
    char* secondStr;
    int   phaseFrequency;
};

View Code

遍历文件的方法：

_findfirst,_findnext函数实现（Windows平台），参考例程

readdir函数实现（Linux平台），参考例程

具体实现方案：

1>主函数：

初始化各变量
遍历给定文件夹中的每个文件
只读方式打开符合要求的文件
单词统计，词组统计
循环至所有文件遍历完成
关闭文件
输出统计结果

2>单词统计：

遍历字符并统计
判断是否是换行符并统计
建立缓冲区域存储一个单词中连续的字符
采集单词字符串
生成单词的哈希值（散列函数使用ELFHash、冲突解决方案采用二次探测）
根据首字母和哈希值确定单词的存储位置并存储单词信息
将当前单词的地址存储到前一个单词的结构体中，以实现词组频率统计

3>词组统计：

存储单词
返回当前单词在词表中的位置
如果不是第一个单词
　　根据位置得到字符串指针
　　在前一个单词的结构体中查找是否存在该指针
　　如果存在该指针对应计数加一
　　如果不存在存储该指针，初始化数量为1
记录该位置

最后遍历即可得到所有词组出现频率

4.代码实现

（1）初始化词表

 1 void dictionaryInit(struct alphaArray* dictionary)
 2 {
 3     int i, j, k;
 4     characterNum = 0;
 5     wordNum = 0;
 6     lineNum = 0;
 7     for (i = 0; i < alphabet; i++)
 8     {
 9         (dictionary + i)->wordArray = (wordInfo*)malloc(sizeof(wordInfo)*wordArrayLength);
10         (dictionary + i)->presentWordArrayLength = wordArrayLength;
11         if ((dictionary + i)->wordArray == NULL) exit(-1);
12         for (j = 0; j < (dictionary + i)->presentWordArrayLength; j++)
13         {
14             ((dictionary + i)->wordArray + j)->wordStr = (char*)malloc(sizeof(char)*wordStrLength);
15             if (((dictionary + i)->wordArray + j)->wordStr == NULL) exit(-1);
16             *(((dictionary + i)->wordArray + j)->wordStr) = '\0';
17             ((dictionary + i)->wordArray + j)->frequency = 0;
18             ((dictionary + i)->wordArray + j)->strlength = wordStrLength;
19             ((dictionary + i)->wordArray + j)->wordLength = 0;
20             ((dictionary + i)->wordArray + j)->nextWordPoint = (char**)malloc(sizeof(char*)*nextWordNum);
21             if (((dictionary + i)->wordArray + j)->nextWordPoint == NULL) exit(-1);
22             ((dictionary + i)->wordArray + j)->nextWordFrequency = (int*)malloc(sizeof(int)*nextWordNum);
23             if (((dictionary + i)->wordArray + j)->nextWordFrequency == NULL) exit(-1);
24             for (k = 0; k < nextWordNum; k++)
25             {
26                 *(((dictionary + i)->wordArray + j)->nextWordPoint + k) = NULL;
27                 *(((dictionary + i)->wordArray + j)->nextWordFrequency + k) = 0;
28             }
29             ((dictionary + i)->wordArray + j)->presentNextWordNum = nextWordNum;
30         }
31     }
32 }

View Code

申请初始内存空间并将所有值置零。

（2）遍历文件夹

　　a）Windows平台

 1 void traverseFileandCount(char* filePath, struct alphaArray* dictionary)
 2 {
 3     _finddata_t FileInfo;
 4     char* presentPath;
 5     char* newPath;
 6     presentPath = (char*)malloc(sizeof(char)*filePathLength);
 7     if (presentPath == NULL) exit(-1);
 8     newPath = (char*)malloc(sizeof(char)*filePathLength);
 9     if (newPath == NULL) exit(-1);
10     strcpy_s(presentPath, filePathLength, filePath);
11     strcat_s(presentPath, filePathLength, "\\*");
12     long Handle = _findfirst(presentPath, &FileInfo);
13     if (Handle == -1L) exit(-1);
14     do {
15         if (FileInfo.attrib & _A_SUBDIR)
16         {
17             if ((strcmp(FileInfo.name, ".") != 0) && (strcmp(FileInfo.name, "..") != 0))
18             {
19                 generatePath(FileInfo, filePath, newPath);
20                 traverseFileandCount(newPath, dictionary);
21             }
22         }
23         else
24         {
25             generatePath(FileInfo, filePath, presentPath);
26             count(presentPath, dictionary);
27         }
28     } while (_findnext(Handle, &FileInfo) == 0);
29     _findclose(Handle);
30     free(presentPath);
31     free(newPath);
32 }

View Code

　　b）Linux平台

 1 void traverseFileandCount(char* path, struct alphaArray* dictionary)
 2 {
 3     DIR *pDir; //定义一个DIR类的指针
 4     struct dirent *ent=NULL; //定义一个结构体 dirent的指针，dirent结构体见上
 5     int i = 0;
 6     char childpath[512]; //定义一个字符数组，用来存放读取的路径
 7     pDir = opendir(path); // opendir方法打开path目录，并将地址付给pDir指针
 8     memset(childpath, 0, sizeof(childpath)); //将字符数组childpath的数组元素全部置零
 9     while ((ent = readdir(pDir)) != NULL)
10         //读取pDir打开的目录，并赋值给ent, 同时判断是否目录为空，不为空则执行循环体
11     {
12         if (ent->d_type&DT_DIR)
13             /*读取 打开目录的文件类型 并与 DT_DIR进行位与运算操作，即如果读取的d_type类型为DT_DIR
14             (=4 表示读取的为目录)*/
15         {
16             if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
17                 //如果读取的d_name为 . 或者.. 表示读取的是当前目录符和上一目录符,
18                 //则用contiue跳过，不进行下面的输出
19                 continue;
20             sprintf(childpath, "%s/%s", path, ent->d_name);
21             //如果非. ..则将 路径 和 文件名d_name 付给childpath, 并在下一行prinf输出
22             //printf("path:%s\n",childpath);原文链接这里是要打印出文件夹的地址
23             traverseFileandCount(childpath, dictionary);
24             //递归读取下层的字目录内容， 因为是递归，所以从外往里逐次输出所有目录（路径+目录名），
25             //然后才在else中由内往外逐次输出所有文件名
26         }
27         else
28             //如果读取的d_type类型不是 DT_DIR, 即读取的不是目录，而是文件，
29             //则直接输出 d_name, 即输出文件名
30         {
31             //cout<<ent->d_name<<endl; 输出文件名
32             //cout<<childpath<<"/"<<ent->d_name<<endl; 输出带有目录的文件名
33             sprintf(childpath, "%s/%s", path, ent->d_name);
34             //你可以唯一注意的地方是下一行
35             //目前childpath就是你要读入的文件的path了，可以作为你的读入文件的函数的参数
36             count(childpath, dictionary);//这里就是你的处理文件的接口！，
37         }
38     }
39 }

View Code

（3）计数

 1 void count(char* path, struct alphaArray* dictionary)
 2 {
 3     FILE* fp;
 4     bool firstWordSign = 1;
 5     int i = 0;
 6     int finalAlphaPosition = 0;
 7     int tempWordStrLength = wordStrLength;
 8     int presentWordOffset;
 9     char ch, *tempWordStr;
10     unsigned long hash;
11     struct wordInfo* lastWordInfo = NULL, *presentWordInfo = NULL;
12     tempWordStr = (char*)malloc(sizeof(char)*wordStrLength);
13     if (tempWordStr == NULL) exit(-1);
14     if (fopen_s(&fp, path, "r") != 0) exit(-1);
15     do
16     {
17         ch = fgetc(fp);
18         characterNumandLineNum(ch);
19         if (!isDigitorAlpha(ch))
20         {
21             tempWordStr[i] = '\0';
22             if (isWord(tempWordStr))
23             {
24                 hash = storeTempWord(dictionary, tempWordStr, finalAlphaPosition);
25                 getOffset(presentWordOffset, tempWordStr[0]);
26                 presentWordInfo = ((dictionary + presentWordOffset)->wordArray + hash);
27                 if (!firstWordSign)
28                     storePhaseInfo(lastWordInfo, presentWordInfo);
29                 lastWordInfo = presentWordInfo;
30                 firstWordSign = 0;
31             }
32             i = 0;
33             finalAlphaPosition = 0;
34             tempWordStr[0] = '\0';
35         }
36         else
37         {
38             if (i < wordStrLength)
39             {
40                 if (isAlpha(ch)) finalAlphaPosition = i;
41                 tempWordStr[i++] = ch;
42             }
43             /*if (i >= tempWordStrLength)
44             {
45                 tempWordStrLength *= 2;
46                 tempWordStr = (char*)realloc(tempWordStr, sizeof(char)*tempWordStrLength);
47                 if (tempWordStr == NULL) exit(-1);
48             }*/
49         }
50     } while (ch != EOF);
51     free(tempWordStr);
52     lineNum++;
53     fclose(fp);
54 }

View Code

（4）存储单词信息

 1 unsigned long storeTempWord(struct alphaArray* dictionary, char* tempWordArray, int lastAlphaPosition)
 2 {
 3     unsigned long hash = 0;
 4     int i = 0, j = 0, offset;
 5     char* wordstrPoint;
 6     struct alphaArray* page;
 7     hash = ELFHash(tempWordArray, lastAlphaPosition);
 8     getOffset(offset,tempWordArray[0]);
 9     page = dictionary + offset;
10     hash = hash % (page->presentWordArrayLength);
11     //hash=hash%wordArrayLength;
12     wordstrPoint = (page->wordArray + hash)->wordStr;
13     while (!isEmpty(wordstrPoint) && isDifferent(page->wordArray + hash, tempWordArray, lastAlphaPosition))
14     {
15         i++;
16         if (i > (page->presentWordArrayLength))
17         {
18             enlargeWordArrayLength(page);
19             i = 0;
20         }
21         hash += i * i;
22         hash = hash % (page->presentWordArrayLength);
23         wordstrPoint = (page->wordArray + hash)->wordStr;
24     }
25     /*while ((int)strlen(tempWordArray) >= (page->wordArray + hash)->strlength)
26         enlargeStrLength(page, hash);*/
27     if ((int)strlen(tempWordArray) >= (page->wordArray + hash)->strlength)
28         *(tempWordArray + (page->wordArray + hash)->strlength - 1) = '\0';
29     wordstrPoint = (page->wordArray + hash)->wordStr;
30     if (isEmpty(wordstrPoint))
31     {
32         strcpy_s(wordstrPoint, strlen(tempWordArray)+1, tempWordArray);
33         (page->wordArray + hash)->wordLength = lastAlphaPosition;
34     }
35     else
36     {
37         if (strcmp(wordstrPoint, tempWordArray) > 0)
38             strcpy_s(wordstrPoint, strlen(tempWordArray)+1, tempWordArray);
39     }
40     (page->wordArray + hash)->frequency++;
41     wordNum++;
42     return hash;
43 }

View Code

（5）存储词组信息

 1 void storePhaseInfo(struct wordInfo* lastWordInfo, struct wordInfo* presentWordInfo)
 2 {
 3     int i = 0, k = 0;
 4     bool stored = 0;
 5     for (i = 0; i < (lastWordInfo->presentNextWordNum);)
 6     {
 7         if ((*(lastWordInfo->nextWordFrequency + i)) != 0)
 8         {
 9             if ((*(lastWordInfo->nextWordPoint + i)) == presentWordInfo->wordStr && !stored)
10             {
11                 (*(lastWordInfo->nextWordFrequency + i))++;
12                 stored = 1;
13             }
14             else
15                 i++;
16         }
17         else
18             break;
19     }
20     if (i == (lastWordInfo->presentNextWordNum))
21     {
22         lastWordInfo->nextWordPoint = (char**)realloc(lastWordInfo->nextWordPoint, sizeof(char*)*(lastWordInfo->presentNextWordNum) * 2);
23         if (lastWordInfo->nextWordPoint == NULL) exit(-1);
24         lastWordInfo->nextWordFrequency = (int*)realloc(lastWordInfo->nextWordFrequency, sizeof(int)*(lastWordInfo->presentNextWordNum) * 2);
25         if (lastWordInfo->nextWordFrequency == NULL) exit(-1);
26         for (k = (lastWordInfo->presentNextWordNum); k < (lastWordInfo->presentNextWordNum)*2; k++)
27         {
28             *(lastWordInfo->nextWordPoint + k) = NULL;
29             *(lastWordInfo->nextWordFrequency + k) = 0;
30         }
31         (lastWordInfo->presentNextWordNum) *= 2;
32     }
33     if (!stored)
34     {
35         *(lastWordInfo->nextWordPoint + i) = presentWordInfo->wordStr;
36         (*(lastWordInfo->nextWordFrequency + i))++;
37         stored = 1;
38     }
39 }

View Code

（6）ELF哈希函数

 1 unsigned long ELFHash(char* tempWordArray, int lastAlphaPosition)
 2 {
 3     unsigned long hash = 0, i = 0, x = 0;
 4     char *hashStr;
 5     hashStr = (char*)malloc(sizeof(char)*(lastAlphaPosition + 1));
 6     if (hashStr == NULL) exit(-1);
 7     for (i = 0; i <= (unsigned long)lastAlphaPosition; i++)
 8     {
 9         if (tempWordArray[i] >= 'a'&&tempWordArray[i] <= 'z'|| tempWordArray[i]>='0'&&tempWordArray[i]<='9')
10             *(hashStr + i) = tempWordArray[i];
11         else 
12             *(hashStr + i) = tempWordArray[i] - 'A' + 'a';
13     }
14     for (i = 0; i <= (unsigned long)lastAlphaPosition; i++)
15     {
16         hash = (hash << 4) + *(hashStr + i);
17         if ((x = hash & 0xf0000000) != 0)
18         {
19             hash ^= (x >> 24);
20             hash &= ~x;
21         }
22     }
23     hash &= 0x7fffffff;
24     free(hashStr);
25     return hash;
26 }

View Code

（7）频率前十单词词组统计

  1 void topFrequencyWordStatistics(struct alphaArray* dictionary, struct wordStatisticsResult* topFrequencyWord)
  2 {
  3     int i = 0, j = 0;
  4     int minWordFrequency = 0;
  5     for (i = 0; i < topFrequencyWordNum; i++)
  6     {
  7         (topFrequencyWord + i)->wordStr = NULL;
  8         (topFrequencyWord + i)->wordFrequency = 0;
  9     }
 10     for (i = 0; i < alphabet; i++)
 11     {
 12         for (j = 0; j < (dictionary + i)->presentWordArrayLength; j++)
 13         {
 14             if (((dictionary + i)->wordArray + j)->frequency > minWordFrequency)
 15                 updateTopFrequencyWord(topFrequencyWord, ((dictionary + i)->wordArray + j), minWordFrequency);
 16         }
 17     }
 18     sortTopFrequencyWord(topFrequencyWord);
 19     puts("Top 10 word:");
 20     for (i = 0; i < topFrequencyWordNum; i++)
 21         printf("%s\t%d\n", (topFrequencyWord + i)->wordStr, (topFrequencyWord + i)->wordFrequency);
 22     printf("\n");
 23 }
 24 
 25 void updateTopFrequencyWord(struct wordStatisticsResult* topFrequencyWord, struct wordInfo* dictionary_i_j, int &minWordFrequency)
 26 {
 27     int i = 0;
 28     for (i = 0; i < topFrequencyWordNum; i++)
 29     {
 30         if ((topFrequencyWord + i)->wordFrequency == minWordFrequency)
 31         {
 32             (topFrequencyWord + i)->wordStr = dictionary_i_j->wordStr;
 33             (topFrequencyWord + i)->wordFrequency = dictionary_i_j->frequency;
 34             minWordFrequency = dictionary_i_j->frequency;
 35         }
 36     }
 37     for (i = 0; i < topFrequencyWordNum; i++)
 38     {
 39         if ((topFrequencyWord + i)->wordFrequency < minWordFrequency)
 40             minWordFrequency = (topFrequencyWord + i)->wordFrequency;
 41     }
 42 }
 43 
 44 void sortTopFrequencyWord(struct wordStatisticsResult* topFrequencyWord)
 45 {
 46     int i = 0, j = 0;
 47     int minWordFrequency;
 48     int minWordFrequencyPosition; 
 49     struct wordStatisticsResult tempWord;
 50     for (i = 0; i < topFrequencyWordNum - 1; i++)
 51     {
 52         minWordFrequency = topFrequencyWord->wordFrequency;
 53         minWordFrequencyPosition = 0;
 54         for (j = 0; j < topFrequencyWordNum - i; j++)
 55         {
 56             if ((topFrequencyWord + j)->wordFrequency < minWordFrequency)
 57             {
 58                 minWordFrequency = (topFrequencyWord + j)->wordFrequency;
 59                 minWordFrequencyPosition = j;
 60             }
 61         }
 62         tempWord.wordStr = (topFrequencyWord + minWordFrequencyPosition)->wordStr;
 63         tempWord.wordFrequency = minWordFrequency;
 64         (topFrequencyWord + minWordFrequencyPosition)->wordStr = (topFrequencyWord + topFrequencyWordNum - i - 1)->wordStr;
 65         (topFrequencyWord + minWordFrequencyPosition)->wordFrequency = (topFrequencyWord + topFrequencyWordNum - i - 1)->wordFrequency;
 66         (topFrequencyWord + topFrequencyWordNum - i - 1)->wordStr = tempWord.wordStr;
 67         (topFrequencyWord + topFrequencyWordNum - i - 1)->wordFrequency = tempWord.wordFrequency;
 68     }
 69 }
 70 
 71 void topFrequencyPhaseStatistics(struct alphaArray* dictionary, struct phaseStatisticsResult* topFrequencyPhase)
 72 {
 73     int i = 0, j = 0, k = 0;
 74     int minPhaseFrequency = 0;
 75     for (i = 0; i < topFrequencyPhaseNum; i++)
 76     {
 77         (topFrequencyPhase + i)->firstStr = NULL;
 78         (topFrequencyPhase + i)->secondStr = NULL;
 79         (topFrequencyPhase + i)->phaseFrequency = 0;
 80     }
 81     for (i = 0; i < alphabet; i++)
 82     {
 83         for (j = 0; j < (dictionary + i)->presentWordArrayLength; j++)
 84         {
 85             for (k = 0; k < ((dictionary + i)->wordArray + j)->presentNextWordNum; k++)
 86             {
 87                 if (*(((dictionary + i)->wordArray + j)->nextWordFrequency + k) > minPhaseFrequency)
 88                     updateTopFrequencyPhase(topFrequencyPhase, ((dictionary + i)->wordArray + j), k, minPhaseFrequency);
 89             }
 90         }
 91     }
 92     sortTopFrequencyPhase(topFrequencyPhase);
 93     puts("Top 10 phase:");
 94     for (i = 0; i < topFrequencyPhaseNum; i++)
 95         printf("%s %s\t%d\n", (topFrequencyPhase + i)->firstStr, (topFrequencyPhase + i) ->secondStr, (topFrequencyPhase + i)->phaseFrequency);
 96     printf("\n");
 97 }
 98 
 99 void updateTopFrequencyPhase(struct phaseStatisticsResult* topFrequencyPhase,wordInfo* dictionary_i_j,int offset,int &minPhaseFrequency)
100 {
101     int i = 0;
102     for (i = 0; i < topFrequencyPhaseNum; i++)
103     {
104         if ((topFrequencyPhase + i)->phaseFrequency == minPhaseFrequency)
105         {
106             (topFrequencyPhase + i)->firstStr = dictionary_i_j->wordStr;
107             (topFrequencyPhase + i)->secondStr = *(dictionary_i_j->nextWordPoint + offset);
108             (topFrequencyPhase + i)->phaseFrequency = *(dictionary_i_j->nextWordFrequency + offset);
109             minPhaseFrequency = (topFrequencyPhase + i)->phaseFrequency;
110         }
111     }
112     for (i = 0; i < topFrequencyPhaseNum; i++)
113     {
114         if ((topFrequencyPhase + i)->phaseFrequency < minPhaseFrequency)
115             minPhaseFrequency = (topFrequencyPhase + i)->phaseFrequency;
116     }
117 }
118 
119 void sortTopFrequencyPhase(struct phaseStatisticsResult* topFrequencyPhase)
120 {
121     int i = 0, j = 0;
122     int minPhaseFrequency;
123     int minPhaseFrequencyPosition;
124     struct phaseStatisticsResult tempPhase;
125     for (i = 0; i < topFrequencyPhaseNum - 1; i++)
126     {
127         minPhaseFrequency = topFrequencyPhase->phaseFrequency;
128         minPhaseFrequencyPosition = 0;
129         for (j = 0; j < topFrequencyPhaseNum - i; j++)
130         {
131             if ((topFrequencyPhase + j)->phaseFrequency < minPhaseFrequency)
132             {
133                 minPhaseFrequency = (topFrequencyPhase + j)->phaseFrequency;
134                 minPhaseFrequencyPosition = j;
135             }
136         }
137         tempPhase.firstStr = (topFrequencyPhase + minPhaseFrequencyPosition)->firstStr;
138         tempPhase.secondStr = (topFrequencyPhase + minPhaseFrequencyPosition)->secondStr;
139         tempPhase.phaseFrequency = minPhaseFrequency;
140         (topFrequencyPhase + minPhaseFrequencyPosition)->firstStr = (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->firstStr;
141         (topFrequencyPhase + minPhaseFrequencyPosition)->secondStr = (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->secondStr;
142         (topFrequencyPhase + minPhaseFrequencyPosition)->phaseFrequency = (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->phaseFrequency;
143         (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->firstStr = tempPhase.firstStr;
144         (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->secondStr = tempPhase.secondStr;
145         (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->phaseFrequency = tempPhase.phaseFrequency;
146     }
147 }

View Code

（8）输出

 1 void outputResult(struct alphaArray* dictionary)
 2 {
 3     int i = 0, j = 0, k = 0;
 4     puts("Statistics result:");
 5     printf("characterNum:%lu\n", characterNum);
 6     printf("wordNum:%lu\n", wordNum);
 7     printf("lineNum:%lu\n\n", lineNum);
 8 }
 9 
10 void outputToFile(struct wordStatisticsResult* topFrequencyWord,struct phaseStatisticsResult* topFrequencyPhase)
11 {
12     int i = 0;
13     FILE* fp;
14     fopen_s(&fp,"D:\\RGhw\\result.txt","wb");
15     if (fp == NULL) exit(-1);
16     fputs("characterNum:", fp);
17     fprintf(fp, "%lu\r\n", characterNum);
18     fputs("wordNum:", fp);
19     fprintf(fp, "%lu\r\n", wordNum);
20     fputs("lineNum:", fp);
21     fprintf(fp, "%lu\r\n\r\n", lineNum);
22     fputs("Top 10 frequency words:\r\n", fp);
23     for (i = 0; i < topFrequencyWordNum; i++)
24         fprintf(fp,"%s:  %d\r\n", (topFrequencyWord + i)->wordStr, (topFrequencyWord + i)->wordFrequency);
25     fputs("\r\n",fp); 
26     fputs("Top 10 frequency phases:\r\n", fp);
27     for (i = 0; i < topFrequencyPhaseNum; i++)
28         fprintf(fp,"%s %s:  %d\r\n", (topFrequencyPhase + i)->firstStr, (topFrequencyPhase + i)->secondStr, (topFrequencyPhase + i)->phaseFrequency);
29     fputs("\r\n", fp);
30     fclose(fp);
31 }

View Code

（9）释放空间

 1 void dictionaryDestroy(struct alphaArray* dictionary)
 2 {
 3     int i, j;
 4     for (i = 0; i < alphabet; i++)
 5     {
 6         for (j = 0; j < ((dictionary + i)->presentWordArrayLength); j++)
 7         {
 8             free(((dictionary + i)->wordArray + j)->wordStr);
 9             free(((dictionary + i)->wordArray + j)->nextWordPoint);
10             free(((dictionary + i)->wordArray + j)->nextWordFrequency);
11         }
12         free((dictionary + i)->wordArray);
13     }
14 }

View Code

5.代码性能分析

（1）CPU和GPA使用情况

（2）各函数CPU占用细节

（3）main函数CPU占用细节

（4）遍历文件并进行统计函数（traverseFileandCount）CPU占用细节

（5）统计函数（count）CPU占用细节

分析：

从性能分析来看，程序运行几乎所有的时间花费在统计函数上。
统计函数内部耗费时间最多的是fgetc()函数，说明每次对文件读取一个字节效率很低。之后有考虑过使用fread()函数（一次性将文件内容读入数组）来提高效率，不过由于时间关系并没有优化。
除了fgetc()之外，存储单词的函数也花费了较多的时间，原因可能是采用了动态内存，每次都要判断空间是否够用，并在不够用的情况下申请更大的空间。考虑到程序的健壮性，这一部分时间我觉得是必不可少的。

6.测试样例与分析

（1）助教提供的测试集

　　运行时间32秒（release模式下），运行结果如下（左侧为我的程序结果，右侧是助教的，后面都是这样，注：行数和单词数输出顺序和助教不一样）：

　　前三项误差均在100左右，这可能和统计方法有关

　　单词和词组统计结果和助教一样

（2）空文件夹

（3）空文件

（4）只含一个词的文件

（5）同一类单词按照词典顺序输出

　　文件内容：

　　运行结果：

（6）词组按词典顺序输出

　　文件内容：

　　运行结果：

（7）不同类型的文件

　　文件夹：

　　运行结果：

（8）错误的路径

　　我的程序直接退出（exit(-1)），没有输出错误信息。

（9）第一版测试集

（10）图片文件

7.程序存在的问题

程序第一次成功运行后，我对测试集进行了统计，发现THAT这个单词输出了两个。也就是说同一个单词存放在两个不同的位置。一开始感觉很奇怪，百思不得其解。后来发现，问题出在动态内存上。为了保证程序的健壮性，我使用了动态内存。当词表存放不下单词的时候，程序会申请两倍的空间。但是我忽略了当词表容量发生变化的时候，根据哈希值确定的单词的存储位置也会发生变化。这造成了同样的单词，存放在了不同的地方。我想出的解决方案是，依次在一倍初始空间，两倍初始空间……进行查找，这样的话可以保证每一个单词只有一个确定的位置。不过发现这个问题的时候已经离DDL没多久了，所以我只是简单的扩大了初始空间去解决这个问题。

8.总结反思

总体过程上，由于最开始进行了大致规划，整个过程比较顺利。出现了两次卡壳：动态内存代码、虚拟机的使用。词表采用了动态内存，需要判断内存是否够用，不够用时要重新申请。写这部分代码的时候由于思路不够清楚，花费了较多时间。程序运行成功后就开始进行移植性的修改。为了进行测试，安装了ubuntu虚拟机。成功测试之后突然虚拟机挂掉了，重新安装了三次，仍然失败（心好累）。。所以最后输出文件的函数没办法验证。

代码规范上，相比以前稍有进步。这次代码编写时，我着重注意了变量命名和函数命名，以增强代码可读性。另外，我尽可能的将长函数拆分成若干个小函数，尽管这样仍然有四五十行的代码。

时间安排上，我只能说，我是先写软工作业然后写其他课程作业。

不足之处，虚拟机使用不熟练，出现问题不能尽快解决；代码性能分析不够详细；代码繁琐难读。

以后编程过程中会不断锻炼、改进。

附：代码github地址

posted @ 2018-03-30 17:38 _最冷一天阅读(209) 评论(3) 编辑收藏举报

刷新页面返回顶部

ustckx

词频统计——软工第一次个人作业

公告