分析一个文本文件（英文文章）中各个词出现的频率，并且把频率最高的十个词打印出来

老师布置了这个大作业后，就一直感觉很头疼，因为我编程很不好，而且很长时间没有好好编程了，但是看到这个题目有似曾相识的感觉，同学提醒上个学期的编译有词法识别的程序，重新找出来看了看，有了一点识别单词的思路，但是对于识别每一个单词，保存并排序还是很头痛，百度查了一下，除了看不懂的就是有些错误的。等到最后弄的时候，看到了好多同学的，不得不承认此博客对于我们用处很大，它保存了我们自己的程序，还让我们看到了别人的程序。因为我对C#、JAVA不是很熟，所以我选择了C语言。

一、编程思路

1、分析文本文件，首先读入文件，在文件上先写好英文文章

FILE * in;
char infile[50];
printf("输入文件路径\n");
scanf("%s",infile);

if((in=fopen(infile,"r"))==NULL)
{
printf("此文件无法打开");
exit(0);
}

2、然后识别单词，识别单词用的是一个判断比较，将单词及其出现的次数存入一个结构体，以便后来输出

struct s //结构体存放单词及其个数

{

int key; //存放单词个数

char w[20]; //存放单词的数组

};

3、

在识别单词的时候，最重要的是确定条件，识别单词后，然后读入单词，此时读入单词的时候，遇到是字母继续读入，不是字母的说明单词读完了

void coppy() //程序的主体

{

char ch;

int k,flag=0,n=0,m=0,j=0; //flag记录单词是否相同，i记录读入每个单词的字母，j记录在结构体中出现第几个的单词的

for(m=0;m<500;m++)

words[m].key=0; //记录每一个单词出现的次数

char infile[50];

每读一个单词的时候，对它进行比较与以前读入的单词，当遇到相同的时候，对记录单词数目的变量加一，如果在比较完以前读入的单词后还是没有相同的，将它存入设立的结构体数组，重复此过程。

while((ch=fgetc(in))!=EOF)

{

for(k=0;k<20;k++) //读入每一个单词，因为每个单词所包含的字母数目少于20，所以k<20

c[k]='\0'; //初始化数组

int i=0;

while(((ch>='A'&&ch<='Z')||(ch>='a'&&ch<='z'))) //判断是否是字母

{

if(ch>'A'&&ch<'Z')

ch=ch+32; //大写换小写

c[i++]=ch; //将读入的字母放到存放单词的数组中

ch=fgetc(in); //ch为读入的每一个字母

}

if(c[0]=='\0')

continue;

else

{

for(j=0;words[j].key!=0;j++) //将每一个读入的单词与放到结构体中的单词进行比较，相同单词数目加1

{

if(strcmp(words[j].w,c)==0)

{

words[j].key=words[j].key+1;

flag=1; //flag=1，单词相同

break;

}

if(flag==0) //flag=0，单词不同，将它放到结构体数组中

{

words[j].key=1;

strcpy(words[j].w,c);

}

flag=0;

}

二、编程日志

2月22日：

找出了自己以前的程序，看懂了，弄明白了读文件并识别单词的思路

查找了一些东西，写出了一小部分程序

2月23日

百度了一些程序，搞懂了基本的算法

大体写出了整体，但是存在大量的错误

2月28日

查找了一些资料，调试，程序不存在错误，但是结果不正确

3月2日

参考，查课本，最后程序终于出来了

三、源程序及运行结果

// 分析文本文件.cpp : Defines the entry point for the console application.
//第一次大作业之分析一个文本文件（英文文章）中各个词出现的频率，并且把频率最高的十个词打印出来
//

#include "stdafx.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
struct s //结构体存放单词及其个数

{

int key; //存放单词个数

char w[20]; //存放单词的数组

};

char c[20];

struct s words[500]; //记录这篇文章出现的单词数目

FILE * in; //文件的指针

void coppy() //程序的主体

{

char ch;

int k,flag=0,n=0,m=0,j=0; //flag记录单词是否相同，i记录读入每个单词的字母，j记录在结构体中出现第几个的单词的

for(m=0;m<500;m++)

words[m].key=0; //记录每一个单词出现的次数

char infile[50];

printf("输入文件路径:\n"); //打开所要分析的文本文件

scanf("%s",infile);

if((in=fopen(infile,"r"))==NULL)

{

printf("此文件无法打开");

exit(0);

}

while((ch=fgetc(in))!=EOF)

{

for(k=0;k<20;k++) //读入每一个单词，因为每个单词所包含的字母数目少于20，所以k<20

c[k]='\0'; //初始化数组

int i=0;

while(((ch>='A'&&ch<='Z')||(ch>='a'&&ch<='z'))) //判断是否是字母

{

if(ch>'A'&&ch<'Z')

ch=ch+32; //大写换小写

c[i++]=ch; //将读入的字母放到存放单词的数组中

ch=fgetc(in); //ch为读入的每一个字母

}

if(c[0]=='\0')

continue;

else

{

for(j=0;words[j].key!=0;j++) //将每一个读入的单词与放到结构体中的单词进行比较，相同单词数目加1

{

if(strcmp(words[j].w,c)==0)

{

words[j].key=words[j].key+1;

flag=1; //flag=1，单词相同

break;

}

if(flag==0) //flag=0，单词不同，将它放到结构体数组中

{

words[j].key=1;

strcpy(words[j].w,c);

}

flag=0;

}

fclose(in); //关闭文件

}

void sort() //采用冒泡法排序

{

int i,j,max;

char ch2[20];

for(i=0;i<10;i++) //因为只需要输出前十个

{

max=words[i].key;

for(j=i+1;words[j].key!=0;j++)

if(words[j].key>max)

{

max=words[j].key;

words[j].key=words[i].key;

words[i].key=max;

strcpy(ch2,words[j].w);

strcpy(words[j].w,words[i].w);

strcpy(words[i].w,ch2);

}

printf("%d\t",words[i].key);

printf("%s\n",words[i].w); }

}

void main()

{

coppy();

sort();

}

运行结果如下：

文件如下：

It is not difficult to imagine a world short of ambition. It would probably be a kinder world: with out demands, without abrasions, without disappointments. People would have time for reflection. Such work as they did would not be for themselves but for the collectivity. Competition would never enter in. conflict would be eliminated, tension become a thing of the past. The stress of creation would be at an end. Art would no longer be troubling, but purely celebratory in its functions. Longevity would be increased, for fewer people would die of heart attack or stroke caused by tumultuous endeavor. Anxiety would be extinct. Time would stretch on and on, with ambition long departed from the human heart.
Ahhow unrelieved boring life would be

There is a strong view that holds that success is a myth, and ambition therefore a sham. Does this mean that success does not really exist? That achievement is at bottom empty? That the efforts of men and women are of no significance alongside the force of movements and events now not all success, obviously, is worth esteeming, nor all ambition worth cultivating. Which are and which are not is something one soon enough learns on one’s own. But even the most cynical secretly admit that success exists; that achievement counts for a great deal; and that the true myth is that the actions of men and women are useless. To believe otherwise is to take on a point of view that is likely to be deranging. It is, in its implications, to remove all motives for competence, interest in attainment, and regard for posterity.

We do not choose to be born. We do not choose our parents. We do not choose our historical epoch, the country of our birth, or the immediate circumstances of our upbringing. We do not, most of us, choose to die; nor do we choose the time or conditions of our death. But within all this realm of choicelessness, we do choose how we shall live: courageously or in cowardice, honorably or dishonorably, with purpose or in drift. We decide what is important and what is trivial in life. We decide that what makes us significant is either what we do or what we refuse to do. But no matter how indifferent the universe may be to our choices and decisions, these choices and decisions are ours to make. We decide. We choose. And as we decide and choose, so are our lives formed. In the end, forming our own destiny is what ambition is about.

四、实验心得

此程序最重要的是搞清思想，确定算法，把这道题分开，一个一个模块的解决。

在识别单词的时候，最重要的是确定条件，识别单词后，然后读入单词，此时读入单词的时候，遇到是字母继续读入，不是字母的说明单词读完了

最后进行排序输出的时候，对单词的次数进行排序，然后输出结果。

posted @ 2014-03-02 17:43 徐梦迪迪阅读(815) 评论(3) 收藏举报

刷新页面返回顶部

徐梦迪迪

分析一个文本文件（英文文章）中各个词出现的频率，并且把频率最高的十个词打印出来

公告