词频统计——实训一

本次作业代码地址：http://gitee.com/changchundehui/09-08/blob/master/Wf.java

本次作业内容地址：https://edu.cnblogs.com/campus/cvit/16NetworkEngineering/homework/2509

任务实施人员：16012008李雪 16012009徐小东

设计：

　　1.利用Java语言来实现统计词频。

　　2.整个词频统计分为三个部分，第一部分是读取文件并把所有单词截出来，第二部分是统计词频（利用树），第三部分是排序及输出。

代码实现：

　　　　三种模式实现三种功能，输入1实现功能一，输入2实现功能二，输入3实现功能三。

一：主程序入口：

public static void main(String[] args) {
        //主程序逻辑入口
        while (true) {
            System.out.println("本程序有三种模式：1.单行语言处理；2.单个文件处理；3.批量处理;0.退出程序\n请键入1 2 3选择您需要的模式，模式2指定具体路径为d盘根目录");
            Scanner readerScanner = new Scanner(System.in);
            int flag = readerScanner.nextInt();
            if (flag == 0) {
                break;

二：功能一

　　为保证程序真实，可以请老五在控制台输入一行英文句子。

//统计单个文件

else if (flag == 1) {
try {
System.out.println("当前为单行语言处理模式，请输入您要评测的语句");
BufferedReader bf =new BufferedReader(new InputStreamReader(System.in)); //读取命令行中一行
String s=bf.readLine();
LineCode(s);
} catch (IOException ex) {
System.out.println("请按单行输入句子");
}

三：功能二

　　让老五亲自写一篇英文文章，然后输入文件名。

else if (flag == 2) {
                System.out.println("当前为单个文件处理模式，请输入您要输入的文件名，格式：aaa.txt");
                String s = readerScanner.next();
                try {
                    TxtCode(s);
                } catch (Exception ex) {
                    System.out.println("请输入正确的文件名称，确定后文件存在以及文件是否放在d:根目录下");
                }

四：功能三

为批量统计文件，所以应输入文件路径，而不能连续写文件名称。

 else if(flag==3){
              System.out.println("当前为批量文件处理模式,请输入文件具体路径，格式：d:/ljr");
               String path=readerScanner.next();              
               File file =new File(path);
                if (file.isDirectory()) {
                    File[] filelist =file.listFiles();
                    for(File file1:filelist){
                        try {
                            String s=file1.getPath();//地址回溯
                            System.out.println(s);
                            FileCode(s);
                        } catch (Exception ex) {
                            System.out.println("请输入正确的路径，若程序无法结束请重新运行程序");
                        }

五：

访问文件，把数据读到缓存区，用“String.split(）+正则表达式”把所有单词截出来，再添加到链表里。统计词频，这里新建了一棵树作为对象，用来遍历lists。将单词作为树的key值，词频作为树的value值，实际上达到了统计词频的目的。排序及输出，首先是以oldmap.entrySet()方法将树里的映射关系存放到Set容器中，然后创建一个树的节点类型的链表，再调用一个系统库里的排序方法collection.sort对链表进行排序，最后遍历链表，从大到小输出　

 //统计单个文件
    public static void TxtCode(String txtname) throws Exception {
        BufferedReader br = new BufferedReader(new FileReader("D:/word.txt" ));
        List<String> lists = new ArrayList<String>();  //存储过滤后单词的列表  
        String readLine = null;
        while ((readLine = br.readLine()) != null) {
            String[] wordsArr1 = readLine.split("[^a-zA-Z]");  //过滤出只含有字母的  
            for (String word : wordsArr1) {
                if (word.length() != 0) {  //去除长度为0的行  
                    lists.add(word);
                }
            }
        }
        br.close();
        StatisticalCode(lists);       
    }

    //统计单行
    public static void LineCode(String args) {
        List<String> lists = new ArrayList<String>();  //存储过滤后单词的列表 
        String[] wordsArr1 = args.split("[^a-zA-Z]");  //过滤出只含有字母的  
        for (String word : wordsArr1) {
            if (word.length() != 0) {  //去除长度为0的行  
                lists.add(word);
            }
        }
        StatisticalCode(lists);    
    }
    
    public static void FileCode(String args) throws FileNotFoundException, IOException {
        BufferedReader br = new BufferedReader(new FileReader(args));
        List<String> lists = new ArrayList<String>();  //存储过滤后单词的列表  
        String readLine = null;
        while ((readLine = br.readLine()) != null) {
            String[] wordsArr1 = readLine.split("[^a-zA-Z]");  //过滤出只含有字母的  
            for (String word : wordsArr1) {
                if (word.length() != 0) {  //去除长度为0的行  
                    lists.add(word);
                }
            }
        }
        br.close();
        StatisticalCode(lists);       
    }
    
    public static void StatisticalCode(List<String> lists) {
              //统计排序
          Map<String, Integer> wordsCount = new TreeMap<String, Integer>();  //存储单词计数信息，key值为单词，value为单词数                
        //单词的词频统计  
        for (String li : lists) {
            if (wordsCount.get(li) != null) {
                wordsCount.put(li, wordsCount.get(li) + 1);
            } else {
                wordsCount.put(li, 1);
            }
        }
        // System.out.println("wordcount.Wordcount.main()");
        SortMap(wordsCount);    //按值进行排序 
    }
    //按value的大小进行排序  
    public static void SortMap(Map<String, Integer> oldmap) {

        ArrayList<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String, Integer>>(oldmap.entrySet());
        Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
            public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
                return o2.getValue() - o1.getValue();  //降序  
            }
        });
        for (int i = 0; i < list.size(); i++) {
            System.out.println(list.get(i).getKey() + ": " + list.get(i).getValue());
        }
    }
}

六：验证实例：