java词频统计——改进后的单元测试
测试项目
博客文章地址:[http://www.cnblogs.com/jx8zjs/p/5862269.html]
工程地址:https://coding.net/u/jx8zjs/p/wordCount/git
ssh://git@git.coding.net:jx8zjs/wordCount.git
测试用例:
1.
1 My English is very very pool
2.地址 [http://www.gutenberg.org/files/2600/2600-0.txt]
待测单元1:统计输入文件的词频到目标文件
前四行代码为输入文件和输出文件地址,文件1是测试用例1,文件2是测试用例2.
1 String filename1 = "D://text/pool.txt"; 2 String filename2 = "D://text/2600-0.txt"; 3 String filenamedes1 = "D://pooltest.txt"; 4 String filenamedes2 = "D://2600-0test.txt"; 5 private static FileWordUtil fu = new FileWordUtil(); 6 7 public void testPrintSortedWordGroupCountToFileBufferedStringString() { 8 fu.printSortedWordGroupCountToFile(filename1, filenamedes1); 9 fu.printSortedWordGroupCountToFile(filename2, filenamedes2); 10 } 11 12 public void printSortedWordGroupCountToFile(String filename, String destinationFilename) { 13 List<String[]> result = getSortedWordGroupCount(filename); 14 if (result == null) { 15 System.out.println("no result"); 16 return; 17 } 18 try { 19 FileWriter fr = new FileWriter(destinationFilename); 20 for (String[] sa : result) { 21 fr.write(sa[1] + ": " + sa[0] + "\r\n"); 22 } 23 fr.close(); 24 } catch (IOException e) { 25 e.printStackTrace(); 26 return; 27 } 28 29 }
核心词频统计代码(2016.9.26优化版):
1 public Map<String, Integer> getWordGroupCountBuffered(String filename) { 2 try { 3 FileReader fr = new FileReader(filename); 4 BufferedReader br = new BufferedReader(fr); 5 StringBuffer content = new StringBuffer(""); 6 Map<String, Integer> result = new HashMap<String, Integer>(); 7 char[] ch = new char[128]; 8 int bs = 0; 9 int idx; 10 boolean added = false; 11 boolean split = false; 12 total = 0; 13 while ((bs = br.read(ch)) > 0) { 14 for (idx = 0; idx < bs; idx++) { // char 15 if (isCharacter(ch[idx]) == 1) { 16 if (split == false) { 17 content.append(ch[idx]); 18 added = false; 19 } else { 20 String key = content.toString().toLowerCase(); 21 split = false; 22 total++; 23 added = true; 24 content = new StringBuffer(""); 25 content.append(ch[idx]); 26 if (result.containsKey(key)) { 27 result.put(key, result.get(key) + 1); 28 continue; 29 } else { 30 result.put(key, 1); 31 continue; 32 } 33 } 34 } else if (isCharacter(ch[idx]) == 2) { // digital 35 if (added == true) { 36 continue; 37 } else { 38 content.append(ch[idx]); 39 } 40 } else { // not char or digital 41 split = true; 42 continue; 43 } 44 } 45 } 46 String key = content.toString().toLowerCase(); 47 if (result.containsKey(key)) { 48 result.put(key, result.get(key) + 1); 49 } else { 50 result.put(key, 1); 51 } 52 total++; 53 br.close(); 54 fr.close(); 55 return result; 56 } catch ( 57 58 FileNotFoundException e) { 59 System.out.println("failed to open file:" + filename); 60 e.printStackTrace(); 61 } catch (Exception e) { 62 System.out.println("some expection occured"); 63 e.printStackTrace(); 64 } 65 return null; 66 }
测试结果:
pooltest.txt
2600-0test.txt
待测单元2:统计输入文件的词频到控制台或终端
测试用例1结果:
单元测试总结:
在单元测试的时候偶然间发现了在上文提到的连接中的分词核心函数在某些情况下回遗漏文章最后一个单词,经过反复改进和思考后重写了分析读出字符的逻辑,使测试结果也能满足于预期结果,更令我意外的是算法的效率也提升了近40%(原版本在本机的执行时间平均在490-550ms,新版本运行时间在276-343ms),原因也是引入了新的boolean变量帮助优化逻辑,也减少了一些判定条件。
代码覆盖率:
测试类:
1 public class FileWordUtilTest { 2 3 private static FileWordUtil fu = new FileWordUtil(); 4 String filename1 = "D://text/pool.txt"; 5 String filename2 = "D://text/2600-0.txt"; 6 String filenamedes1 = "D://pooltest.txt"; 7 String filenamedes2 = "D://2600-0test.txt"; 8 9 @Before 10 public void setUp() throws Exception { 11 } 12 13 @After 14 public void tearDown() throws Exception { 15 } 16 17 18 @Test 19 public void testGetSortedWordGroupCountBufferedString() { 20 fu.getSortedWordGroupCountBuffered(filename1); 21 fu.getSortedWordGroupCountBuffered(filename2); 22 } 23 24 @Test 25 public void testPrintSortedWordGroupCountToFileBufferedStringString() { 26 fu.printSortedWordGroupCountToFileBuffered(filename1, filenamedes1); 27 fu.printSortedWordGroupCountToFileBuffered(filename2, filenamedes2); 28 } 29 30 @Test 31 public void testPrintSortedWordGroupCountBufferedString() { 32 fu.printSortedWordGroupCountBuffered(filename1); 33 fu.printSortedWordGroupCountBuffered(filename2); 34 } 35 36 @Test 37 public void testPrintSortedWordGroupCountToFileBufferedFileArrayString() { 38 fu.printSortedWordGroupCountToFileBuffered(filename1, filenamedes1); 39 fu.printSortedWordGroupCountToFileBuffered(filename2, filenamedes2); 40 } 41 42 }
覆盖率结果:
覆盖率分析:
测试中使用上述两个测试用例来进行的代码行覆盖统计,分别测试了getSortedWordGroupCountBuffered 89.0%,printSortedWordGroupCountToFileBuffered 88.9%,printSortedWordGroupCountBuffered 87.3%。
其中未测试到的部分就是catch块,或者旧版本api,null值检测等。所以所选的测试用例基本可以证明当前代码测试完全。
工程地址:https://coding.net/u/jx8zjs/p/wordCount/git
ssh://git@git.coding.net:jx8zjs/wordCount.git