提取新闻正文
参考了好多人的算法,但感觉太深奥了,自己写了一个,感觉效果还可以,不过还是有很多杂质在里面
成功率没有测试过,以后校验。
public static String extractContent(String url) { Document document = JsoupUitl.readUrl(url); String orderHtml = document.toString().toLowerCase(); orderHtml = orderHtml.replaceAll("(?is)<!DOCTYPE.*?>", ""); orderHtml = orderHtml.replaceAll("(?is)<!--.*?-->", ""); // remove html orderHtml = orderHtml.replaceAll("(?is)<script.*?>.*?</script>", ""); // remove orderHtml = orderHtml.replaceAll("(?is)<style.*?>.*?</style>", ""); // remove orderHtml = orderHtml.replaceAll("(?is)<a.*?>.*?</a>", ""); // remove orderHtml = orderHtml.replaceAll("&.{2,5};|&#.{2,5};", ""); orderHtml = orderHtml.replaceAll("<(?!\\/?(td|tr|img|br|p)).*?>", ""); String[] eleList = orderHtml.split("\n"); StringBuffer sb = new StringBuffer(); for (String string : eleList) { if (string.trim().length() > 20) { if (string.contains("></p>")) { } else { sb.append(string); } } } orderHtml = sb.toString(); // System.out.println("====================================="); // System.out.println(Jsoup.parse(orderHtml)); return orderHtml; }
测试例子,感觉效果还不错: