java动态爬虫jsoup以及正则表达式的运用
1.jsoup是java的HTML解析器,可直接解析某个URL地址,HTML文本内容。http://jsoup.org/官网
2.解析URL地址
1 Document doc = Jsoup 2 .connect(url) 3 .userAgent( 4 "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0)") // 设置User-Agent 5 .timeout(5000) // 设置连接超时时间 6 .get();
1 Elements elements = doc.getElementsByClass("desc"); 2 Elements subelements = elements.get(0).getElementsByTag("li"); 3 Elements dayElements = eachDayElement.getElementsByTag("tr"); 4 Elements firstSubElements = firstElement.getElementsByTag("td"); 5 String text = elements.get(0).text(); 6 private static String regEx_publishDate = "由中央气象台\\s*(\\d+):(\\d+)\\s*发布的"; 7 private static Pattern pattern_publishDate = Pattern 8 .compile(regEx_publishDate); 9 Matcher matcher = pattern_publishDate.matcher(text); 10 if (matcher.find()) { 11 int hour = Integer.parseInt(matcher.group(1)); 12 int minute = Integer.parseInt(matcher.group(2));}
3.要有jsoup的jar包
4. \s 匹配任意的空白符 \S匹配任意不是空白符的字符 \d匹配数字 +重复一次或更多次 * 重复零次或更多次
demo:
1 (\\d{4})-(\\d{2})-(\\d{2})\\s+(\\d{2}):(\\d{2})发布 2 (\\S+过敏\\S+):\\s+(\\S+)\\s+(\\S+) 3 \\s+(感冒\\S+):\\s+(\\S+)\\s+(\\S+) 4 \\s*(\\S+)\\s* 5 首要污染物:\\s*(\\S+)\\s*"
正则表达式语法:
https://msdn.microsoft.com/zh-cn/library/ae5bf541%28v=vs.80%29.aspx