【HTML Parser】解析HTML:基于第三方库Jsoup

JSoup官方地址:http://jsoup.org

Apache HttpComponents官方地址:http://hc.apache.org/index.html

 

1、抓取HTML内容

这里我们使用HttpClient库,根据URL请求远端的HTML

public static String getHTMLFromURL(String url) {
    String html = null;
    HttpClient httpClient = new DefaultHttpClient();
    HttpGet httpGet = new HttpGet(url);
    try {
        HttpResponse httpResponse = httpClient.execute(httpGet);
        int resStatu = httpResponse.getStatusLine().getStatusCode();
        if (resStatu == HttpStatus.SC_OK) {
            HttpEntity entity = httpResponse.getEntity();
            if (entity != null) {
                html = EntityUtils.toString(entity);
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        httpClient.getConnectionManager().shutdown();
    }
    return html;
}

 

2、解析HTML

示例,打印百度的标题

> 解析,获得Document对象

Document doc = Jsoup.parse(html);

> 使用 CSS 或 类似 JQuery 的 Selector 选择元素

Elements elements = doc.select("title");

> 打印元素的文本内容

System.out.println(ele.text());

String html = WebCrawler.getHTMLFromURL("http://www.baidu.com");
if (html != null) {
    Document doc = Jsoup.parse(html);
    Elements elements = doc.select("title");
    for (Element element : linksElements) {
        System.out.println(element.text());
    }
}

 

运行结果:

 

posted on 2013-03-20 11:38  Anthony Li  阅读(1037)  评论(0编辑  收藏  举报

博客园博客已停止更新,博客地址:dyinigbleed.com