HttpClient(二)-- 模拟浏览器抓取网页

一、设置请求头消息 User-Agent模拟浏览器

   1.当使用第一节的代码 来 访问推酷的时候,会返回给我们如下信息:

网页内容:<!DOCTYPE html>
<html>
    <head>
          <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>
    <body>
        <p>系统检测亲不是真人行为,因系统资源限制,我们只能拒绝你的请求。如果你有疑问,可以通过微博 http://weibo.com/tuicool2012/ 联系我们。</p>
    </body>
</html>

  这是因为网站做了限制,限制别人爬。解决方式可以设置请求头消息 User-Agent模拟浏览器。代码如下:

/**
     * 抓取网页信息使用 get请求
     * @param args
     * @throws IOException 
     * @throws ClientProtocolException 
     */
    public static void main(String[] args) throws ClientProtocolException, IOException {
        // 创建httpClient实例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 创建httpGet实例
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if(response != null){
            HttpEntity entity = response.getEntity();   // 获取网页内容
            String result = EntityUtils.toString(entity, "UTF-8"); 
            System.out.println("网页内容:" + result);
        }
        if(response != null){
            response.close();
        }
        if(httpClient != null){
            httpClient.close();
        }
    }

   给HttpGet方法设置头消息,即可模拟浏览器访问。

二、获取响应内容Content-Type  

   使用  entity.getContentType().getValue()  来获取Content-Type,代码如下:

public static void main(String[] args) throws ClientProtocolException, IOException {
        // 创建httpClient实例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 创建httpGet实例
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if(response != null){
            HttpEntity entity = response.getEntity();   // 获取网页内容
            System.out.println("Content-Type:" + entity.getContentType().getValue());   // 获取Content-Type
        }
        if(response != null){
            response.close();
        }
        if(httpClient != null){
            httpClient.close();
        }
    }

三、获取响应状态

  200 -- 正常

  403 -- 拒绝

  500 -- 服务器报错

  400 -- 未找到页面

  使用 response.getStatusLine().getStatusCode() 获取响应状态,代码如下:

public static void main(String[] args) throws ClientProtocolException, IOException {
        // 创建httpClient实例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 创建httpGet实例
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        if(response != null){
            int state = response.getStatusLine().getStatusCode();
            System.out.println("响应状态:" + state);
        }
        if(response != null){
            response.close();
        }
        if(httpClient != null){
            httpClient.close();
        }
    }

 四、HttpClient学习地址

  开源博客系统-HttpClient

 

posted @ 2017-09-11 23:11  小葱拌豆腐~  阅读(2535)  评论(0编辑  收藏  举报