闲来无事研究了下httpclient,做个简单的页面爬取。做个简单的笔记方便以后使用。

先来说说httpclient使用所依赖的jar包。httpclient-4.5.2.jar,httpcore-4.4.4.jar,commons-logging-1.2.jar,commons-codec-1.9.jar,commons-io-2.5.jar,jsoup-1.10.2.jar

直接上代码:

1.模拟浏览器抓取网页,并得到响应内容类型,以及响应状态:

public class Demo {

    public static void main(String[] args)throws Exception {
        CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
        HttpGet httpGet=new HttpGet("http://www.baidu.com/"); // 创建httpget实例
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");//设置模拟浏览器
        CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
        System.out.println("Status:"+response.getStatusLine().getStatusCode());
        HttpEntity entity=response.getEntity(); // 获取返回实体
        System.out.println("Content-Type:"+entity.getContentType().getValue());
        //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
        response.close(); // response关闭
        httpClient.close(); // httpClient关闭
    }
}

2.抓取某张图片保存到本地:

public class Demo1 {

    public static void main(String[] args)throws Exception {
        CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
        HttpGet httpGet=new HttpGet("http://www.baidu.com/jd4.gif"); // 创建httpget实例
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
        CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
        HttpEntity entity=response.getEntity(); // 获取返回实体
        if(entity!=null){
            System.out.println("ContentType:"+entity.getContentType().getValue());
            InputStream inputStream=entity.getContent();
            FileUtils.copyToFile(inputStream, new File("D://jd4.gif"));
        }
        response.close(); // response关闭
        httpClient.close(); // httpClient关闭
    }
}

3.httpclient使用代理ip

public class Demo01 {

    public static void main(String[] args)throws Exception {
        CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
        HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例
        HttpHost proxy=new HttpHost("10.155.213.235", 8888);
        RequestConfig config=RequestConfig.custom().setProxy(proxy).build();
        httpGet.setConfig(config);
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
        CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
        HttpEntity entity=response.getEntity(); // 获取返回实体
        System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
        response.close(); // response关闭
        httpClient.close(); // httpClient关闭
    }
}

4.获取页面并通过jsoup解析页面得到想要的。

public class HttpClientDemo {
     public static void main(String[] args) throws Exception{
         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
         HttpGet httpGet=new HttpGet("http://www.cnblogs.com/"); // 创建httpget实例
         RequestConfig config=RequestConfig.custom()
                 .setConnectTimeout(10000) // 设置连接超时时间 10秒钟
                 .setSocketTimeout(20000) // 设置读取超时时间10秒钟
                 .build();
         httpGet.setConfig(config);
         httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");//模拟浏览器
         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
         HttpEntity entity=response.getEntity(); // 获取返回实体
     //    System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
         String content=EntityUtils.toString(entity, "utf-8");
         response.close(); // response关闭
        Document doc=Jsoup.parse(content); // 解析网页 得到文档对象
        
        Elements linkElements=doc.select("#post_list .post_item .post_item_body h3 a"); //通过选择器查找所有博客链接DOM
        for(Element e:linkElements){
            System.out.println("标题:"+e.text());
            System.out.println("地址:"+e.attr("href"));
            System.out.println("target:"+e.attr("target"));
        }
        
        Element linkElement=doc.select("#friend_link").first();
        System.out.println("纯文本:"+linkElement.text());
        System.out.println("Html:"+linkElement.html());
         
         
     
         httpClient.close(); // httpClient关闭
    }
}

关于jsoup的一些方法:

getElementById(String id) 根据 id 来查询 DOM
getElementsByTag(String tagName) 根据 tag 名称来查询 DOM
getElementsByClass(String className) 根据样式名称来查询 DOM
getElementsByAttribute(String key) 根据属性名来查询 DOM
getElementsByAttributeValue(String key,String value) 根据属性名和属性值来查询 DOM

 

posted on 2018-07-06 14:53  王衙内  阅读(195)  评论(0编辑  收藏  举报