闲来无事研究了下httpclient,做个简单的页面爬取。做个简单的笔记方便以后使用。
先来说说httpclient使用所依赖的jar包。httpclient-4.5.2.jar,httpcore-4.4.4.jar,commons-logging-1.2.jar,commons-codec-1.9.jar,commons-io-2.5.jar,jsoup-1.10.2.jar
直接上代码:
1.模拟浏览器抓取网页,并得到响应内容类型,以及响应状态:
public class Demo { public static void main(String[] args)throws Exception { CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例 HttpGet httpGet=new HttpGet("http://www.baidu.com/"); // 创建httpget实例 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");//设置模拟浏览器 CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求 System.out.println("Status:"+response.getStatusLine().getStatusCode()); HttpEntity entity=response.getEntity(); // 获取返回实体 System.out.println("Content-Type:"+entity.getContentType().getValue()); //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容 response.close(); // response关闭 httpClient.close(); // httpClient关闭 } }
2.抓取某张图片保存到本地:
public class Demo1 { public static void main(String[] args)throws Exception { CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例 HttpGet httpGet=new HttpGet("http://www.baidu.com/jd4.gif"); // 创建httpget实例 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求 HttpEntity entity=response.getEntity(); // 获取返回实体 if(entity!=null){ System.out.println("ContentType:"+entity.getContentType().getValue()); InputStream inputStream=entity.getContent(); FileUtils.copyToFile(inputStream, new File("D://jd4.gif")); } response.close(); // response关闭 httpClient.close(); // httpClient关闭 } }
3.httpclient使用代理ip
public class Demo01 { public static void main(String[] args)throws Exception { CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例 HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例 HttpHost proxy=new HttpHost("10.155.213.235", 8888); RequestConfig config=RequestConfig.custom().setProxy(proxy).build(); httpGet.setConfig(config); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求 HttpEntity entity=response.getEntity(); // 获取返回实体 System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容 response.close(); // response关闭 httpClient.close(); // httpClient关闭 } }
4.获取页面并通过jsoup解析页面得到想要的。
public class HttpClientDemo { public static void main(String[] args) throws Exception{ CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例 HttpGet httpGet=new HttpGet("http://www.cnblogs.com/"); // 创建httpget实例 RequestConfig config=RequestConfig.custom() .setConnectTimeout(10000) // 设置连接超时时间 10秒钟 .setSocketTimeout(20000) // 设置读取超时时间10秒钟 .build(); httpGet.setConfig(config); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");//模拟浏览器 CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求 HttpEntity entity=response.getEntity(); // 获取返回实体 // System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容 String content=EntityUtils.toString(entity, "utf-8"); response.close(); // response关闭 Document doc=Jsoup.parse(content); // 解析网页 得到文档对象 Elements linkElements=doc.select("#post_list .post_item .post_item_body h3 a"); //通过选择器查找所有博客链接DOM for(Element e:linkElements){ System.out.println("标题:"+e.text()); System.out.println("地址:"+e.attr("href")); System.out.println("target:"+e.attr("target")); } Element linkElement=doc.select("#friend_link").first(); System.out.println("纯文本:"+linkElement.text()); System.out.println("Html:"+linkElement.html()); httpClient.close(); // httpClient关闭 } }
关于jsoup的一些方法:
getElementById(String id) 根据 id 来查询 DOM getElementsByTag(String tagName) 根据 tag 名称来查询 DOM getElementsByClass(String className) 根据样式名称来查询 DOM getElementsByAttribute(String key) 根据属性名来查询 DOM getElementsByAttributeValue(String key,String value) 根据属性名和属性值来查询 DOM