(二)模拟浏览器抓取网页
第一节: 设置请求头消息 User-Agent 模拟浏览器
HttpClient设置请求头消息User-Agent模拟浏览器
比如我们请求 www.tuicool.com
用前面的代码:
1 package com.javaxk.httpclient.chap02; 2 3 import org.apache.http.HttpEntity; 4 import org.apache.http.client.methods.CloseableHttpResponse; 5 import org.apache.http.client.methods.HttpGet; 6 import org.apache.http.impl.client.CloseableHttpClient; 7 import org.apache.http.impl.client.HttpClients; 8 import org.apache.http.util.EntityUtils; 9 10 public class Demo1 { 11 12 public static void main(String[] args)throws Exception { 13 CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例 14 HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例 15 CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求 16 HttpEntity entity=response.getEntity(); // 获取返回实体 17 System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容 18 response.close(); // response关闭 19 httpClient.close(); // httpClient关闭 20 } 21 22 }
返回内容:
网页内容:<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>系统检测亲不是真人行为,因系统资源限制,我们只能拒绝你的请求。如果你有疑问,可以通过微博 http://weibo.com/tuicool2012/ 联系我们。</p>
</body>
</html>
我们模拟下浏览器 设置下User-Agent头消息:
加下 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent
1 package com.javaxk.httpclient.chap02; 2 3 import org.apache.http.HttpEntity; 4 import org.apache.http.client.methods.CloseableHttpResponse; 5 import org.apache.http.client.methods.HttpGet; 6 import org.apache.http.impl.client.CloseableHttpClient; 7 import org.apache.http.impl.client.HttpClients; 8 import org.apache.http.util.EntityUtils; 9 10 public class Demo1 { 11 12 public static void main(String[] args)throws Exception { 13 CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例 14 HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例 15 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent 16 CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求 17 HttpEntity entity=response.getEntity(); // 获取返回实体 18 System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容 19 response.close(); // response关闭 20 httpClient.close(); // httpClient关闭 21 } 22 23 }
运行:
当然通过火狐firebug,我们还可以看到其他请求头消息:
都是可以通过setHeader方法 设置key value;来得到模拟浏览器请求;
第二节: 获取响应内容类型 Content-Type
HttpClient获取响应内容类型Content-Type
响应的网页内容都有类型也就是Content-Type
通过火狐firebug,我们看响应头信息:
当然我们可以通过HttpClient接口来获取;
HttpEntity的getContentType().getValue() 就能获取到响应类型;
1 package com.javaxk.httpclient.chap02; 2 3 import org.apache.http.HttpEntity; 4 import org.apache.http.client.methods.CloseableHttpResponse; 5 import org.apache.http.client.methods.HttpGet; 6 import org.apache.http.impl.client.CloseableHttpClient; 7 import org.apache.http.impl.client.HttpClients; 8 import org.apache.http.util.EntityUtils; 9 10 public class Demo2 { 11 12 public static void main(String[] args) throws Exception{ 13 CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例 14 HttpGet httpGet=new HttpGet("http://www.javaxk.com"); // 创建httpget实例 15 //HttpGet httpGet=new HttpGet("http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar"); // 创建httpget实例 16 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent 17 CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求 18 HttpEntity entity=response.getEntity(); // 获取返回实体 19 System.out.println("Content-Type:"+entity.getContentType().getValue()); 20 //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容 21 response.close(); // response关闭 22 httpClient.close(); // httpClient关闭 23 } 24 25 }
运行输出:
Content-Type:text/html; charset=utf-8
一般网页是text/html当然有些是带编码的,
比如请求www.tuicool.com:输出:
Content-Type:text/html; charset=utf-8
假如请求js文件,比如 http://www.javaxk.com/include/dedeajax2.js
运行输出:
Content-Type:application/javascript
假如请求的是文件,比如 http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar
运行输出:
Content-Type:application/java-archive
当然Content-Type还有一堆,那这东西对于我们爬虫有啥用的,我们再爬取网页的时候 ,可以通过
Content-Type来提取我们需要爬取的网页或者是爬取的时候,需要过滤掉的一些网页;
第三节: 获取响应状态 Status
200 正常
403 拒绝
500 服务器报错
400 未找到页面
HttpClient获取响应状态Status
我们HttpClient向服务器请求时,
正常情况 执行成功 返回200状态码,
不一定每次都会请求成功,
比如这个请求地址不存在 返回404
服务器内部报错 返回500
有些服务器有防采集,假如你频繁的采集数据,则返回403 拒绝你请求。
当然 我们是有办法的 后面会讲到用代理IP。
这个获取状态码,我们可以用 CloseableHttpResponse对象的getStatusLine().getStatusCode()
1 package com.javaxk.httpclient.chap02; 2 3 import org.apache.http.HttpEntity; 4 import org.apache.http.client.methods.CloseableHttpResponse; 5 import org.apache.http.client.methods.HttpGet; 6 import org.apache.http.impl.client.CloseableHttpClient; 7 import org.apache.http.impl.client.HttpClients; 8 import org.apache.http.util.EntityUtils; 9 10 public class Demo2 { 11 12 public static void main(String[] args) throws Exception{ 13 CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例 14 HttpGet httpGet=new HttpGet("http://www.javaxk.com"); // 创建httpget实例 15 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent 16 CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求 17 System.out.println("Status:"+response.getStatusLine().getStatusCode()); 18 HttpEntity entity=response.getEntity(); // 获取返回实体 19 System.out.println("Content-Type:"+entity.getContentType().getValue()); 20 //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容 21 response.close(); // response关闭 22 httpClient.close(); // httpClient关闭 23 } 24 25 }
运行输出:
Status:200
Content-Type:text/html;charset=UTF-8
假如换个页面 http://www.javaxk.com/a.jsp
因为不存在,
所以返回 404
1 package com.javaxk.httpclient.chap02; 2 3 import org.apache.http.HttpEntity; 4 import org.apache.http.client.methods.CloseableHttpResponse; 5 import org.apache.http.client.methods.HttpGet; 6 import org.apache.http.impl.client.CloseableHttpClient; 7 import org.apache.http.impl.client.HttpClients; 8 import org.apache.http.util.EntityUtils; 9 10 public class Demo2 { 11 12 public static void main(String[] args) throws Exception{ 13 CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例 14 HttpGet httpGet=new HttpGet("http://www.javaxk.com/a.jsp"); // 创建httpget实例 15 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent 16 CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求 17 System.out.println("Status:"+response.getStatusLine().getStatusCode()); 18 HttpEntity entity=response.getEntity(); // 获取返回实体 19 System.out.println("Content-Type:"+entity.getContentType().getValue()); 20 //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容 21 response.close(); // response关闭 22 httpClient.close(); // httpClient关闭 23 } 24 25 }
运行输出:
Status:404
Content-Type:text/html